SCHED_ULE makes 256Mbyte i386 unusable

Discussion:

(too old to reply)

Rick Macklem

2018-04-21 19:21:58 UTC

I decided to start a new thread on current related to SCHED_ULE, since I see
more than just performance degradation and on a recent current kernel.
(I cc'd a couple of the people discussing performance problems in freebsd-stable
recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".

When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
current/head kernel, I would see about a 30% performance degradation (elapsed
run time for a kernel build over NFSv4.1) when the server kernel was built with
options SCHED_ULE
instead of
options SCHED_4BSD

Now, with a kernel from a couple of days ago, the
options SCHED_ULE
kernel becomes unusable shortly after starting testing.
I have seen two variants of this:
- Became essentially hung. All I could do was ping the machine from the network.
- Reported "vm_thread_new: kstack allocation failed
and then any attempt to do anything gets "No more processes".
with the only difference being a kernel built with
options SCHED_4BSD
everything works and performs the same as the Dec 2017 kernel.

I can try rolling back through the revisions, but it would be nice if someone
could suggest where to start, because it takes a couple of hours to build a
kernel on this system.

So, something has made things worse for a head/current kernel this winter, rick

Konstantin Belousov

2018-04-21 20:11:28 UTC

Permalink

On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
> I decided to start a new thread on current related to SCHED_ULE, since I see
> more than just performance degradation and on a recent current kernel.
> (I cc'd a couple of the people discussing performance problems in freebsd-stable
> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>
> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
> current/head kernel, I would see about a 30% performance degradation (elapsed
> run time for a kernel build over NFSv4.1) when the server kernel was built with
> options SCHED_ULE
> instead of
> options SCHED_4BSD
>
> Now, with a kernel from a couple of days ago, the
> options SCHED_ULE
> kernel becomes unusable shortly after starting testing.
> I have seen two variants of this:
> - Became essentially hung. All I could do was ping the machine from the network.
> - Reported "vm_thread_new: kstack allocation failed
> and then any attempt to do anything gets "No more processes".
This is strange. It usually means that you get KVA either exhausted or
severly fragmented.

Enter ddb, it should be operational since pings are replied. Try to see
where the threads are stuck.

> with the only difference being a kernel built with
> options SCHED_4BSD
> everything works and performs the same as the Dec 2017 kernel.
>
> I can try rolling back through the revisions, but it would be nice if someone
> could suggest where to start, because it takes a couple of hours to build a
> kernel on this system.
>
> So, something has made things worse for a head/current kernel this winter, rick

There are at least two potentially relevant changes.

First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.

Consequences of the first one are obvious, it is much harder to find
the place to map the stack. Second change, on the other hand, provides
almost full 4G for KVA and should have mostly compensate for the negative
effects of the first.

And, I cannot see how changing the scheduler would fix or even affect that
behaviour.

Rick Macklem

2018-04-21 23:30:55 UTC

Permalink

Konstantin Belousov wrote:
>On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>> I decided to start a new thread on current related to SCHED_ULE, since I see
>> more than just performance degradation and on a recent current kernel.
>> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>>
>> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>> current/head kernel, I would see about a 30% performance degradation (elapsed
>> run time for a kernel build over NFSv4.1) when the server kernel was built with
>> options SCHED_ULE
>> instead of
>> options SCHED_4BSD
>>
>> Now, with a kernel from a couple of days ago, the
>> options SCHED_ULE
>> kernel becomes unusable shortly after starting testing.
>> I have seen two variants of this:
>> - Became essentially hung. All I could do was ping the machine from the network.
>> - Reported "vm_thread_new: kstack allocation failed
>> and then any attempt to do anything gets "No more processes".
>This is strange. It usually means that you get KVA either exhausted or
>severly fragmented.
Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
kernel is working ok now. I haven't done enough to compare performance yet.
Maybe I'll post again when I have some numbers.

>Enter ddb, it should be operational since pings are replied. Try to see
>where the threads are stuck.
I didn't do this, since reducing the number of kernel threads seems to have fixed
the problem. For the pNFS server, the nfsd threads will spawn additional kernel
threads to do proxies to the mirrored DS servers.

>> with the only difference being a kernel built with
>> options SCHED_4BSD
>> everything works and performs the same as the Dec 2017 kernel.
>>
>> I can try rolling back through the revisions, but it would be nice if someone
>> could suggest where to start, because it takes a couple of hours to build a
>> kernel on this system.
>>
>> So, something has made things worse for a head/current kernel this winter, rick
>
>There are at least two potentially relevant changes.
>
>First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
I've been running this machine with KSTACK_PAGES=4 for some time, so no change.

>Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
Could this change have resulted in the system being able to allocate fewer
kernel threads/stacks for some reason?

>Consequences of the first one are obvious, it is much harder to find
>the place to map the stack. Second change, on the other hand, provides
>almost full 4G for KVA and should have mostly compensate for the negative
>effects of the first.
>
>And, I cannot see how changing the scheduler would fix or even affect that
>behaviour.
My hunch is that the system was running near its limit for kernel threads/stacks.
Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
to a higher peak number of threads and hit the limit.
SCHED_4BSD happened to result in timing such that it stayed just below the
limit and worked.
I can think of a couple of things that might affect this:
1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
they wouldn't terminate and release their resources before more new ones
are spawned.
2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
could try and spawn more mirror DS worker threads at about the same time.

Anyhow, thanks for the help, rick

Konstantin Belousov

2018-04-22 12:02:41 UTC

Permalink

On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
> Konstantin Belousov wrote:
> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
> >> I decided to start a new thread on current related to SCHED_ULE, since I see
> >> more than just performance degradation and on a recent current kernel.
> >> (I cc'd a couple of the people discussing performance problems in freebsd-stable
> >> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
> >>
> >> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
> >> current/head kernel, I would see about a 30% performance degradation (elapsed
> >> run time for a kernel build over NFSv4.1) when the server kernel was built with
> >> options SCHED_ULE
> >> instead of
> >> options SCHED_4BSD
> >>
> >> Now, with a kernel from a couple of days ago, the
> >> options SCHED_ULE
> >> kernel becomes unusable shortly after starting testing.
> >> I have seen two variants of this:
> >> - Became essentially hung. All I could do was ping the machine from the network.
> >> - Reported "vm_thread_new: kstack allocation failed
> >> and then any attempt to do anything gets "No more processes".
> >This is strange. It usually means that you get KVA either exhausted or
> >severly fragmented.
> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
> kernel is working ok now. I haven't done enough to compare performance yet.
> Maybe I'll post again when I have some numbers.
>
> >Enter ddb, it should be operational since pings are replied. Try to see
> >where the threads are stuck.
> I didn't do this, since reducing the number of kernel threads seems to have fixed
> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
> threads to do proxies to the mirrored DS servers.
>
> >> with the only difference being a kernel built with
> >> options SCHED_4BSD
> >> everything works and performs the same as the Dec 2017 kernel.
> >>
> >> I can try rolling back through the revisions, but it would be nice if someone
> >> could suggest where to start, because it takes a couple of hours to build a
> >> kernel on this system.
> >>
> >> So, something has made things worse for a head/current kernel this winter, rick
> >
> >There are at least two potentially relevant changes.
> >
> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
>
> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
> Could this change have resulted in the system being able to allocate fewer
> kernel threads/stacks for some reason?
Well, it could, as anything can be buggy. But the intent of the change
was to give 4G KVA, and it did.

>
> >Consequences of the first one are obvious, it is much harder to find
> >the place to map the stack. Second change, on the other hand, provides
> >almost full 4G for KVA and should have mostly compensate for the negative
> >effects of the first.
> >
> >And, I cannot see how changing the scheduler would fix or even affect that
> >behaviour.
> My hunch is that the system was running near its limit for kernel threads/stacks.
> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
> to a higher peak number of threads and hit the limit.
> SCHED_4BSD happened to result in timing such that it stayed just below the
> limit and worked.
> I can think of a couple of things that might affect this:
> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
> they wouldn't terminate and release their resources before more new ones
> are spawned.
Scheduler has nothing to do with the threads termination. It might
select running threads in a way that causes the undesired pattern to
appear which might create some amount of backlog for termination, but
I doubt it.

> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
> could try and spawn more mirror DS worker threads at about the same time.
>
> Anyhow, thanks for the help, rick

Rick Macklem

2018-04-22 13:43:53 UTC

Permalink

Konstantin Belousov wrote:
>On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>> Konstantin Belousov wrote:
>> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>> >> I decided to start a new thread on current related to SCHED_ULE, since I see
>> >> more than just performance degradation and on a recent current kernel.
>> >> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>> >> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>> >>
>> >> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>> >> current/head kernel, I would see about a 30% performance degradation (elapsed
>> >> run time for a kernel build over NFSv4.1) when the server kernel was built with
>> >> options SCHED_ULE
>> >> instead of
>> >> options SCHED_4BSD
So, now that I have decreased the number of nfsd kernel threads to 32, it works
with both schedulers and with essentially the same performance. (ie. The 30%
performance degradation has disappeared.)

>> >>
>> >> Now, with a kernel from a couple of days ago, the
>> >> options SCHED_ULE
>> >> kernel becomes unusable shortly after starting testing.
>> >> I have seen two variants of this:
>> >> - Became essentially hung. All I could do was ping the machine from the network.
>> >> - Reported "vm_thread_new: kstack allocation failed
>> >> and then any attempt to do anything gets "No more processes".
>> >This is strange. It usually means that you get KVA either exhausted or
>> >severly fragmented.
>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>> kernel is working ok now. I haven't done enough to compare performance yet.
>> Maybe I'll post again when I have some numbers.
>>
>> >Enter ddb, it should be operational since pings are replied. Try to see
>> >where the threads are stuck.
>> I didn't do this, since reducing the number of kernel threads seems to have fixed
>> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
>> threads to do proxies to the mirrored DS servers.
>>
>> >> with the only difference being a kernel built with
>> >> options SCHED_4BSD
>> >> everything works and performs the same as the Dec 2017 kernel.
>> >>
>> >> I can try rolling back through the revisions, but it would be nice if someone
>> >> could suggest where to start, because it takes a couple of hours to build a
>> >> kernel on this system.
>> >>
>> >> So, something has made things worse for a head/current kernel this winter, rick
>> >
>> >There are at least two potentially relevant changes.
>> >
>> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
in the thread):
I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
up and seemed to be scheduler related (but not really, it seems).
I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
Server code.

Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
item getting allocated on the stack, but many moderate sized ones.
(A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
that NFS needs to use. I don't think these are large enough to justify malloc/free,
but it has to use several of them.)

One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
the stack. I changes the code to malloc/free them and then when testing, to
my surprise I had a 20% performance hit and shelved the patch.
Now that I know that the server was running near its limit, I might try this one
again, to see if the performance hit doesn't occur when the machine has adequate
memory. If the performance hit goes away, I could commit this, but it wouldn't have that much effect on the kstack usage. (It's interesting how this patch ended
up related to the issue this thread discussed.)

>>
>> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>> Could this change have resulted in the system being able to allocate fewer
>> kernel threads/stacks for some reason?
>Well, it could, as anything can be buggy. But the intent of the change
>was to give 4G KVA, and it did.
Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
(see performance issue that went away, noted above) and any change could
have pushed it across the line, I think.

>>
>> >Consequences of the first one are obvious, it is much harder to find
>> >the place to map the stack. Second change, on the other hand, provides
>> >almost full 4G for KVA and should have mostly compensate for the negative
>> >effects of the first.
>> >
>> >And, I cannot see how changing the scheduler would fix or even affect that
>> >behaviour.
>> My hunch is that the system was running near its limit for kernel threads/stacks.
>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
>> to a higher peak number of threads and hit the limit.
>> SCHED_4BSD happened to result in timing such that it stayed just below the
>> limit and worked.
>> I can think of a couple of things that might affect this:
>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
>> they wouldn't terminate and release their resources before more new ones
>> are spawned.
>Scheduler has nothing to do with the threads termination. It might
>select running threads in a way that causes the undesired pattern to
>appear which might create some amount of backlog for termination, but
>I doubt it.
>
>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
>> could try and spawn more mirror DS worker threads at about the same time.
>>
>> Anyhow, thanks for the help, rick

Have a good day, rick

Julian Elischer

2018-04-23 07:37:03 UTC

Permalink

On 22/4/18 9:43 pm, Rick Macklem wrote:
> Konstantin Belousov wrote:
>> On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>>> Konstantin Belousov wrote:
>>>> On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>>>>> I decided to start a new thread on current related to SCHED_ULE, since I see
>>>>> more than just performance degradation and on a recent current kernel.
>>>>> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>>>>> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>>>>>
>>>>> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>>>>> current/head kernel, I would see about a 30% performance degradation (elapsed
>>>>> run time for a kernel build over NFSv4.1) when the server kernel was built with
>>>>> options SCHED_ULE
>>>>> instead of
>>>>> options SCHED_4BSD
> So, now that I have decreased the number of nfsd kernel threads to 32, it works
> with both schedulers and with essentially the same performance. (ie. The 30%
> performance degradation has disappeared.)
>
>>>>> Now, with a kernel from a couple of days ago, the
>>>>> options SCHED_ULE
>>>>> kernel becomes unusable shortly after starting testing.
>>>>> I have seen two variants of this:
>>>>> - Became essentially hung. All I could do was ping the machine from the network.
>>>>> - Reported "vm_thread_new: kstack allocation failed
>>>>> and then any attempt to do anything gets "No more processes".
>>>> This is strange. It usually means that you get KVA either exhausted or
>>>> severly fragmented.
>>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>>> kernel is working ok now. I haven't done enough to compare performance yet.
>>> Maybe I'll post again when I have some numbers.
>>>
>>>> Enter ddb, it should be operational since pings are replied. Try to see
>>>> where the threads are stuck.
>>> I didn't do this, since reducing the number of kernel threads seems to have fixed
>>> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
>>> threads to do proxies to the mirrored DS servers.
>>>
>>>>> with the only difference being a kernel built with
>>>>> options SCHED_4BSD
>>>>> everything works and performs the same as the Dec 2017 kernel.
>>>>>
>>>>> I can try rolling back through the revisions, but it would be nice if someone
>>>>> could suggest where to start, because it takes a couple of hours to build a
>>>>> kernel on this system.
>>>>>
>>>>> So, something has made things worse for a head/current kernel this winter, rick
>>>> There are at least two potentially relevant changes.
>>>>
>>>> First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>>> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
> in the thread):
> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
> up and seemed to be scheduler related (but not really, it seems).
> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
> Server code.
>
> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
> item getting allocated on the stack, but many moderate sized ones.
> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
> that NFS needs to use. I don't think these are large enough to justify malloc/free,
> but it has to use several of them.)
>
> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
> the stack. I changes the code to malloc/free them and then when testing, to
> my surprise I had a 20% performance hit and shelved the patch.

you might try using uma. especially setting up a non-freeing zone,
where he system allocates what it needs and then just recycles them.
(man uma)
> Now that I know that the server was running near its limit, I might try this one
> again, to see if the performance hit doesn't occur when the machine has adequate
> memory. If the performance hit goes away, I could commit this, but it wouldn't have that much effect on the kstack usage. (It's interesting how this patch ended
> up related to the issue this thread discussed.)
>
>>>> Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>>> Could this change have resulted in the system being able to allocate fewer
>>> kernel threads/stacks for some reason?
>> Well, it could, as anything can be buggy. But the intent of the change
>> was to give 4G KVA, and it did.
> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
> (see performance issue that went away, noted above) and any change could
> have pushed it across the line, I think.
>
>>>> Consequences of the first one are obvious, it is much harder to find
>>>> the place to map the stack. Second change, on the other hand, provides
>>>> almost full 4G for KVA and should have mostly compensate for the negative
>>>> effects of the first.
>>>>
>>>> And, I cannot see how changing the scheduler would fix or even affect that
>>>> behaviour.
>>> My hunch is that the system was running near its limit for kernel threads/stacks.
>>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
>>> to a higher peak number of threads and hit the limit.
>>> SCHED_4BSD happened to result in timing such that it stayed just below the
>>> limit and worked.
>>> I can think of a couple of things that might affect this:
>>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
>>> they wouldn't terminate and release their resources before more new ones
>>> are spawned.
>> Scheduler has nothing to do with the threads termination. It might
>> select running threads in a way that causes the undesired pattern to
>> appear which might create some amount of backlog for termination, but
>> I doubt it.
>>
>>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
>>> could try and spawn more mirror DS worker threads at about the same time.
>>>
>>> Anyhow, thanks for the help, rick
> Have a good day, rick
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
>

Rodney W. Grimes

2018-04-22 01:37:09 UTC

Permalink

> On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
> > I decided to start a new thread on current related to SCHED_ULE, since I see
> > more than just performance degradation and on a recent current kernel.
> > (I cc'd a couple of the people discussing performance problems in freebsd-stable
> > recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
> >
> > When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
> > current/head kernel, I would see about a 30% performance degradation (elapsed
> > run time for a kernel build over NFSv4.1) when the server kernel was built with
> > options SCHED_ULE
> > instead of
> > options SCHED_4BSD
> >
> > Now, with a kernel from a couple of days ago, the
> > options SCHED_ULE
> > kernel becomes unusable shortly after starting testing.
> > I have seen two variants of this:
> > - Became essentially hung. All I could do was ping the machine from the network.
> > - Reported "vm_thread_new: kstack allocation failed
> > and then any attempt to do anything gets "No more processes".
> This is strange. It usually means that you get KVA either exhausted or
> severly fragmented.
>
> Enter ddb, it should be operational since pings are replied. Try to see
> where the threads are stuck.
>
> > with the only difference being a kernel built with
> > options SCHED_4BSD
> > everything works and performs the same as the Dec 2017 kernel.
> >
> > I can try rolling back through the revisions, but it would be nice if someone
> > could suggest where to start, because it takes a couple of hours to build a
> > kernel on this system.
> >
> > So, something has made things worse for a head/current kernel this winter, rick
>
> There are at least two potentially relevant changes.
>
> First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.

My experience with this bump and trying to run a GENERIC -current (meaning
it has INVARIANTS and WITNESS) is that you have a very unstable situation
until you get above 1G of memory. And NFS is a major kstack consumer.

> Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>
> Consequences of the first one are obvious, it is much harder to find
> the place to map the stack. Second change, on the other hand, provides
> almost full 4G for KVA and should have mostly compensate for the negative
> effects of the first.
>
> And, I cannot see how changing the scheduler would fix or even affect that
> behaviour.
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
>

--
Rod Grimes ***@freebsd.org

Rodney W. Grimes

2018-04-22 14:36:09 UTC

Permalink

> Konstantin Belousov wrote:
> >On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
> >> Konstantin Belousov wrote:
> >> >On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
> >> >> I decided to start a new thread on current related to SCHED_ULE, since I see
> >> >> more than just performance degradation and on a recent current kernel.
> >> >> (I cc'd a couple of the people discussing performance problems in freebsd-stable
> >> >> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
> >> >>
> >> >> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
> >> >> current/head kernel, I would see about a 30% performance degradation (elapsed
> >> >> run time for a kernel build over NFSv4.1) when the server kernel was built with
> >> >> options SCHED_ULE
> >> >> instead of
> >> >> options SCHED_4BSD
> So, now that I have decreased the number of nfsd kernel threads to 32, it works
> with both schedulers and with essentially the same performance. (ie. The 30%
> performance degradation has disappeared.)
>
> >> >>
> >> >> Now, with a kernel from a couple of days ago, the
> >> >> options SCHED_ULE
> >> >> kernel becomes unusable shortly after starting testing.
> >> >> I have seen two variants of this:
> >> >> - Became essentially hung. All I could do was ping the machine from the network.
> >> >> - Reported "vm_thread_new: kstack allocation failed
> >> >> and then any attempt to do anything gets "No more processes".
> >> >This is strange. It usually means that you get KVA either exhausted or
> >> >severly fragmented.
> >> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
> >> kernel is working ok now. I haven't done enough to compare performance yet.
> >> Maybe I'll post again when I have some numbers.
> >>
> >> >Enter ddb, it should be operational since pings are replied. Try to see
> >> >where the threads are stuck.
> >> I didn't do this, since reducing the number of kernel threads seems to have fixed
> >> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
> >> threads to do proxies to the mirrored DS servers.
> >>
> >> >> with the only difference being a kernel built with
> >> >> options SCHED_4BSD
> >> >> everything works and performs the same as the Dec 2017 kernel.
> >> >>
> >> >> I can try rolling back through the revisions, but it would be nice if someone
> >> >> could suggest where to start, because it takes a couple of hours to build a
> >> >> kernel on this system.
> >> >>
> >> >> So, something has made things worse for a head/current kernel this winter, rick
> >> >
> >> >There are at least two potentially relevant changes.
> >> >
> >> >First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
> >> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
> in the thread):
> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
> up and seemed to be scheduler related (but not really, it seems).
> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
> Server code.
>
> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
> item getting allocated on the stack, but many moderate sized ones.
> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
> that NFS needs to use. I don't think these are large enough to justify malloc/free,
> but it has to use several of them.)
>
> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
> the stack. I changes the code to malloc/free them and then when testing, to
> my surprise I had a 20% performance hit and shelved the patch.
> Now that I know that the server was running near its limit, I might try this one
> again, to see if the performance hit doesn't occur when the machine has adequate
> memory. If the performance hit goes away, I could commit this, but it wouldn't
> have that much effect on the kstack usage. (It's interesting how this patch ended
> up related to the issue this thread discussed.)

Anything we can do to help relieve KSTACK usage, especially on i386
is helpfull. These is a thread back quite some time where someone
came up with a compile time static "this functions uses X bytes of
local stack" and a bit of clean up was done. We should persue
this issue further.

My experiece with the i386/KSTACK issues was attempting to do installs
from snapshot .iso's, I usually had to change to a custom kernel without
INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
problem (ie, dont use NFS during install). Neither was very pleasant.

I have found it in practical to run the 4 page KSTACK in production
VM's using i386 due to memory requirements. I run many very lean
i386 VM's with 64MB of memory. I suspect our user base also has
many people doing this, and it would be to our advantage to try
and reduce our kernel stack needs.

> >> >Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
> >> Could this change have resulted in the system being able to allocate fewer
> >> kernel threads/stacks for some reason?
> >Well, it could, as anything can be buggy. But the intent of the change
> >was to give 4G KVA, and it did.
> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
> (see performance issue that went away, noted above) and any change could
> have pushed it across the line, I think.
>
> >>
> >> >Consequences of the first one are obvious, it is much harder to find
> >> >the place to map the stack. Second change, on the other hand, provides
> >> >almost full 4G for KVA and should have mostly compensate for the negative
> >> >effects of the first.
> >> >
> >> >And, I cannot see how changing the scheduler would fix or even affect that
> >> >behaviour.
> >> My hunch is that the system was running near its limit for kernel threads/stacks.
> >> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
> >> to a higher peak number of threads and hit the limit.
> >> SCHED_4BSD happened to result in timing such that it stayed just below the
> >> limit and worked.
> >> I can think of a couple of things that might affect this:
> >> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
> >> they wouldn't terminate and release their resources before more new ones
> >> are spawned.
> >Scheduler has nothing to do with the threads termination. It might
> >select running threads in a way that causes the undesired pattern to
> >appear which might create some amount of backlog for termination, but
> >I doubt it.
> >
> >> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
> >> could try and spawn more mirror DS worker threads at about the same time.
> >>
> >> Anyhow, thanks for the help, rick
>
> Have a good day, rick
> _______________________________________________
> freebsd-***@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-current
> To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
>

--
Rod Grimes ***@freebsd.org

Julian Elischer

2018-04-23 07:46:51 UTC

Permalink

On 22/4/18 10:36 pm, Rodney W. Grimes wrote:
>> Konstantin Belousov wrote:
>>> On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>>>> Konstantin Belousov wrote:
>>>>> On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>>>>>> I decided to start a new thread on current related to SCHED_ULE, since I see
>>>>>> more than just performance degradation and on a recent current kernel.
>>>>>> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>>>>>> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>>>>>>
>>>>>> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>>>>>> current/head kernel, I would see about a 30% performance degradation (elapsed
>>>>>> run time for a kernel build over NFSv4.1) when the server kernel was built with
>>>>>> options SCHED_ULE
>>>>>> instead of
>>>>>> options SCHED_4BSD
>> So, now that I have decreased the number of nfsd kernel threads to 32, it works
>> with both schedulers and with essentially the same performance. (ie. The 30%
>> performance degradation has disappeared.)
>>
>>>>>> Now, with a kernel from a couple of days ago, the
>>>>>> options SCHED_ULE
>>>>>> kernel becomes unusable shortly after starting testing.
>>>>>> I have seen two variants of this:
>>>>>> - Became essentially hung. All I could do was ping the machine from the network.
>>>>>> - Reported "vm_thread_new: kstack allocation failed
>>>>>> and then any attempt to do anything gets "No more processes".
>>>>> This is strange. It usually means that you get KVA either exhausted or
>>>>> severly fragmented.
>>>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>>>> kernel is working ok now. I haven't done enough to compare performance yet.
>>>> Maybe I'll post again when I have some numbers.
>>>>
>>>>> Enter ddb, it should be operational since pings are replied. Try to see
>>>>> where the threads are stuck.
>>>> I didn't do this, since reducing the number of kernel threads seems to have fixed
>>>> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
>>>> threads to do proxies to the mirrored DS servers.
>>>>
>>>>>> with the only difference being a kernel built with
>>>>>> options SCHED_4BSD
>>>>>> everything works and performs the same as the Dec 2017 kernel.
>>>>>>
>>>>>> I can try rolling back through the revisions, but it would be nice if someone
>>>>>> could suggest where to start, because it takes a couple of hours to build a
>>>>>> kernel on this system.
>>>>>>
>>>>>> So, something has made things worse for a head/current kernel this winter, rick
>>>>> There are at least two potentially relevant changes.
>>>>>
>>>>> First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>>>> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
>> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
>> in the thread):
>> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
>> up and seemed to be scheduler related (but not really, it seems).
>> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
>> Server code.
>>
>> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
>> item getting allocated on the stack, but many moderate sized ones.
>> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
>> that NFS needs to use. I don't think these are large enough to justify malloc/free,
>> but it has to use several of them.)
>>
>> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
>> the stack. I changes the code to malloc/free them and then when testing, to
>> my surprise I had a 20% performance hit and shelved the patch.
>> Now that I know that the server was running near its limit, I might try this one
>> again, to see if the performance hit doesn't occur when the machine has adequate
>> memory. If the performance hit goes away, I could commit this, but it wouldn't
>> have that much effect on the kstack usage. (It's interesting how this patch ended
>> up related to the issue this thread discussed.)
> Anything we can do to help relieve KSTACK usage, especially on i386
> is helpfull. These is a thread back quite some time where someone
> came up with a compile time static "this functions uses X bytes of
> local stack" and a bit of clean up was done. We should persue
> this issue further.

that was me.

use
|-Wframe-larger-than||=<arg>|¶
<https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-wframe-larger-than>
and set it to something like 512 bytes

>
> My experiece with the i386/KSTACK issues was attempting to do installs
> from snapshot .iso's, I usually had to change to a custom kernel without
> INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
> problem (ie, dont use NFS during install). Neither was very pleasant.
>
> I have found it in practical to run the 4 page KSTACK in production
> VM's using i386 due to memory requirements. I run many very lean
> i386 VM's with 64MB of memory. I suspect our user base also has
> many people doing this, and it would be to our advantage to try
> and reduce our kernel stack needs.
>
>
>>>>> Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>>>> Could this change have resulted in the system being able to allocate fewer
>>>> kernel threads/stacks for some reason?
>>> Well, it could, as anything can be buggy. But the intent of the change
>>> was to give 4G KVA, and it did.
>> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
>> (see performance issue that went away, noted above) and any change could
>> have pushed it across the line, I think.
>>
>>>>> Consequences of the first one are obvious, it is much harder to find
>>>>> the place to map the stack. Second change, on the other hand, provides
>>>>> almost full 4G for KVA and should have mostly compensate for the negative
>>>>> effects of the first.
>>>>>
>>>>> And, I cannot see how changing the scheduler would fix or even affect that
>>>>> behaviour.
>>>> My hunch is that the system was running near its limit for kernel threads/stacks.
>>>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
>>>> to a higher peak number of threads and hit the limit.
>>>> SCHED_4BSD happened to result in timing such that it stayed just below the
>>>> limit and worked.
>>>> I can think of a couple of things that might affect this:
>>>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
>>>> they wouldn't terminate and release their resources before more new ones
>>>> are spawned.
>>> Scheduler has nothing to do with the threads termination. It might
>>> select running threads in a way that causes the undesired pattern to
>>> appear which might create some amount of backlog for termination, but
>>> I doubt it.
>>>
>>>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
>>>> could try and spawn more mirror DS worker threads at about the same time.
>>>>
>>>> Anyhow, thanks for the help, rick
>> Have a good day, rick
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
>>

Julian Elischer

2018-04-23 07:47:37 UTC

Permalink

On 22/4/18 10:36 pm, Rodney W. Grimes wrote:
>> Konstantin Belousov wrote:
>>> On Sat, Apr 21, 2018 at 11:30:55PM +0000, Rick Macklem wrote:
>>>> Konstantin Belousov wrote:
>>>>> On Sat, Apr 21, 2018 at 07:21:58PM +0000, Rick Macklem wrote:
>>>>>> I decided to start a new thread on current related to SCHED_ULE, since I see
>>>>>> more than just performance degradation and on a recent current kernel.
>>>>>> (I cc'd a couple of the people discussing performance problems in freebsd-stable
>>>>>> recently under a subject line of "Re: kern.sched.quantum: Creepy, sadistic scheduler".
>>>>>>
>>>>>> When testing a pNFS server on a single core i386 with 256Mbytes using a Dec. 2017
>>>>>> current/head kernel, I would see about a 30% performance degradation (elapsed
>>>>>> run time for a kernel build over NFSv4.1) when the server kernel was built with
>>>>>> options SCHED_ULE
>>>>>> instead of
>>>>>> options SCHED_4BSD
>> So, now that I have decreased the number of nfsd kernel threads to 32, it works
>> with both schedulers and with essentially the same performance. (ie. The 30%
>> performance degradation has disappeared.)
>>
>>>>>> Now, with a kernel from a couple of days ago, the
>>>>>> options SCHED_ULE
>>>>>> kernel becomes unusable shortly after starting testing.
>>>>>> I have seen two variants of this:
>>>>>> - Became essentially hung. All I could do was ping the machine from the network.
>>>>>> - Reported "vm_thread_new: kstack allocation failed
>>>>>> and then any attempt to do anything gets "No more processes".
>>>>> This is strange. It usually means that you get KVA either exhausted or
>>>>> severly fragmented.
>>>> Yes. I reduced the number of nfsd threads from 256->32 and the SCHED_ULE
>>>> kernel is working ok now. I haven't done enough to compare performance yet.
>>>> Maybe I'll post again when I have some numbers.
>>>>
>>>>> Enter ddb, it should be operational since pings are replied. Try to see
>>>>> where the threads are stuck.
>>>> I didn't do this, since reducing the number of kernel threads seems to have fixed
>>>> the problem. For the pNFS server, the nfsd threads will spawn additional kernel
>>>> threads to do proxies to the mirrored DS servers.
>>>>
>>>>>> with the only difference being a kernel built with
>>>>>> options SCHED_4BSD
>>>>>> everything works and performs the same as the Dec 2017 kernel.
>>>>>>
>>>>>> I can try rolling back through the revisions, but it would be nice if someone
>>>>>> could suggest where to start, because it takes a couple of hours to build a
>>>>>> kernel on this system.
>>>>>>
>>>>>> So, something has made things worse for a head/current kernel this winter, rick
>>>>> There are at least two potentially relevant changes.
>>>>>
>>>>> First is r326758 Dec 11 which bumped KSTACK_PAGES on i386 to 4.
>>>> I've been running this machine with KSTACK_PAGES=4 for some time, so no change.
>> W.r.t. Rodney Grimes comments about this (which didn't end up in this messages
>> in the thread):
>> I didn't see any instability when using KSTACK_PAGES=4 for this until this cropped
>> up and seemed to be scheduler related (but not really, it seems).
>> I bumped it to KSTACK_PAGES=4 because I needed that for the pNFS Metadata
>> Server code.
>>
>> Yes, NFS does use quite a bit of kernel stack. Unfortunately, it isn't one big
>> item getting allocated on the stack, but many moderate sized ones.
>> (A part of it is multiple instances of "struct vattr", some buried in "struct nfsvattr",
>> that NFS needs to use. I don't think these are large enough to justify malloc/free,
>> but it has to use several of them.)
>>
>> One case I did try fixing was about 6 cases where "struct nfsstate" ended up on
>> the stack. I changes the code to malloc/free them and then when testing, to
>> my surprise I had a 20% performance hit and shelved the patch.
>> Now that I know that the server was running near its limit, I might try this one
>> again, to see if the performance hit doesn't occur when the machine has adequate
>> memory. If the performance hit goes away, I could commit this, but it wouldn't
>> have that much effect on the kstack usage. (It's interesting how this patch ended
>> up related to the issue this thread discussed.)
> Anything we can do to help relieve KSTACK usage, especially on i386
> is helpfull. These is a thread back quite some time where someone
> came up with a compile time static "this functions uses X bytes of
> local stack" and a bit of clean up was done. We should persue
> this issue further.

that was me.

use
|-Wframe-larger-than||=<arg>|¶
<https://clang.llvm.org/docs/ClangCommandLineReference.html#cmdoption-clang-wframe-larger-than>
and set it to something like 512 bytes (obviously you have to make
warnings non fatal as well).

>
> My experiece with the i386/KSTACK issues was attempting to do installs
> from snapshot .iso's, I usually had to change to a custom kernel without
> INVARIANTS and WITNESS, or reduce KSTACK to 2 and suffer the small stack
> problem (ie, dont use NFS during install). Neither was very pleasant.
>
> I have found it in practical to run the 4 page KSTACK in production
> VM's using i386 due to memory requirements. I run many very lean
> i386 VM's with 64MB of memory. I suspect our user base also has
> many people doing this, and it would be to our advantage to try
> and reduce our kernel stack needs.
>
>
>>>>> Second is r332489 Apr 13, which introduced 4/4G KVA/UVA split.
>>>> Could this change have resulted in the system being able to allocate fewer
>>>> kernel threads/stacks for some reason?
>>> Well, it could, as anything can be buggy. But the intent of the change
>>> was to give 4G KVA, and it did.
>> Righto. No concern here. I suspect the Dec. 2017 kernel was close to the limit
>> (see performance issue that went away, noted above) and any change could
>> have pushed it across the line, I think.
>>
>>>>> Consequences of the first one are obvious, it is much harder to find
>>>>> the place to map the stack. Second change, on the other hand, provides
>>>>> almost full 4G for KVA and should have mostly compensate for the negative
>>>>> effects of the first.
>>>>>
>>>>> And, I cannot see how changing the scheduler would fix or even affect that
>>>>> behaviour.
>>>> My hunch is that the system was running near its limit for kernel threads/stacks.
>>>> Then, somehow, the timing SCHED_ULE caused resulted in the nfsd trying to get
>>>> to a higher peak number of threads and hit the limit.
>>>> SCHED_4BSD happened to result in timing such that it stayed just below the
>>>> limit and worked.
>>>> I can think of a couple of things that might affect this:
>>>> 1 - If SCHED_ULE doesn't do the termination of kernel threads as quickly, then
>>>> they wouldn't terminate and release their resources before more new ones
>>>> are spawned.
>>> Scheduler has nothing to do with the threads termination. It might
>>> select running threads in a way that causes the undesired pattern to
>>> appear which might create some amount of backlog for termination, but
>>> I doubt it.
>>>
>>>> 2 - If SCHED_ULE handles the nfsd threads in a more "bursty" way, then the burst
>>>> could try and spawn more mirror DS worker threads at about the same time.
>>>>
>>>> Anyhow, thanks for the help, rick
>> Have a good day, rick
>> _______________________________________________
>> freebsd-***@freebsd.org mailing list
>> https://lists.freebsd.org/mailman/listinfo/freebsd-current
>> To unsubscribe, send any mail to "freebsd-current-***@freebsd.org"
>>