Extremely low disk throughput under high compute load

Discussion:

(too old to reply)

Stefan Esser

2018-04-01 15:18:04 UTC

My i7-2600K based system with 24 GB RAM was in the midst of a buildworld -j8
(starting from a clean state) which caused a load average of 12 for more than
1 hour, when I decided to move a directory structure holding some 10 GB to its
own ZFS file system. File sizes varied, but were mostly in the range 0f 500KB.

I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus there
was nearly no disk activity caused by the buildworld.

The copying proceeded at a rate of at most 10 MB/s, but most of the time less
than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and thus a
much better priority than the compute bound compiler processes, but it got
just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was scheduled
at such a low rate, that it only managed to issue a few controller writes per
second.

The system is healthy and does not show any problems or anomalies under
normal use (e.g., file copies are fast, without the high compute load).

This was with SCHED_ULE on a -CURRENT without WITNESS or malloc debugging.

Is this a regression in -CURRENT?

Regards, STefan

Warner Losh

2018-04-01 16:33:54 UTC

Permalink

Post by Stefan Esser
My i7-2600K based system with 24 GB RAM was in the midst of a buildworld -j8
(starting from a clean state) which caused a load average of 12 for more than
1 hour, when I decided to move a directory structure holding some 10 GB to its
own ZFS file system. File sizes varied, but were mostly in the range 0f 500KB.
I had just thrown away /usr/obj, but /usr/src was cached in ARC and thus there
was nearly no disk activity caused by the buildworld.
The copying proceeded at a rate of at most 10 MB/s, but most of the time less
than 100 KB/s were transferred. The "cp" process had a PRIO of 20 and thus a
much better priority than the compute bound compiler processes, but it got
just 0.2% to 0.5% of 1 CPU core. Apparently, the copy process was scheduled
at such a low rate, that it only managed to issue a few controller writes per
second.
The system is healthy and does not show any problems or anomalies under
normal use (e.g., file copies are fast, without the high compute load).
This was with SCHED_ULE on a -CURRENT without WITNESS or malloc debugging.
Is this a regression in -CURRENT?

Does 'sync' push a lot of I/O to the disk?

Is the effective throughput of CP tiny or large? It's tiny, if I read
right, and the I/O is slow (as opposed to it all buffering in memory and
being slow to drain own), right?

Warner

Stefan Esser

2018-04-01 22:18:56 UTC

Permalink

Each sync takes 0.7 to 1.5 seconds to complete, but since reading is so
slow, not much is written.

Normal gstat output for the 3 drives the RAIDZ1 consists of:

dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 2 2 84 39.1 0 0 0.0 7.8 ada0
0 4 4 92 66.6 0 0 0.0 26.6 ada1
0 6 6 259 66.9 0 0 0.0 36.2 ada3
dT: 1.058s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 1 1 60 70.6 0 0 0.0 6.7 ada0
0 3 3 68 71.3 0 0 0.0 20.2 ada1
0 6 6 242 65.5 0 0 0.0 28.8 ada3
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 5 5 192 44.8 0 0 0.0 22.4 ada0
0 6 6 160 61.9 0 0 0.0 26.5 ada1
0 6 6 172 43.7 0 0 0.0 26.2 ada3

This includes the copy process and the reads caused by "make -j 8 world"
(but I assume that all the source files are already cached in ARC).

During sync:

dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
1 101 9 132 14.6 90 1760 5.6 59.7 ada0
2 110 16 267 15.0 92 1756 6.0 50.7 ada1
2 82 13 291 17.8 67 1653 7.4 34.3 ada3

ZFS is configured to flush dirty buffers after 5 seconds, so there are
not many dirty buffers in RAM at any time, anyway.

Post by Stefan Esser
Is the effective throughput of CP tiny or large? It's tiny, if I read right,
and the I/O is slow (as opposed to it all buffering in memory and being slow
to drain own), right?

Yes, reading is very slow, with less than 10 read operations scheduled
per second.

Top output taken at the same time as above gstat samples:

last pid: 24306; load averages: 12.07, 11.51, 8.13

up 2+05:41:57 00:10:22
132 processes: 10 running, 122 sleeping
CPU: 98.2% user, 0.0% nice, 1.7% system, 0.1% interrupt, 0.0% idle
Mem: 1069M Active, 1411M Inact, 269M Laundry, 20G Wired, 1076M Free
ARC: 16G Total, 1234M MFU, 14G MRU, 83M Anon, 201M Header, 786M Other
14G Compressed, 30G Uncompressed, 2.09:1 Ratio
Swap: 24G Total, 533M Used, 23G Free, 2% Inuse

PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
24284 root 1 92 0 228M 199M CPU6 6 0:11 101.34% c++
24287 root 1 91 0 269M 241M CPU3 3 0:10 101.32% c++
24266 root 1 97 0 303M 276M CPU0 0 0:17 101.13% c++
24297 root 1 85 0 213M 184M CPU1 1 0:06 98.40% c++
24281 root 1 93 0 245M 217M CPU7 7 0:12 96.76% c++
24300 root 1 76 0 114M 89268K RUN 2 0:02 83.22% c++
24303 root 1 75 0 105M 79908K CPU4 4 0:01 59.94% c++
24302 root 1 52 0 74940K 47264K wait 4 0:00 0.35% c++
24299 root 1 52 0 74960K 47268K wait 2 0:00 0.33% c++
20954 root 1 20 0 15528K 4900K zio->i 3 0:02 0.11% cp

ARC is limited to 18 GB to leave 6 GB RAM for use by kernel and user programs.

vfs.zfs.arc_meta_limit: 4500000000
vfs.zfs.arc_free_target: 42339
vfs.zfs.compressed_arc_enabled: 1
vfs.zfs.arc_grow_retry: 60
vfs.zfs.arc_shrink_shift: 7
vfs.zfs.arc_average_blocksize: 8192
vfs.zfs.arc_no_grow_shift: 5
vfs.zfs.arc_min: 2250000000
vfs.zfs.arc_max: 18000000000

Regards, STefan

Stefan Esser

2018-04-04 13:19:31 UTC

Permalink

Post by Stefan Esser

Each sync takes 0.7 to 1.5 seconds to complete, but since reading is so
slow, not much is written.
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 2 2 84 39.1 0 0 0.0 7.8 ada0
0 4 4 92 66.6 0 0 0.0 26.6 ada1
0 6 6 259 66.9 0 0 0.0 36.2 ada3
dT: 1.058s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 1 1 60 70.6 0 0 0.0 6.7 ada0
0 3 3 68 71.3 0 0 0.0 20.2 ada1
0 6 6 242 65.5 0 0 0.0 28.8 ada3
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 5 5 192 44.8 0 0 0.0 22.4 ada0
0 6 6 160 61.9 0 0 0.0 26.5 ada1
0 6 6 172 43.7 0 0 0.0 26.2 ada3
This includes the copy process and the reads caused by "make -j 8 world"
(but I assume that all the source files are already cached in ARC).

I have identified the cause of the extremely low I/O performance (2 to 6 read
operations scheduled per second).

The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
up the CPU (or exits).

Any non-zero value of preemt_thresh lets the system perform I/O in parallel
with the CPU bound processes, again.

I'm not sure about the bias relative to the PRI values displayed by top, but
for me a process with PRI above 72 (in top) should be eligible for preemption.

What value of preempt_thresh should I use to get that behavior?

And, more important: Is preempt_thresh=0 a reasonable default???

This prevents I/O bound processes from making reasonable progress if all CPU
cores/threads are busy. In my case, performance dropped from > 10 MB/s to just
a few hundred KB per second, i.e. by a factor of 30. (The %busy values in my
previous mail are misleading: At 10 MB/s the disk was about 70% busy ...)

Should preempt_thresh be set to some (possibly high, to only preempt long
running processes) value?

Regards, STefan

Andriy Gapon

2018-04-04 16:45:10 UTC

Permalink

Post by Stefan Esser
I have identified the cause of the extremely low I/O performance (2 to 6 read
operations scheduled per second).
The default value of kern.sched.preempt_thresh=0 does not give any CPU to the
I/O bound process unless a (long) time slice expires (kern.sched.quantum=94488
on my system with HZ=1000) or one of the CPU bound processes voluntarily gives
up the CPU (or exits).
Any non-zero value of preemt_thresh lets the system perform I/O in parallel
with the CPU bound processes, again.

Let me guess... you have a custom kernel configuration and, unlike GENERIC
(assuming x86), it does not have 'options PREEMPTION'?

--
Andriy Gapon

Stefan Esser

2018-04-05 12:31:24 UTC

Permalink

Post by Andriy Gapon

Let me guess... you have a custom kernel configuration and, unlike GENERIC
(assuming x86), it does not have 'options PREEMPTION'?

Yes, thank you for pointing that out!!!

I used to have PREEMPTION and FULL_PREEMPTION in my kernel configuration,
and apparently have deleted both options when only FULL_PREEMPTION was
supposed to go ...

After looking at sched_ule.c and top/machine.c it appears, that the value
of preempt_thresh corresponds to the PRI value as shown by top (or ps -l)
plus PZERO which is calculated as (PRI_MIN_KERN=80) + 20.

What I do not understand, though, is that the decision about a preemption
is only based on the calculated new priority of the thread, but not at all
on the priority of other running threads (except the idle thread).

On my system, a "real" batch job (i.e. one that does not voluntarily give
up the CPU due to I/O) seems to have a PRI value of 80 to 100 (growing
over time), while an interactive process has a PRI of 20, a maximally
"niced" interactive process has 52.

So, I'd expect a reasonable default value of preempt_thresh to be slightly
above 120 (e.g. 124) to prevent I/O heavy threads from stealing each other
the CPU too often, and to prevent "niced" processes from doing the same ...

The two values configured into the kernel (80 for PREEMPTION and 255 for
FULL_PREEMPTION) seem to be extremes, but something in between (e.g. 124)
is not offered (can only be configured via sysctl without any information
for the correspondence between the threshold value and the PRI value in
any document I've found, besides the kernel sources ...).

Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?

Regards, STefan

Andriy Gapon

2018-05-03 09:41:01 UTC

Permalink

Post by Stefan Esser
After looking at sched_ule.c and top/machine.c it appears, that the value
of preempt_thresh corresponds to the PRI value as shown by top (or ps -l)
plus PZERO which is calculated as (PRI_MIN_KERN=80) + 20.

Kernel defines priorities from zero to 255.
top shows the same priorities with 100 subtracted.
At least that's how I look at it.
I think we said the same thing but in different words.

Post by Stefan Esser
What I do not understand, though, is that the decision about a preemption
is only based on the calculated new priority of the thread, but not at all
on the priority of other running threads (except the idle thread).

I don't understand this statement. A new thread to run is picked up based on
priorities of all runnable threads. The preemption decision does take into
account the priorities of the currently running thread as well as the new thread.

Post by Stefan Esser
On my system, a "real" batch job (i.e. one that does not voluntarily give
up the CPU due to I/O) seems to have a PRI value of 80 to 100 (growing
over time), while an interactive process has a PRI of 20, a maximally
"niced" interactive process has 52.
So, I'd expect a reasonable default value of preempt_thresh to be slightly
above 120 (e.g. 124) to prevent I/O heavy threads from stealing each other
the CPU too often, and to prevent "niced" processes from doing the same ...
The two values configured into the kernel (80 for PREEMPTION and 255 for
FULL_PREEMPTION) seem to be extremes, but something in between (e.g. 124)
is not offered (can only be configured via sysctl without any information
for the correspondence between the threshold value and the PRI value in
any document I've found, besides the kernel sources ...).
Is PRI_MIN_KERN=80 really a good default value for the preemption threshold?

Yeah, a good question...
I am not really sure about this. In my opinion it would be better to set
preempt_thresh to at least PRI_MAX_KERN, so that all threads running in kernel
are allowed to preempt userland threads. But that would also allow kernel
threads (with priorities between PRI_MIN_KERN and PRI_MAX_KERN) to preempt other
kernel threads as well, not sure if that's always okay. The same argument
applies to higher values for preempt_thresh as well.

Perhaps a single preempt_thresh is not expressive enough?
Just a thought... maybe we need two thresholds where one tells that threads with
better priority are potentially allowed to preempt other threads and the other
tells that threads with worse priority can be preempted.
For example:
- may_preempt_prio=PRI_MAX_INTERACT
- may_be_preempted_prio=PRI_MIN_BATCH
This tells that realtime, kernel and interactive threads are allowed to preempt
other threads if other conditions are met.
And only batch and idle threads can actually be preempted.

Probably even the above is not flexible enough.
I think that we need preemption policies that might not be expressible as one or
two numbers. A policy could be something like this:
- interrupt threads can preempt only threads from "lower" classes: real-time,
kernel, timeshare, idle;
- interrupt threads cannot preempt other interrupt threads
- real-time threads can preempt other real-time threads and threads from "lower"
classes: kernel, timeshare, idle
- kernel threads can preempt only threads from lower classes: timeshare, idle
- interactive timeshare threads can only preempt batch and idle threads
- batch threads can only preempt idle threads

--
Andriy Gapon

Andriy Gapon

2018-06-07 17:14:10 UTC

Permalink

Post by Andriy Gapon
I think that we need preemption policies that might not be expressible as one or
- interrupt threads can preempt only threads from "lower" classes: real-time,
kernel, timeshare, idle;
- interrupt threads cannot preempt other interrupt threads
- real-time threads can preempt other real-time threads and threads from "lower"
classes: kernel, timeshare, idle
- kernel threads can preempt only threads from lower classes: timeshare, idle
- interactive timeshare threads can only preempt batch and idle threads
- batch threads can only preempt idle threads

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

--
Andriy Gapon

Gary Jennejohn

2018-06-08 12:27:19 UTC

Permalink

On Thu, 7 Jun 2018 20:14:10 +0300

Post by Andriy Gapon

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

What about SCHED_4BSD? Or is this just an example and you chose
SCHED_ULE for it?

--
Gary Jennejohn

Andriy Gapon

2018-06-08 14:18:43 UTC

Permalink

Post by Gary Jennejohn
On Thu, 7 Jun 2018 20:14:10 +0300

Post by Andriy Gapon

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

What about SCHED_4BSD? Or is this just an example and you chose
SCHED_ULE for it?

I haven't looked at SCHED_4BSD code at all.

--
Andriy Gapon

Gary Jennejohn

2018-06-08 14:40:33 UTC

Permalink

On Fri, 8 Jun 2018 17:18:43 +0300

Post by Andriy Gapon

Post by Gary Jennejohn
On Thu, 7 Jun 2018 20:14:10 +0300

Post by Andriy Gapon

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

What about SCHED_4BSD? Or is this just an example and you chose
SCHED_ULE for it?

I haven't looked at SCHED_4BSD code at all.

I hope you will eventually because that's what I use. I find its
scheduling of interactive processes much better than ULE.

--
Gary Jennejohn

Stefan Esser

2018-06-09 11:53:48 UTC

Permalink

Post by Andriy Gapon

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

Hi Andriy,

I highly appreciate your effort to improve the scheduling in SCHED_ULE.

But I'm afraid, that your scheme will not fix the problem. As you may
know, there are a number of problems with SCHED_ULE, which let quite a
number of users prefer SCHED_4BSD even on multi-core systems.

The problems I'm aware of:

1) On UP systems, I/O intensive applications may be starved by compute
intensive processes that are allowed to consume their full quantum of
time (limiting reads to some 10 per second worst case).

2) Similarly, on SMP systems with load higher than the number of cores
(virtual cores in case of HT), the compute bound cores can slow down
a cp of a large file from 100s of MB/s to 100s of KB/s, under certain
circumstances.

3) Programs that evenly split the load on all available cores have been
suffering from sub-optimal assignment of threads to cores. E.g. on a
CPU with 8 (virtual) cores, this resulted in 6 cores running the load
in nominal time, 1 core taking twice as long because 2 threads were
scheduled to run on it, while 1 core was mostly idle. Even if the
load was initially evenly distributed, a woken up process that ran on
one core destroyed the symmetry and it was not recovered. (This was a
problem e.g. for parallel programs using MPI or the like.)

4) The real time behavior of SCHED_ULE is weak due to interactive
processes (e.g. the X server) being put into the "time-share" class
and then suffering from the problems described as 1) or 2) above.
(You distinguish time-share and batch processes, which both are
allowed to consume their full quanta even of a higher priority
process in their class becomes runnable. I think this will not
give the required responsiveness e.g. for an X server.)
They should be considered I/O intensive, if they often don't use
their full quantum, without taking the significant amount of CPU
time they may use at times into account. (I.e. the criterion for
time-sharing should not be the CPU time consumed, but rather some
fraction of the quanta not being fully used due to voluntarily giving
up the CPU.) With many real-time threads it may be hard to identify
interactive threads, since they are non-voluntarily disrupted too
often - this must be considered in the sampling of voluntary vs.
non-voluntary context switches.

5) The NICE parameter has hardly any effect on the scheduling. Processes
started with nice 19 get nearly the same share of the CPU as processes
at nice 0, while they should traditionally only run when a core was
idle, otherwise. Nice values between 0 and 19 have even less effect
(hardly any).

I have not had time to try the patch in that review, but I think that
the cause of scheduling problems is not localized in that function.

And a solution should be based on typical use cases or sample scenarios
being applied to a scheduling policy. There are some easy cases (e.g. a
"random" load of independent processes like a parallel make run), where
only cache effects are relevant (try to keep a thread on its CPU as long
as possible and, if interrupted, continue it on that CPU if you can assume
there is still significant cached state).

There have been excessive KTR traces that showed the scheduler behavior
under specific loads, especially MPI, and there have been attempts to
fix the uneven distribution of processes for that case (but AFAIR not
with good success).

Your patches may be part of the solution, with at least 3 other parts
remaining:

1) The classification of interactive and time-share should be separate.
Interactive means that the process does not use its full quantum in
a non-negligible fraction of cases. The X server or a DBMS server
should not be considered compute intensive, or request rates will
be as low as 10 per second (if the time-share quantum is in the
order of 100 ms).

2) The scheduling should guarantee symmetric distribution of the load
for scenarios as parallel programs with MPI. Since OpenMP and other
mechanism have similar requirements, this will become more relevant
over time.

3) The nice-ness of a process should be relevant, to give the user or
admin a way to adjust priorities.

Each of those points will require changes in different parts of the
scheduler, but I think those changes should not be considered in
isolation.

Best regards, STefan

Don Lewis

2018-06-10 01:07:15 UTC

Permalink

Post by Stefan Esser
3) Programs that evenly split the load on all available cores have been
suffering from sub-optimal assignment of threads to cores. E.g. on a
CPU with 8 (virtual) cores, this resulted in 6 cores running the load
in nominal time, 1 core taking twice as long because 2 threads were
scheduled to run on it, while 1 core was mostly idle. Even if the
load was initially evenly distributed, a woken up process that ran on
one core destroyed the symmetry and it was not recovered. (This was a
problem e.g. for parallel programs using MPI or the like.)

When a core is about to go idle or first enters the idle state it will
search for the most heavily loaded core and steal a thread from it. The
core will only go to sleep if it can't find a non-running thread to
steal.

If there are N cores and N+1 runnable threads, there is a long term load
balancer than runs periodically. It searches for the most and least
loaded cores and moves a thread from the former to the latter. That
prevents the same pair of threads from having to share the same core
indefinitely.

There is an observed bug where a low priority thread can get pinned to a
particular core that is already occupied by a high-priority CPU-bound
thread that never releases the CPU. The low priority thread can't
migrate to another core that subsequently becomes available because it
it is pinned. It is not known how the thread originally got into this
state. I don't see any reason for 4BSD to be immune to this problem.

Post by Stefan Esser
4) The real time behavior of SCHED_ULE is weak due to interactive
processes (e.g. the X server) being put into the "time-share" class
and then suffering from the problems described as 1) or 2) above.
(You distinguish time-share and batch processes, which both are
allowed to consume their full quanta even of a higher priority
process in their class becomes runnable. I think this will not
give the required responsiveness e.g. for an X server.)
They should be considered I/O intensive, if they often don't use
their full quantum, without taking the significant amount of CPU
time they may use at times into account. (I.e. the criterion for
time-sharing should not be the CPU time consumed, but rather some
fraction of the quanta not being fully used due to voluntarily giving
up the CPU.) With many real-time threads it may be hard to identify
interactive threads, since they are non-voluntarily disrupted too
often - this must be considered in the sampling of voluntary vs.
non-voluntary context switches.

It can actually be worse than this. There is a bug that can cause the
wnck-applet component of the MATE desktop to consume a large amount of
CPU time, and apparently it is communicating with the Xorg server, which it
drives to 100% CPU. That makes it's PRI value increase greatly so
it has a lower scheduling priority. Even without competing CPU load,
interactive performance is hurt. With competing CPU load it gets much
worse.

Steve Kargl

2018-06-10 01:35:11 UTC

Permalink

Post by Don Lewis

When a core is about to go idle or first enters the idle state it will
search for the most heavily loaded core and steal a thread from it. The
core will only go to sleep if it can't find a non-running thread to
steal.
If there are N cores and N+1 runnable threads, there is a long term load
balancer than runs periodically. It searches for the most and least
loaded cores and moves a thread from the former to the latter. That
prevents the same pair of threads from having to share the same core
indefinitely.
There is an observed bug where a low priority thread can get pinned to a
particular core that is already occupied by a high-priority CPU-bound
thread that never releases the CPU. The low priority thread can't
migrate to another core that subsequently becomes available because it
it is pinned. It is not known how the thread originally got into this
state. I don't see any reason for 4BSD to be immune to this problem.

It is a well-known problem that an over-subscribed ULE kernel
has much worse performance than a 4BSD kernel. I've posted
more than once with benchmark numbers that demonstrate the problem.

--
Steve

Rodney W. Grimes

2018-06-09 12:36:55 UTC

Permalink

Post by Gary Jennejohn
On Fri, 8 Jun 2018 17:18:43 +0300

Post by Andriy Gapon

Post by Gary Jennejohn
On Thu, 7 Jun 2018 20:14:10 +0300

Post by Andriy Gapon

Here is a sketch of the idea: https://reviews.freebsd.org/D15693

What about SCHED_4BSD? Or is this just an example and you chose
SCHED_ULE for it?

I haven't looked at SCHED_4BSD code at all.

I hope you will eventually because that's what I use. I find its
scheduling of interactive processes much better than ULE.

+1

Bruce Evans may have some info and/or changes here too.

--
Rod Grimes ***@freebsd.org