Post by Andriy GaponPost by Andriy GaponI think that we need preemption policies that might not be expressible as one or
- interrupt threads can preempt only threads from "lower" classes: real-time,
kernel, timeshare, idle;
- interrupt threads cannot preempt other interrupt threads
- real-time threads can preempt other real-time threads and threads from "lower"
classes: kernel, timeshare, idle
- kernel threads can preempt only threads from lower classes: timeshare, idle
- interactive timeshare threads can only preempt batch and idle threads
- batch threads can only preempt idle threads
Here is a sketch of the idea: https://reviews.freebsd.org/D15693
Hi Andriy,
I highly appreciate your effort to improve the scheduling in SCHED_ULE.
But I'm afraid, that your scheme will not fix the problem. As you may
know, there are a number of problems with SCHED_ULE, which let quite a
number of users prefer SCHED_4BSD even on multi-core systems.
The problems I'm aware of:
1) On UP systems, I/O intensive applications may be starved by compute
intensive processes that are allowed to consume their full quantum of
time (limiting reads to some 10 per second worst case).
2) Similarly, on SMP systems with load higher than the number of cores
(virtual cores in case of HT), the compute bound cores can slow down
a cp of a large file from 100s of MB/s to 100s of KB/s, under certain
circumstances.
3) Programs that evenly split the load on all available cores have been
suffering from sub-optimal assignment of threads to cores. E.g. on a
CPU with 8 (virtual) cores, this resulted in 6 cores running the load
in nominal time, 1 core taking twice as long because 2 threads were
scheduled to run on it, while 1 core was mostly idle. Even if the
load was initially evenly distributed, a woken up process that ran on
one core destroyed the symmetry and it was not recovered. (This was a
problem e.g. for parallel programs using MPI or the like.)
4) The real time behavior of SCHED_ULE is weak due to interactive
processes (e.g. the X server) being put into the "time-share" class
and then suffering from the problems described as 1) or 2) above.
(You distinguish time-share and batch processes, which both are
allowed to consume their full quanta even of a higher priority
process in their class becomes runnable. I think this will not
give the required responsiveness e.g. for an X server.)
They should be considered I/O intensive, if they often don't use
their full quantum, without taking the significant amount of CPU
time they may use at times into account. (I.e. the criterion for
time-sharing should not be the CPU time consumed, but rather some
fraction of the quanta not being fully used due to voluntarily giving
up the CPU.) With many real-time threads it may be hard to identify
interactive threads, since they are non-voluntarily disrupted too
often - this must be considered in the sampling of voluntary vs.
non-voluntary context switches.
5) The NICE parameter has hardly any effect on the scheduling. Processes
started with nice 19 get nearly the same share of the CPU as processes
at nice 0, while they should traditionally only run when a core was
idle, otherwise. Nice values between 0 and 19 have even less effect
(hardly any).
I have not had time to try the patch in that review, but I think that
the cause of scheduling problems is not localized in that function.
And a solution should be based on typical use cases or sample scenarios
being applied to a scheduling policy. There are some easy cases (e.g. a
"random" load of independent processes like a parallel make run), where
only cache effects are relevant (try to keep a thread on its CPU as long
as possible and, if interrupted, continue it on that CPU if you can assume
there is still significant cached state).
There have been excessive KTR traces that showed the scheduler behavior
under specific loads, especially MPI, and there have been attempts to
fix the uneven distribution of processes for that case (but AFAIR not
with good success).
Your patches may be part of the solution, with at least 3 other parts
remaining:
1) The classification of interactive and time-share should be separate.
Interactive means that the process does not use its full quantum in
a non-negligible fraction of cases. The X server or a DBMS server
should not be considered compute intensive, or request rates will
be as low as 10 per second (if the time-share quantum is in the
order of 100 ms).
2) The scheduling should guarantee symmetric distribution of the load
for scenarios as parallel programs with MPI. Since OpenMP and other
mechanism have similar requirements, this will become more relevant
over time.
3) The nice-ness of a process should be relevant, to give the user or
admin a way to adjust priorities.
Each of those points will require changes in different parts of the
scheduler, but I think those changes should not be considered in
isolation.
Best regards, STefan