The Linux scheduler must be slightly modified in order to support the
symmetricmultiprocessor (SMP) architecture. Actually, each processor runs the schedule( )
function on its own, but processors must exchange
informationin order to boost system performance.
When the scheduler computes the goodness of a runnable process, it shouldconsider whether that process was previously running on the same CPU or onanother one. A process that was running on the same CPU is always preferred,since the hardware cache of the CPU could still include useful data. Thisrule helps in reducing the number of cache misses.
Let us suppose, however, that CPU 1 is running a process when a second, higher-priorityprocess that was last running on CPU 2 becomes runnable. Now the kernel isfaced with an interesting dilemma: should it immediately execute the higher-priorityprocess on CPU 1, or should it defer that process's execution until CPU 2becomes available? In the former case, hardware caches contents are discarded;in the latter case, parallelism of the SMP architecture may not be fullyexploited when CPU 2 is running the idle process ( swapper).
In order to achieve good system performance, Linux/SMP adopts an empiricalrule to solve the dilemma. The adopted choice is always a compromise, andthe trade-off mainly depends on the size of the hardware caches integrated into each CPU: the larger the CPU cache is, the more convenient it is tokeep a process bound on that CPU.
An aligned_data
table includes one data
structurefor each processor, which is used mainly to obtain the descriptors of
currentprocesses quickly. Each element is filled by every invocation of the schedule( )
function and has the following structure:
struct schedule_data {
struct task_struct * curr;
unsigned long last_schedule;
};
The curr
field points to the descriptor of
theprocess running on the corresponding CPU, while last_schedule
specifies when schedule( )
selected curr
as the running process.
Several SMP-related fields are included in the process descriptor. In
particular,the avg_slice
field keeps track of the
averagequantum duration of the process, and the processor
field stores the logical identifier of the last CPU that executed it.
The cacheflush_time
variable contains a
roughestimate of the minimal number of CPU cycles it takes to entirely overwritethe
hardware cache content. It is initialized by the smp_tune_scheduling(
)
function to:
Intel Pentium processors have a hardware cache of 8 KB, so their cacheflush_time
is initialized to a few hundred CPU cycles, that
is,a few microseconds. Recent Intel processors have larger hardware caches,and therefore
the minimal cache flush time could range from 50 to 100 microseconds.
As we shall see later, if cacheflush_time
isgreater than the average time slice of some currently running process, noprocess
preemption is performed because it is convenient in this case to bind processes to the
processors that last executed them.
When the schedule( )
function is
executed onan SMP system, it carries out the following operations:
schedule( )
asusual.this_cpu
local variable; such value is read from the processor
field of prev
(that is,
ofthe process to be replaced).sched_data
local
variable sothat it points to the schedule_data
structureof
the this_cpu
CPU.goodness( )
repeatedly to
select thenew process to be executed; this function also examines the processor
field of the processes and gives a consistent bonus (PROC_CHANGE_PENALTY
, usually 15) to the process that was last
executedon the this_cpu
CPU.sched_data->curr
to next
.next->has_cpu
to 1 and next->processor
to this_cpu
.t
local variable.prev
in
the this_slice
local variable; this valueis the difference
between t
and sched_data->last_schedule
.sched_data->last_schedule
to t
.avg_slice
field of prev
to (prev->avg_slice+this_slice
)/2; in other words, updates the average.prev
local variablenow
refers to the process that has just been replaced. If prev
is
still runnable and it is not the idle task of this CPU, invokesthe reschedule_idle(
)
function on it (seethe next section).has_cpu
field of prev
to 0. The reschedule_idle( )
function is invoked when a process p
becomes runnable (see
the earlier section " The schedule( )
Function"). On an SMP system, the function determines whether the process should
preempt the current process of some CPU. It performs the following operations:
p
is a real-time process, always
attemptsto perform preemption: go to step 3.cacheflush_time
is greater than the
average time slice of the currentprocess. If this is true, the process is not dirtying the
cache significantly.p
and the current process need
the globalkernel lock (see the section "Global and Local Kernel Locks" in
Chapter 11)in order to access some critical kernel data structure. This check is
performedbecause replacing a process holding the lock with another one that needsit is not
fruitful.
p->processor
CPU (the one on
which p
was last running) is idle, selects it.goodness(tsk, p) - goodness(tsk, tsk)
for each task tsk
running on some CPU
and selectsthe CPU for which the difference is greatest, provided it is a positive value.
need_resched
field of the corresponding running process and sends a "reschedule" messageto
that processor (see the section "Interprocessor Interrupts" in Chapter 11).