The Linux/SMP Scheduler

The Linux scheduler must be slightly modified in order to support the symmetricmultiprocessor (SMP) architecture. Actually, each processor runs the schedule(  ) function on its own, but processors must exchange informationin order to boost system performance.

When the scheduler computes the goodness of a runnable process, it shouldconsider whether that process was previously running on the same CPU or onanother one. A process that was running on the same CPU is always preferred,since the hardware cache of the CPU could still include useful data. Thisrule helps in reducing the number of cache misses.

Let us suppose, however, that CPU 1 is running a process when a second, higher-priorityprocess that was last running on CPU 2 becomes runnable. Now the kernel isfaced with an interesting dilemma: should it immediately execute the higher-priorityprocess on CPU 1, or should it defer that process's execution until CPU 2becomes available? In the former case, hardware caches contents are discarded;in the latter case, parallelism of the SMP architecture may not be fullyexploited when CPU 2 is running the idle process ( swapper).

In order to achieve good system performance, Linux/SMP adopts an empiricalrule to solve the dilemma. The adopted choice is always a compromise, andthe trade-off mainly depends on the size of the hardware caches integrated into each CPU: the larger the CPU cache is, the more convenient it is tokeep a process bound on that CPU.

Linux/SMP scheduler data structures

An aligned_data table includes one data structurefor each processor, which is used mainly to obtain the descriptors of currentprocesses quickly. Each element is filled by every invocation of the schedule(  ) function and has the following structure:

struct schedule_data { 
    struct task_struct * curr; 
    unsigned long last_schedule; 
}; 

The curr field points to the descriptor of theprocess running on the corresponding CPU, while last_schedule  specifies when schedule(  ) selected curr as the running process.

Several SMP-related fields are included in the process descriptor. In particular,the avg_slice field keeps track of the averagequantum duration of the process, and the processor  field stores the logical identifier of the last CPU that executed it.

The cacheflush_time variable contains a roughestimate of the minimal number of CPU cycles it takes to entirely overwritethe hardware cache content. It is initialized by the smp_tune_scheduling(  ) function to:

Intel Pentium processors have a hardware cache of 8 KB, so their cacheflush_time is initialized to a few hundred CPU cycles, that is,a few microseconds. Recent Intel processors have larger hardware caches,and therefore the minimal cache flush time could range from 50 to 100 microseconds.

As we shall see later, if cacheflush_time isgreater than the average time slice of some currently running process, noprocess preemption is performed because it is convenient in this case to bind processes to the processors that last executed them.

The schedule(  ) function

When the schedule(  ) function is executed onan SMP system, it carries out the following operations:

  1. Performs the initial part of schedule(  ) asusual.
  2. Stores the logical identifier of the executing processor in the this_cpu local variable; such value is read from the processor field of prev (that is, ofthe process to be replaced).
  3. Initializes the sched_data local variable sothat it points to the schedule_data structureof the this_cpu CPU.
  4. Invokes goodness(  ) repeatedly to select thenew process to be executed; this function also examines the processor field of the processes and gives a consistent bonus (PROC_CHANGE_PENALTY, usually 15) to the process that was last executedon the this_cpu CPU.
  5. If needed, recomputes process dynamic priorities as usual.
  6. Sets sched_data->curr to next.
  7. Sets next->has_cpu to 1 and next->processor to this_cpu.
  8. Stores the current Time Stamp Counter value in the t local variable.
  9. Stores the last time slice duration of previn the this_slice local variable; this valueis the difference between t and sched_data->last_schedule.
  10. Sets sched_data->last_schedule to t.
  11. Sets the avg_slice field of prev to (prev->avg_slice+this_slice )/2; in other words, updates the average.
  12. Performs the context switch.
  13. When the kernel returns here, the original previous process has been selectedagain by the scheduler; the prev local variablenow refers to the process that has just been replaced. If prev is still runnable and it is not the idle task of this CPU, invokesthe reschedule_idle(  ) function on it (seethe next section).
  14. Sets the has_cpu field of prev to 0.

The reschedule_idle(  ) function

          The reschedule_idle(  ) function is invoked when a process p becomes runnable (see the earlier section " The schedule(  ) Function"). On an SMP system, the function determines whether the process should preempt the current process of some CPU. It performs the following operations:

  1. If p is a real-time process, always attemptsto perform preemption: go to step 3.
  2. Returns immediately (does not attempt to preempt) if there is a CPU whosecurrent process satisfies both of the following conditions: [4]
  3. If the p->processor CPU (the one on which  p was last running) is idle, selects it.
  4. Otherwise, computes the difference:
  5. goodness(tsk, p) - goodness(tsk, tsk)

    for each task tsk running on some CPU and selectsthe CPU for which the difference is greatest, provided it is a positive value.                                                                         

  6. If CPU has been selected, sets the need_resched  field of the corresponding running process and sends a "reschedule" messageto that processor (see the section "Interprocessor Interrupts" in Chapter 11).
  7.