Every process under Linux is dynamically allocated a 'struct task_struct' structure. The maximum number of processes that can be created on the Linux system is limited only by the amount of physical memory present, and is equal to (see kernel/fork.c:fork_init()):
/*
* The default maximum number of threads is set to a safe
* value: the thread structures can take up at most half
* of memory.
*/
max_threads = mempages / (THREAD_SIZE/PAGE_SIZE) / 2;
which on IA32 architecture basically means 'num_physpages/4' so, for example on 512M machine you can create 32k threads which is a considerable improvement over the 4k-epsilon limit for older (2.2 and earlier) kernels. Moreover, this can be changed at runtime using KERN_MAX_THREADS sysctl(2) or simply using procfs interface to kernel tunables:
# cat /proc/sys/kernel/threads-max
32764
# echo 100000 > /proc/sys/kernel/threads-max
# cat /proc/sys/kernel/threads-max
100000
# gdb -q vmlinux /proc/kcore
Core was generated by `BOOT_IMAGE=240ac18 ro root=306 video=matrox:vesa:0x118'.
#0 0x0 in ?? ()
(gdb) p max_threads
$1 = 100000
The set of processes on the Linux system is represented as a collection of 'struct task_struct' structures which are linked in two ways:
The hashtable is called pidhash[] and is defined in include/linux/sched.h:
/* PID hashing. (shouldnt this be dynamic?) */
#define PIDHASH_SZ (4096 >> 2)
extern struct task_struct *pidhash[PIDHASH_SZ];
#define pid_hashfn(x) ((((x) >> 8) ^ (x)) & (PIDHASH_SZ - 1))
The tasks are hashed by their pid value and the above hashing function is supposed to distribute the elements uniformly in their domain (0 to PID_MAX-1). The hashtable is used to quickly find a task by given pid, using find_task_pid() inline from include/linux/sched.h:
static inline struct task_struct *find_task_by_pid(int pid)
{
struct task_struct *p, **htable = &pidhash[pid_hashfn(pid)];
for(p = *htable; p && p->pid != pid; p = p->pidhash_next)
;
return p;
}
The tasks on each hashlist (i.e. hashed to the same value) are linked by p->pidhash_next/pidhash_pprev which are used by hash_pid() and unhash_pid() to insert and remove a given process into the hashtable. These are done under protection of the rw spinlock called 'tasklist_lock' taken for WRITE.
The circular doubly-linked list that uses p->next_task/prev_task is maintained so that one could go through all tasks on the system easily. This is achieved by for_each_task() macro from include/linux/sched.h:
#define for_each_task(p) \
for (p = &init_task ; (p = p->next_task) != &init_task ; )
The users of for_each_task() should take tasklist_lock for READ. Note that for_each_task() is using init_task to mark the beginning (and end) of the list - this is safe because the idle task (pid 0) never exits.
The modifiers of the process hashtable or/and the process table links, notably fork, exit and ptrace must take the tasklist_lock for WRITE. What is more interesting is that the writers must also disable interrupts on the local cpu. The reason for this is not trivial. The send_sigio() walks the task list and thus takes tasklist_lock for READ and it is called from kill_fasync() in the interrupt context. This is why writers must disable the interrupts while the readers don't need to.
Now that we understand how the task_struct structures are linked together, let us examine the members of task_struct. They loosely corresponds to the members of UNIX 'struct proc' and 'struct user' combined together.
The other versions of UNIX separated the task state information into part which should be kept memory-resident at all times (called 'proc structure' which includes process state, scheduling information etc.) and part which is only needed when the process is running (called 'u area' which includes file descriptor table, disk quota information etc.). The only reason for such ugly design was that memory was a very scarce resource. Modern operating systems (well, only Linux at the moment but others, e.g. FreeBSD seem to improve in this direction towards Linux) do not need such separation and therefore maintain process state in a kernel memory-resident data structure at all times.
The task_struct structure is declared in include/linux/sched.h and is currently 1680 bytes in size.
The state field is declared as:
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
#define TASK_RUNNING 0
#define TASK_INTERRUPTIBLE 1
#define TASK_UNINTERRUPTIBLE 2
#define TASK_ZOMBIE 4
#define TASK_STOPPED 8
#define TASK_EXCLUSIVE 32
Why is TASK_EXCLUSIVE defined as 32 and not 16? Because 16 was used up by TASK_SWAPPING and I forgot to shift TASK_EXCLUSIVE up when I removed all references to TASK_SWAPPING (sometime in 2.3.x).
The volatile in p->state declaration means it can be modified asynchronously (from interrupt handler):
Task flags contain information about the process states which are not mutually exclusive:
unsigned long flags; /* per process flags, defined below */
/*
* Per process flags
*/
#define PF_ALIGNWARN 0x00000001 /* Print alignment warning msgs */
/* Not implemented yet, only for 486*/
#define PF_STARTING 0x00000002 /* being created */
#define PF_EXITING 0x00000004 /* getting shut down */
#define PF_FORKNOEXEC 0x00000040 /* forked but didn't exec */
#define PF_SUPERPRIV 0x00000100 /* used super-user privileges */
#define PF_DUMPCORE 0x00000200 /* dumped core */
#define PF_SIGNALED 0x00000400 /* killed by a signal */
#define PF_MEMALLOC 0x00000800 /* Allocating memory */
#define PF_VFORK 0x00001000 /* Wake up parent in mm_release */
#define PF_USEDFPU 0x00100000 /* task used FPU this quantum (SMP) */
The fields p->has_cpu,p->processor, p->counter, p->priority, p->policy and p->rt_priority are related to the scheduler and will be looked at later.
The fields p->mm and p->active_mm point to the process' address space described by mm_struct structure and to the active address space if the process doesn't have a real one (e.g. kernel threads) - this is to minimize TLB flushes on switching address spaces when the task is scheduled out. So, if we are scheduling-in the kernel thread (which has no p->mm) then its next->active_mm will be set to the prev->active_mm of the task that was scheduled-out which will be the same as prev->mm if prev->mm != NULL. The address space can be shared between threads if CLONE_VM flag is passed to the clone(2) system call or by means of vfork(2) system call.
The fields p->exec_domain and p->personality related to the personality of the task, i.e. to the way certain system calls behave in order to emulate "personality" of foreign flavours of UNIX.
The field p->fs contains filesystem information, which under Linux means three pieces of information:
Also, this structure includes a reference count because it can be shared between cloned tasks when CLONE_FS flags are passed to the clone(2) system call.
The field p->files contains the file descriptor table. This also can be shared between tasks if CLONE_FILES is specified with clone(2) system call.
The field p->sig contains signal handlers and can be shared between cloned tasks by means of CLONE_SIGHAND flag passed to the clone(2) system call.
Different books on operating systems define a "process" in different ways, starting from "instance of a program in execution" and ending with "that which is produced by clone(2) or fork(2) system calls". Under Linux, there are three kinds of processes:
The idle thread is created at compile time for the first CPU and then it is "manually" created for each CPU by means of arch-specific fork_by_hand() in arch/i386/kernel/smpboot.c which unrolls fork system call by hand (on some archs). Idle tasks share one init_task structure but have a private TSS structure in per-CPU array init_tss. Idle tasks all have pid = 0 and no other task can share pid, i.e. use CLONE_PID flag to clone(2).
Kernel threads are created using kernel_thread() function which invokes the clone system call in kernel mode. Kernel threads usually have no user address space, i.e. p->mm = NULL because they explicitly do exit_mm(), e.g. via daemonize() function. Kernel threads can always access kernel address space directly. They are allocated pid numbers in the low range. Running at processor's ring 0 implies that the kernel threads enjoy all the io privileges and cannot be pre-empted by the scheduler.
User tasks are created by means of clone(2) or fork(2) system calls, both of which internally invoke kernel/fork.c:do_fork().
Let us understand what happens when a user process makes a fork(2) system call. Although the fork(2) system call is architecture-dependent due to the different ways of passing user stack and registers, the actual underlying function do_fork() that does the job is portable and is located at kernel/fork.c.
The following steps are done:
Thus tasks are created. There are several ways for tasks to terminate:
Functions implementing system calls under Linux are prefixed with 'sys_', but they are usually concerned only with argument checking or arch-specific ways to pass some information and the actual work is done by 'do_' functions. So it is with sys_exit() which calls do_exit() to do the work. Although, other parts of the kernel sometimes invoke sys_exit(), they should really call do_exit().
The function do_exit() is found in kernel/exit.c. The points to note about do_exit():
The job of a scheduler is to arbitrate access to the current CPU between multiple processes. Scheduler is implemented in the 'main kernel file' kernel/sched.c. The corresponding header file include/linux/sched.h is included (either explicitly or indirectly) by virtually every kernel source file.
The fields of task structure relevant to scheduler include:
The scheduler's algorithm is simple, despite the great apparent complexity of the schedule() function. The function is complex because it implements three scheduling algorithms in one and also because of the subtle SMP-specifics.
The apparently 'useless' gotos in schedule() are there for a purpose - to generate the best optimized (for i386) code. Also, note that scheduler (like most of the kernel) was completely rewritten for 2.4 so the discussion below does not apply to 2.2 or to any other old kernels.
Let us look at the function in detail:
recalculate:
{
struct task_struct *p;
spin_unlock_irq(&runqueue_lock);
read_lock(&tasklist_lock);
for_each_task(p)
p->counter = (p->counter >> 1) + p->priority;
read_unlock(&tasklist_lock);
spin_lock_irq(&runqueue_lock);
}
Note that the we drop the runqueue_lock before we recalculate because we go through entire set of processes which can take a long time whilst the schedule() could be called on another cpu and select a process with goodness good enough for that cpu whilst we on this cpu were forced to recalculate. Ok, admittedly this is somewhat inconsistent because while we (on this cpu) are selecting a process with the best goodness, schedule() running on another cpu could be recalculating dynamic priorities
Before we go on to examine implementation of wait queues we must acquaint ourselves with the Linux standard doubly-linked list implementation because wait queues (as well as everything else in Linux) makes heavy use of them and they are called in jargon "list.h implementation" because the most relevant file is include/linux/list.h.
The fundamental data structure here is 'struct list_head':
struct list_head {
struct list_head *next, *prev;
};
#define LIST_HEAD_INIT(name) { &(name), &(name) }
#define LIST_HEAD(name) \
struct list_head name = LIST_HEAD_INIT(name)
#define INIT_LIST_HEAD(ptr) do { \
(ptr)->next = (ptr); (ptr)->prev = (ptr); \
} while (0)
#define list_entry(ptr, type, member) \
((type *)((char *)(ptr)-(unsigned long)(&((type *)0)->member)))
#define list_for_each(pos, head) \
for (pos = (head)->next; pos != (head); pos = pos->next)
The first three macros are for initialising an empty list by pointing both next and prev pointers to itself. It is obvious from C syntactical restrictions which ones should be used where - for example, LIST_HEAD_INIT() can be used for structure's element initialisation in declaration, the second can be used for static variable initialising declarations and the third can be used inside a function.
The macro list_entry() gives access to individual list element, for example: (from fs/file_table.c:fs_may_remount_ro())
struct super_block {
...
struct list_head s_files;
...
} *sb = &some_super_block;
struct file {
...
struct list_head f_list;
...
} *file;
struct list_head *p;
for (p = sb->s_files.next; p != &sb->s_files; p = p->next) {
struct file *file = list_entry(p, struct file, f_list);
do something to 'file'
}
A good example of the use of list_for_each() macro is in the scheduler where we walk the runqueue looking for the process with highest goodness:
static LIST_HEAD(runqueue_head);
struct list_head *tmp;
struct task_struct *p;
list_for_each(tmp, &runqueue_head) {
p = list_entry(tmp, struct task_struct, run_list);
if (can_schedule(p)) {
int weight = goodness(p, this_cpu, prev->active_mm);
if (weight > c)
c = weight, next = p;
}
}
Here p->run_list is declared as 'struct list_head run_list' inside task_struct structure and serves as anchor to the list. Removing an element from the list and adding (to head or tail of the list) is done by list_del()/list_add()/list_add_tail() macros. The examples below are adding and removing a task from runqueue:
static inline void del_from_runqueue(struct task_struct * p)
{
nr_running--;
list_del(&p->run_list);
p->run_list.next = NULL;
}
static inline void add_to_runqueue(struct task_struct * p)
{
list_add(&p->run_list, &runqueue_head);
nr_running++;
}
static inline void move_last_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add_tail(&p->run_list, &runqueue_head);
}
static inline void move_first_runqueue(struct task_struct * p)
{
list_del(&p->run_list);
list_add(&p->run_list, &runqueue_head);
}
When a process requests the kernel to do something which is currently impossible but that may become possible later, the process is put to sleep and is woken up when the request is more likely to be satisfied. One of the kernel mechanisms used for this is called a 'wait queue'.
Linux implementation allows wake-on semantics using TASK_EXCLUSIVE flag. With waitqueues you can either use a well-known queue and then simply sleep_on/sleep_on_timeout/interruptible_sleep_on/interruptible_sleep_on_timeout or you can define your own waitqueue and use add/remove_wait_queue to add and remove yourself from it and also wake_up/wake_up_interruptible to wake up when needed.
An example of the first usage of waiteueus is interaction between page allocator mm/page_alloc.c:__alloc_pages() using the well-known queue kswapd_wait declared in mm/vmscan.c and on which kswap kernel daemon is sleeping in mm/vmscan.c:kswap() and is woken up when page allocator needs to free up some pages.
An example of autonomous waitqueue usage is interaction between user process requesting data via read(2) system call and kernel running in the interrupt context to supply the data. An interrupt handler might look like (simplified drivers/char/rtc_interrupt()):
static DECLARE_WAIT_QUEUE_HEAD(rtc_wait);
void rtc_interrupt(int irq, void *dev_id, struct pt_regs *regs)
{
spin_lock(&rtc_lock);
rtc_irq_data = CMOS_READ(RTC_INTR_FLAGS);
spin_unlock(&rtc_lock);
wake_up_interruptible(&rtc_wait);
}
so, the interrupt handler obtains the data by reading from some device-specific io port (CMOS_READ() macro turns into a couple outb/inb) and then wakes up whoever is sleeping on the rtc_wait wait queue.
Now, the read(2) system call could be implemented as:
ssize_t rtc_read(struct file file, char *buf, size_t count, loff_t *ppos)
{
DECLARE_WAITQUEUE(wait, current);
unsigned long data;
ssize_t retval;
add_wait_queue(&rtc_wait, &wait);
current->state = TASK_INTERRUPTIBLE;
do {
spin_lock_irq(&rtc_lock);
data = rtc_irq_data;
rtc_irq_data = 0;
spin_unlock_irq(&rtc_lock);
if (data != 0)
break;
if (file->f_flags & O_NONBLOCK) {
retval = -EAGAIN;
goto out;
}
if (signal_pending(current)) {
retval = -ERESTARTSYS;
goto out;
}
schedule();
} while(1);
retval = put_user(data, (unsigned long *)buf);
if (!retval)
retval = sizeof(unsigned long);
out:
current->state = TASK_RUNNING;
remove_wait_queue(&rtc_wait, &wait);
return retval;
}
What happens in rtc_read() is this:
It is also worth pointing out that using wait queues it is rather easy to implement poll(2) system call:
static unsigned int rtc_poll(struct file *file, poll_table *wait)
{
unsigned long l;
poll_wait(file, &rtc_wait, wait);
spin_lock_irq(&rtc_lock);
l = rtc_irq_data;
spin_unlock_irq(&rtc_lock);
if (l != 0)
return POLLIN | POLLRDNORM;
return 0;
}
All the work is done by device-independent function poll_wait() which does the necessary waitqueue manipulations all we need is to point it to the waitqueue which is woken up by our device-specific interrupt handler.
Now let us turn our attention to kernel timers. Kernel timers are used to dispatch execution of a particular function (called 'timer handler') at a specified time in the future. The main data structure is 'struct timer_list' declared in include/linux/timer.h:
struct timer_list {
struct list_head list;
unsigned long expires;
unsigned long data;
void (*function)(unsigned long);
volatile int running;
};
The 'list' field is for linking into the internal list, protected by the timerlist_lock spinlock. The 'expires' field is the value of jiffies when the 'function' handler should be invoked with 'data' passed as a parameter. The 'running' field is used on SMP to test if the timer handler is currently running on another cpu.
The functions add_timer() and del_timer() add and remove a given timer to the list. When a timer expires it is removed automatically. Before a timer is used it must be initialised by means of init_timer() function. And before it is added, the fields 'function' and 'expires' must be set.
Sometimes it is reasonable to split the amount of work to be performed inside an interrupt handler into immediate (e.g. acknowledging the interrupt, updating the stats etc.) and that which can be postponed until later, when interrupts are enabled (e.g. to do some postprocessing on data, wake up processes waiting for this data etc.).
Bottom halves are the oldest mechanism for deferred execution of kernel tasks and have been available since Linux 1.x. In Linux2.0 a new mechanism was added called 'task queues' which will be the subject of next section.
Bottom halves are serialized by a global_bh_lock spinlock, i.e. there can only be one bottom half running on any cpu at a time. However, when attempting to execute the handler, if global_bh_lock is not available, the bottom half is marked (i.e. scheduled) for execution - so processing can continue, as opposed to a busy loop on global_bh_lock.
There can only be 32 bottom halves registered in total. The functions required to manipulate bottom halves are as follows (all exported to modules):
Bottom halves are globally locked tasklets so the question "when are bottom half handlers executed?" is really "when are tasklets executed?". And the answer is - in two places, a) on each schedule() and b) on each interrupt/syscall return path in entry.S. (TODO: therefore, the schedule() case is really boring - it like adding yet another very very slow interrupt, why not get rid of handle_softirq label from schedule() altogether?)
Task queues can be though of as dynamic extension to old bottom halves. In fact, in the source code they are sometimes referred to as "new" bottom halves. More specifically, the old bottom halves discussed in previous section have these limitations:
So, with task queues, arbitrary number of functions can be chained and processed one after another at a later time. One create a new task queue using DECLARE_TASK_QUEUE() macro and queues a task onto it using queue_task() function. The task queue then can be processed using run_task_queue() function. Instead of creating your own task queue (and having to consume it manually) you can use one of the Linux's predefined task queues which are consumed at well-known points:
Unless the driver uses its own task queues it does not need to call run_tasks_queues() to process the queue, except under circumstances explained below.
The reason tq_timer/tq_schduler task queues are consumed not only in the usual places but elsewhere (closing tty device is but one example) becomes clear if one remembers that the driver can schedule tasks on the queue that only makes sense while a particular instance of the device is still valid - which usually means until the application closes it. So, the driver may need to call run_task_queue() to flush the tasks it (and anyone else) has put on the queue, because allowing them to run at a later time may make no sense - i.e. the relevant data structures may have been freed/reused by a different instance. This is the reason you see run_task_queue() on tq_timer and tq_scheduler in places other than timer interrup and schedule() respectively.
Not yet, will be in future revision.
Not yet, will be in future revision.
There are two mechanisms under Linux for implementing system calls:
Native Linux programs use int 0x80 whilst the binaries from foreign flavours of UNIX (Solaris, UnixWare 7 etc.) use lcall7 mechanism. The name 'lcall7' is historically misleading because it covers also lcall27 (e.g. Solaris/x86) but the handler function is called lcall7_func.
When the system boots the function arch/i386/kernel/traps.c:trap_init() is called which sets up the IDT to point vector 0x80 (of type 15, dpl 3) to the address of system_call entry from arch/i386/kernel/entry.S.
When application makes a system call, the arguments are passed via registers and the application executes 'int 0x80' instruction. This causes trap into kernel mode and processor jumps to system_call entry point in entry.S. What this does is:
Linux supports up to 6 arguments for system calls. They are passed in %ebx, %ecx, %edx, %esi, %edi (and %ebp used temporarily, see _syscall6() in asm-i386/unistd.h) and the system call number is passed via %eax.
There are two types of atomic operations - bitmaps and atomic_t. Bitmaps are very convenient for maintaining a concept of "allocated" or "free" units from some large collection where each unit is identified by some number, for example free inodes or free blocks. They are also widely use for simple locking for example to provide exclusive access to open a device, e.g. in arch/i386/kernel/microcode.c:
/*
* Bits in microcode_status. (31 bits of room for future expansion)
*/
#define MICROCODE_IS_OPEN 0 /* set if device is in use */
static unsigned long microcode_status;
There is no need to initialise microcode_status to 0 as BSS is zero-cleared under Linux explicitly.
/*
* We enforce only one user at a time here with open/close.
*/
static int microcode_open(struct inode *inode, struct file *file)
{
if (!capable(CAP_SYS_RAWIO))
return -EPERM;
/* one at a time, please */
if (test_and_set_bit(MICROCODE_IS_OPEN, µcode_status))
return -EBUSY;
MOD_INC_USE_COUNT;
return 0;
}
The operations on bitmaps are:
(TODO: why 'volatile' in the above declarations?)
These operations use LOCK_PREFIX which on SMP evaluates to bus lock instruction prefix and to nothing on UP. This guarantees atomicity of access in SMP environment.
Sometimes bit manipulations are not convenient but instead we need to perform arithmetic operations - add, subtract, increment decrement. The typical cases are reference counts (e.g. for inodes). This facility is provided by the atomic_t data type and the following operations:
Since the early days of Linux support (early 90s, this century), the developers were faced with the classical problem of solving the problem of accessing shared data between different types of context (user process vs interrupt) and different instances of the same context from multiple cpus.
SMP support was added to Linux 1.3.42 on 15 Nov 1995 (the original patch was made to 1.3.37 in October the same year).
If the critical region of code may be executed by either process context and interrupt context, then the way to protect it using cli/sti instructions on UP is:
unsigned long flags;
save_flags(flags);
cli();
/* critical code */
restore_flags(flags);
While this is ok on UP, it obviously is of no use on SMP because the same code sequence may be executed simultaneously on another cpu and so cli will provide protection against races with interrupt context on each cpu, it will provide no protection against races between contexts running on different cpus. This is where spinlocks are useful for.
There are three types of spinlocks - vanilla (basic), read-write and big-reader spinlocks. Read-write spinlocks should be used when there is a natural tendency of 'many readers and few writers'. Example of this is access to the list of registered filesystems - see fs/super.c. The list is guarded by read-write spinlock file_systems_lock because one needs exclusive access only when registering/unregistering a filesystem but any process can read the file /proc/filesystems of use sysfs(2) system call to force a read-only scan of the file_systems list. This makes it sensible to use read-write spinlocks. With read-write spinlocks, one can have multiple readers at a time but only one writer and there can be no readers while there is a writer. Btw, it would be nice if new readers would not get a lock while there is a writer trying to get a lock, i.e. if Linux could correctly deal with the issue of potential writer starvation by multiple readers. This would mean that readers must be blocked while there is a writer attempting to get the lock. This is not currently the case and it is not obvious whether this should be fixed - the argument to the contrary is - readers usually take the lock for a very short time so should they really be starved while the writer takes the lock for potentially longer periods?
Big-reader spinlocks are a form of read-write spinlocks heavily optimised for very light read access with the penalty for writes. There is a limited number of big-reader spinlocks - currently only two exist, of which one is used only on sparc64 (global irq) and the other is used for networking. In all other cases where the access pattern does not fit into any of these two scenarios one should use basic spinlocks. You cannot block while holding any kind of spinlock.
Spinlocks come in three flavours: plain, _irq() and _bh().
The reason you cannot use plain spin_lock() if you race against interrupt handlers is because if you take it and then interrupt comes in on the same cpu - it will busy wait for the lock forever because the lock holder was interrupted and will not continue until the interrupt handler returns.
The most common usage of a spinlock is to access a data structure shared between user process context and interrupt handlers:
spinlock_t my_lock = SPIN_LOCK_UNLOCKED;
my_ioctl()
{
unsigned long flags;
spin_lock_irq(&my_lock, flags);
/* critical section */
spin_unlock_irq(&my_lock, flags);
}
my_irq_handler()
{
spin_lock(&lock);
/* critical section */
spin_unlock(&lock);
}
There are a couple of things to note about this example:
Sometimes while accessing a shared data structure one must perform operations that can block, for example to copy data to userspace. The locking primitive available for such scenarios under Linux is called a semaphore. There are two types of semaphores - basic and read-write semaphores. Depending on the initial value of the semaphore, they can be used for eithe mutual exclusion (initial value of 1) or to provide more sophisticated type of access.
Read-write semaphores differ from basic semaphores in the same way as read-write spinlocks differ from basic spinlocks, i.e. one can have multiple readers at a time but only one writer and there be no readers while there are writers - i.e. the writer blocks all readers and new readers block while a writer is waiting.
Also, basic semaphores can be interruptible - just use the operations down_interruptible()/up() instead of the plain down()/up() and check the value returned from down_interruptible() - if it is non-0 the operation was interrupted.
Using semaphore for mutual exclusion is ideal in situation where critical code section may call by reference unknown functions registered by other subsystems/modules, i.e. the caller cannot know apriori whether the function blocks or not.
A simple example of semaphore usage is in kernel/sys.c, implementation of gethostname(2)/sethostname(2) system calls.
asmlinkage long sys_sethostname(char *name, int len)
{
int errno;
if (!capable(CAP_SYS_ADMIN))
return -EPERM;
if (len < 0 || len > __NEW_UTS_LEN)
return -EINVAL;
down_write(&uts_sem);
errno = -EFAULT;
if (!copy_from_user(system_utsname.nodename, name, len)) {
system_utsname.nodename[len] = 0;
errno = 0;
}
up_write(&uts_sem);
return errno;
}
asmlinkage long sys_gethostname(char *name, int len)
{
int i, errno;
if (len < 0)
return -EINVAL;
down_read(&uts_sem);
i = 1 + strlen(system_utsname.nodename);
if (i > len)
i = len;
errno = 0;
if (copy_to_user(name, system_utsname.nodename, i))
errno = -EFAULT;
up_read(&uts_sem);
return errno;
}
The points to note about this example are:
Although Linux implementation of semaphores and read-write semaphores is very sophisticated, there are possible scenarios one can think of which are not yet implemented, for example there is no concept of interruptible read-write semaphores. This is obviously because there are no real-world situations which require these exotic flavours of the primitives.
Linux is a monolithic operating system and despite all the modern hype about some "advantages" offered by operating systems based on micro-kernel design, the truth remains (quoting Linus Torvalds himself):
... message passing as the fundamental operation of the OS is just an exercise in
computer science masturbation. It may feel good, but you don't actually get anything DONE.
Therefore, Linux is and will always be based on the monolithic design, which means that all subsystems run in the same privileged mode and share the same address space; communication between them is achieved by the usual C function call means.
However, although separating kernel functionality into separate "processes" as is done in micro-kernels is definitely a bad idea, separating it into dynamically loadable on demand kernel modules is desirable in some circumstances (e.g. on machines with low memory or for installation kernels which could otherwise contain ISA auto-probing device drivers that are mutually exclusive). The decision whether to include support for loadable modules is made at compilation time and is determined by the CONFIG_MODULES option. Support for auto-loading modules via request_module() mechanism is a separate compilation option - CONFIG_KMOD.
The following functionality can be implemented as loadable modules under Linux:
There a few things that cannot be implemented as modules under Linux (probably because it makes no sense for them to be modularized):
Linux provides several system calls to assist in loading modules:
The command interface available to users consists of:
Apart from being to load a module manually using either insmod or modprobe it is also possible to have the module inserted automatically by the kernel when a particular functionality is required. The kernel interface for this is the function called request_module(name) which is exported to modules so modules can load other modules as well. The request_module(name) internally creates a kernel thread which execs the userspace command "modprobe -s -k module_name" using the standard exec_usermodehelper() kernel interface (which is also exported to modules). The function returns 0 on success, however it is usually not worth checking the return code from request_module(). Instead, the programming idiom is:
if (check_some_feature() == NULL)
request_module(module);
if (check_some_feature() == NULL)
return -ENODEV;
For example, this is done by fs/block_dev.c:get_blkfops() to load a module "block-major-N" when attempt is made to open a block device on a major N. Obviously, there is no such module called "block-major-N" (Linux developers only chose sensible names for their modules) but it is mapped to a proper module name using the file /etc/modules.conf. However, for most well-known major numbers (and other kinds of modules) the modprobe/insmod commands know which real module to load without needing an explicit alias statement in /etc/modules.conf.
A good example of loading a module is inside the mount(2) system call. The mount(2) system call accepts the filesystem type as a string which fs/super.c:do_mount() then passes on to fs/super.c:get_fs_type():
static struct file_system_type *get_fs_type(const char *name)
{
struct file_system_type *fs;
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
if (!fs && (request_module(name) == 0)) {
read_lock(&file_systems_lock);
fs = *(find_filesystem(name));
if (fs && !try_inc_mod_count(fs->owner))
fs = NULL;
read_unlock(&file_systems_lock);
}
return fs;
}
A few things to note in this function:
When a module is loaded into the kernel it can refer to any symbols that are exported as public by the kernel using EXPORT_SYMBOL() macro or by other currently loaded modules. If the module uses symbols from another module it is marked as depending on that module during dependency recalculation, achieved by running "depmod -a" command on boot (e.g. after installing a new kernel).
Usually, one must match the set of modules with the version of the kernel interfaces they use, which under Linux simply means the "kernel version" as there is no special kernel interface versioning mechanism in general. However, there is a limited functionality called "module versioning" or CONFIG_MODVERSIONS which allows to avoid recompiling modules when switching to a new kernel. What happens here is that the kernel symbol table is treated differently for internal access and for access from modules. The elements of public (i.e. exported) part of the symbol table are built by 32bit checksumming the C declaration. So, in order to resolve a symbol used by a module during loading, the loader must match the full representation of the symbol that includes the checksum and will refuse to load the module. This only happens when both the kernel and the module are compiled with module versioning enabled. If either one of them uses the original symbol names then the loader simply tries to match the kernel version declared by the module and the one exported by the kernel and refuses to load if they differ.