Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Process Management: Scheduling, Structures and Signals in FreeBSD, Study notes of Statistics

An overview of process management in FreeBSD, focusing on process composition, scheduling policies, process structures, and signals. It covers the process creation process, the organization of process state, and real-time scheduling. The document also discusses process groups, sessions, and jails.

What you will learn

  • What is the role of the kernel in managing the illusion of concurrent execution of multiple processes?
  • What information is contained in the process structure and user structure?
  • How does the system switch between processes in FreeBSD?
  • What is real-time scheduling and how does it differ from regular scheduling?
  • How are signals handled in FreeBSD?

Typology: Study notes

2021/2022

Uploaded on 09/27/2022

kataelin
kataelin 🇬🇧

4.7

(9)

221 documents

1 / 56

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CHAPTER 4
Process Management
4.1 Introduction to Process Management
Aprocess is a program in execution. A process must have system resources, such
as memory and the underlying CPU. The kernel supports the illusion of concur-
rent execution of multiple processes by scheduling system resources among the set
of processes that are ready to execute. On a multiprocessor, multiple processes
may really execute concurrently. This chapter describes the composition of a pro-
cess, the method that the system uses to switch between processes, and the
scheduling policy that it uses to promote sharing of the CPU. It also introduces
process creation and termination, and details the signal facilities and process-
debugging facilities.
Tw o months after the developers began the first implementation of the UNIX
operating system, there were two processes: one for each of the terminals of the
PDP-7. At age 10 months, and still on the PDP-7, UNIX had many processes, the
fork operation, and something like the wait system call. A process executed a new
program by reading in a new program on top of itself. The first PDP-11 system
(First Edition UNIX) saw the introduction of exec. All these systems allowed only
one process in memory at a time. When a PDP-11 with memory management (a
KS-11) was obtained, the system was changed to permit several processes to
remain in memory simultaneously, to reduce swapping. But this change did not
apply to multiprogramming because disk I/O was synchronous. This state of
affairs persisted into 1972 and the first PDP-11/45 system. True multiprogram-
ming was finally introduced when the system was rewritten in C. Disk I/O for one
process could then proceed while another process ran. The basic structure of pro-
cess management in UNIX has not changed since that time [Ritchie, 1988].
A process operates in either user mode or kernel mode. In user mode, a pro-
cess executes application code with the machine in a nonprivileged protection
mode. When a process requests services from the operating system with a system
call, it switches into the machines privileged protection mode via a protected
mechanism and then operates in kernel mode. 79
Front Page 79
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38

Partial preview of the text

Download Process Management: Scheduling, Structures and Signals in FreeBSD and more Study notes Statistics in PDF only on Docsity!

C H A P T E R 4

Process Management

4.1 Introduction to Process Management

A process is a program in execution. A process must have system resources, such as memory and the underlying CPU. The kernel supports the illusion of concur- rent execution of multiple processes by scheduling system resources among the set of processes that are ready to execute. On a multiprocessor, multiple processes may really execute concurrently. This chapter describes the composition of a pro- cess, the method that the system uses to switch between processes, and the scheduling policy that it uses to promote sharing of the CPU. It also introduces process creation and termination, and details the signal facilities and process- debugging facilities. Two months after the developers began the first implementation of the UNIX operating system, there were two processes: one for each of the terminals of the PDP -7. At age 10 months, and still on the PDP -7, UNIX had many processes, the fork operation, and something like the wait system call. A process executed a new program by reading in a new program on top of itself. The first PDP -11 system (First Edition UNIX ) saw the introduction of exec. All these systems allowed only one process in memory at a time. When a PDP -11 with memory management (a KS -11) was obtained, the system was changed to permit several processes to remain in memory simultaneously, to reduce swapping. But this change did not apply to multiprogramming because disk I/O was synchronous. This state of affairs persisted into 1972 and the first PDP -11/45 system. True multiprogram- ming was finally introduced when the system was rewritten in C. Disk I/O for one process could then proceed while another process ran. The basic structure of pro- cess management in UNIX has not changed since that time [Ritchie, 1988]. A process operates in either user mode or kernel mode. In user mode, a pro- cess executes application code with the machine in a nonprivileged protection mode. When a process requests services from the operating system with a system call, it switches into the machine’s privileged protection mode via a protected mechanism and then operates in kernel mode. 79

Front Page 79

80 Chapter 4 Process Management

The resources used by a process are similarly split into two parts. The resources needed for execution in user mode are defined by the CPU architecture and typically include the CPU’s general-purpose registers, the program counter, the processor-status register, and the stack-related registers, as well as the contents of the memory segments that constitute the FreeBSD notion of a program (the text, data, shared library, and stack segments). Kernel-mode resources include those required by the underlying hardware— such as registers, program counter, and stack pointer—and also by the state required for the FreeBSD kernel to provide system services for a process. This kernel state includes parameters to the current system call, the current process’s user identity, scheduling information, and so on. As described in Section 3.1, the kernel state for each process is divided into several separate data structures, with two primary structures: the process structure and the user structure. The process structure contains information that must always remain resident in main memory, along with references to other structures that remain resident, whereas the user structure contains information that needs to be resident only when the process is executing (although user structures of other processes also may be resident). User structures are allocated dynamically through the memory- management facilities. Historically, more than one-half of the process state was stored in the user structure. In FreeBSD , the user structure is used for only a cou- ple of structures that are referenced from the process structure. Process structures are allocated dynamically as part of process creation and are freed as part of pro- cess exit.

Multiprogramming

The FreeBSD system supports transparent multiprogramming: the illusion of con- current execution of multiple processes or programs. It does so by context switching —that is, by switching between the execution context of processes. A mechanism is also provided for scheduling the execution of processes—that is, for deciding which one to execute next. Facilities are provided for ensuring con- sistent access to data structures that are shared among processes. Context switching is a hardware-dependent operation whose implementation is influenced by the underlying hardware facilities. Some architectures provide machine instructions that save and restore the hardware-execution context of the process, including the virtual-address space. On the others, the software must col- lect the hardware state from various registers and save it, then load those registers with the new hardware state. All architectures must save and restore the software state used by the kernel. Context switching is done frequently, so increasing the speed of a context switch noticeably decreases time spent in the kernel and provides more time for execution of user applications. Since most of the work of a context switch is expended in saving and restoring the operating context of a process, reducing the amount of the information required for that context is an effective way to produce faster context switches.

Back Page 80

82 Chapter 4 Process Management

The system also needs a scheduling policy to deal with problems that arise from not having enough main memory to hold the execution contexts of all pro- cesses that want to execute. The major goal of this scheduling policy is to mini- mize thrashing —a phenomenon that occurs when memory is in such short supply that more time is spent in the system handling page faults and scheduling pro- cesses than in user mode executing application code. The system must both detect and eliminate thrashing. It detects thrashing by observing the amount of free memory. When the system has few free memory pages and a high rate of new memory requests, it considers itself to be thrashing. The system reduces thrashing by marking the least-recently run process as not being allowed to run. This marking allows the pageout daemon to push all the pages associated with the process to backing store. On most architectures, the ker- nel also can push to backing store the user structure of the marked process. The effect of these actions is to cause the process to be swapped out (see Section 5.12). The memory freed by blocking the process can then be distributed to the remain- ing processes, which usually can then proceed. If the thrashing continues, addi- tional processes are selected for being blocked from running until enough memory becomes available for the remaining processes to run effectively. Eventually, enough processes complete and free their memory that blocked processes can resume execution. However, even if there is not enough memory, the blocked pro- cesses are allowed to resume execution after about 20 seconds. Usually, the thrashing condition will return, requiring that some other process be selected for being blocked (or that an administrative action be taken to reduce the load).

4.2 Process State

Every process in the system is assigned a unique identifier termed the process identifier ( PID ). PID s are the common mechanism used by applications and by the kernel to reference processes. PID s are used by applications when the latter are sending a signal to a process and when receiving the exit status from a deceased process. Two PID s are of special importance to each process: the PID of the pro- cess itself and the PID of the process’s parent process. The layout of process state was completely reorganized in FreeBSD 5.2. The goal was to support multiple thread s that share an address space and other resources. Threads have also been called lightweight processes in other systems. A thread is the unit of execution of a process; it requires an address space and other resources, but it can share many of those resources with other threads. Threads sharing an address space and other resources are scheduled independently and can all do system calls simultaneously. The reorganization of process state in FreeBSD 5.2 was designed to support threads that can select the set of resources to be shared, known as variable-weight processes [Aral et al., 1989]. The developers did the reorganization by moving many components of pro- cess state from the process and user structures into separate substructures for each type of state information, as shown in Figure 4.1. The process structure references

Back Page 82

thread list

scheduling info

thread

scheduling info

thread

credential

syscall vector

process group

VM space file descriptors resource limits

session

region list

process entry

signal actions statistics user structure

file entries

thread information

machine-dependent

thread kernel stack

thread control block

thread information

machine-dependent

thread kernel stack

thread control block

Figure 4.1 Process state.

all the substructures directly or indirectly. The user structure remains primarily as a historic artifact for the benefit of debuggers. The thread structure contains just the information needed to run in the kernel: information about scheduling, a stack to use when running in the kernel, a thread control block ( TCB ), and other machine-dependent state. The TCB is defined by the machine architecture; it includes the general-purpose registers, stack pointers, program counter, processor- status longword, and memory-management registers. In their lightest-weight form, FreeBSD threads share all the process resources including the PID. When additional parallel computation is needed, a new thread is created using the kse_create system call. All the scheduling and management of the threads is handled by a user-level scheduler that is notified of thread transi- tions via callbacks from the kernel. The user-level thread manager must also keep track of the user-level stacks being used by each of the threads, since the entire address space is shared including the area normally used for the stack. Since the threads all share a single process structure, they have only a single PID and thus show up as a single entry in the ps listing. Many applications do not wish to share all of a process’s resources. The rfork system call creates a new process entry that shares a selected set of resources from its parent. Typically the signal actions, statistics, and the stack and data parts of the address space are not shared. Unlike the lightweight thread created by

Section 4.2 Process State 83

Front Page 83

Table 4.1 Process states.

State Description NEW undergoing process creation NORMAL thread(s) will be RUNNABLE, SLEEPING, or STOPPED ZOMBIE undergoing process termination

The state element of the process structure holds the current value of the process state. The possible state values are shown in Table 4.1. When a process is first created with a fork system call, it is initially marked as NEW. The state is changed to NORMAL when enough resources are allocated to the process for the latter to begin execution. From that point onward, a process’s state will fluctuate among NORMAL (where its thread(s) will be either RUNNABLE—that is, preparing to be or actually executing; SLEEPING—that is, waiting for an event; or STOPPED—that is, stopped by a signal or the parent process) until the process terminates. A deceased process is marked as ZOMBIE until it has freed its resources and commu- nicated its termination status to its parent process. The system organizes process structures into two lists. Process entries are on the zombproc list if the process is in the ZOMBIE state; otherwise, they are on the allproc list. The two queues share the same linkage pointers in the process struc- ture, since the lists are mutually exclusive. Segregating the dead processes from the live ones reduces the time spent both by the wait system call, which must scan the zombies for potential candidates to return, and by the scheduler and other functions that must scan all the potentially runnable processes. Most threads, except the currently executing thread (or threads if the system is running on a multiprocessor), are also in one of two queues: a run queue or a sleep queue. Threads that are in a runnable state are placed on a run queue, whereas threads that are blocked awaiting an event are located on a sleep queue. Stopped threads awaiting an event are located on a sleep queue, or they are on nei- ther type of queue. The run queues are organized according to thread-scheduling priority and are described in Section 4.4. The sleep queues are organized in a data structure that is hashed by event identifier. This organization optimizes finding the sleeping threads that need to be awakened when a wakeup occurs for an event. The sleep queues are described in Section 4.3. The p_pptr pointer and related lists ( p_children and p_sibling ) are used in locating related processes, as shown in Figure 4.2 (on page 86). When a process spawns a child process, the child process is added to its parent’s p_children list. The child process also keeps a backward link to its parent in its p_pptr pointer. If a process has more than one child process active at a time, the children are linked together through their p_sibling list entries. In Figure 4.2, process B is a direct descendant of process A, whereas processes C, D, and E are descendants of pro- cess B and are siblings of one another. Process B typically would be a shell that

Section 4.2 Process State 85

Front Page 85

86 Chapter 4 Process Management

p_pptr process C process D process E

process B

process A

p_children p_pptr

p_children p_pptr

p_sibling p_sibling

p_pptr

Figure 4.2 Process-group hierarchy.

started a pipeline (see Sections 2.4 and 2.6) including processes C, D, and E. Process A probably would be the system-initialization process init (see Sections 3.1 and 14.6). CPU time is made available to threads according to their scheduling class and scheduling priority. As shown in Table 4.2, the FreeBSD kernel has two kernel and three user scheduling classes. The kernel will always run the thread in the highest-priority class. Any kernel-interrupt threads will run in preference to any- thing else followed by any top-half-kernel threads. Any runnable real-time threads are run in preference to runnable threads in the share and idle classes. Runnable time-share threads are run in preference to runnable threads in the idle class. The priorities of threads in the real-time and idle classes are set by the applications using the rtprio system call and are never adjusted by the kernel. The bottom-half interrupt priorities are set when the devices are configured and never change. The top-half priorities are set based on predefined priorities for each ker- nel subsystem and never change. The priorities of threads running in the time-share class are adjusted by the kernel based on resource usage and recent CPU utilization. A thread has two scheduling priorities: one for scheduling user-mode execution and one for scheduling kernel-mode execution. The kg_user_pri field associated with the thread structure contains the user-mode scheduling priority, whereas the td_priority field holds the current scheduling priority. The current priority may be different from the user-mode priority when the thread is executing in the top half of the ker- nel. Priorities range between 0 and 255, with a lower value interpreted as a higher priority (see Table 4.2). User-mode priorities range from 128 to 255; priorities less than 128 are used only when a thread is asleep —that is, awaiting an event in the kernel—and immediately after such a thread is awakened. Threads in the ker- nel are given a higher priority because they typically hold shared kernel resources when they awaken. The system wants to run them as quickly as possible once they get a resource so that they can use the resource and return it before another thread requests it and gets blocked waiting for it. When a thread goes to sleep in the kernel, it must specify whether it should be awakened and marked runnable if a signal is posted to it. In FreeBSD , a kernel thread will be awakened by a signal only if it sets the PCATCH flag when it sleeps. The msleep () interface also handles sleeps limited to a maximum time duration and the processing of restartable system calls. The msleep () interface includes a

Back Page 86

88 Chapter 4 Process Management

  • Thread state: the run state of a thread (runnable, sleeping); additional status flags; if the thread is sleeping, the wait channel , the identity of the event for which the thread is waiting (see Section 4.3), and a pointer to a string describing the event
  • Machine state: the machine-dependent thread information
  • TCB : the user- and kernel-mode execution states
  • Kernel stack: the per-thread execution stack for the kernel

Historically, the kernel stack was mapped to a fixed location in the virtual address space. The reason for using a fixed mapping is that when a parent forks, its run- time stack is copied for its child. If the kernel stack is mapped to a fixed address, the child’s kernel stack is mapped to the same addresses as its parent kernel stack. Thus, all its internal references, such as frame pointers and stack-variable refer- ences, work as expected. On modern architectures with virtual address caches, mapping the user struc- ture to a fixed address is slow and inconvenient. Free BSD 5.2 removes this con- straint by eliminating all but the top call frame from the child’s stack after copying it from its parent so that it returns directly to user mode, thus avoiding stack copy- ing and relocation problems. Every thread that might potentially run must have its stack resident in mem- ory because one task of its stack is to handle page faults. If it were not resident, it would page fault when the thread tried to run, and there would be no kernel stack available to service the page fault. Since a system may have many thousands of threads, the kernel stacks must be kept small to avoid wasting too much physical memory. In FreeBSD 5.2 on the PC , the kernel stack is limited to two pages of memory. Implementors must be careful when writing code that executes in the kernel to avoid using large local variables and deeply nested subroutine calls, to avoid overflowing the run-time stack. As a safety precaution, some architectures leave an invalid page between the area for the run-time stack and the data struc- tures that follow it. Thus, overflowing the kernel stack will cause a kernel-access fault instead of disastrously overwriting other data structures. It would be possible to simply kill the process that caused the fault and continue running. However, the FreeBSD kernel panics on a kernel-access fault because such a fault shows a fun- damental design error in the kernel. By panicking and creating a crash dump, the error can usually be pinpointed and corrected.

4.3 Context Switching

The kernel switches among threads in an effort to share the CPU effectively; this activity is called context switching. When a thread executes for the duration of its time slice or when it blocks because it requires a resource that is currently

Back Page 88

unavailable, the kernel finds another thread to run and context switches to it. The system can also interrupt the currently executing thread to run a thread trig- gered by an asynchronous event, such as a device interrupt. Although both sce- narios involve switching the execution context of the CPU , switching between threads occurs synchronously with respect to the currently executing thread, whereas servicing interrupts occurs asynchronously with respect to the current thread. In addition, interprocess context switches are classified as voluntary or involuntary. A voluntary context switch occurs when a thread blocks because it requires a resource that is unavailable. An involuntary context switch takes place when a thread executes for the duration of its time slice or when the sys- tem identifies a higher-priority thread to run. Each type of context switching is done through a different interface. Volun- tary context switching is initiated with a call to the sleep () routine, whereas an involuntary context switch is forced by direct invocation of the low-level context- switching mechanism embodied in the mi_switch () and setrunnable () routines. Asynchronous event handling is triggered by the underlying hardware and is effec- tively transparent to the system. Our discussion will focus on how asynchronous event handling relates to synchronizing access to kernel data structures.

Thread State

Context switching between threads requires that both the kernel- and user-mode context be changed. To simplify this change, the system ensures that all a thread’s user-mode state is located in one data structure: the thread structure (most kernel state is kept elsewhere). The following conventions apply to this localization:

  • Kernel-mode hardware-execution state: Context switching can take place in only kernel mode. The kernel’s hardware-execution state is defined by the contents of the TCB that is located in the thread structure.
  • User-mode hardware-execution state: When execution is in kernel mode, the user- mode state of a thread (such as copies of the program counter, stack pointer, and general registers) always resides on the kernel’s execution stack that is located in the thread structure. The kernel ensures this location of user-mode state by requir- ing that the system-call and trap handlers save the contents of the user-mode execution context each time that the kernel is entered (see Section 3.1).
  • The process structure: The process structure always remains resident in memory.
  • Memory resources: Memory resources of a process are effectively described by the contents of the memory-management registers located in the TCB and by the values present in the process and thread structures. As long as the process remains in memory, these values will remain valid, and context switches can be done without the associated page tables being saved and restored. However, these values need to be recalculated when the process returns to main memory after being swapped to secondary storage.

Section 4.3 Context Switching 89

Front Page 89

exits, it awakens its parent’s process-structure address rather than its own. Thus, the parent doing the wait will awaken independent of which child process is the first to exit. Once running, it must scan its list of children to determine which one exited.

  • When a thread does a sigpause system call, it does not want to run until it receives a signal. Thus, it needs to do an interruptible sleep on a wait channel that will never be awakened. By convention, the address of the user structure is given as the wait channel.

Sleeping threads are organized in an array of queues (see Figure 4.3). The sleep () and wakeup () routines hash wait channels to calculate an index into the sleep queues. The sleep () routine takes the following steps in its operation:

  1. Prevent interrupts that might cause thread-state transitions by acquiring the sched_lock mutex (mutexes are explained in the next section).
  2. Record the wait channel in the thread structure and hash the wait-channel value to locate a sleep queue for the thread.
  3. Set the thread’s priority to the priority that the thread will have when the thread is awakened and set the SLEEPING flag.
  4. Place the thread at the end of the sleep queue selected in step 2.
  5. Call mi_switch () to request that a new thread be scheduled; the sched_lock mutex is released as part of switching to the other thread.

Figure 4.3 Queueing structure for sleeping threads.

hash-table header

sleep queue td_slpq.tqe_next

td_slpq.tqe_prev

thread thread

thread thread thread

thread

Section 4.3 Context Switching 91

Front Page 91

92 Chapter 4 Process Management

A sleeping thread is not selected to execute until it is removed from a sleep queue and is marked runnable. This operation is done by the wakeup () routine, which is called to signal that an event has occurred or that a resource is available. Wakeup () is invoked with a wait channel, and it awakens all threads sleeping on that wait channel. All threads waiting for the resource are awakened to ensure that none are inadvertently left sleeping. If only one thread were awakened, it might not request the resource on which it was sleeping, and any other threads waiting for that resource would be left sleeping forever. A thread that needs an empty disk buffer in which to write data is an example of a thread that may not request the resource on which it was sleeping. Such a thread can use any available buffer. If none is available, it will try to create one by requesting that a dirty buffer be writ- ten to disk and then waiting for the I/O to complete. When the I/O finishes, the thread will awaken and will check for an empty buffer. If several are available, it may not use the one that it cleaned, leaving any other threads waiting for the buffer that it cleaned sleeping forever. In instances where a thread will always use a resource when it becomes avail- able, wakeup_one () can be used instead of wakeup (). The wakeup_one () routine wakes up only the first thread that it finds waiting for a resource. The assumption is that when the awakened thread is done with the resource it will issue another wakeup_one () to notify the next waiting thread that the resource is available. The succession of wakeup_one () calls will continue until all threads waiting for the resource have been awakened and had a chance to use it. To avoid having excessive numbers of threads awakened, kernel programmers try to use wait channels with fine enough granularity that unrelated uses will not collide on the same resource. Thus, they put locks on each buffer in the buffer cache rather than putting a single lock on the buffer cache as a whole. The prob- lem of many threads awakening for a single resource is further mitigated on a uniprocessor by the latter’s inherently single-threaded operation. Although many threads will be put into the run queue at once, only one at a time can execute. Since the uniprocessor kernel runs nonpreemptively, each thread will run its sys- tem call to completion before the next one will get a chance to execute. Unless the previous user of the resource blocked in the kernel while trying to use the resource, each thread waiting for the resource will be able to get and use the resource when it is next run. A wakeup () operation processes entries on a sleep queue from front to back. For each thread that needs to be awakened, wakeup () does the following:

  1. Removes the thread from the sleep queue
  2. Recomputes the user-mode scheduling priority if the thread has been sleeping longer than one second
  3. Makes the thread runnable if it is in a SLEEPING state and places the thread on the run queue if its process is not swapped out of main memory. If the process has been swapped out, the swapin process will be awakened to load it back into memory (see Section 5.12); if the thread is in a STOPPED state, it is

Back Page 92

94 Chapter 4 Process Management

lock of that type may be held when a thread goes to sleep. At the lowest level, the hardware must provide a memory interlocked test-and-set instruction. The test-and-set instruction must allow two operations to be done on a main-memory location—the reading of the existing value followed by the writing of a new value—without any other processor being able to read or write that memory loca- tion between the two memory operations. Some architectures support more com- plex versions of the test-and-set instruction. For example, the PC provides a memory interlocked compare-and-swap instruction. Spin locks are built from the hardware test-and-set instruction. A memory location is reserved for the lock with a value of zero showing that the lock is free and a value of one showing that the lock is held. The test-and-set instruction tries to acquire the lock. The lock value is tested and the lock unconditionally set to one. If the tested value is zero, then the lock was successfully acquired and the thread may proceed. If the value is one, then some other thread held the lock so the thread must loop doing the test-and-set until the thread holding the lock (and running on a different processor) stores a zero into the lock to show that it is done with it. Spin locks are used for locks that are held only briefly—for example, to protect a list while adding or removing an entry from it. It is wasteful of CPU cycles to use spin locks for resources that will be held for long periods of time. For example, a spin lock would be inappropriate for a disk buffer that would need to be locked throughout the time that a disk I/O was being done. Here a sleep lock should be used. When a thread trying to acquire a sleep-type lock finds that the lock is held, it is put to sleep so that other threads can run until the lock becomes available. The time to acquire a lock can be variable—for example, a lock needed to search and remove an item from a list. The list usually will be short, for which a spin lock would be appropriate, but will occasionally grow long. Here a hybrid lock is used; the lock spins for a while, but if unsuccessful after a specified num- ber of attempts, it reverts to a sleep-type lock. Most architectures require 100 to 200 instructions to put a thread to sleep and then awaken it again. A spin lock that can be acquired in less than this time is going to be more efficient than a sleep lock. The hybrid lock is usually set to try for about half this time before reverting to a sleep lock. Spin locks are never appropriate on a uniprocessor, since the only way that a resource held by another thread will ever be released will be when that thread gets to run. Thus, spin locks are always converted to sleep locks when running on a uniprocessor. The highest-level locking prevents threads from deadlocking when locking multiple resources. Suppose that two threads, A and B, require exclusive access to two resources, R 1 and R 2 , to do some operation as shown in Figure 4.4. If thread A acquires R 1 and thread B acquires R 2 , then a deadlock occurs when thread A tries to acquire R 2 and thread B tries to acquire R 1. To avoid deadlock, FreeBSD 5.2 maintains a partial ordering on all the locks. The two partial-ordering rules are as follows:

Back Page 94

2 R (^) R R R 1 ’ 1 ’’^2 ’^2 ’’

Class 1 Class 2

Thread A B

Thread

R 1

R

Figure 4.4 Partial ordering of resources.

  1. A thread may acquire only one lock in each class.
  2. A thread may acquire only a lock in a higher-numbered class than the highest- numbered class for which it already holds a lock.

In Figure 4.4 thread A holds R 1 and can request R 2 as R 1 and R 2 are in different classes and R 2 is in a higher-numbered class than R 1. However, thread B must release R 2 before requesting R 1 , since R 2 is in a higher class than R 1. Thus, thread A will be able to acquire R 2 when it is released by thread B. After thread A com- pletes and releases R 1 and R 2 , thread B will be able to acquire both of those locks and run to completion without deadlock. Historically, the class members and ordering were poorly documented and unenforced. Violations were discovered when threads would deadlock and a care- ful analysis done to figure out what ordering had been violated. With an increas- ing number of developers and a growing kernel, the ad hoc method of maintaining the partial ordering of locks became untenable. A witness module was added to the kernel to derive and enforce the partial ordering of the locks. The witness module keeps track of the locks acquired and released by each thread. It also keeps track of the order in which locks are acquired relative to each other. Each time a lock is acquired, the witness module uses these two lists to verify that a lock is not being acquired in the wrong order. If a lock order violation is detected, then a message is output to the console detailing the locks involved and the loca- tions in question. The witness module also verifies that no spin locks are held when requesting a sleep lock or voluntarily going to sleep. The witness module can be configured to either panic or drop into the kernel debugger when an order violation occurs or some other witness check fails. When running the debugger, the witness module can output the list of locks held by the current thread to the console along with the filename and line number at which each lock was last acquired. It can also dump the current order list to the console. The code first displays the lock order tree for all the sleep locks. Then it displays

Section 4.3 Context Switching 95

Front Page 95

passed to mtx_lock (). The mtx_init () function specifies a type that the witness code uses to classify a mutex when doing checks of lock ordering. It is not per- missible to pass the same mutex to mtx_init () multiple times without intervening calls to mtx_destroy (). The mtx_lock () function acquires a mutual exclusion lock for the currently running kernel thread. If another kernel thread is holding the mutex, the caller will sleep until the mutex is available. The mtx_lock_spin () function is similar to the mtx_lock () function except that it will spin until the mutex becomes available. Interrupts are disabled on the CPU holding the mutex during the spin and remain disabled following the acquisition of the lock. It is possible for the same thread to recursively acquire a mutex with no ill effects if the MTX_RECURSE bit was passed to mtx_init () during the initialization of the mutex. The witness module verifies that a thread does not recurse on a non- recursive lock. A recursive lock is useful if a resource may be locked at two or more levels in the kernel. By allowing a recursive lock, a lower layer need not check if the resource has already been locked by a higher layer; it can simply lock and release the resource as needed. The mtx_trylock () function tries to acquire a mutual exclusion lock for the currently running kernel thread. If the mutex cannot be immediately acquired, mtx_trylock () will return 0; otherwise the mutex will be acquired and a nonzero value will be returned. The mtx_unlock () function releases a mutual exclusion lock; if a higher-priority thread is waiting for the mutex, the releasing thread will be put to sleep to allow the higher-priority thread to acquire the mutex and run. A mutex that allows recursive locking maintains a reference count showing the number of times that it has been locked. Each successful lock request must have a corresponding unlock request. The mutex is not released until the final unlock has been done, causing the reference count to drop to zero. The mtx_unlock_spin () function releases a spin-type mutual exclusion lock; interrupt state from before the lock was acquired is restored. The mtx_destroy () function destroys a mutex so the data associated with it may be freed or otherwise overwritten. Any mutex that is destroyed must previously have been initialized with mtx_init (). It is permissible to have a single reference to a mutex when it is destroyed. It is not permissible to hold the mutex recursively or have another thread blocked on the mutex when it is destroyed. If these rules are violated, the kernel will panic. The giant lock that protects subsystems in FreeBSD 5.2 that have not yet been converted to operate on a multiprocessor must be acquired before acquiring other mutexes. Put another way: It is impossible to acquire giant nonrecursively while holding another mutex. It is possible to acquire other mutexes while holding giant, and it is possible to acquire giant recursively while holding other mutexes. Sleeping while holding a mutex (except for giant) is almost never safe and should be avoided. There are numerous assertions that will fail if this is attempted.

Section 4.3 Context Switching 97

Front Page 97

98 Chapter 4 Process Management

Lock-Manager Locks

Interprocess synchronization to a resource typically is implemented by associating it with a lock structure. The kernel has a lock manager that manipulates a lock. The operations provided by the lock manager are

  • Request shared: Get one of many possible shared locks. If a thread holding an exclusive lock requests a shared lock, the exclusive lock will be downgraded to a shared lock.
  • Request exclusive: Stop further shared locks when they are cleared, grant a pend- ing upgrade (see following) if it exists, and then grant an exclusive lock. Only one exclusive lock may exist at a time, except that a thread holding an exclusive lock may get additional exclusive locks if it explicitly sets the canrecurse flag in the lock request or if the canrecurse flag was set when the lock was initialized.
  • Request upgrade: The thread must hold a shared lock that it wants to have upgraded to an exclusive lock. Other threads may get exclusive access to the resource between the time that the upgrade is requested and the time that it is granted.
  • Request exclusive upgrade: The thread must hold a shared lock that it wants to have upgraded to an exclusive lock. If the request succeeds, no other threads will have gotten exclusive access to the resource between the time that the upgrade is requested and the time that it is granted. However, if another thread has already requested an upgrade, the request will fail.
  • Request downgrade: The thread must hold an exclusive lock that it wants to have downgraded to a shared lock. If the thread holds multiple (recursive) exclusive locks, they will all be downgraded to shared locks.
  • Request release: Release one instance of a lock.
  • Request drain: Wait for all activity on the lock to end, and then mark it decom- missioned. This feature is used before freeing a lock that is part of a piece of memory that is about to be released.

Locks must be initialized before their first use by calling the lockinit () func- tion. Parameters to the lockinit () function include the following:

  • A top-half kernel priority at which the thread should run once it has acquired the lock
  • Flags such as canrecurse that allows the thread currently holding an exclusive lock to get another exclusive lock rather than panicking with a ‘‘locking against myself ’’ failure
  • A string that describes the resource that the lock protects referred to as the wait channel message
  • An optional maximum time to wait for the lock to become available

Back Page 98