Class 7: Internal Kernel Interfaces, part II¶
Date: 08.04.2025 Small task #6
Information about the process¶
When writing code that will execute in the kernel address space,
it is important to keep in mind all the time that the thread of execution is
associated with a certain user process, on behalf of which the kernel
performs certain operations.
Using the current
macro (asm/current.h
) we can easily (and quickly) get to all the
information that the kernel stores about the current process in the task_struct
structure (linux/sched.h
).
One exception is interrupt handling - while during the execution of the hardware interrupt handling function,
the current
macro may point to some process,
it should not be referenced (it is not related to this interrupt)
as well as you must not switch to other processes
(i.e., you can't perform blocking operations either).
Data exchange between user and kernel address space¶
To read/write something from/to the memory space of user programs, use the following functions (actually macros):
put_user(kptr, ptr)
.write a byte/word/long word into user program memory space (from under the address
ptr
); the macro definition works automagically - the size is determined by the type to whichkptr
points.get_user(kptr, ptr)
.as above, but reading
Use the following functions to copy larger areas of memory:
unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);
unsigned long copy_to_user(void __user *to, const void *from, unsigned long n);
The former allows copying data from the user address space to the kernel address space, the latter the opposite.
In general, they behave like memcpy
,
but it is important to note that in case of an address page error of the user space,
they can cause the process to sleep until the page is downloaded from the swap.
Before copying, the correctness of the address in user space is checked.
If the beginning of the area is correct, but the rest is not, the longest possible fragment is copied.
The value returned by both functions is the number of NOT copied bytes - a non-zero value indicates a copy error.
The functions and their corresponding macro definitions are defined in the asm/uaccess.h
file.
Note that functions for blocks of powers of two are optimized.
In case of an error when copying from/to user space, syscalls should return -EFAULT
.
Mutual exclusion¶
Synchronization in the Linux kernel continues to gain importance with its development. It is worth examining more carefully the related issues described here and presented in the lecture.
All of the following functions work correctly on both single and multiprocessor systems. Many of them are implemented internally in two versions, providing higher performance on single-processor systems. For example, the implementation of spinlocks boils down to disabling kernel preemption (or disabling interrupts).
Kernel documentation: https://docs.kernel.org/locking/locktypes.html
Mutex¶
Presented in previous classes. In-kernel documentation is here.
System semaphores¶
Semaphores in Linux are used to synchronize processes and are represented by the struct semaphore
structure (defined in asm/semaphore.h
).
Semaphores can be declared using macros such as:
static DECLARE_SEMAPHORE_GENERIC(sem_general, 11);
The following operations, among others, are available:
void down(struct semaphore *sem)
int down_interruptible(struct semaphore *sem)
int down_trylock(struct semaphore *sem)
Functions that lower the system semaphore. The variants are analogous to regular locks.
void up(struct semaphore *sem)
This function raises the system semaphore.
Linux also provides reader-writer semaphores of type struct rw_semaphore
(linux/rwsem.h
).
Hands-on
Please familiarize yourself with the functions available for semaphores of type readers-writers: (linux/rwsem.h
).
Spin locks (spin locks)¶
Spin locks have a similar effect to ordinary locks, but they use active waiting instead of blocking the process. Thus, they can be used in places where blocking the process would be unacceptable (primarily interrupt handling functions), or when ordinary locks would significantly reduce performance (i.e., with a small number of protected operations).
Note that NO locking operations can be performed once a spinning lock is taken!
(The most important of the functions that do not work with spinning locks are:
copy_to/from_user
, kmalloc
, down
, sleep_on
, mutex_lock
).
Before calling them, the lock must be released.
Basic spinlocks¶
Spinlocks (defined in asm/spinlock.h
) are used in a general case in the following way:
spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
unsigned long flags;
spin_lock_irqsave(&xxx_lock, flags);
/* ... critical section ... */
spin_unlock_irqrestore(&xxx_lock, flags);
The above call is always safe (because it blocks interrupts on the local processor, and then restores the original interrupt handling).
There is also a simplified version that we can use when a given lock is NEVER used in an interrupt handler function:
spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
spin_lock(&xxx_lock);
/* ... critical section ... */
spin_unlock(&xxx_lock);
Reader-writer spinlocks¶
This type of spinlock (defined in asm/spinlock.h
) allows non-exclusive reading and exclusive full data access (e.g., writing):
rwlock_t xxx_lock = RW_LOCK_UNLOCKED;
unsigned long flags;
read_lock_irqsave(&xxx_lock, flags);
/* .. critical section reading ... */
read_unlock_irqrestore(&xxx_lock, flags);
write_lock_irqsave(&xxx_lock, flags);
/* .. critical section with exclusive access ... */
write_unlock_irqrestore(&xxx_lock, flags);
This works well with complex data structures, when most of the operations consist of traversing (reading) the structure. Only the actual change (writing) requires exclusivity.
There is also a simplified version (without interrupt blocking) of reader-writer type spinning locks for use in situations, when the use of the lock cannot appear in the interrupt handler program:
rwlock_t xxx_lock = RW_LOCK_UNLOCKED;
read_lock(&xxx_lock);
/* .. critical section reading ... */
read_unlock(&xxx_lock);
write_lock(&xxx_lock);
/* .. critical section with exclusive access ... */
write_unlock(&xxx_lock);
Debugging locks¶
The kernel has a fairly sophisticated built-in lock checking facility called lockdep. It allows detecting potential deadlocks by analyzing the sequence of locks occupied by individual kernel threads. The deadlock does not have to occur to be detected - it is sufficient, that any sequence required for a deadlock has ever been used by a thread.
The detected classes of errors are simple lock usage errors (e.g., uninitialized locks, recursive locking), deadlocks caused by cyclic lock dependencies, and deadlocks caused by incorrect use of locks inside interrupt handlers.
Lock debugging should be explicitly enabled in the kernel configuration, as it slows down the system.
Blocking operations¶
Wait queues¶
A process that must wait for a certain event to occur
(e.g., completion of an I/O operation, or arrival of data in a FIFO queue),
is removed from the ready process queue and placed in the waitqueues
The process state changes from TASK_RUNNING
to TASK_INTERRUPTIBLE
or TASK_UNINTERRUPTIBLE
.
The structure representing the head of the list in which such processes are placed is wait_queue_head_t
,
defined in the linux/wait.h
file.
Selected operations:
DECLARE_WAIT_QUEUE_HEAD(name)
.Before using
wait_queue
you need to initialize it.wait_event(name, cond)
.puts the process in the wait queue in a signal-resistant state (
TASK_UNINTERRUPTIBLE
) and sleeping until the cond occurswait_event_interruptible(name, cond)
.as above, but transitioning to the
TASK_INTERRUPTIBLE
state. (the arrival of a signal will cause a transition to theTASK_RUNNING
state), returns 0 ifcond
has occurred,-ERESTARTSYS
if interrupted by a signalwait_event_timeout(name,cond,timeout)
.like
wait_event
, but with waking up after the specified timeout even ifcond
has not occurredwait_event_interruptible_timeout(name,cond,timeout)
.as above, but transition to
TASK_INTERRUPTIBLE
state.wake_up(name)
.wakes up all waiting processes without the
WQ_FLAG_EXCLUSIVE
flag set, and one that has it set. Note: the function does not remove the process from the waiting queue - the process will 'remove' itself, after resuming operationwake_up_interruptible(name)
.as above, but only for processes in
TASK_INTERRUPTIBLE
state.wake_up_all(name)
.insert all processes (in state
TASK_INTERRUPTIBLE
orTASK_UNINTERRUPTIBLE
). located in the waiting queue to the ready process queue
The above operations are defined in linux/sched.h
(except the first one, defined in linux/wait.h
).
When using wait queues, care should be taken to avoid races. For example, if we are a reader and we want to wait for the writer to write any data to a lock-protected buffer, the waiting code might look like this:
/* definitions */
DEFINE_MUTEX(lock);
DECLARE_WAIT_QUEUE_HEAD(queue);
int pos_read, pos_write;
char *buffer;
/* loading a byte */
char character;
/* locking */
mutex_lock(&block);
/* check if we already have data to read */
while (pos_read == pos_write) {
/* we don't have - we need to remove the lock so that the writer can do anything */
mutex_unlock(&block);
/* we are waiting for a write - the condition ensures that we don't wait,
if the writer has updated pos_write between the time we gave up the lock,
and we joined the queue - otherwise the
writer's wake_up would not cover our process and we would wait
much longer (perhaps forever). The condition is checked
after being added to the waiting queue, but before the actual
waiting. */
wait_event(queue, pos_read != pos_write);
/* wait_event only ensures that at some point since the call
pos_read != pos_write occurred, but some other thread may have
emptied the buffer in the meantime - we can't assume anything
until we check this condition while holding the lock - so we take the
lock and check the while condition again, until we get it */
mutex_lock(&block);
}
/* successful - we can read the character */.
character = buffer[pos_read++];
mutex_unlock(&block);
Waiting for completion¶
This is a rather strange variant of the semaphore:
we create a structure
struct completion
, which initially has a value of 0we can perform the
complete
operation, which increases the value of the structure by 1we can perform the
wait_for_completion
operation (or its variants), which waits, until the structure has a positive value, then decreases it by 1we can perform the
complete_all
operation, which permanently sets the value of the structure toUINT_MAX
(all waiting threads are woken up, all further executions ofcomplete
andwait_for_completion
will be no-ops)
Here is a typical use of the 'wait for completion' mechanism:
struct completion event;
init_completion(&event);
/* .. pass a pointer to event to whoever is to wake up .... */
wait_for_completion(&event);
The wake-up caller calls:
complete(&event)
when the appropriate time arrives.
The implementation is in the kernel/sched/completion.c
file,
while the structure itself is declared in linux/completion.h
.
Other useful kernel features¶
The kernel library contains many other ready-to-use functions, which can prove useful when writing a great variety of drivers. Their documentation may be found under the Core API documentation section. Among the more useful ones can be mentioned:
linux/idr.h
.int
map intovoid *
with dynamic identifier allocationlinux/kref.h
.easy reference counting
linux/bitmap.h
efficient bitmap arrays
linux/btree.h
.B-trees
linux/bug.h
,asm-generic/bug.h
.Reporting critical errors in the driver code resulting from code defects and warnings (something like
assert
)linux/circ_buf.h
.cyclic buffers
linux/hash.h
.simple hash functions
linux/kernel.h
.Miscellaneous simple functions:
ALIGN(x,a)
.aligns
x
down to a multiple ofa
(a
must be a power of two)PTR_ALIGN(p, a)
.as above, but on pointers
IS_ALIGNED(x, a)
.checks if
x
is already alignedARRAY_SIZE(arr)
.array size
arr
DIV_ROUND_UP(n,d)
.n/d
, rounding up.roundup(x, y)
.rounds
x
up to a multiple ofy
.upper_32_bits(x), lower_32_bits(x)
.as in the name
might_sleep()
.denotes where the code can sleep, helps with debugging (throws error if spinlock debugging is enabled and spinlock is held)
min(x,y), max(x,y)
.as in the name
clamp(val, min, max)
.val
clipped to the range [min
,max
].
linux/kobject.h
.general object type with reference counting and visibility in sysfs (among others,
cdev
anddevice
are based on them).linux/parser.h
.a simple parser for options
linux/rbtree.h
.red-black trees
Small task #6¶
Implement "bookmark" support for the lseek
syscall (and related ones):
add a new field to the
file
structure — a stack of "bookmarks" (i.e., saved positions in the file)add a new reference point to the
lseek
syscall:SEEK_BOOKMARK
, using the bookmark at the top of the bookmarks stack as a reference pointadd two new flags to
lseek
:SEEK_PUSH
andSEEK_POP
(which can be ORed with the other flags):SEEK_PUSH
pushes the current (before calling the syscall) position in the file onto the bookmark stackSEEK_POP
removes the bookmark from the stack
In case of an error (using SEEK_BOOKMARK
or SEEK_POP
without bookmarks), the error EINVAL
should be returned and no do nothing.
The appropriate place to make changes would be the vfs_llseek
function.
Please submit a patch generated with git format-patch
.