Class 7: Internal Kernel Interfaces, part II

Date: 08.04.2025 Small task #6

Information about the process

When writing code that will execute in the kernel address space, it is important to keep in mind all the time that the thread of execution is associated with a certain user process, on behalf of which the kernel performs certain operations. Using the current macro (asm/current.h) we can easily (and quickly) get to all the information that the kernel stores about the current process in the task_struct structure (linux/sched.h).

One exception is interrupt handling - while during the execution of the hardware interrupt handling function, the current macro may point to some process, it should not be referenced (it is not related to this interrupt) as well as you must not switch to other processes (i.e., you can't perform blocking operations either).

Data exchange between user and kernel address space

To read/write something from/to the memory space of user programs, use the following functions (actually macros):

put_user(kptr, ptr).

write a byte/word/long word into user program memory space (from under the address ptr); the macro definition works automagically - the size is determined by the type to which kptr points.

get_user(kptr, ptr).

as above, but reading

Use the following functions to copy larger areas of memory:

unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);
unsigned long copy_to_user(void __user *to, const void *from, unsigned long n);

The former allows copying data from the user address space to the kernel address space, the latter the opposite. In general, they behave like memcpy, but it is important to note that in case of an address page error of the user space, they can cause the process to sleep until the page is downloaded from the swap. Before copying, the correctness of the address in user space is checked. If the beginning of the area is correct, but the rest is not, the longest possible fragment is copied.

The value returned by both functions is the number of NOT copied bytes - a non-zero value indicates a copy error.

The functions and their corresponding macro definitions are defined in the asm/uaccess.h file. Note that functions for blocks of powers of two are optimized.

In case of an error when copying from/to user space, syscalls should return -EFAULT.

Mutual exclusion

Synchronization in the Linux kernel continues to gain importance with its development. It is worth examining more carefully the related issues described here and presented in the lecture.

All of the following functions work correctly on both single and multiprocessor systems. Many of them are implemented internally in two versions, providing higher performance on single-processor systems. For example, the implementation of spinlocks boils down to disabling kernel preemption (or disabling interrupts).

Kernel documentation: https://docs.kernel.org/locking/locktypes.html

Mutex

Presented in previous classes. In-kernel documentation is here.

System semaphores

Semaphores in Linux are used to synchronize processes and are represented by the struct semaphore structure (defined in asm/semaphore.h).

Semaphores can be declared using macros such as:

static DECLARE_SEMAPHORE_GENERIC(sem_general, 11);

The following operations, among others, are available:

void down(struct semaphore *sem)
int down_interruptible(struct semaphore *sem)
int down_trylock(struct semaphore *sem)

Functions that lower the system semaphore. The variants are analogous to regular locks.

void up(struct semaphore *sem)

This function raises the system semaphore.

Linux also provides reader-writer semaphores of type struct rw_semaphore (linux/rwsem.h).

Hands-on

Please familiarize yourself with the functions available for semaphores of type readers-writers: (linux/rwsem.h).

Spin locks (spin locks)

Spin locks have a similar effect to ordinary locks, but they use active waiting instead of blocking the process. Thus, they can be used in places where blocking the process would be unacceptable (primarily interrupt handling functions), or when ordinary locks would significantly reduce performance (i.e., with a small number of protected operations).

Note that NO locking operations can be performed once a spinning lock is taken! (The most important of the functions that do not work with spinning locks are: copy_to/from_user, kmalloc, down, sleep_on, mutex_lock). Before calling them, the lock must be released.

Basic spinlocks

Spinlocks (defined in asm/spinlock.h) are used in a general case in the following way:

spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
unsigned long flags;

spin_lock_irqsave(&xxx_lock, flags);
/* ... critical section ... */
spin_unlock_irqrestore(&xxx_lock, flags);

The above call is always safe (because it blocks interrupts on the local processor, and then restores the original interrupt handling).

There is also a simplified version that we can use when a given lock is NEVER used in an interrupt handler function:

spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;

spin_lock(&xxx_lock);
/* ... critical section ... */
spin_unlock(&xxx_lock);

Reader-writer spinlocks

This type of spinlock (defined in asm/spinlock.h) allows non-exclusive reading and exclusive full data access (e.g., writing):

rwlock_t xxx_lock = RW_LOCK_UNLOCKED;
unsigned long flags;

read_lock_irqsave(&xxx_lock, flags);
/* .. critical section reading ... */
read_unlock_irqrestore(&xxx_lock, flags);

write_lock_irqsave(&xxx_lock, flags);
/* .. critical section with exclusive access ... */
write_unlock_irqrestore(&xxx_lock, flags);

This works well with complex data structures, when most of the operations consist of traversing (reading) the structure. Only the actual change (writing) requires exclusivity.

There is also a simplified version (without interrupt blocking) of reader-writer type spinning locks for use in situations, when the use of the lock cannot appear in the interrupt handler program:

rwlock_t xxx_lock = RW_LOCK_UNLOCKED;

read_lock(&xxx_lock);
/* .. critical section reading ... */
read_unlock(&xxx_lock);

write_lock(&xxx_lock);
/* .. critical section with exclusive access ... */
write_unlock(&xxx_lock);

Debugging locks

The kernel has a fairly sophisticated built-in lock checking facility called lockdep. It allows detecting potential deadlocks by analyzing the sequence of locks occupied by individual kernel threads. The deadlock does not have to occur to be detected - it is sufficient, that any sequence required for a deadlock has ever been used by a thread.

The detected classes of errors are simple lock usage errors (e.g., uninitialized locks, recursive locking), deadlocks caused by cyclic lock dependencies, and deadlocks caused by incorrect use of locks inside interrupt handlers.

Lock debugging should be explicitly enabled in the kernel configuration, as it slows down the system.

Blocking operations

Wait queues

A process that must wait for a certain event to occur (e.g., completion of an I/O operation, or arrival of data in a FIFO queue), is removed from the ready process queue and placed in the waitqueues The process state changes from TASK_RUNNING to TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE. The structure representing the head of the list in which such processes are placed is wait_queue_head_t, defined in the linux/wait.h file.

Selected operations:

DECLARE_WAIT_QUEUE_HEAD(name).

Before using wait_queue you need to initialize it.

wait_event(name, cond).

puts the process in the wait queue in a signal-resistant state (TASK_UNINTERRUPTIBLE) and sleeping until the cond occurs

wait_event_interruptible(name, cond).

as above, but transitioning to the TASK_INTERRUPTIBLE state. (the arrival of a signal will cause a transition to the TASK_RUNNING state), returns 0 if cond has occurred, -ERESTARTSYS if interrupted by a signal

wait_event_timeout(name,cond,timeout).

like wait_event, but with waking up after the specified timeout even if cond has not occurred

wait_event_interruptible_timeout(name,cond,timeout).

as above, but transition to TASK_INTERRUPTIBLE state.

wake_up(name).

wakes up all waiting processes without the WQ_FLAG_EXCLUSIVE flag set, and one that has it set. Note: the function does not remove the process from the waiting queue - the process will 'remove' itself, after resuming operation

wake_up_interruptible(name).

as above, but only for processes in TASK_INTERRUPTIBLE state.

wake_up_all(name).

insert all processes (in state TASK_INTERRUPTIBLE or TASK_UNINTERRUPTIBLE). located in the waiting queue to the ready process queue

The above operations are defined in linux/sched.h (except the first one, defined in linux/wait.h).

When using wait queues, care should be taken to avoid races. For example, if we are a reader and we want to wait for the writer to write any data to a lock-protected buffer, the waiting code might look like this:

/* definitions */
DEFINE_MUTEX(lock);
DECLARE_WAIT_QUEUE_HEAD(queue);
int pos_read, pos_write;
char *buffer;

/* loading a byte */
char character;
/* locking */
mutex_lock(&block);
/* check if we already have data to read */
while (pos_read == pos_write) {
    /* we don't have - we need to remove the lock so that the writer can do anything */
    mutex_unlock(&block);
    /* we are waiting for a write - the condition ensures that we don't wait,
       if the writer has updated pos_write between the time we gave up the lock,
       and we joined the queue - otherwise the
       writer's wake_up would not cover our process and we would wait
       much longer (perhaps forever). The condition is checked
       after being added to the waiting queue, but before the actual
       waiting. */
    wait_event(queue, pos_read != pos_write);
    /* wait_event only ensures that at some point since the call
       pos_read != pos_write occurred, but some other thread may have
       emptied the buffer in the meantime - we can't assume anything
       until we check this condition while holding the lock - so we take the
       lock and check the while condition again, until we get it */
    mutex_lock(&block);
}
/* successful - we can read the character */.
character = buffer[pos_read++];
mutex_unlock(&block);

Waiting for completion

This is a rather strange variant of the semaphore:

  • we create a structure struct completion, which initially has a value of 0

  • we can perform the complete operation, which increases the value of the structure by 1

  • we can perform the wait_for_completion operation (or its variants), which waits, until the structure has a positive value, then decreases it by 1

  • we can perform the complete_all operation, which permanently sets the value of the structure to UINT_MAX (all waiting threads are woken up, all further executions of complete and wait_for_completion will be no-ops)

Here is a typical use of the 'wait for completion' mechanism:

struct completion event;

init_completion(&event);

/* .. pass a pointer to event to whoever is to wake up .... */

wait_for_completion(&event);

The wake-up caller calls:

complete(&event)

when the appropriate time arrives.

The implementation is in the kernel/sched/completion.c file, while the structure itself is declared in linux/completion.h.

Other useful kernel features

The kernel library contains many other ready-to-use functions, which can prove useful when writing a great variety of drivers. Their documentation may be found under the Core API documentation section. Among the more useful ones can be mentioned:

linux/idr.h.

int map into void * with dynamic identifier allocation

linux/kref.h.

easy reference counting

linux/bitmap.h

efficient bitmap arrays

linux/btree.h.

B-trees

linux/bug.h, asm-generic/bug.h.

Reporting critical errors in the driver code resulting from code defects and warnings (something like assert)

linux/circ_buf.h.

cyclic buffers

linux/hash.h.

simple hash functions

linux/kernel.h.

Miscellaneous simple functions:

ALIGN(x,a).

aligns x down to a multiple of a (a must be a power of two)

PTR_ALIGN(p, a).

as above, but on pointers

IS_ALIGNED(x, a).

checks if x is already aligned

ARRAY_SIZE(arr).

array size arr

DIV_ROUND_UP(n,d).

n/d, rounding up.

roundup(x, y).

rounds x up to a multiple of y.

upper_32_bits(x), lower_32_bits(x).

as in the name

might_sleep().

denotes where the code can sleep, helps with debugging (throws error if spinlock debugging is enabled and spinlock is held)

min(x,y), max(x,y).

as in the name

clamp(val, min, max).

val clipped to the range [min, max].

linux/kobject.h.

general object type with reference counting and visibility in sysfs (among others, cdev and device are based on them).

linux/parser.h.

a simple parser for options

linux/rbtree.h.

red-black trees

Small task #6

Implement "bookmark" support for the lseek syscall (and related ones):

  • add a new field to the file structure — a stack of "bookmarks" (i.e., saved positions in the file)

  • add a new reference point to the lseek syscall: SEEK_BOOKMARK, using the bookmark at the top of the bookmarks stack as a reference point

  • add two new flags to lseek: SEEK_PUSH and SEEK_POP (which can be ORed with the other flags):

    • SEEK_PUSH pushes the current (before calling the syscall) position in the file onto the bookmark stack

    • SEEK_POP removes the bookmark from the stack

In case of an error (using SEEK_BOOKMARK or SEEK_POP without bookmarks), the error EINVAL should be returned and no do nothing.

The appropriate place to make changes would be the vfs_llseek function.

Please submit a patch generated with git format-patch.