Class 5: Internal Kernel Interfaces, part I¶

Date: 25.03.2025

Introduction¶

In the previous classes we discussed the procedure for compiling kernel sources. Today we will show the most commonly used internal interfaces in the kernel.

The kernel is written in C, but it cannot use the standard C language library (or any other library we are familiar with in user space). It does, however, have its own library of useful functions. Some of them are identical to those of the standard library (or very similar), but many of them have significant differences, or are completely Linux-specific.

Examples of functions available in the kernel:

most of the functions known from string.h (memcpy, strcmp, strcpy, ... )
kstrto[u](int|l|ll): functions for converting from strings to numbers, similar to the standard strto*, but with a different interface
malloc/free/calloc: do not exist, replaced by kmalloc, vmalloc, and several other memory allocators depending on the case
snprintf/sscanf`: work similarly to the usual ones, but have a different set of formats (e.g., %pI4 prints an IPv4 address)
bsearch: as in the C standard
sort: like the standard qsort, but you still have to pass it a function swapping two elements

These functions are contained in different headers than usual: we have to use, for example, linux/string.h, linux/bsearch.h, etc.

printk¶

For debugging purposes and reporting important events, you can use the printk function, which works similarly to printf:

printk(KERN_WARNING "Crashed, error code: %d\n", err);

In front of the message, include its priority (note the lack of a comma), which can be (in order of ascending priority):

KERN_DEBUG: alias pr_debug()
KERN_INFO: alias pr_info()
KERN_NOTICE: alias pr_notice()
KERN_WARNING: alias pr_warn()
KERN_ERR: alias pr_err()
KERN_CRIT: alias pr_crit()
KERN_ALERT: alias pr_alert()
KERN_EMERG: alias pr_emerg()

Messages output by printk will be included in the system log, which can be viewed by using the dmesg command. If they have a high enough priority, they will also be immediately output to the console. To change the priority from which messages are written out directly to the console, use the command dmesg -n <level>, where 8 causes all messages to be written out, and 1 only for critical ones.

Documentation: basics, formats

Dynamically allocating memory for the kernel¶

There are many functions in the kernel that allow dynamic memory allocation. By far the most important and most commonly used is kmalloc (linux/slab.h):

void *kmalloc(size_t size, gfp_t flags);
void kfree(void *obj);

The kmalloc function allows the allocation of a sequential area of physical memory with the size of up to 32 pages of memory (this gives for x86 just under 128kb of memory; part of the memory is reserved by the kernel for the block header). The allocation of memory is carried out quickly (buddy algorithm). The flags parameter specifies the type of memory (GFP_* constants defined in the linux/gfp.h file) -- the most important are:

GFP_KERNEL -- the most commonly used, it can be blocking, so it can be called only from the process context or in its own thread.
GFP_NOWAIT/GFP_ATOMIC - does not block, can be called from interrupt handling routines (although usually a bad idea anyway).

void *vmalloc(size_t size);
void vfree(void *addr);

With vmalloc you can allocate an area arbitrarily large (under condition that there is enough free physical memory), but it is only virtually (hence v) contiguous (it may be physically inconsistent as this memory goes through address translation). Using this function without a good reason is not recommended.

struct page *alloc_pages(gfp_t flags, unsigned long order)
void __free_pages(struct page *page, unsigned long order)

Allocates 2**order whole pages, the flags parameter specifies how allocate pages (as in kmalloc).

Documentation: https://docs.kernel.org/core-api/memory-allocation.html

Fixed-size allocator¶

When you have a lot of objects of identical size, it can be useful to create your own heap dedicated to a particular type of objects. The following functions are used for this:

struct kmem_cache * kmem_cache_create(
    char *name, size_t size, size_t align,
    unsigned long flags,
    void (*ctor)(void*));

int kmem_cache_destroy (struct kmem_cache * cachep);

0 is usually specified as the flags parameter (as most flags are only used for debugging).

The struct kmem_cache is our private cache heap -- it consists of dynamically allocated pages cut into chunks of exactly the specified length with minimal overhead, additionally arranged to make the most of the CPU cache. You can allocate our object on it with the following functions:

void *kmem_cache_alloc(struct kmem_cache *cachep, int flags);
void kmem_cache_free(struct kmem_cache *cachep, void* objp);

The memory for the new object is initialized with the constructor given when creating the cache. For convenience for simple cases (if the constructor is not needed) there is a defined macro KMEM_CACHE wrapping kmem_cache_create.

Error handling in the kernel¶

While writing in a kernel mode, it is important to remember that the stability of the entire system depends on the correct operation of our code -- it is absolutely necessary to handle all possible errors in our module in a way that does not interfere with the rest of the kernel, as well as to strictly control the lifetime of allocated resources (a memory leak in the kernel makes it necessary to periodically reboot the entire system).

Most non-trivial functions in the kernel can fail and return an error code as result. Numeric error codes are used to describe the error encountered, the same as errno in user code, but negated (i.e., for example, a function detecting a permissions error executes return -EPERM;). The range of numbers for such error codes is -4096..-1.

There are 4 conventions for returning error codes from functions in the kernel (and you should always check which convention is being used before using a function):

The function does not return a result other than the error code -- the return type is int, the return value is the error code, or 0 in case of success.
The function returns a numeric type (int, long, off_t, ...) -- values in the range -4096 ... -1 indicate an error code, other values indicate a "normal" result.
The function returns a pointer -- in case of an error, the negative error code is casted onto the pointer type and returned. Upon use, you need to check that the returned pointer is not, by any chance, an encoded error.
The function returns a pointer and does not use error codes -- in the event of an error, NULL is returned, and the user must infer the appropriate error code (an example of such a function is kmalloc -- if NULL is returned, the -ENOMEM code above must be propagated).

The following macros (linux/err.h) are useful for handling error codes:

IS_ERR_VALUE(x): true if x (integer) is an error code (that is, it has the value -4096.... - 1)
void *ERR_PTR(long error): converts an error code from a number to a pointer
long PTR_ERR(const void *ptr): converts in the reverse direction
long IS_ERR(const void *ptr): true if the pointer is an error code
long IS_ERR_OR_NULL(const void *ptr): true if the pointer is an error code or NULL
void *ERR_CAST(const void *ptr): converts an error code from a pointer to a pointer (useful for different pointer types)

In case, when the function we are using returns an error, remember (unless we have a special handling of the error in question planned) to clean up all resources allocated in the current function and propagate the unmodified error code higher up. A fairly common (and recommended) idiom in the kernel is to use goto to use a common error handling path:

resource_c *get_c() {
    int res;
    mutex_lock(&lock);
    resource_a *a = get_a();
    if (IS_ERR(a)) {
        res = PTR_ERR(a);
        goto err_a;
    }
    int b = get_b();
    if (IS_ERR_VALUE(b)) {
        res = b;
        goto err_b;
    }
    resource_c *c = kmalloc(sizeof *c, GFP_KERNEL);
    if (c == NULL) {
        res = -ENOMEM;
        goto err_c;
    }
    c->a = a;
    c->b; = b;
    mutex_unlock(&lock);
    return c;

    /* Common error handling */
err_c:
    release_b(b);
err_b:
    release_a(a);
err_a:
    mutex_unlock(&lock);
    return ERR_PTR(res);
}

A list of error codes can be found in asm-generic/errno-base.h and asm-generic/errno.h. Note that many of these errors have well-defined semantics, sometimes loosely related to the description, and should only be used in specific situations. Of the more important codes, the following are worth mentioning:

-EFAULT: Error when copying from/to user memory (any other use is incorrect).
-ENOMEM: Memory exhaustion (but not other types of resources).
-ENOSPC: Exhaustion of disk space or other sufficiently similar device.
-ENOENT: The specified file (or other sufficiently similar resource) was not found.
-ESRCH: The specified process was not found.
-EPERM: No permissions (unspecified) to perform the operation.
-EACCES: Operation prohibited by file system permissions.
-EEXISTS: The operation failed because the file (or other resource) already exists (used e.g., for operations that create files).
-EIO: The device broke down in an unspecified way (the calling function is not at fault: e.g., scratched CD, pulled out cable, etc.).
-EINVAL: The user has given incorrect parameters (conflicting, not supported by device, etc.).
-ENOTTY: Attempting to perform an operation on an incompatible device type (e.g., trying to change terminal settings on a regular file). Used primarily to reject unknown ioctl.
-ERESTARTSYS: Used to abort a wait when the kernel needs to exit the kernel mode to deliver a signal to a user process -- at the appropriate place it will be converted to -EINTR the system call will be restarted.
-EINTR: System call interrupted by signal -- do not use directly (return -ERESTARTSYS instead).
-ESPIPE: Attempting to change the position of a file on an object where such a term does not make sense (pipe, socket, terminal...).

Important

Returning -1 instead of an error code, using an obviously invalid one, or carelessly disregarding the error code returned by the called function will incur negative points for the large assignments!

Sleep and waiting in the kernel¶

Whenever one writes code running in kernel mode, it is common to have to wait for some event to occur (mutex release, device readiness, data availability, etc.). In such cases, we need to choose which type of wait we want to use:

Active waiting: our kernel thread (or prerecord handler) occupies the CPU the entire time and continuously checks whether the event has already occurred. It is allowed to use it only for waiting for events that must take place in a finite and very small time (e.g. release of a spinlock). It is not permitted to be used to wait for events that are triggered from other kernel threads (unless we are guaranteed that the thread in question is currently active and will not be preempted, as in the case of waiting for a spinlock release). This is the only type of wait that is allowed to be used in the context of interrupt handling, or when we are already holding a spinlock.
Uninterruptible sleep: our kernel thread adds itself to the wait queue and goes to sleep, asking to be woken up when the event occurs. The processor is released and starts processing other kernel threads. The only way to wake the thread is if the event occurs - a process in this state will not respond to signals. It should only be used to wait for events, which will happen in a finite and small amount of time (otherwise we may end up with an unkillable suspended process).
Interruptible sleep by signals -- as above, but the waiting can be interrupted by the arrival of a signal that requires handling (that is, calling the signal handling function or killing the process). In such a case, the waiting function returns the result -EINTR, and the waiting code should abandon what it is doing, and if possible undo all the changes it has already made, then return to user space, returning the error code -ERESTARTSYS. Use it to wait for events that do not have a limited time frame (e.g., user typing, receiving data on a socket). It must not be used when we are not allowed to deny the current task (e.g. in the release function).
Sleep interruptible by death (killable sleep) -- as above, but the wait can be interrupted only by signals that will cause the death of the process. It is a kind of compromise, allowing to kill the suspended process in case it is too difficult to undo partial changes. If possible, it should not be used.

Simple locks¶

Due to the frequent need in the kernel to use data shared by many processes, one of the most important types of function classes in the kernel are functions that synchronize the invocation of processes, among other things, locks.

The simplest type of locks are ordinary locks (also called mutexes). This type of lock is defined in linux/mutex.h. Such a lock is not recursive and must be released by the process that set it up. A process attempting to lock an already locked lock will sleep (going into the S or D state) until the lock is released.

Locks are created in one of the following ways:

 /* for locks that are global variables */
static DEFINE_MUTEX(lock);

 /* for locks in dynamically allocated structures */
 struct mutex lock2;
 /* ... */
 mutex_init(&lock2);

Among other operations, the following are available:

void mutex_lock(struct mutex *lock);
int mutex_lock_interruptible(struct mutex *lock);
int mutex_lock_killable(struct mutex *lock);
int mutex_trylock(struct mutex *lock);

The first one uses uninterruptible sleep, the second one uses sleep interruptible by signals (returns -EINTR in case of signal-induced wake-up, 0 on success), and the third uses sleep interruptible by death. The fourth function does not wait for the lock to be released - in case the lock is already locked, it returns -EAGAIN and does nothing, in case of successful locking it returns 0.

void mutex_unlock(struct mutex *lock);

Releases the lock and wakes up one of the possible pending processes.

Other types of locks and other synchronization functions will be discussed in the next class.

Documentation: https://docs.kernel.org/locking/locktypes.html

Macro `container_of`¶

A quite interesting mechanism peculiar to the Linux kernel is the macro container_of. It is used to obtain the address of the structure containing a given field based on the field address, the type of the structure, and the name of the field:

struct a {
    int x;
    int y;
    int z;
};

struct a *ptr_a = ....;
/* Having a pointer to the y field of some instance of structure a.... */
int *ptr_y = &ptr_a->y;
/* ... we can reconstruct from it a pointer to the whole structure */
struct a *ptr_a_recovered = container_of(ptr_y, struct a, y);
/* ptr_a == ptr_a_recovered */

This is the mechanism used in many data structures in the kernel libraries (such as lists and cref) - e.g. in the list implementation, instead of having a separate list structure (in a separate allocation) that points to the actual structure of the list element, the list_head structure is simply embedded in the list element as a field, and the container_of macro is used to get a pointer to the containing structure.

Using a standard implementation of lists¶

The kernel contains an efficient implementation of bidirectional lists. A list consists of cyclically connected list_head structures, most often which are components of some larger structure. It should be noted, that both the head of a list and its components are list_head structures -- detection of the end of a list is done by comparison with the head address.

All list operations are available through the linux/list.h file. Here are a handful of syntax related to lists:

list_head: structure representing the head (part) of a list
LIST_HEAD(list): macro defining and initializing a list head variable
INIT_LIST_HEAD(list): macro initializing a list head (for dynamically created lists)
list_add(what, to_what): adding what to the beginning of to_what
list_add_tail(what, to_what): adding what to the end of to_what
list_del(what): removing what from the list
list_empty(list): checking, whether the list is empty
list_splice(what, before_what): gluing the list of what and before_what
list_for_each(varname, over_list): iterating the variable varname over each element of over_list
list_for_each_safe(varname, meanwhile, over_list): iteration of the variable varname over each element of over_list in a safe way due to deletion of list elements (for this purpose uses the variable meanwhile)
list_entry(my_list, structure, field): computes a pointer to the start of the structure, which contains a field named field of type list_head (just a syntax sugar for container_of)

Documentation: https://docs.kernel.org/core-api/kernel-api.html#list-management-functions

Reading¶

https://docs.kernel.org/core-api/index.html

Class 5: Internal Kernel Interfaces, part I¶

Introduction¶

printk¶

Dynamically allocating memory for the kernel¶

Fixed-size allocator¶

Error handling in the kernel¶

Sleep and waiting in the kernel¶

Simple locks¶

Macro container_of¶

Using a standard implementation of lists¶

Reading¶

Macro `container_of`¶