Class 5: Internal Kernel Interfaces, part I¶
Date: 25.03.2025
Introduction¶
In the previous classes we discussed the procedure for compiling kernel sources. Today we will show the most commonly used internal interfaces in the kernel.
The kernel is written in C, but it cannot use the standard C language library (or any other library we are familiar with in user space). It does, however, have its own library of useful functions. Some of them are identical to those of the standard library (or very similar), but many of them have significant differences, or are completely Linux-specific.
Examples of functions available in the kernel:
most of the functions known from
string.h
(memcpy
,strcmp
,strcpy
, ... )kstrto[u](int|l|ll)
: functions for converting from strings to numbers, similar to the standardstrto*
, but with a different interfacemalloc
/free
/calloc
: do not exist, replaced bykmalloc
,vmalloc
, and several other memory allocators depending on the casesnprintf
/sscanf`
: work similarly to the usual ones, but have a different set of formats (e.g.,%pI4
prints an IPv4 address)bsearch
: as in the C standardsort
: like the standardqsort
, but you still have to pass it a function swapping two elements
These functions are contained in different headers than usual: we have to use, for example,
linux/string.h
, linux/bsearch.h
, etc.
printk¶
For debugging purposes and reporting important events, you can use
the printk
function, which works similarly to printf
:
printk(KERN_WARNING "Crashed, error code: %d\n", err);
In front of the message, include its priority (note the lack of a comma), which can be (in order of ascending priority):
KERN_DEBUG
: aliaspr_debug()
KERN_INFO
: aliaspr_info()
KERN_NOTICE
: aliaspr_notice()
KERN_WARNING
: aliaspr_warn()
KERN_ERR
: aliaspr_err()
KERN_CRIT
: aliaspr_crit()
KERN_ALERT
: aliaspr_alert()
KERN_EMERG
: aliaspr_emerg()
Messages output by printk
will be included in the system log, which can be viewed by
using the dmesg
command. If they have a high enough priority, they will
also be immediately output to the console. To change the priority from which
messages are written out directly to the console, use the command
dmesg -n <level>
, where 8 causes all messages to be written out,
and 1 only for critical ones.
Dynamically allocating memory for the kernel¶
There are many functions in the kernel that allow dynamic memory allocation.
By far the most important and most commonly used is kmalloc
(linux/slab.h
):
void *kmalloc(size_t size, gfp_t flags);
void kfree(void *obj);
The kmalloc
function allows the allocation of a sequential area of physical memory with
the size of up to 32 pages of memory (this gives for x86 just under 128kb of memory;
part of the memory is reserved by the kernel for the block header). The allocation of memory
is carried out quickly (buddy algorithm). The flags
parameter specifies
the type of memory (GFP_*
constants defined in the linux/gfp.h
file) -- the most important are:
GFP_KERNEL
-- the most commonly used, it can be blocking, so it can be called only from the process context or in its own thread.GFP_NOWAIT
/GFP_ATOMIC
- does not block, can be called from interrupt handling routines (although usually a bad idea anyway).
void *vmalloc(size_t size);
void vfree(void *addr);
With vmalloc
you can allocate an area arbitrarily large (under
condition that there is enough free physical memory), but it is only virtually (hence v) contiguous
(it may be physically inconsistent as this memory goes through address translation).
Using this function without a good reason is not recommended.
struct page *alloc_pages(gfp_t flags, unsigned long order)
void __free_pages(struct page *page, unsigned long order)
Allocates 2**order
whole pages, the flags
parameter specifies how
allocate pages (as in kmalloc
).
Documentation: https://docs.kernel.org/core-api/memory-allocation.html
Fixed-size allocator¶
When you have a lot of objects of identical size, it can be useful to create your own heap dedicated to a particular type of objects. The following functions are used for this:
struct kmem_cache * kmem_cache_create(
char *name, size_t size, size_t align,
unsigned long flags,
void (*ctor)(void*));
int kmem_cache_destroy (struct kmem_cache * cachep);
0
is usually specified as the flags
parameter
(as most flags are only used for debugging).
The struct kmem_cache
is our private cache heap -- it consists of dynamically
allocated pages cut into chunks of exactly the specified length
with minimal overhead, additionally arranged to make the most of
the CPU cache. You can allocate our object on it with
the following functions:
void *kmem_cache_alloc(struct kmem_cache *cachep, int flags);
void kmem_cache_free(struct kmem_cache *cachep, void* objp);
The memory for the new object is initialized with the constructor given
when creating the
cache. For convenience for simple cases (if the constructor
is not needed) there is a defined macro KMEM_CACHE
wrapping
kmem_cache_create
.
Error handling in the kernel¶
While writing in a kernel mode, it is important to remember that the stability of the entire system depends on the correct operation of our code -- it is absolutely necessary to handle all possible errors in our module in a way that does not interfere with the rest of the kernel, as well as to strictly control the lifetime of allocated resources (a memory leak in the kernel makes it necessary to periodically reboot the entire system).
Most non-trivial functions in the kernel can fail and return an error code as
result. Numeric error codes are used to describe the error encountered,
the same as errno
in user code, but negated (i.e., for example,
a function detecting a permissions error executes return -EPERM;
).
The range of numbers for such error codes is -4096..-1.
There are 4 conventions for returning error codes from functions in the kernel (and you should always check which convention is being used before using a function):
The function does not return a result other than the error code -- the return type is
int
, the return value is the error code, or 0 in case of success.The function returns a numeric type (
int
,long
,off_t
, ...) -- values in the range -4096 ... -1 indicate an error code, other values indicate a "normal" result.The function returns a pointer -- in case of an error, the negative error code is casted onto the pointer type and returned. Upon use, you need to check that the returned pointer is not, by any chance, an encoded error.
The function returns a pointer and does not use error codes -- in the event of an error,
NULL
is returned, and the user must infer the appropriate error code (an example of such a function iskmalloc
-- ifNULL
is returned, the-ENOMEM
code above must be propagated).
The following macros (linux/err.h
) are useful for handling error codes:
IS_ERR_VALUE(x)
true if
x
(integer) is an error code (that is, it has the value -4096.... - 1)void *ERR_PTR(long error)
converts an error code from a number to a pointer
long PTR_ERR(const void *ptr)
converts in the reverse direction
long IS_ERR(const void *ptr)
true if the pointer is an error code
long IS_ERR_OR_NULL(const void *ptr)
true if the pointer is an error code or NULL
void *ERR_CAST(const void *ptr)
converts an error code from a pointer to a pointer (useful for different pointer types)
In case, when the function we are using returns an error, remember (unless
we have a special handling of the error in question planned) to clean up all
resources allocated in the current function and propagate the unmodified
error code higher up. A fairly common (and recommended) idiom in the kernel is to use
goto
to use a common error handling path:
resource_c *get_c() {
int res;
mutex_lock(&lock);
resource_a *a = get_a();
if (IS_ERR(a)) {
res = PTR_ERR(a);
goto err_a;
}
int b = get_b();
if (IS_ERR_VALUE(b)) {
res = b;
goto err_b;
}
resource_c *c = kmalloc(sizeof *c, GFP_KERNEL);
if (c == NULL) {
res = -ENOMEM;
goto err_c;
}
c->a = a;
c->b; = b;
mutex_unlock(&lock);
return c;
/* Common error handling */
err_c:
release_b(b);
err_b:
release_a(a);
err_a:
mutex_unlock(&lock);
return ERR_PTR(res);
}
A list of error codes can be found in asm-generic/errno-base.h
and asm-generic/errno.h
.
Note that many of these errors have well-defined semantics,
sometimes loosely related to the description, and should only be used in specific
situations. Of the more important codes, the following are worth mentioning:
-EFAULT
Error when copying from/to user memory (any other use is incorrect).
-ENOMEM
Memory exhaustion (but not other types of resources).
-ENOSPC
Exhaustion of disk space or other sufficiently similar device.
-ENOENT
The specified file (or other sufficiently similar resource) was not found.
-ESRCH
The specified process was not found.
-EPERM
No permissions (unspecified) to perform the operation.
-EACCES
Operation prohibited by file system permissions.
-EEXISTS
The operation failed because the file (or other resource) already exists (used e.g., for operations that create files).
-EIO
The device broke down in an unspecified way (the calling function is not at fault: e.g., scratched CD, pulled out cable, etc.).
-EINVAL
The user has given incorrect parameters (conflicting, not supported by device, etc.).
-ENOTTY
Attempting to perform an operation on an incompatible device type (e.g., trying to change terminal settings on a regular file). Used primarily to reject unknown
ioctl
.-ERESTARTSYS
Used to abort a wait when the kernel needs to exit the kernel mode to deliver a signal to a user process -- at the appropriate place it will be converted to
-EINTR
the system call will be restarted.-EINTR
System call interrupted by signal -- do not use directly (return
-ERESTARTSYS
instead).-ESPIPE
Attempting to change the position of a file on an object where such a term does not make sense (pipe, socket, terminal...).
Important
Returning -1
instead of an error code, using an obviously invalid one,
or carelessly disregarding the error code returned by the called
function will incur negative points for the large assignments!
Sleep and waiting in the kernel¶
Whenever one writes code running in kernel mode, it is common to have to wait for some event to occur (mutex release, device readiness, data availability, etc.). In such cases, we need to choose which type of wait we want to use:
Active waiting: our kernel thread (or prerecord handler) occupies the CPU the entire time and continuously checks whether the event has already occurred. It is allowed to use it only for waiting for events that must take place in a finite and very small time (e.g. release of a spinlock). It is not permitted to be used to wait for events that are triggered from other kernel threads (unless we are guaranteed that the thread in question is currently active and will not be preempted, as in the case of waiting for a spinlock release). This is the only type of wait that is allowed to be used in the context of interrupt handling, or when we are already holding a spinlock.
Uninterruptible sleep: our kernel thread adds itself to the wait queue and goes to sleep, asking to be woken up when the event occurs. The processor is released and starts processing other kernel threads. The only way to wake the thread is if the event occurs - a process in this state will not respond to signals. It should only be used to wait for events, which will happen in a finite and small amount of time (otherwise we may end up with an unkillable suspended process).
Interruptible sleep by signals -- as above, but the waiting can be interrupted by the arrival of a signal that requires handling (that is, calling the signal handling function or killing the process). In such a case, the waiting function returns the result
-EINTR
, and the waiting code should abandon what it is doing, and if possible undo all the changes it has already made, then return to user space, returning the error code-ERESTARTSYS
. Use it to wait for events that do not have a limited time frame (e.g., user typing, receiving data on a socket). It must not be used when we are not allowed to deny the current task (e.g. in therelease
function).Sleep interruptible by death (killable sleep) -- as above, but the wait can be interrupted only by signals that will cause the death of the process. It is a kind of compromise, allowing to kill the suspended process in case it is too difficult to undo partial changes. If possible, it should not be used.
Simple locks¶
Due to the frequent need in the kernel to use data shared by many processes, one of the most important types of function classes in the kernel are functions that synchronize the invocation of processes, among other things, locks.
The simplest type of locks are ordinary locks (also called mutexes).
This type of lock is defined in linux/mutex.h
. Such a lock
is not recursive and must be released by the process that
set it up. A process attempting to lock an already locked lock will
sleep (going into the S or D state) until the lock is released.
Locks are created in one of the following ways:
/* for locks that are global variables */
static DEFINE_MUTEX(lock);
/* for locks in dynamically allocated structures */
struct mutex lock2;
/* ... */
mutex_init(&lock2);
Among other operations, the following are available:
void mutex_lock(struct mutex *lock);
int mutex_lock_interruptible(struct mutex *lock);
int mutex_lock_killable(struct mutex *lock);
int mutex_trylock(struct mutex *lock);
The first one uses uninterruptible sleep, the second one
uses sleep interruptible by signals (returns -EINTR
in case of signal-induced
wake-up, 0
on success), and the third uses sleep interruptible by
death. The fourth function does not
wait for the lock to be released - in case the lock is already
locked, it returns -EAGAIN
and does nothing, in case of successful
locking it returns 0
.
void mutex_unlock(struct mutex *lock);
Releases the lock and wakes up one of the possible pending processes.
Other types of locks and other synchronization functions will be discussed in the next class.
Documentation: https://docs.kernel.org/locking/locktypes.html
Macro container_of
¶
A quite interesting mechanism peculiar to the Linux kernel is the macro
container_of
. It is used to obtain the address of the structure containing
a given field based on the field address, the type of the structure, and the name of the field:
struct a {
int x;
int y;
int z;
};
struct a *ptr_a = ....;
/* Having a pointer to the y field of some instance of structure a.... */
int *ptr_y = &ptr_a->y;
/* ... we can reconstruct from it a pointer to the whole structure */
struct a *ptr_a_recovered = container_of(ptr_y, struct a, y);
/* ptr_a == ptr_a_recovered */
This is the mechanism used in many data structures in the kernel libraries
(such as lists and cref
) - e.g. in the list implementation, instead of having a separate
list structure (in a separate allocation) that points to the actual
structure of the list element, the list_head
structure is simply embedded
in the list element as a field, and the container_of
macro is used to
get a pointer to the containing structure.
Using a standard implementation of lists¶
The kernel contains an efficient implementation of bidirectional lists. A list
consists of cyclically connected list_head
structures, most often
which are components of some larger structure. It should be noted,
that both the head of a list and its components are list_head
structures --
detection of the end of a list is done by comparison with the head address.
All list operations are available through the linux/list.h
file.
Here are a handful of syntax related to lists:
list_head
structure representing the head (part) of a list
LIST_HEAD(list)
macro defining and initializing a list head variable
INIT_LIST_HEAD(list)
macro initializing a list head (for dynamically created lists)
list_add(what, to_what)
adding
what
to the beginning ofto_what
list_add_tail(what, to_what)
adding
what
to the end ofto_what
list_del(what)
removing
what
from the listlist_empty(list)
checking, whether the list is empty
list_splice(what, before_what)
gluing the list of
what
andbefore_what
list_for_each(varname, over_list)
iterating the variable
varname
over each element ofover_list
list_for_each_safe(varname, meanwhile, over_list)
iteration of the variable
varname
over each element ofover_list
in a safe way due to deletion of list elements (for this purpose uses the variablemeanwhile
)list_entry(my_list, structure, field)
computes a pointer to the start of the structure, which contains a field named
field
of typelist_head
(just a syntax sugar forcontainer_of
)
Documentation: https://docs.kernel.org/core-api/kernel-api.html#list-management-functions