.. _lab-kernel-internal-api-en:

===========================================
Class 5: Internal Kernel Interfaces, part I
===========================================

Date: 25.03.2025

.. : :ref:`small_task_5`


Introduction
============

In the previous classes we discussed the procedure for compiling kernel sources.
Today we will show the most commonly used internal interfaces in the kernel.

The kernel is written in C, but it cannot use the standard
C language library (or any other library we are familiar with in user space).
It does, however, have its own library of useful functions.  Some of them
are identical to those of the standard library (or very similar), but
many of them have significant differences, or are completely Linux-specific.

Examples of functions available in the kernel:

- most of the functions known from ``string.h`` (``memcpy``, ``strcmp``, ``strcpy``, ... )
- ``kstrto[u](int|l|ll)``: functions for converting from strings to numbers, similar
  to the standard ``strto*``, but with a different interface
- ``malloc``/``free``/``calloc``: do not exist, replaced by ``kmalloc``,
  ``vmalloc``, and several other memory allocators depending on the case
- ``snprintf``/``sscanf```: work similarly to the usual ones, but have a different
  set of formats (e.g., ``%pI4`` prints an IPv4 address)
- ``bsearch``: as in the C standard
- ``sort``: like the standard ``qsort``, but you still have to pass it a function
  swapping two elements

These functions are contained in different headers than usual: we have to use, for example,
``linux/string.h``, ``linux/bsearch.h``, etc.

printk
======

For debugging purposes and reporting important events, you can use
the ``printk`` function, which works similarly to ``printf``::

 printk(KERN_WARNING "Crashed, error code: %d\n", err);

In front of the message, include its priority (note the lack of
a comma), which can be (in order of ascending priority):

- ``KERN_DEBUG``: alias ``pr_debug()``
- ``KERN_INFO``: alias ``pr_info()``
- ``KERN_NOTICE``: alias ``pr_notice()``
- ``KERN_WARNING``: alias ``pr_warn()``
- ``KERN_ERR``: alias ``pr_err()``
- ``KERN_CRIT``: alias ``pr_crit()``
- ``KERN_ALERT``: alias ``pr_alert()``
- ``KERN_EMERG``: alias ``pr_emerg()``


Messages output by ``printk`` will be included in the system log, which can be viewed by
using the ``dmesg`` command.  If they have a high enough priority, they will
also be immediately output to the console.  To change the priority from which
messages are written out directly to the console, use the command
``dmesg -n <level>``, where 8 causes all messages to be written out,
and 1 only for critical ones.

Documentation: `basics <https://docs.kernel.org/core-api/printk-basics.html>`_, `formats <https://docs.kernel.org/core-api/printk-formats.html>`_

.. highlight:: c

Dynamically allocating memory for the kernel
============================================

There are many functions in the kernel that allow dynamic memory allocation.
By far the most important and most commonly used is ``kmalloc`` (``linux/slab.h``)::

    void *kmalloc(size_t size, gfp_t flags);
    void kfree(void *obj);

The ``kmalloc`` function allows the allocation of a sequential area of physical memory with
the size of up to 32 pages of memory (this gives for x86 just under 128kb of memory;
part of the memory is reserved by the kernel for the block header). The allocation of memory
is carried out quickly (buddy algorithm). The ``flags`` parameter specifies
the type of memory (``GFP_*`` constants defined in the ``linux/gfp.h`` file) -- the most important are:

- ``GFP_KERNEL`` -- the most commonly used, it can be blocking, so it can be called
  only from the process context or in its own thread.
- ``GFP_NOWAIT``/``GFP_ATOMIC`` - does not block, can be called from
  interrupt handling routines (although usually a bad idea anyway).

::

    void *vmalloc(size_t size);
    void vfree(void *addr);

With ``vmalloc`` you can allocate an area arbitrarily large (under
condition that there is enough free physical memory), but it is only virtually (hence **v**) contiguous
(it may be physically inconsistent as this memory goes through address translation).
Using this function without a good reason is not recommended.

::

    struct page *alloc_pages(gfp_t flags, unsigned long order)
    void __free_pages(struct page *page, unsigned long order)

Allocates ``2**order`` whole pages, the ``flags`` parameter specifies how
allocate pages (as in ``kmalloc``).

Documentation: https://docs.kernel.org/core-api/memory-allocation.html


Fixed-size allocator
--------------------

When you have a lot of objects of identical size, it can be useful to create
your own heap dedicated to a particular type of objects.
The following functions are used for this::

    struct kmem_cache * kmem_cache_create(
        char *name, size_t size, size_t align,
        unsigned long flags,
        void (*ctor)(void*));

    int kmem_cache_destroy (struct kmem_cache * cachep);

``0`` is usually specified as the ``flags`` parameter
(as most flags are only used for debugging).

The ``struct kmem_cache`` is our private cache heap -- it consists of dynamically
allocated pages cut into chunks of exactly the specified length
with minimal overhead, additionally arranged to make the most of
the CPU cache.  You can allocate our object on it with
the following functions::

    void *kmem_cache_alloc(struct kmem_cache *cachep, int flags);
    void kmem_cache_free(struct kmem_cache *cachep, void* objp);

The memory for the new object is initialized with the constructor given
when creating the
cache. For convenience for simple cases (if the constructor
is not needed) there is a defined macro ``KMEM_CACHE`` wrapping
``kmem_cache_create``.


Error handling in the kernel
============================

While writing in a kernel mode, it is important to remember that
the stability of the entire system depends on the correct operation of our code
-- it is absolutely necessary to handle all
possible errors in our module in a way that does not interfere with the rest of the kernel,
as well as to strictly control the lifetime of allocated resources
(a memory leak in the kernel makes it necessary to periodically reboot the entire system).

Most non-trivial functions in the kernel can fail and return an error code as
result.  Numeric error codes are used to describe the error encountered,
the same as ``errno`` in user code, but negated (i.e., for example,
a function detecting a permissions error executes ``return -EPERM;``).
The range of numbers for such error codes is -4096..-1.

There are 4 conventions for returning error codes from functions in the kernel
(and you should always check which convention is being used before using a function):

- The function does not return a result other than the error code -- the return type is ``int``,
  the return value is the error code, or 0 in case of success.
- The function returns a numeric type (``int``, ``long``, ``off_t``, ...) -- values
  in the range -4096 ... -1 indicate an error code, other values indicate a "normal" result.
- The function returns a pointer -- in case of an error, the negative error code is casted
  onto the pointer type and returned.
  Upon use, you need to check that the returned pointer is not, by any chance, an encoded error.
- The function returns a pointer and does not use error codes -- in the event of an error,
  ``NULL`` is returned, and the user must infer the appropriate error code
  (an example of such a function is ``kmalloc`` -- if ``NULL`` is returned,
  the ``-ENOMEM`` code above must be propagated).

The following macros (``linux/err.h``) are useful for handling error codes:

``IS_ERR_VALUE(x)``
  true if ``x`` (integer) is an error code (that is, it has the value
  -4096.... - 1)
``void *ERR_PTR(long error)``
  converts an error code from a number to a pointer
``long PTR_ERR(const void *ptr)``
  converts in the reverse direction
``long IS_ERR(const void *ptr)``
  true if the pointer is an error code
``long IS_ERR_OR_NULL(const void *ptr)``
  true if the pointer is an error code or NULL
``void *ERR_CAST(const void *ptr)``
  converts an error code from a pointer to a pointer (useful for different
  pointer types)

In case, when the function we are using returns an error, remember (unless
we have a special handling of the error in question planned) to clean up all
resources allocated in the current function and propagate the unmodified
error code higher up. A fairly common (and recommended) idiom in the kernel is to use
``goto`` to use a common error handling path::

    resource_c *get_c() {
        int res;
        mutex_lock(&lock);
        resource_a *a = get_a();
        if (IS_ERR(a)) {
            res = PTR_ERR(a);
            goto err_a;
        }
        int b = get_b();
        if (IS_ERR_VALUE(b)) {
            res = b;
            goto err_b;
        }
        resource_c *c = kmalloc(sizeof *c, GFP_KERNEL);
        if (c == NULL) {
            res = -ENOMEM;
            goto err_c;
        }
        c->a = a;
        c->b; = b;
        mutex_unlock(&lock);
        return c;

        /* Common error handling */
    err_c:
        release_b(b);
    err_b:
        release_a(a);
    err_a:
        mutex_unlock(&lock);
        return ERR_PTR(res);
    }

A list of error codes can be found in ``asm-generic/errno-base.h`` and ``asm-generic/errno.h``.
Note that many of these errors have well-defined semantics,
sometimes loosely related to the description, and should only be used in specific
situations.  Of the more important codes, the following are worth mentioning:

``-EFAULT``
  Error when copying from/to user memory (any other use is
  incorrect).
``-ENOMEM``
  Memory exhaustion (but not other types of resources).
``-ENOSPC``
  Exhaustion of disk space or other sufficiently similar device.
``-ENOENT``
  The specified file (or other sufficiently similar resource) was not found.
``-ESRCH``
  The specified process was not found.
``-EPERM``
  No permissions (unspecified) to perform the operation.
``-EACCES``
  Operation prohibited by file system permissions.
``-EEXISTS``
  The operation failed because the file (or other resource) already exists (used
  e.g., for operations that create files).
``-EIO``
  The device broke down in an unspecified way
  (the calling function is not at fault: e.g., scratched CD, pulled out cable, etc.).
``-EINVAL``
  The user has given incorrect parameters (conflicting, not supported by
  device, etc.).
``-ENOTTY``
  Attempting to perform an operation on an incompatible device type (e.g., trying to change
  terminal settings on a regular file).  Used primarily to reject unknown ``ioctl``.
``-ERESTARTSYS``
  Used to abort a wait when the kernel needs to exit the kernel mode to deliver a
  signal to a user process -- at the appropriate place it will be converted
  to ``-EINTR`` the system call will be restarted.
``-EINTR``
  System call interrupted by signal -- *do not use directly* (return ``-ERESTARTSYS`` instead).
``-ESPIPE``
  Attempting to change the position of a file on an object where such a term does not make sense
  (pipe, socket, terminal...).

.. important::
    Returning ``-1`` instead of an error code, using an obviously invalid one,
    or carelessly disregarding the error code returned by the called
    function will incur negative points for the large assignments!


Sleep and waiting in the kernel
===============================

Whenever one writes code running in kernel mode, it is common to have to
wait for some event to occur (mutex release, device readiness,
data availability, etc.).  In such cases, we need to choose which type of
wait we want to use:

1. Active waiting: our kernel thread (or prerecord handler) occupies the CPU the entire time and
   continuously checks whether the event has already occurred.  It is allowed to use it only for
   waiting for events that must take place in a finite and very small
   time (e.g. release of a spinlock).  It is not permitted to be used to wait
   for events that are triggered from other kernel threads (unless we are
   guaranteed that the thread in question is currently active and will not be preempted,
   as in the case of waiting for a spinlock release).  This is the only
   type of wait that is allowed to be used in the context of interrupt handling, or
   when we are already holding a spinlock.

2. Uninterruptible sleep: our kernel thread adds itself to the wait queue
   and goes to sleep, asking to be woken up when the event occurs.  The
   processor is released and starts processing other kernel threads.  The only way
   to wake the thread is if the event occurs - a process in this state will not
   respond to signals.  It should only be used to wait for events,
   which will happen in a finite and small amount of time (otherwise we may
   end up with an unkillable suspended process).

3. Interruptible sleep by signals -- as above, but the waiting can be
   interrupted by the arrival of a signal that requires handling (that is, calling
   the signal handling function or killing the process).  In such a case, the
   waiting function returns the result ``-EINTR``, and the waiting code should abandon
   what it is doing, and if possible undo all the changes it has already made,
   then return to user space, returning the error code
   ``-ERESTARTSYS``.  Use it to wait for events that do not
   have a limited time frame (e.g., user typing, receiving data on a socket).
   It must not be used when we are not allowed to deny
   the current task (e.g. in the ``release`` function).

4. Sleep interruptible by death (`killable sleep`) -- as above, but the wait
   can be interrupted only by signals that will cause the death of the process.
   It is a kind of compromise, allowing to kill the suspended process
   in case it is too difficult to undo partial changes.
   If possible, it should not be used.

Simple locks
--------------

Due to the frequent need in the kernel to use data shared by
many processes, one of the most important types of function classes in the kernel are
functions that synchronize the invocation of processes, among other things, locks.

The simplest type of locks are ordinary locks (also called mutexes).
This type of lock is defined in ``linux/mutex.h``. Such a lock
is not recursive and must be released by the process that
set it up. A process attempting to lock an already locked lock will
sleep (going into the S or D state) until the lock is released.

Locks are created in one of the following ways::

    /* for locks that are global variables */
   static DEFINE_MUTEX(lock);

    /* for locks in dynamically allocated structures */
    struct mutex lock2;
    /* ... */
    mutex_init(&lock2);

Among other operations, the following are available::

    void mutex_lock(struct mutex *lock);
    int mutex_lock_interruptible(struct mutex *lock);
    int mutex_lock_killable(struct mutex *lock);
    int mutex_trylock(struct mutex *lock);

The first one uses uninterruptible sleep, the second one
uses sleep interruptible by signals (returns ``-EINTR`` in case of signal-induced
wake-up, ``0`` on success), and the third uses sleep interruptible by
death.  The fourth function does not
wait for the lock to be released - in case the lock is already
locked, it returns ``-EAGAIN`` and does nothing, in case of successful
locking it returns ``0``.

::

    void mutex_unlock(struct mutex *lock);

Releases the lock and wakes up one of the possible pending processes.

Other types of locks and other synchronization functions will be discussed
in the next class.


Documentation: https://docs.kernel.org/locking/locktypes.html


Macro ``container_of``
======================

A quite interesting mechanism peculiar to the Linux kernel is the macro
``container_of``.  It is used to obtain the address of the structure containing
a given field based on the field address, the type of the structure, and the name of the field::

    struct a {
        int x;
        int y;
        int z;
    };

    struct a *ptr_a = ....;
    /* Having a pointer to the y field of some instance of structure a.... */
    int *ptr_y = &ptr_a->y;
    /* ... we can reconstruct from it a pointer to the whole structure */
    struct a *ptr_a_recovered = container_of(ptr_y, struct a, y);
    /* ptr_a == ptr_a_recovered */

This is the mechanism used in many data structures in the kernel libraries
(such as lists and ``cref``) - e.g. in the list implementation, instead of having a separate
list structure (in a separate allocation) that points to the actual
structure of the list element, the ``list_head`` structure is simply embedded
in the list element as a field, and the ``container_of`` macro is used to
get a pointer to the containing structure.


Using a standard implementation of lists
========================================

The kernel contains an efficient implementation of bidirectional lists. A list
consists of cyclically connected ``list_head`` structures, most often
which are components of some larger structure. It should be noted,
that both the head of a list and its components are ``list_head`` structures --
detection of the end of a list is done by comparison with the head address.

All list operations are available through the ``linux/list.h`` file.
Here are a handful of syntax related to lists:

``list_head``
    structure representing the head (part) of a list
``LIST_HEAD(list)``
    macro defining and initializing a list head variable
``INIT_LIST_HEAD(list)``
    macro initializing a list head (for dynamically created lists)
``list_add(what, to_what)``
    adding ``what`` to the beginning of ``to_what``
``list_add_tail(what, to_what)``
    adding ``what`` to the end of ``to_what``
``list_del(what)``
    removing ``what`` from the list
``list_empty(list)``
    checking, whether the list is empty
``list_splice(what, before_what)``
    gluing the list of ``what`` and ``before_what``
``list_for_each(varname, over_list)``
    iterating the variable ``varname`` over each element of ``over_list``
``list_for_each_safe(varname, meanwhile, over_list)``
    iteration of the variable ``varname`` over each element of ``over_list`` in a
    safe way due to deletion of list elements (for this purpose
    uses the variable ``meanwhile``)
``list_entry(my_list, structure, field)``
    computes a pointer to the start of the structure, which contains a field named ``field`` of type ``list_head``
    (just a syntax sugar for ``container_of``)

Documentation: https://docs.kernel.org/core-api/kernel-api.html#list-management-functions

.. notes::
    Andrzej::

      - Dalsze detale jak działają syscalle (w nawiązaniu do lab3)
        * entry/entry_64.S
        * entry/common.c -> do_syscall_x64
        * SYSCALL_3; https://lwn.net/Articles/604287/
        * Przykład co robi printf:
          > strace na printf, jest write; man 2 write
          > SYSCALL_DEFINE3(write,
          > https://opensource.com/sites/default/files/uploads/virtualfilesystems_2-shim-layer.png
          > bierzemy plik z fd (fdget_pos); include/linux/fs.h
          > zwróć uwagę na "user" przy argumentach
          > wołamy write albo write_iter z file_operations
          > fs/ext4/file.c i ext4_file_operations, brak write
          > debugfs, for kernel debugging, w środku używa copy_from_user
          > include/linux/uaccess.h; raw_copy_from_user; arch/x86/include/asm/uaccess_64.h
      - Dalsze detale copy_from_user / copy_to_user w uaccess.h / uaccess_32.h, tam widać
         > SMAP / Supervisor Mode Access Prevention
         > STAC: Set AC Flag
      - Pamiec, kmalloc
      - Konteksty: https://www.kernel.org/doc/htmldocs/kernel-hacking/basic-players.html
                  At any time each of the CPUs in a system can be:
                     1. not associated with any process, serving a hardware interrupt;
                     2. not associated with any process, serving a softirq or tasklet;
                     3. running in kernel space, associated with a process (user context);
                     4. running a process in user space.
      - Sprawdzanie kontektsu: in_irq(), in_softirq(), in_interrupt()

.. :
    .. _small_task_5:

    Small task #5
    =============

    Implement "bookmark" support for the ``lseek`` syscall (and related ones):

    - add a new field to the ``file`` structure — a stack of "bookmarks" (i.e., saved positions in the file)
    - add a new reference point to the ``lseek`` syscall: ``SEEK_BOOKMARK``, using the bookmark at the top of the bookmarks stack as a reference point
    - add two new flags to ``lseek``: ``SEEK_PUSH`` and ``SEEK_POP`` (which can be ORed with the other flags):

      - ``SEEK_PUSH`` pushes the current (before calling the syscall) position in the file onto the bookmark stack
      - ``SEEK_POP`` removes the bookmark from the stack

    In case of an error (using ``SEEK_BOOKMARK`` or ``SEEK_POP`` without bookmarks), the error ``EINVAL`` should be returned and no do nothing.

    The appropriate place to make changes would be the ``vfs_llseek`` function.

Reading
=======

- https://docs.kernel.org/core-api/index.html