.. _lab-kernel-internal-api-part2-en:

============================================
Class 7: Internal Kernel Interfaces, part II
============================================

Date: 08.04.2025
:ref:`small_task_6`

.. notes::
    Od Andrzeja::

        = Wstęp
        - Przypomnienie, że ostatni tydzień pierwszego zadania.
        - Pytanie do studentów, czy są jakieś problemy z drugim zadaniem jak na razie.
        - Dzisiejsze zajęcia to rozwinięcie zajęć sprzed dwóch tygodni. Materiał w małym stopniu może przydać się w drugim dużym zadaniu, ale będzie bardzo ważny dla trzeciego.

        = Interfejsy wewnętrzne #2
        - Przedstaw materiały laboratoryjne, w szcególności:
        * put_user / get_user jako alternatywa dla copy_from/to_user
        * zmienne atomowe,  asm/atomic.h
        * SpinLock (spinlock arch/x86/include/asm/spinlock.h)
        * asm-generic/qspinlock.h
        * https://0xax.gitbooks.io/linux-insides/content/SyncPrim/linux-sync-1.html
        * Semafory - pokaz kod w asm/semaphore.h, ew. Rwsem.h
        * Operacje blokujące
        * Inne funkcje (na ile zostanie czas)

Information about the process
=============================

When writing code that will execute in the kernel address space,
it is important to keep in mind all the time that the thread of execution is
associated with a certain user process, on behalf of which the kernel
performs certain operations.
Using the ``current`` macro (``asm/current.h``) we can easily (and quickly) get to all the
information that the kernel stores about the current process in the ``task_struct`` structure (``linux/sched.h``).

One exception is interrupt handling - while during the execution of the hardware interrupt handling function,
the ``current`` macro may point to some process,
it should not be referenced (it is not related to this interrupt)
as well as you must not switch to other processes
(i.e., you can't perform blocking operations either).


Data exchange between user and kernel address space
===============================================================

.. highlight:: c

To read/write something from/to the memory space of
user programs, use the following functions
(actually macros):

``put_user(kptr, ptr)``.
    write a byte/word/long word into user program memory space
    (from under the address ``ptr``); the macro definition works automagically - the size
    is determined by the type to which ``kptr`` points.

``get_user(kptr, ptr)``.
    as above, but reading

Use the following functions to copy larger areas of memory::

    unsigned long copy_from_user(void *to, const void __user *from, unsigned long n);
    unsigned long copy_to_user(void __user *to, const void *from, unsigned long n);

The former allows copying data from the user address space to the kernel address space, the latter the opposite.
In general, they behave like ``memcpy``,
but it is important to note that in case of an address page error of the user space,
they can cause the process to sleep until the page is downloaded from the swap.
Before copying, the correctness of the address in user space is checked.
If the beginning of the area is correct, but the rest is not, the longest possible fragment is copied.

The value returned by both functions is the number of NOT copied bytes - a non-zero value indicates a copy error.

The functions and their corresponding macro definitions are defined in the ``asm/uaccess.h`` file.
Note that functions for blocks of powers of two are optimized.

In case of an error when copying from/to user space, syscalls should return ``-EFAULT``.


Mutual exclusion
==================

Synchronization in the Linux kernel continues to gain importance with its development.
It is worth examining more carefully the related issues described here and presented in the lecture.

All of the following functions work correctly on both single and multiprocessor systems.
Many of them are implemented internally in two versions, providing higher performance on single-processor systems.
For example, the implementation of spinlocks boils down to disabling kernel preemption (or disabling interrupts).

Kernel documentation: https://docs.kernel.org/locking/locktypes.html

Mutex
-----

Presented in previous classes.
In-kernel documentation is `here <https://docs.kernel.org/locking/mutex-design.html>`_.


System semaphores
------------------

Semaphores in Linux are used to synchronize processes and are represented by the ``struct semaphore`` structure (defined in ``asm/semaphore.h``).

Semaphores can be declared using macros such as::

   static DECLARE_SEMAPHORE_GENERIC(sem_general, 11);

The following operations, among others, are available::

    void down(struct semaphore *sem)
    int down_interruptible(struct semaphore *sem)
    int down_trylock(struct semaphore *sem)

Functions that lower the system semaphore. The variants are analogous to regular locks.

::

    void up(struct semaphore *sem)

This function raises the system semaphore.

Linux also provides reader-writer semaphores  of type ``struct rw_semaphore`` (``linux/rwsem.h``).

.. admonition:: Hands-on

    Please familiarize yourself with the functions available for semaphores of type readers-writers: (``linux/rwsem.h``).


Spin locks (spin locks)
-----------------------------

Spin locks have a similar effect to ordinary locks, but they use active waiting instead of blocking the process.
Thus, they can be used in places where blocking the process would be unacceptable (primarily interrupt handling functions),
or when ordinary locks would significantly reduce performance (i.e., with a small number of protected operations).

Note that NO locking operations can be performed once a spinning lock is taken!
(The most important of the functions that do not work with spinning locks are:
``copy_to/from_user``, ``kmalloc``, ``down``, ``sleep_on``, ``mutex_lock``).
Before calling them, the lock must be released.

Basic spinlocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Spinlocks (defined in ``asm/spinlock.h``) are used in a general case in the following way::

    spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;
    unsigned long flags;

    spin_lock_irqsave(&xxx_lock, flags);
    /* ... critical section ... */
    spin_unlock_irqrestore(&xxx_lock, flags);

The above call is always safe (because it blocks interrupts on the local processor,
and then restores the original interrupt handling).

There is also a simplified version that we can use when a given lock is NEVER used in an interrupt handler function::

    spinlock_t xxx_lock = SPIN_LOCK_UNLOCKED;

    spin_lock(&xxx_lock);
    /* ... critical section ... */
    spin_unlock(&xxx_lock);


Reader-writer spinlocks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This type of spinlock (defined in ``asm/spinlock.h``) allows non-exclusive reading and exclusive full data access (e.g., writing)::

    rwlock_t xxx_lock = RW_LOCK_UNLOCKED;
    unsigned long flags;

    read_lock_irqsave(&xxx_lock, flags);
    /* .. critical section reading ... */
    read_unlock_irqrestore(&xxx_lock, flags);

    write_lock_irqsave(&xxx_lock, flags);
    /* .. critical section with exclusive access ... */
    write_unlock_irqrestore(&xxx_lock, flags);

This works well with complex data structures,
when most of the operations consist of traversing (reading) the structure.
Only the actual change (writing) requires exclusivity.

There is also a simplified version (without interrupt blocking) of reader-writer type spinning locks for use in situations,
when the use of the lock cannot appear in the interrupt handler program::

    rwlock_t xxx_lock = RW_LOCK_UNLOCKED;

    read_lock(&xxx_lock);
    /* .. critical section reading ... */
    read_unlock(&xxx_lock);

    write_lock(&xxx_lock);
    /* .. critical section with exclusive access ... */
    write_unlock(&xxx_lock);


Debugging locks
------------------

The kernel has a fairly sophisticated built-in lock checking facility called
`lockdep <https://docs.kernel.org/locking/lockdep-design.html>`_.
It allows detecting potential deadlocks by analyzing the sequence of locks occupied by individual kernel threads.
The deadlock does not have to occur to be detected - it is sufficient,
that any sequence required for a deadlock has ever been used by a thread.

The detected classes of errors are simple lock usage errors (e.g., uninitialized locks, recursive locking),
deadlocks caused by cyclic lock dependencies,
and deadlocks caused by incorrect use of locks inside interrupt handlers.

Lock debugging should be explicitly enabled in the kernel configuration, as it slows down the system.


Blocking operations
===================

Wait queues
-------------------

A process that must wait for a certain event to occur
(e.g., completion of an I/O operation, or arrival of data in a FIFO queue),
is removed from the ready process queue and placed in the waitqueues
The process state changes from ``TASK_RUNNING`` to ``TASK_INTERRUPTIBLE`` or ``TASK_UNINTERRUPTIBLE``.
The structure representing the head of the list in which such processes are placed is ``wait_queue_head_t``,
defined in the ``linux/wait.h`` file.

Selected operations:

``DECLARE_WAIT_QUEUE_HEAD(name)``.
    Before using ``wait_queue`` you need to initialize it.
``wait_event(name, cond)``.
    puts the process in the wait queue in a signal-resistant state (``TASK_UNINTERRUPTIBLE``)
    and sleeping until the cond occurs
``wait_event_interruptible(name, cond)``.
    as above, but transitioning to the ``TASK_INTERRUPTIBLE`` state.
    (the arrival of a signal will cause a transition to the ``TASK_RUNNING`` state),
    returns 0 if ``cond`` has occurred, ``-ERESTARTSYS`` if interrupted by a signal
``wait_event_timeout(name,cond,timeout)``.
    like ``wait_event``, but with waking up after the specified timeout even if ``cond`` has not occurred
``wait_event_interruptible_timeout(name,cond,timeout)``.
    as above, but transition to ``TASK_INTERRUPTIBLE`` state.
``wake_up(name)``.
    wakes up all waiting processes without the ``WQ_FLAG_EXCLUSIVE`` flag set, and one that has it set.
    Note: the function does not remove the process from the waiting queue -
    the process will 'remove' itself, after resuming operation
``wake_up_interruptible(name)``.
    as above, but only for processes in ``TASK_INTERRUPTIBLE`` state.
``wake_up_all(name)``.
    insert all processes
    (in state ``TASK_INTERRUPTIBLE`` or ``TASK_UNINTERRUPTIBLE``).
    located in the waiting queue to the ready process queue

The above operations are defined in ``linux/sched.h`` (except the first one, defined in ``linux/wait.h``).

When using wait queues, care should be taken to avoid races.
For example, if we are a reader and we want to wait for the writer to write any data to a lock-protected buffer,
the waiting code might look like this:

.. code:: c

    /* definitions */
    DEFINE_MUTEX(lock);
    DECLARE_WAIT_QUEUE_HEAD(queue);
    int pos_read, pos_write;
    char *buffer;

    /* loading a byte */
    char character;
    /* locking */
    mutex_lock(&block);
    /* check if we already have data to read */
    while (pos_read == pos_write) {
        /* we don't have - we need to remove the lock so that the writer can do anything */
        mutex_unlock(&block);
        /* we are waiting for a write - the condition ensures that we don't wait,
           if the writer has updated pos_write between the time we gave up the lock,
           and we joined the queue - otherwise the
           writer's wake_up would not cover our process and we would wait
           much longer (perhaps forever). The condition is checked
           after being added to the waiting queue, but before the actual
           waiting. */
        wait_event(queue, pos_read != pos_write);
        /* wait_event only ensures that at some point since the call
           pos_read != pos_write occurred, but some other thread may have
           emptied the buffer in the meantime - we can't assume anything
           until we check this condition while holding the lock - so we take the
           lock and check the while condition again, until we get it */
        mutex_lock(&block);
    }
    /* successful - we can read the character */.
    character = buffer[pos_read++];
    mutex_unlock(&block);


Waiting for completion
----------------------

This is a rather strange variant of the semaphore:

- we create a structure ``struct completion``, which initially has a value of 0
- we can perform the ``complete`` operation, which increases the value of the structure by 1
- we can perform the ``wait_for_completion`` operation (or its variants), which waits,
  until the structure has a positive value, then decreases it by 1
- we can perform the ``complete_all`` operation, which permanently sets the value of the structure to ``UINT_MAX``
  (all waiting threads are woken up, all further executions of ``complete`` and ``wait_for_completion`` will be no-ops)

Here is a typical use of the 'wait for completion' mechanism::

    struct completion event;

    init_completion(&event);

    /* .. pass a pointer to event to whoever is to wake up .... */

    wait_for_completion(&event);

The wake-up caller calls::

    complete(&event)

when the appropriate time arrives.

The implementation is in the ``kernel/sched/completion.c`` file,
while the structure itself is declared in ``linux/completion.h``.


Other useful kernel features
==============================

The kernel library contains many other ready-to-use functions,
which can prove useful when writing a great variety of drivers.
Their documentation may be found under the
`Core API <https://docs.kernel.org/core-api/index.html>`_ documentation section.
Among the more useful ones can be mentioned:

``linux/idr.h``.
    ``int`` map into ``void *`` with dynamic identifier allocation
``linux/kref.h``.
    easy reference counting
``linux/bitmap.h``
    efficient bitmap arrays
``linux/btree.h``.
    B-trees
``linux/bug.h``, ``asm-generic/bug.h``.
    Reporting critical errors in the driver code resulting from code defects and warnings
    (something like ``assert``)
``linux/circ_buf.h``.
    cyclic buffers
``linux/hash.h``.
    simple hash functions
``linux/kernel.h``.
    Miscellaneous simple functions:

    ``ALIGN(x,a)``.
        aligns ``x`` down to a multiple of ``a`` (``a`` must be a power of two)
    ``PTR_ALIGN(p, a)``.
        as above, but on pointers
    ``IS_ALIGNED(x, a)``.
        checks if ``x`` is already aligned
    ``ARRAY_SIZE(arr)``.
        array size ``arr``
    ``DIV_ROUND_UP(n,d)``.
        ``n/d``, rounding up.
    ``roundup(x, y)``.
        rounds ``x`` up to a multiple of ``y``.
    ``upper_32_bits(x), lower_32_bits(x)``.
        as in the name
    ``might_sleep()``.
        denotes where the code can sleep, helps with debugging
        (throws error if spinlock debugging is enabled and spinlock is held)
    ``min(x,y), max(x,y)``.
        as in the name
    ``clamp(val, min, max)``.
        ``val`` clipped to the range [``min``, ``max``].

``linux/kobject.h``.
    general object type with reference counting and visibility in sysfs
    (among others, ``cdev`` and ``device`` are based on them).
``linux/parser.h``.
    a simple parser for options
``linux/rbtree.h``.
    red-black trees


.. _small_task_6:

Small task #6
=============

Implement "bookmark" support for the ``lseek`` syscall (and related ones):

- add a new field to the ``file`` structure — a stack of "bookmarks" (i.e., saved positions in the file)
- add a new reference point to the ``lseek`` syscall: ``SEEK_BOOKMARK``, using the bookmark at the top of the bookmarks stack as a reference point
- add two new flags to ``lseek``: ``SEEK_PUSH`` and ``SEEK_POP`` (which can be ORed with the other flags):

  - ``SEEK_PUSH`` pushes the current (before calling the syscall) position in the file onto the bookmark stack
  - ``SEEK_POP`` removes the bookmark from the stack

In case of an error (using ``SEEK_BOOKMARK`` or ``SEEK_POP`` without bookmarks), the error ``EINVAL`` should be returned and no do nothing.

The appropriate place to make changes would be the ``vfs_llseek`` function.

Please submit a patch generated with ``git format-patch``.