.. _lab-syscall-en:

==========================
Class 3: kernel interfaces
==========================

Date: 11.03.2025

:ref:`small_task_3`

System calls in the Linux kernel
================================

What is a system call?
----------------------

System calls (syscalls) are a mechanism that allows processes
operating in the user space to request certain functions
from the kernel.

Syscalls are the main mechanism of communication of user processes
with the outside world -- the only thing that a program that
does not use syscalls can do is to hang or cause a runtime error.
System calls are a **stable** interface of the Linux kernel -- they may only be added.

The mechanism of calling syscalls from the user level varies
for each architecture, and sometimes even for processors
and / or kernel versions within a single architecture.
Usually, this is done using an assembler
instruction specially designed for this purpose.

Syscalls are identified by numbers. Their exact list and numbering
depends on architecture -- you can view it in the file ``/usr/include/asm/unistd*.h`` or
``arch/*/include/uapi/asm/unistd*.h`` in Linux sources.

A value returned from the kernel in the range [-4096, -1] means an error and
is a negated standard error code (list in ``/usr/include/asm-generic/errno.h`` and
``errno-base.h``) . Other values mean success, and their
meaning depends on the syscall.


Review of the most important syscalls
-------------------------------------

.. admonition:: Hands-on

    Syscall executed by a process may be inspected with ``strace``.
    This program shows *all* system calls by the process -- including the ones done by the dynamic linker.

    Go ahead and trace a process, for instance::

        strace ls

    You'll find a list of the most common system calls below.
    Try to figure out what's happening and analyze the work of the dynamic linker
    in correspondence to the previous labs.

.. notes::
    Walk through over this with students.
    The dynamic linker for ``strace ls`` should use almost all of the calls below,
    so you may just shortly explain them one by one.

    Some notes:

    - we start from implicit fork() + execve
    - openat is relative open
    - mmap has a strict correspondence to a task from previous labs /proc/.../maps (segments by the linker)
    - skip over libc/libpcre to mprotect
    - ioctl with terminal size
    - getdents -> directory entries

Process control
^^^^^^^^^^^^^^^

These syscalls are related to process management. It should be noted that
the Linux kernel and the POSIX standard (including the pthreads thread
library) use different process definitions: kernel processes correspond
to POSIX threads.  Here we use the POSIX definition.

``noreturn void _exit ()``
  ends the thread
``noreturn void exit_group ()``
  ends the process
``pid_t getpid()``
  returns the ID of the current process
``pid_t gettid()``
  returns the ID of the current thread
``int fork()``
  creates a new process that is a copy of the current (but only
  with one thread); this is a special case of the ``clone`` syscall
``int clone (int (*fn) (void *), void *newstack, int flags, void *arg, ...)``
  creates a new process or thread, quite a complicated function
``pid_t waitpid (pid_t pid, int *stat_loc, int options)``
  waits for an event (exit, stop, etc.) in the child process
``int execve (const char *path, char *const *argv, char *const *envp)``
  launches a new program in the current process, replacing the current one
``long ptrace (int request, pid_t pid, void *addr, void *data)``
  performs many operations related to tracing other processes: it allows you
  to stop a given process, single-step it, read and write its address space,
  registers, etc. Used by gdb and strace.

File support
^^^^^^^^^^^^

At the syscall level, open files are identified by so-called file descriptors,
i.e., small non-negative integers. Descriptors 0-2 typically correspond to
the standard input, output, and error output. Other descriptors rarely
have specific roles defined.

The most important syscalls from this group:

``ssize_t read (int fd, void *buf, size_t len)``
  reads from the file ``fd`` to the buffer; returns the number of bytes read.
  0 means the end of the file (this is not considered an error). A positive
  number, but less than len means a partial read -- this may be due to an error,
  hitting the end of the file, or simply lack of more available data at the moment.
``ssize_t write (int fd, const void *buf, size_t len)``
  writes to a file, works similarly to read.
``int open (const char *fname, int flags, mode_t mode)``
  opens a file, returns the descriptor.
``int close (int fd)``
  closes the file; may return an error if there is a problem in emptying kernel buffers.
``int ioctl (int fd, int request, void *arg)``
  performing a special operation on the file. The set of available operations
  is very dependent on the file. Mostly only applies to device files. Examples
  of special operations:

  - change of terminal settings (performed on a terminal file)
  - volume change (performed on the sound card device file)
  - reading information about the manufacturer and physical parameters
    (performed on the hard disk device file)

``int poll(struct pollfd *fds, int nfds, int timeout)``
  waits for one of the listed events to take place on one of the given
  descriptor lists, useful when the program can receive input from many sources

Memory management
^^^^^^^^^^^^^^^^^

``void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)``
  creates a new memory area. If ``flags`` contains ``MAP_ANONYMOUS``, this is
  simply a new block of memory.  Otherwise, the area will be mapped to
  the given file -- reading from the area will give us the contents of the file.
  If ``flags`` contains ``MAP_SHARED``, writing to the area will also write to
  the file; otherwise (``MAP_PRIVATE``), writing to the area will create a new
  copy of the data from the file and modify only this copy. If the ``MAP_FIXED``
  flag is passed, the area will be mapped at the given address; otherwise,
  the kernel will look for a free address.
``void munmap (void *addr, size_t len)``
  unmaps (destroys) the given memory area
``int mprotect (void *addr, size_t len, int prot)``
  changes the access rights to a given memory area
``void *brk (void *addr)``
  shortens or extends the process heap segment

It should be noted that on Linux (as well as many other UNIXes) there are two
methods for the allocation of "normal" memory: ``mmap`` with the option
``MAP_ANONYMOUS`` and ``brk``. The usual ``malloc()`` with libc in the standard
configuration uses the latter for small allocations, and the former for large allocations.


Performing system calls on Linux
================================

.. admonition:: Hands-on

    The most direct way of invoking a system call from C is to use the ``syscall`` wrapper function,
    which expects a syscall number (or a constant from ``sys/syscall.h``) and its arguments.
    The function return the syscall result on success, and -1 on error,
    while moving the actual (non-negated) error code to a global variable ``errno``.

    Compile the program listed below and open in a debugger.
    Single-step it from the ``main()`` function to see how the system call is executed,
    and refer to the sections below for the exact description.

    Note the instruction address which performed the actual call -- check what was mapped there in the ``maps`` file.

    .. code-block:: c

        #include <unistd.h>
        #include <sys/syscall.h>
        #include <errno.h>

        int main()
        {
            int rc = syscall(SYS_chmod, "./f", 777); // oops
            if (rc == -1)
                return errno;
            return 0;
        }

.. notes::
    From old notes:

     - Pokaż, skompiluj, uruchom, wytłumacz ex1 z lab2

       * Dodatkowo, skompiluj z -H i pokaż które pliki są użyte
       * Pokaż co jest w /usr/include/unistd.h
       * dpkg -S /usr/include/unistd.h
       * Omówienie jak działa errno; /usr/include/errno.h; /usr/include/asm-generic/errno.h i errno-base.h
       * nm -D a.out
       * Kod źródłowy glibc, csu/errno
       * Zagadka 0777 vs 777 (ósemkowy vs dziesiątkowy; często błąd na który trzeba uważać)
       * Uruchom w gdb, pokaż "info register" przed wywołaniem syscalla i omów zawartość (rdi, rsi itp.)


Before calling a syscall, you should put its number and parameters in certain
fixed registers of the processor. After the call, the result or error code
is also available in a register.

System calls on the x86_64 architecture
---------------------------------------

.. tip::
     There is `a useful reference <https://syscalls64.paolostivanin.com/>`_ available.

The only native mechanism of system calls on the x86_64 architecture is
the ``syscall`` instruction.  The call takes place as follows:

- the syscall number is passed in ``rax``.
- parameters are passed in: ``rdi``, ``rsi``, ``rdx``, ``r10``, ``r8``, ``r9`` (in this order),
- the ``syscall`` instruction is called,
- the contents of ``rcx`` and ``r11`` are destroyed by the kernel as a side effect of the syscall,
- the result is in ``rax``.

System calls on the i386 architecture
-------------------------------------

There are 3 mechanisms of system calls on the i386 architecture:

- interrupt ``0x80`` (available on all processors)
- ``sysenter`` instruction (available on Intel processors starting from Pentium Pro)
- ``syscall`` instruction (available on AMD processors starting from K6)

The ``sysenter`` / ``syscall`` instructions were introduced in later processors
due to poor performance of interrupts on x86 processors.

If you use interrupt ``0x80``, the system call looks like this:

- the syscall number is passed in ``eax``
- parameters are passed in the following registers: ``ebx``, ``ecx``, ``edx``, ``esi``,
  ``edi``, ``ebp`` (in this order); syscalls requiring more parameters have special conventions
- syscall is called by the ``int $0x80`` instruction
- the syscall result will be in the ``eax`` register

Calls through ``syscall`` and ``sysenter`` are quite similar, but a bit more
complicated.

VDSO and vsyscall mechanisms
-----------------------------

Due to the existence of many syscall mechanisms on the i386 architecture and
the need to choose the right one for the given machine, the VDSO mechanism
was introduced. VDSO is a small shared library provided by the kernel that
contains the syscall function appropriate for a given processor. The kernel
contains several prepared versions of this library (``int 0x80``, ``syscall``,
``sysenter``) and selects the appropriate one at run time.

The x86_64 architecture does not require a syscall selection mechanism,
but an improved mechanism for the execution of certain syscalls (``clock``,
``time``, ``get_cpu``) has been introduced there. These syscalls have optimized
versions that do not require the processor to go into kernel mode (they only
read global kernel variables, which are made available to read from the user
space in a special way). This mechanism also uses a code block exported by
the kernel to the user space, named vsyscall.

The VDSO and vsyscall code also include the implementation of the ``sigreturn``
and ``rt_sigreturn`` functions used when returning from the user-space signal
handling function.

System calls in libc
--------------------

Most system calls have their "wrappers" in the standard C library (libc).
These are functions whose only task is to move parameters to the right place,
call the appropriate syscall, and return the result. Note that the kernel
and libc have different conventions for passing error information --
the kernel returns the negated error code (eg ``-EINVAL``) directly from
the syscall, whereas the library functions always return -1 on error, and
the error code is passed in the global variable ``errno`` (error code in ``errno`` is *not* negated).

The simplified version of the write syscall wrapper (omitting vdso, errno,
cancellation point) may look like this::

    .global write
    write:
    movl $1, %eax           # syscall number
    syscall
    cmpq $-4096, %rax       # error?
    jna out                 # if not error, then exit
    neg %rax                # -EINVAL -> EINVAL etc.
    movl %eax, errno        # set errno
    movl $-1, %eax          # return -1
    out:
    ret

Not all system calls correspond directly to library functions, for many reasons:

- many syscalls have several versions with parameters of various sizes (mainly
  those related to uids / gids, pids, offsets in files, etc.). Older versions
  with smaller parameters are preserved as part of compatibility with older
  versions of libc. Examples are syscall ``getuid`` (16-bit uid) and ``getuid32``
  (32-bit uid) as well as ``lseek`` (32-bit offset) and ``_llseek`` (64-bit offset).
  Existing versions of a syscall depend strongly on the architecture -- for example,
  64-bit architectures have never had syscalls with 32-bit offsets.
- some syscalls (``ipc``, ``socketcall`` in fact have many subfunctions with
  different parameters (``shmat``, ``shmctl``, ``msgctl``, ``socket``, ``connect``,
  ``bind``, ``listen``, etc.). Each of these subfunctions has its own wrapper function in libc.
- many syscalls have semantics modified by the thread library (more about this below)
- syscall requires special intervention from libc (``vfork``, ``clone``, etc.)
- because it is so

.. admonition:: Hand-on

    Compile the following program statically (``-static``, so we won't be spammed by the linker)
    and execute it under strace.

    .. code-block:: c

        #include <sys/time.h>
        #include <stdio.h>

        int main()
        {
            struct timeval tv;
            printf("123\n");
            gettimeofday(&tv, NULL);
            printf("%d\n", tv.tv_sec);
        }

    Where did the call to ``gettimeofday`` (syscall #96 on x86_64) go?
    If you redirect the output to a file/pipe, why there is only a single ``write``?

.. notes::
    From the old notes:

     - Pokaż, skompiluj, uruchom, wytłumacz ex2 z lab2

       * uruchom ze strace (2>&1 | less)
       * tylko jeden write, przez vsdo
       * cat /proc/self/map | grep "vsyscall\|vsdo"
       * Wyłączenie vsdo: vim /etc/default/grub (vsdo=0); update-grub

     - Syscalle stabilne, ale czasami dochodzą. Zazwyczaj owinięte w funkcje

       * Sprawdzenie wersji linuksa: uname -a i ew. /proc/version
       * Sprawdzenie wersji glibc: ldd --version
       * Krótki pokaz kodu glibca np.:

         - string/ i strcmp
         - signal/ kill, sigblock (i sysdeps/unix/syscalls.list)
         - clone

     - Pokaż, skompiluj, uruchom, wytłumacz ex3 i ex4

Other useful syscalls
=====================

.. admonition:: Hands-on

    In contrast to the fairly simple ``fork()`` system call, ``clone`` has a more advanced
    interface allowing fine-grained specification of information shared between the parent and child process.
    Moreover, instead of returning in both processes, it calls a given function on its own stack.
    Refer to the ``man clone`` page for a list of available flags.

    The code attached below presents the basic usage and how data may be shared.
    Review the code below and explain what is the difference when the "vm" parameter is passed.

    .. code-block:: c

        #define _GNU_SOURCE
        #include <stdio.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <string.h>
        #include <unistd.h>
        #include <sys/wait.h>

        #define STACK_SIZE	65536

        int global_value = 0;
        char *heap;
        static int child_func(void *arg)
        {
            char *buf = (char *)arg;
            printf("Child sees buf = %s\n", buf);
            printf("Child sees global value = %d\n", global_value);
            printf("Child see heap = %s\n", heap);
            strcpy(buf, "hello from child");
            global_value = 10;
            strcpy(heap, "bye");
            return 0;
        }

        int main(int argc, char *argv[])
        {
            //Allocate stack for child task
            char *stack = malloc(STACK_SIZE);
            unsigned long flags = 0;
            char buf[256];
            int status;

            if (!stack) {
                perror("Failed to allocate memory\n");
                exit(1);
            }
            heap = malloc(1024);
            if (!heap) {
                perror("Failed to allocate memory\n");
                exit(2);
            }
            if (argc == 2 && !strcmp(argv[1], "vm"))
                flags |= CLONE_VM;
            strcpy(buf, "Hello from Parent");
            strcpy(heap, "Hey");
            global_value = 5;
            if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) {
                perror("clone");
                exit(1);
            }
            if (wait(&status) == -1) {
                perror("Wait");
                exit(1);
            }
            printf("Child exited with status:%d\t", status);
            printf("buf:%s\t global_value=%d\n",
                    buf, global_value);
            printf("Parent heap:%s\n", heap);
            return 0;
        }

futex
-----

Traditional methods of implementing mutexes in the userspace required,
to avoid busy waiting, creating, for example, an unnamed pipe for each mutex
(used to awaken waiting processes). This approach has a number of disadvantages:

- mutexes are expensive: using two file descriptors for each mutex
  requiring active waiting, the mutex structure itself must also be quite large
- implementation of interprocess mutexes (e.g., in shared memory) is very heavy

The 2.6 kernel has added the ``futex`` syscall (fast userspace mutex), which
allows significantly simplifying the implementation of mutexes::

   int futex (int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);

This syscall, like ``socketcall``, is actually a wrapper for several subfunctions (selected by the ``op`` parameter):

- ``FUTEX_WAIT``: atomically checks if ``*uaddr == val`` and falls asleep, if so.
  If a timeout is given, it falls asleep at most for such a period of time,
  otherwise it falls asleep indefinitely.
- ``FUTEX_WAKE``: wakes up at most ``val`` processes waiting in ``FUTEX_WAIT``
  on the address ``uaddr``.
- … and a few more, more complicated ones

The use of futex avoids problems with the traditional implementation:

- syscall ``futex`` is called only when the mutex is already busy. There is no
  constant consumption of resources: kernel resources are used only when
  a thread is actually waiting for the mutex to be released
- mutex is a very small structure (for the basic variant, a single ``int`` is enough)
- mutexes work between processes without any special handling -- the kernel uses
  the physical address for comparisons and correctly handles references to the same
  place through different addresses in different processes, etc.


Signals
=======

Signals are a mechanism for transmitting information about asynchronous or synchronous
events to the user's process. This mechanism is very similar to the kernel-level interrupt mechanism.

There are approximately 32 signals with a fixed purpose (the list is to some extent
dependent on the architecture), and 32 real-time signals that can be used by the user
for any purpose. Important signals are:

Signals sent by the kernel caused by processor errors:

- ``SIGSEGV``: informs about violation of protection mechanisms, most often reference to a bad memory area
- ``SIGILL``: informs about the execution of an incorrect machine instruction
- ``SIGBUS``: indicates a memory access error for a reason other than an invalid address or lack of permissions. It's hard to get it on x86. On other architectures, it is often caused by, for example, access to a word at an unaligned address.
- ``SIGFPE``: floating point exception, originally reported exceptions in floating point calculations, later also used for arithmetic errors on integers (dividing by 0)
- ``SIGTRAP``: informs about hitting a breakpoint, used for debugging programs

Signals for controlling processes (sent by other processes):

- ``SIGTERM``: informs the process that it should end
- ``SIGKILL``: forcefully terminates the process
- ``SIGSTOP``: stop the process (with the possibility of continuing)
- ``SIGCONT``: continues the process
- ``SIGCHLD``: indicates the status of the child process

Signals related to terminal service:

- ``SIGHUP``: informs about disconnecting the terminal (e.g., closing the xterm window, disconnecting the ssh session)
- ``SIGINT``: indicates that Ctrl-C has been pressed
- ``SIGQUIT``: indicates that Ctrl-\\ has been pressed
- ``SIGTSTP``: indicates that Ctrl-Z has been pressed
- ``SIGTTIN``: informs about trying to read from the controlling terminal without being in the foreground group
- ``SIGTTOU``: informs about trying to write to the controlling terminal without being in the foreground group
- ``SIGWINCH``: informs about the terminal size changing

Other signals:


- ``SIGABRT``: informs about the occurrence of an unforeseen error in the program (failed assert, etc.) and the necessity of its forceful closing
- ``SIGPIPE``: indicates that you have tried to write to the pipe or socket whose other end has been closed
- ``SIGUSR1``, ``SIGUSR2``: without a predetermined purpose, intended for the user
- ``SIGIO``: informs you that an asynchronous IO operation has been finished, or that IO on a file can now be executed if the program has previously requested such information

Each signal has an assigned action that will be executed when it is delivered. This is one of:

- ignoring: nothing will happen
- executing a function: the function provided by the user will be called
- default action, depending on the signal:

  - ignore (``SIGCHLD``, ``SIGWINCH``)
  - stopping the process (``SIGSTOP``, ``SIGTSTP``, ``SIGTTIN``, ``SIGTTOU``)
  - continuation of the process (``SIGCONT``)
  - killing the process (``SIGTERM``, ``SIGKILL``, ``SIGINT``, …)
  - killing the process with a core dump (``SIGSEGV``, ``SIGQUIT``, …)

The action performed on most signals can be changed by syscalls ``signal``
(simple interface, but limited functionality) or ``sigaction`` (much greater functionality).
Signals whose actions cannot be changed are ``SIGKILL`` and ``SIGSTOP``. In addition,
although the ``SIGCONT`` action can be changed, it will always continue the process in addition
to calling the assigned action.

In addition to changing the action assigned to signals, you can also block their delivery
by syscall ``sigprocmask``. A blocked signal is not the same as an ignored signal --
an ignored signal will be discarded while a blocked signal will be waiting in
the queue until it is unblocked.

In addition to the signals sent by the kernel, each signal can also be manually sent
by the user. Signals can be sent to entire processes (or groups of processes) via
syscall ``kill``, or to individual threads inside your own process by syscall
``tkill`` (or the corresponding function ``pthread_kill``). Sending a signal
to a process results in delivering it to an arbitrarily selected thread in the process.

Signal handling
---------------

The signal handling functions are set by the ``sigaction`` syscall and have
one of the following types:

- ``void func (int signum)``
- ``void func (int signum, siginfo_t *info, ucontext_t *ctx)``

``signum`` is the number of the incoming signal. In case of using the second type,
``info`` is a structure containing information about the signal -- e.g., its source
(processor error, sent by the user, terminal, etc.) and details (error address
in the case of ``SIGSEGV``, pid of the sender, etc.). ``ctx`` is a pointer to
a structure containing the full state of the processor registers before entering the signal handler.

When a signal arrives with a handler function assigned, the kernel interrupts the process.
If the process is executing a blocking system call and is in a state of interruptible
sleep, the system call is interrupted.  Depending on the semantics selected when setting
the handler function, the call is either terminated and returns the error code ``EINTR``,
or it will be restarted when returning from the signal handler.

After interrupting the process and possibly a syscall, the kernel writes the state of
the processor registers to the user stack (usually it is a "normal" stack of the program,
but it is possible to set a separate stack for signals by syscall ``sigaltstack``).
After saving the state of registers, the kernel writes to the stack (or registers)
parameters to the interrupt service function and the return address. Because of
the need to clean up after the signal handler and restore the exact state of
the processor before it was called, this return address points to a special ``sigreturn``
pseudo function, which is part of the VDSO / vsyscall block. Upon returning from
the signal handler, ``sigreturn`` calls syscall ``sigreturn`` which deals with proper cleanup.

Simply returning from the signal handler is not the only way to leave it --
sometimes it is useful to use the ``siglongjmp`` call or to use information
from the ``ucontext`` structure to unwind the stack.

Signal handling is a very delicate mechanism, because it is very difficult
to control where the signal will interrupt the program (for example, we can
not guarantee that we can use ``malloc`` while handling the signal -- this signal
could have interrupted a ``malloc`` call from the main program). For this reason,
the signal handling functions are often limited to the setting of a single variable,
which is regularly checked by the main program.
The functions which are safe to call in the asynchronous signal context are listed in ``man signal-safety``.

.. _small_task_3:

Small task #3
=============

As you scroll through the signal-safe libc functions, you may notice that ``write`` is there,
but no int-to-string conversion function is present. What if we wanted to print the signal number to stderr?
As the old interface provided no way to pass an extra per-signal-kind information to the function,
let's pass it ourselves by generating a trampoline (thunk) to do the work.


Trampolines
-----------

Trampolines are small pieces of code whose only task is to transfer control
to another location, perhaps modifying the process context in some way
or passing additional parameters.

An interesting example of using trampolines are the FFI (foreign function
interface) libraries for scripting languages. Let's assume that an API
written in C, which we want to use with Python / Lua / etc, requires passing
a pointer to a function that it will call in a given situation (a so-called
callback). Because our program is written in an interpreted dynamic language,
the FFI library supporting callbacks must support the dynamic generation of
functions callable from C. To this end, the library generates a trampoline,
which will remember the appropriate registers containing parameters, write
the trampoline identifier (to select the appropriate high-level function
to invocation), and then call the FFI library function that starts
the dynamic language interpreter with the appropriate parameters. Thanks
to such a solution, we can generate any number of functions for callbacks
at runtime and pass pointers to them to C functions.


Task
----

.. highlight:: C

Forwarding the signal handing logic to a dynamic language is one example of such an API,
but we are going to start with a simplified task here.

Write an implementation of the ``make_signal_handler`` function::

    typedef void (*sighandler_t)(int);
    sighandler_t make_signal_handler(int signum);

The ``make_signal_handler`` function should set a new signal handler using the ``signal`` libc function
and return the previous handler.
As a new handler, you need to **generate a pointer to a dynamically created printing function**
which would print the signal number using
``write(1, ...)`` while ignoring the input ``signum`` argument.
For instance::

    int main() {
        sighandler_t old_2 = make_signal_handler(2);
        sighandler_t old_12 = make_signal_handler(12);
        for (;;) pause();
    }

should be equivalent to::

    void write2(int signum) {
        write(1, "2", 1);
    }
    void write12(int signum) {
        write(1, "12", 2);
    }

    int main() {
        sighandler_t old_2 = signal(2, write2);
        sighandler_t old_12 = signal(12, write12);
        for (;;) pause();
    }

.. important::
    Such a handler could have just used a global strings table indexed by the signal number to perform the required task.
    However, such a solution would not take us closer to implementing a trampoline...

.. note::
    The ``signal`` libc function has different semantics (BSD) from the Linux ``signal`` syscall.
    It uses ``sigaction`` internally.

Hints
-----
Compile the following code (with the ``-no-pie -mcmodel=large -fno-pie`` option) and
disassemble the resulting ``.o`` file::

    void writer(int signum) {
        write(1, "12", 2);
    }

From the compiled machine code, make a template that will be filled with
a pointer to the appropriate string and its length by ``make_signal_handler``.

This example program shows how to dynamically create an executable function::

    #include <stdio.h>
    #include <stdint.h>
    #include <unistd.h>
    #include <string.h>
    #include <sys/mman.h>   /* mmap(), mprotect() */

    static uint8_t code[] = {
        0xB8,0xFF,0x00,0x00,0x00,   /* mov  eax,0xFF    */
        0xC3,                       /* ret              */
    };

    int main(void)
    {
        const size_t len = sizeof(code);

        /* mmap a region for our code */
        void *p = mmap(NULL, len, PROT_READ|PROT_WRITE,  /* No PROT_EXEC */
                MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
        if (p==MAP_FAILED) {
            fprintf(stderr, "mmap() failed\n");
            return 2;
        }

        /* Copy it in (still not executable) */
        memcpy(p, code, len);

        /* Now make it execute-only */
        if (mprotect(p, len, PROT_EXEC) < 0) {
            fprintf(stderr, "mprotect failed to mark exec-only\n");
            return 2;
        }

        /* Go! */
        int (*func)(void) = (int(*)())p;
        printf("(dynamic) code returned %d\n", func());

        pause();
        return 0;
    }

.. highlight:: none


Extra Topics
=============

If you want to extend your knowledge beyond the scope of this course, consider:

Reading
-------

1. Manual section 2, in particular: ``syscall``, ``futex``, ``sigaction``
2. VDSO/vsyscall sources in ``arch/x86/vdso``
3. Syscall list in ``asm/unistd*.h``
4. Ulrich Drepper "Futexes Are Tricky", 2011 - http://www.akkadia.org/drepper/futex.pdf
5. ``man 7 signal``, ``man 7 signal-safety``

Implementing
------------

Some ideas to implement:

A warm shutdown on Ctrl-C
    A warm shutdown of a system allows finishing all currently performed tasks and leaving
    external resources in a proper state.
    Typically, hitting Ctrl-C again transitions to a cold shutdown:
    the process should end as fast as possible without corrupting the external state.

    Extend the code in the collapsed section below with support for warm shutdown.
    Subsequent ``SIGINT`` signals should:

    1. Print a message, break the loop, flush libc buffers and flush all data to disk.
    2. Print a message, flush the file to disk (ignore libc buffers) and exit (use ``siglongjmp``).
    3. Kill the program.

    Be careful to read about requirements of signal safety (especially for ``siglongjmp``),
    and restore the original handler after cancellable block.

    .. class:: details

    Code
      .. code-block:: c

        #include <unistd.h>
        #include <stdio.h>
        #include <string.h>
        #include <err.h>

        void calculator(FILE* fd) {
            int iters = 0;
            volatile long my_something = 3;

            while (iters < 1000) {
                for (int i = 0; i < 1000000000; ++i) {
                    my_something = my_something * 3 + 1; // digits of PI...
                }
                if (fwrite((void*)&my_something, sizeof(my_something), 1, fd) != 1) {
                    perror("Cannot write results\n");
                    break;
                }
                if (iters % 10 == 0) {
                    fflush(fd);
                }
                putchar('.');
                fflush(stdout);
            }
            // Make sure we have our important computation on disk!
            fflush(fd);
            fsync(fileno(fd));
        }

        void cancellable(const char* filename) {
            FILE* out = fopen(filename, "w");
            if (out == NULL) {
                err(1, "Cannot open %s:", filename);
            }

            if (false /* long jump */) {
                // We're inside the long jump here.
                // Calling non-async-safe functions is illegal,
                // or even returning from main.

                // Store to disk what was written.
                // sleep(5); to test the third case.
            }

            calculator(out);
            if (fclose(out) == EOF) {
                err(3, "Cannot close %s:", filename);
            };
        }

        int main() {
            cancellable("powerbytes");
            puts("We're done!");
        }

A ``fork()`` server
    A forkserver is a technique of launching new processes of an interpreted language without much of
    the overhead of initializing the interpreter again.
    A naive use of ``fork()`` has many issues, especially with multithreaded programs, such as inheriting locked mutexes.

    A *forkserver* is a single-threaded server process (possibly with preloaded modules) that can fork itself on request.
    As the server is started early or is an independent process, no unnecessary resources would be inherited.
    For instance, you can make the ``multiprocessing`` Python module to use
    `forkserver for starting new processes <https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods>`_.

    Play with arguments to ``clone`` and ``mmap`` to support proper data sharing.
    Use unnamed pipes / sockets to communicate between processes:
    the main process should have a socket interface for requests to the forkserver,
    which should enable communication with the newly spawned process through another pipe.
    The requester would then use that pipe to communicate work that needs to be done.

    The main trick is to use ``SCM_RIGHTS`` Unix socket options to send ends (file descriptors) of pipes
    used for communication with the new process.