.. _lab-syscall-en: ========================== Class 3: kernel interfaces ========================== Date: 11.03.2025 :ref:`small_task_3` System calls in the Linux kernel ================================ What is a system call? ---------------------- System calls (syscalls) are a mechanism that allows processes operating in the user space to request certain functions from the kernel. Syscalls are the main mechanism of communication of user processes with the outside world -- the only thing that a program that does not use syscalls can do is to hang or cause a runtime error. System calls are a **stable** interface of the Linux kernel -- they may only be added. The mechanism of calling syscalls from the user level varies for each architecture, and sometimes even for processors and / or kernel versions within a single architecture. Usually, this is done using an assembler instruction specially designed for this purpose. Syscalls are identified by numbers. Their exact list and numbering depends on architecture -- you can view it in the file ``/usr/include/asm/unistd*.h`` or ``arch/*/include/uapi/asm/unistd*.h`` in Linux sources. A value returned from the kernel in the range [-4096, -1] means an error and is a negated standard error code (list in ``/usr/include/asm-generic/errno.h`` and ``errno-base.h``) . Other values mean success, and their meaning depends on the syscall. Review of the most important syscalls ------------------------------------- .. admonition:: Hands-on Syscall executed by a process may be inspected with ``strace``. This program shows *all* system calls by the process -- including the ones done by the dynamic linker. Go ahead and trace a process, for instance:: strace ls You'll find a list of the most common system calls below. Try to figure out what's happening and analyze the work of the dynamic linker in correspondence to the previous labs. .. notes:: Walk through over this with students. The dynamic linker for ``strace ls`` should use almost all of the calls below, so you may just shortly explain them one by one. Some notes: - we start from implicit fork() + execve - openat is relative open - mmap has a strict correspondence to a task from previous labs /proc/.../maps (segments by the linker) - skip over libc/libpcre to mprotect - ioctl with terminal size - getdents -> directory entries Process control ^^^^^^^^^^^^^^^ These syscalls are related to process management. It should be noted that the Linux kernel and the POSIX standard (including the pthreads thread library) use different process definitions: kernel processes correspond to POSIX threads. Here we use the POSIX definition. ``noreturn void _exit ()`` ends the thread ``noreturn void exit_group ()`` ends the process ``pid_t getpid()`` returns the ID of the current process ``pid_t gettid()`` returns the ID of the current thread ``int fork()`` creates a new process that is a copy of the current (but only with one thread); this is a special case of the ``clone`` syscall ``int clone (int (*fn) (void *), void *newstack, int flags, void *arg, ...)`` creates a new process or thread, quite a complicated function ``pid_t waitpid (pid_t pid, int *stat_loc, int options)`` waits for an event (exit, stop, etc.) in the child process ``int execve (const char *path, char *const *argv, char *const *envp)`` launches a new program in the current process, replacing the current one ``long ptrace (int request, pid_t pid, void *addr, void *data)`` performs many operations related to tracing other processes: it allows you to stop a given process, single-step it, read and write its address space, registers, etc. Used by gdb and strace. File support ^^^^^^^^^^^^ At the syscall level, open files are identified by so-called file descriptors, i.e., small non-negative integers. Descriptors 0-2 typically correspond to the standard input, output, and error output. Other descriptors rarely have specific roles defined. The most important syscalls from this group: ``ssize_t read (int fd, void *buf, size_t len)`` reads from the file ``fd`` to the buffer; returns the number of bytes read. 0 means the end of the file (this is not considered an error). A positive number, but less than len means a partial read -- this may be due to an error, hitting the end of the file, or simply lack of more available data at the moment. ``ssize_t write (int fd, const void *buf, size_t len)`` writes to a file, works similarly to read. ``int open (const char *fname, int flags, mode_t mode)`` opens a file, returns the descriptor. ``int close (int fd)`` closes the file; may return an error if there is a problem in emptying kernel buffers. ``int ioctl (int fd, int request, void *arg)`` performing a special operation on the file. The set of available operations is very dependent on the file. Mostly only applies to device files. Examples of special operations: - change of terminal settings (performed on a terminal file) - volume change (performed on the sound card device file) - reading information about the manufacturer and physical parameters (performed on the hard disk device file) ``int poll(struct pollfd *fds, int nfds, int timeout)`` waits for one of the listed events to take place on one of the given descriptor lists, useful when the program can receive input from many sources Memory management ^^^^^^^^^^^^^^^^^ ``void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)`` creates a new memory area. If ``flags`` contains ``MAP_ANONYMOUS``, this is simply a new block of memory. Otherwise, the area will be mapped to the given file -- reading from the area will give us the contents of the file. If ``flags`` contains ``MAP_SHARED``, writing to the area will also write to the file; otherwise (``MAP_PRIVATE``), writing to the area will create a new copy of the data from the file and modify only this copy. If the ``MAP_FIXED`` flag is passed, the area will be mapped at the given address; otherwise, the kernel will look for a free address. ``void munmap (void *addr, size_t len)`` unmaps (destroys) the given memory area ``int mprotect (void *addr, size_t len, int prot)`` changes the access rights to a given memory area ``void *brk (void *addr)`` shortens or extends the process heap segment It should be noted that on Linux (as well as many other UNIXes) there are two methods for the allocation of "normal" memory: ``mmap`` with the option ``MAP_ANONYMOUS`` and ``brk``. The usual ``malloc()`` with libc in the standard configuration uses the latter for small allocations, and the former for large allocations. Performing system calls on Linux ================================ .. admonition:: Hands-on The most direct way of invoking a system call from C is to use the ``syscall`` wrapper function, which expects a syscall number (or a constant from ``sys/syscall.h``) and its arguments. The function return the syscall result on success, and -1 on error, while moving the actual (non-negated) error code to a global variable ``errno``. Compile the program listed below and open in a debugger. Single-step it from the ``main()`` function to see how the system call is executed, and refer to the sections below for the exact description. Note the instruction address which performed the actual call -- check what was mapped there in the ``maps`` file. .. code-block:: c #include #include #include int main() { int rc = syscall(SYS_chmod, "./f", 777); // oops if (rc == -1) return errno; return 0; } .. notes:: From old notes: - Pokaż, skompiluj, uruchom, wytłumacz ex1 z lab2 * Dodatkowo, skompiluj z -H i pokaż które pliki są użyte * Pokaż co jest w /usr/include/unistd.h * dpkg -S /usr/include/unistd.h * Omówienie jak działa errno; /usr/include/errno.h; /usr/include/asm-generic/errno.h i errno-base.h * nm -D a.out * Kod źródłowy glibc, csu/errno * Zagadka 0777 vs 777 (ósemkowy vs dziesiątkowy; często błąd na który trzeba uważać) * Uruchom w gdb, pokaż "info register" przed wywołaniem syscalla i omów zawartość (rdi, rsi itp.) Before calling a syscall, you should put its number and parameters in certain fixed registers of the processor. After the call, the result or error code is also available in a register. System calls on the x86_64 architecture --------------------------------------- .. tip:: There is `a useful reference `_ available. The only native mechanism of system calls on the x86_64 architecture is the ``syscall`` instruction. The call takes place as follows: - the syscall number is passed in ``rax``. - parameters are passed in: ``rdi``, ``rsi``, ``rdx``, ``r10``, ``r8``, ``r9`` (in this order), - the ``syscall`` instruction is called, - the contents of ``rcx`` and ``r11`` are destroyed by the kernel as a side effect of the syscall, - the result is in ``rax``. System calls on the i386 architecture ------------------------------------- There are 3 mechanisms of system calls on the i386 architecture: - interrupt ``0x80`` (available on all processors) - ``sysenter`` instruction (available on Intel processors starting from Pentium Pro) - ``syscall`` instruction (available on AMD processors starting from K6) The ``sysenter`` / ``syscall`` instructions were introduced in later processors due to poor performance of interrupts on x86 processors. If you use interrupt ``0x80``, the system call looks like this: - the syscall number is passed in ``eax`` - parameters are passed in the following registers: ``ebx``, ``ecx``, ``edx``, ``esi``, ``edi``, ``ebp`` (in this order); syscalls requiring more parameters have special conventions - syscall is called by the ``int $0x80`` instruction - the syscall result will be in the ``eax`` register Calls through ``syscall`` and ``sysenter`` are quite similar, but a bit more complicated. VDSO and vsyscall mechanisms ----------------------------- Due to the existence of many syscall mechanisms on the i386 architecture and the need to choose the right one for the given machine, the VDSO mechanism was introduced. VDSO is a small shared library provided by the kernel that contains the syscall function appropriate for a given processor. The kernel contains several prepared versions of this library (``int 0x80``, ``syscall``, ``sysenter``) and selects the appropriate one at run time. The x86_64 architecture does not require a syscall selection mechanism, but an improved mechanism for the execution of certain syscalls (``clock``, ``time``, ``get_cpu``) has been introduced there. These syscalls have optimized versions that do not require the processor to go into kernel mode (they only read global kernel variables, which are made available to read from the user space in a special way). This mechanism also uses a code block exported by the kernel to the user space, named vsyscall. The VDSO and vsyscall code also include the implementation of the ``sigreturn`` and ``rt_sigreturn`` functions used when returning from the user-space signal handling function. System calls in libc -------------------- Most system calls have their "wrappers" in the standard C library (libc). These are functions whose only task is to move parameters to the right place, call the appropriate syscall, and return the result. Note that the kernel and libc have different conventions for passing error information -- the kernel returns the negated error code (eg ``-EINVAL``) directly from the syscall, whereas the library functions always return -1 on error, and the error code is passed in the global variable ``errno`` (error code in ``errno`` is *not* negated). The simplified version of the write syscall wrapper (omitting vdso, errno, cancellation point) may look like this:: .global write write: movl $1, %eax # syscall number syscall cmpq $-4096, %rax # error? jna out # if not error, then exit neg %rax # -EINVAL -> EINVAL etc. movl %eax, errno # set errno movl $-1, %eax # return -1 out: ret Not all system calls correspond directly to library functions, for many reasons: - many syscalls have several versions with parameters of various sizes (mainly those related to uids / gids, pids, offsets in files, etc.). Older versions with smaller parameters are preserved as part of compatibility with older versions of libc. Examples are syscall ``getuid`` (16-bit uid) and ``getuid32`` (32-bit uid) as well as ``lseek`` (32-bit offset) and ``_llseek`` (64-bit offset). Existing versions of a syscall depend strongly on the architecture -- for example, 64-bit architectures have never had syscalls with 32-bit offsets. - some syscalls (``ipc``, ``socketcall`` in fact have many subfunctions with different parameters (``shmat``, ``shmctl``, ``msgctl``, ``socket``, ``connect``, ``bind``, ``listen``, etc.). Each of these subfunctions has its own wrapper function in libc. - many syscalls have semantics modified by the thread library (more about this below) - syscall requires special intervention from libc (``vfork``, ``clone``, etc.) - because it is so .. admonition:: Hand-on Compile the following program statically (``-static``, so we won't be spammed by the linker) and execute it under strace. .. code-block:: c #include #include int main() { struct timeval tv; printf("123\n"); gettimeofday(&tv, NULL); printf("%d\n", tv.tv_sec); } Where did the call to ``gettimeofday`` (syscall #96 on x86_64) go? If you redirect the output to a file/pipe, why there is only a single ``write``? .. notes:: From the old notes: - Pokaż, skompiluj, uruchom, wytłumacz ex2 z lab2 * uruchom ze strace (2>&1 | less) * tylko jeden write, przez vsdo * cat /proc/self/map | grep "vsyscall\|vsdo" * Wyłączenie vsdo: vim /etc/default/grub (vsdo=0); update-grub - Syscalle stabilne, ale czasami dochodzą. Zazwyczaj owinięte w funkcje * Sprawdzenie wersji linuksa: uname -a i ew. /proc/version * Sprawdzenie wersji glibc: ldd --version * Krótki pokaz kodu glibca np.: - string/ i strcmp - signal/ kill, sigblock (i sysdeps/unix/syscalls.list) - clone - Pokaż, skompiluj, uruchom, wytłumacz ex3 i ex4 Other useful syscalls ===================== .. admonition:: Hands-on In contrast to the fairly simple ``fork()`` system call, ``clone`` has a more advanced interface allowing fine-grained specification of information shared between the parent and child process. Moreover, instead of returning in both processes, it calls a given function on its own stack. Refer to the ``man clone`` page for a list of available flags. The code attached below presents the basic usage and how data may be shared. Review the code below and explain what is the difference when the "vm" parameter is passed. .. code-block:: c #define _GNU_SOURCE #include #include #include #include #include #include #define STACK_SIZE 65536 int global_value = 0; char *heap; static int child_func(void *arg) { char *buf = (char *)arg; printf("Child sees buf = %s\n", buf); printf("Child sees global value = %d\n", global_value); printf("Child see heap = %s\n", heap); strcpy(buf, "hello from child"); global_value = 10; strcpy(heap, "bye"); return 0; } int main(int argc, char *argv[]) { //Allocate stack for child task char *stack = malloc(STACK_SIZE); unsigned long flags = 0; char buf[256]; int status; if (!stack) { perror("Failed to allocate memory\n"); exit(1); } heap = malloc(1024); if (!heap) { perror("Failed to allocate memory\n"); exit(2); } if (argc == 2 && !strcmp(argv[1], "vm")) flags |= CLONE_VM; strcpy(buf, "Hello from Parent"); strcpy(heap, "Hey"); global_value = 5; if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) { perror("clone"); exit(1); } if (wait(&status) == -1) { perror("Wait"); exit(1); } printf("Child exited with status:%d\t", status); printf("buf:%s\t global_value=%d\n", buf, global_value); printf("Parent heap:%s\n", heap); return 0; } futex ----- Traditional methods of implementing mutexes in the userspace required, to avoid busy waiting, creating, for example, an unnamed pipe for each mutex (used to awaken waiting processes). This approach has a number of disadvantages: - mutexes are expensive: using two file descriptors for each mutex requiring active waiting, the mutex structure itself must also be quite large - implementation of interprocess mutexes (e.g., in shared memory) is very heavy The 2.6 kernel has added the ``futex`` syscall (fast userspace mutex), which allows significantly simplifying the implementation of mutexes:: int futex (int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3); This syscall, like ``socketcall``, is actually a wrapper for several subfunctions (selected by the ``op`` parameter): - ``FUTEX_WAIT``: atomically checks if ``*uaddr == val`` and falls asleep, if so. If a timeout is given, it falls asleep at most for such a period of time, otherwise it falls asleep indefinitely. - ``FUTEX_WAKE``: wakes up at most ``val`` processes waiting in ``FUTEX_WAIT`` on the address ``uaddr``. - … and a few more, more complicated ones The use of futex avoids problems with the traditional implementation: - syscall ``futex`` is called only when the mutex is already busy. There is no constant consumption of resources: kernel resources are used only when a thread is actually waiting for the mutex to be released - mutex is a very small structure (for the basic variant, a single ``int`` is enough) - mutexes work between processes without any special handling -- the kernel uses the physical address for comparisons and correctly handles references to the same place through different addresses in different processes, etc. Signals ======= Signals are a mechanism for transmitting information about asynchronous or synchronous events to the user's process. This mechanism is very similar to the kernel-level interrupt mechanism. There are approximately 32 signals with a fixed purpose (the list is to some extent dependent on the architecture), and 32 real-time signals that can be used by the user for any purpose. Important signals are: Signals sent by the kernel caused by processor errors: - ``SIGSEGV``: informs about violation of protection mechanisms, most often reference to a bad memory area - ``SIGILL``: informs about the execution of an incorrect machine instruction - ``SIGBUS``: indicates a memory access error for a reason other than an invalid address or lack of permissions. It's hard to get it on x86. On other architectures, it is often caused by, for example, access to a word at an unaligned address. - ``SIGFPE``: floating point exception, originally reported exceptions in floating point calculations, later also used for arithmetic errors on integers (dividing by 0) - ``SIGTRAP``: informs about hitting a breakpoint, used for debugging programs Signals for controlling processes (sent by other processes): - ``SIGTERM``: informs the process that it should end - ``SIGKILL``: forcefully terminates the process - ``SIGSTOP``: stop the process (with the possibility of continuing) - ``SIGCONT``: continues the process - ``SIGCHLD``: indicates the status of the child process Signals related to terminal service: - ``SIGHUP``: informs about disconnecting the terminal (e.g., closing the xterm window, disconnecting the ssh session) - ``SIGINT``: indicates that Ctrl-C has been pressed - ``SIGQUIT``: indicates that Ctrl-\\ has been pressed - ``SIGTSTP``: indicates that Ctrl-Z has been pressed - ``SIGTTIN``: informs about trying to read from the controlling terminal without being in the foreground group - ``SIGTTOU``: informs about trying to write to the controlling terminal without being in the foreground group - ``SIGWINCH``: informs about the terminal size changing Other signals: - ``SIGABRT``: informs about the occurrence of an unforeseen error in the program (failed assert, etc.) and the necessity of its forceful closing - ``SIGPIPE``: indicates that you have tried to write to the pipe or socket whose other end has been closed - ``SIGUSR1``, ``SIGUSR2``: without a predetermined purpose, intended for the user - ``SIGIO``: informs you that an asynchronous IO operation has been finished, or that IO on a file can now be executed if the program has previously requested such information Each signal has an assigned action that will be executed when it is delivered. This is one of: - ignoring: nothing will happen - executing a function: the function provided by the user will be called - default action, depending on the signal: - ignore (``SIGCHLD``, ``SIGWINCH``) - stopping the process (``SIGSTOP``, ``SIGTSTP``, ``SIGTTIN``, ``SIGTTOU``) - continuation of the process (``SIGCONT``) - killing the process (``SIGTERM``, ``SIGKILL``, ``SIGINT``, …) - killing the process with a core dump (``SIGSEGV``, ``SIGQUIT``, …) The action performed on most signals can be changed by syscalls ``signal`` (simple interface, but limited functionality) or ``sigaction`` (much greater functionality). Signals whose actions cannot be changed are ``SIGKILL`` and ``SIGSTOP``. In addition, although the ``SIGCONT`` action can be changed, it will always continue the process in addition to calling the assigned action. In addition to changing the action assigned to signals, you can also block their delivery by syscall ``sigprocmask``. A blocked signal is not the same as an ignored signal -- an ignored signal will be discarded while a blocked signal will be waiting in the queue until it is unblocked. In addition to the signals sent by the kernel, each signal can also be manually sent by the user. Signals can be sent to entire processes (or groups of processes) via syscall ``kill``, or to individual threads inside your own process by syscall ``tkill`` (or the corresponding function ``pthread_kill``). Sending a signal to a process results in delivering it to an arbitrarily selected thread in the process. Signal handling --------------- The signal handling functions are set by the ``sigaction`` syscall and have one of the following types: - ``void func (int signum)`` - ``void func (int signum, siginfo_t *info, ucontext_t *ctx)`` ``signum`` is the number of the incoming signal. In case of using the second type, ``info`` is a structure containing information about the signal -- e.g., its source (processor error, sent by the user, terminal, etc.) and details (error address in the case of ``SIGSEGV``, pid of the sender, etc.). ``ctx`` is a pointer to a structure containing the full state of the processor registers before entering the signal handler. When a signal arrives with a handler function assigned, the kernel interrupts the process. If the process is executing a blocking system call and is in a state of interruptible sleep, the system call is interrupted. Depending on the semantics selected when setting the handler function, the call is either terminated and returns the error code ``EINTR``, or it will be restarted when returning from the signal handler. After interrupting the process and possibly a syscall, the kernel writes the state of the processor registers to the user stack (usually it is a "normal" stack of the program, but it is possible to set a separate stack for signals by syscall ``sigaltstack``). After saving the state of registers, the kernel writes to the stack (or registers) parameters to the interrupt service function and the return address. Because of the need to clean up after the signal handler and restore the exact state of the processor before it was called, this return address points to a special ``sigreturn`` pseudo function, which is part of the VDSO / vsyscall block. Upon returning from the signal handler, ``sigreturn`` calls syscall ``sigreturn`` which deals with proper cleanup. Simply returning from the signal handler is not the only way to leave it -- sometimes it is useful to use the ``siglongjmp`` call or to use information from the ``ucontext`` structure to unwind the stack. Signal handling is a very delicate mechanism, because it is very difficult to control where the signal will interrupt the program (for example, we can not guarantee that we can use ``malloc`` while handling the signal -- this signal could have interrupted a ``malloc`` call from the main program). For this reason, the signal handling functions are often limited to the setting of a single variable, which is regularly checked by the main program. The functions which are safe to call in the asynchronous signal context are listed in ``man signal-safety``. .. _small_task_3: Small task #3 ============= As you scroll through the signal-safe libc functions, you may notice that ``write`` is there, but no int-to-string conversion function is present. What if we wanted to print the signal number to stderr? As the old interface provided no way to pass an extra per-signal-kind information to the function, let's pass it ourselves by generating a trampoline (thunk) to do the work. Trampolines ----------- Trampolines are small pieces of code whose only task is to transfer control to another location, perhaps modifying the process context in some way or passing additional parameters. An interesting example of using trampolines are the FFI (foreign function interface) libraries for scripting languages. Let's assume that an API written in C, which we want to use with Python / Lua / etc, requires passing a pointer to a function that it will call in a given situation (a so-called callback). Because our program is written in an interpreted dynamic language, the FFI library supporting callbacks must support the dynamic generation of functions callable from C. To this end, the library generates a trampoline, which will remember the appropriate registers containing parameters, write the trampoline identifier (to select the appropriate high-level function to invocation), and then call the FFI library function that starts the dynamic language interpreter with the appropriate parameters. Thanks to such a solution, we can generate any number of functions for callbacks at runtime and pass pointers to them to C functions. Task ---- .. highlight:: C Forwarding the signal handing logic to a dynamic language is one example of such an API, but we are going to start with a simplified task here. Write an implementation of the ``make_signal_handler`` function:: typedef void (*sighandler_t)(int); sighandler_t make_signal_handler(int signum); The ``make_signal_handler`` function should set a new signal handler using the ``signal`` libc function and return the previous handler. As a new handler, you need to **generate a pointer to a dynamically created printing function** which would print the signal number using ``write(1, ...)`` while ignoring the input ``signum`` argument. For instance:: int main() { sighandler_t old_2 = make_signal_handler(2); sighandler_t old_12 = make_signal_handler(12); for (;;) pause(); } should be equivalent to:: void write2(int signum) { write(1, "2", 1); } void write12(int signum) { write(1, "12", 2); } int main() { sighandler_t old_2 = signal(2, write2); sighandler_t old_12 = signal(12, write12); for (;;) pause(); } .. important:: Such a handler could have just used a global strings table indexed by the signal number to perform the required task. However, such a solution would not take us closer to implementing a trampoline... .. note:: The ``signal`` libc function has different semantics (BSD) from the Linux ``signal`` syscall. It uses ``sigaction`` internally. Hints ----- Compile the following code (with the ``-no-pie -mcmodel=large -fno-pie`` option) and disassemble the resulting ``.o`` file:: void writer(int signum) { write(1, "12", 2); } From the compiled machine code, make a template that will be filled with a pointer to the appropriate string and its length by ``make_signal_handler``. This example program shows how to dynamically create an executable function:: #include #include #include #include #include /* mmap(), mprotect() */ static uint8_t code[] = { 0xB8,0xFF,0x00,0x00,0x00, /* mov eax,0xFF */ 0xC3, /* ret */ }; int main(void) { const size_t len = sizeof(code); /* mmap a region for our code */ void *p = mmap(NULL, len, PROT_READ|PROT_WRITE, /* No PROT_EXEC */ MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); if (p==MAP_FAILED) { fprintf(stderr, "mmap() failed\n"); return 2; } /* Copy it in (still not executable) */ memcpy(p, code, len); /* Now make it execute-only */ if (mprotect(p, len, PROT_EXEC) < 0) { fprintf(stderr, "mprotect failed to mark exec-only\n"); return 2; } /* Go! */ int (*func)(void) = (int(*)())p; printf("(dynamic) code returned %d\n", func()); pause(); return 0; } .. highlight:: none Extra Topics ============= If you want to extend your knowledge beyond the scope of this course, consider: Reading ------- 1. Manual section 2, in particular: ``syscall``, ``futex``, ``sigaction`` 2. VDSO/vsyscall sources in ``arch/x86/vdso`` 3. Syscall list in ``asm/unistd*.h`` 4. Ulrich Drepper "Futexes Are Tricky", 2011 - http://www.akkadia.org/drepper/futex.pdf 5. ``man 7 signal``, ``man 7 signal-safety`` Implementing ------------ Some ideas to implement: A warm shutdown on Ctrl-C A warm shutdown of a system allows finishing all currently performed tasks and leaving external resources in a proper state. Typically, hitting Ctrl-C again transitions to a cold shutdown: the process should end as fast as possible without corrupting the external state. Extend the code in the collapsed section below with support for warm shutdown. Subsequent ``SIGINT`` signals should: 1. Print a message, break the loop, flush libc buffers and flush all data to disk. 2. Print a message, flush the file to disk (ignore libc buffers) and exit (use ``siglongjmp``). 3. Kill the program. Be careful to read about requirements of signal safety (especially for ``siglongjmp``), and restore the original handler after cancellable block. .. class:: details Code .. code-block:: c #include #include #include #include void calculator(FILE* fd) { int iters = 0; volatile long my_something = 3; while (iters < 1000) { for (int i = 0; i < 1000000000; ++i) { my_something = my_something * 3 + 1; // digits of PI... } if (fwrite((void*)&my_something, sizeof(my_something), 1, fd) != 1) { perror("Cannot write results\n"); break; } if (iters % 10 == 0) { fflush(fd); } putchar('.'); fflush(stdout); } // Make sure we have our important computation on disk! fflush(fd); fsync(fileno(fd)); } void cancellable(const char* filename) { FILE* out = fopen(filename, "w"); if (out == NULL) { err(1, "Cannot open %s:", filename); } if (false /* long jump */) { // We're inside the long jump here. // Calling non-async-safe functions is illegal, // or even returning from main. // Store to disk what was written. // sleep(5); to test the third case. } calculator(out); if (fclose(out) == EOF) { err(3, "Cannot close %s:", filename); }; } int main() { cancellable("powerbytes"); puts("We're done!"); } A ``fork()`` server A forkserver is a technique of launching new processes of an interpreted language without much of the overhead of initializing the interpreter again. A naive use of ``fork()`` has many issues, especially with multithreaded programs, such as inheriting locked mutexes. A *forkserver* is a single-threaded server process (possibly with preloaded modules) that can fork itself on request. As the server is started early or is an independent process, no unnecessary resources would be inherited. For instance, you can make the ``multiprocessing`` Python module to use `forkserver for starting new processes `_. Play with arguments to ``clone`` and ``mmap`` to support proper data sharing. Use unnamed pipes / sockets to communicate between processes: the main process should have a socket interface for requests to the forkserver, which should enable communication with the newly spawned process through another pipe. The requester would then use that pipe to communicate work that needs to be done. The main trick is to use ``SCM_RIGHTS`` Unix socket options to send ends (file descriptors) of pipes used for communication with the new process.