Class 3: kernel interfaces¶

Date: 11.03.2025

System calls in the Linux kernel¶

What is a system call?¶

System calls (syscalls) are a mechanism that allows processes operating in the user space to request certain functions from the kernel.

Syscalls are the main mechanism of communication of user processes with the outside world -- the only thing that a program that does not use syscalls can do is to hang or cause a runtime error. System calls are a stable interface of the Linux kernel -- they may only be added.

The mechanism of calling syscalls from the user level varies for each architecture, and sometimes even for processors and / or kernel versions within a single architecture. Usually, this is done using an assembler instruction specially designed for this purpose.

Syscalls are identified by numbers. Their exact list and numbering depends on architecture -- you can view it in the file /usr/include/asm/unistd*.h or arch/*/include/uapi/asm/unistd*.h in Linux sources.

A value returned from the kernel in the range [-4096, -1] means an error and is a negated standard error code (list in /usr/include/asm-generic/errno.h and errno-base.h) . Other values mean success, and their meaning depends on the syscall.

Review of the most important syscalls¶

Hands-on

Syscall executed by a process may be inspected with strace. This program shows all system calls by the process -- including the ones done by the dynamic linker.

Go ahead and trace a process, for instance:

strace ls

You'll find a list of the most common system calls below. Try to figure out what's happening and analyze the work of the dynamic linker in correspondence to the previous labs.

Process control¶

These syscalls are related to process management. It should be noted that the Linux kernel and the POSIX standard (including the pthreads thread library) use different process definitions: kernel processes correspond to POSIX threads. Here we use the POSIX definition.

noreturn void _exit (): ends the thread
noreturn void exit_group (): ends the process
pid_t getpid(): returns the ID of the current process
pid_t gettid(): returns the ID of the current thread
int fork(): creates a new process that is a copy of the current (but only with one thread); this is a special case of the clone syscall
int clone (int (*fn) (void *), void *newstack, int flags, void *arg, ...): creates a new process or thread, quite a complicated function
pid_t waitpid (pid_t pid, int *stat_loc, int options): waits for an event (exit, stop, etc.) in the child process
int execve (const char *path, char *const *argv, char *const *envp): launches a new program in the current process, replacing the current one
long ptrace (int request, pid_t pid, void *addr, void *data): performs many operations related to tracing other processes: it allows you to stop a given process, single-step it, read and write its address space, registers, etc. Used by gdb and strace.

File support¶

At the syscall level, open files are identified by so-called file descriptors, i.e., small non-negative integers. Descriptors 0-2 typically correspond to the standard input, output, and error output. Other descriptors rarely have specific roles defined.

The most important syscalls from this group:

ssize_t read (int fd, void *buf, size_t len)

reads from the file fd to the buffer; returns the number of bytes read. 0 means the end of the file (this is not considered an error). A positive number, but less than len means a partial read -- this may be due to an error, hitting the end of the file, or simply lack of more available data at the moment.

ssize_t write (int fd, const void *buf, size_t len)

writes to a file, works similarly to read.

int open (const char *fname, int flags, mode_t mode)

opens a file, returns the descriptor.

int close (int fd)

closes the file; may return an error if there is a problem in emptying kernel buffers.

int ioctl (int fd, int request, void *arg)

performing a special operation on the file. The set of available operations is very dependent on the file. Mostly only applies to device files. Examples of special operations:

change of terminal settings (performed on a terminal file)
volume change (performed on the sound card device file)
reading information about the manufacturer and physical parameters (performed on the hard disk device file)

int poll(struct pollfd *fds, int nfds, int timeout)

waits for one of the listed events to take place on one of the given descriptor lists, useful when the program can receive input from many sources

Memory management¶

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off): creates a new memory area. If flags contains MAP_ANONYMOUS, this is simply a new block of memory. Otherwise, the area will be mapped to the given file -- reading from the area will give us the contents of the file. If flags contains MAP_SHARED, writing to the area will also write to the file; otherwise (MAP_PRIVATE), writing to the area will create a new copy of the data from the file and modify only this copy. If the MAP_FIXED flag is passed, the area will be mapped at the given address; otherwise, the kernel will look for a free address.
void munmap (void *addr, size_t len): unmaps (destroys) the given memory area
int mprotect (void *addr, size_t len, int prot): changes the access rights to a given memory area
void *brk (void *addr): shortens or extends the process heap segment

It should be noted that on Linux (as well as many other UNIXes) there are two methods for the allocation of "normal" memory: mmap with the option MAP_ANONYMOUS and brk. The usual malloc() with libc in the standard configuration uses the latter for small allocations, and the former for large allocations.

Performing system calls on Linux¶

Hands-on

The most direct way of invoking a system call from C is to use the syscall wrapper function, which expects a syscall number (or a constant from sys/syscall.h) and its arguments. The function return the syscall result on success, and -1 on error, while moving the actual (non-negated) error code to a global variable errno.

Compile the program listed below and open in a debugger. Single-step it from the main() function to see how the system call is executed, and refer to the sections below for the exact description.

Note the instruction address which performed the actual call -- check what was mapped there in the maps file.

#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>

int main()
{
    int rc = syscall(SYS_chmod, "./f", 777); // oops
    if (rc == -1)
        return errno;
    return 0;
}

Before calling a syscall, you should put its number and parameters in certain fixed registers of the processor. After the call, the result or error code is also available in a register.

System calls on the x86_64 architecture¶

Tip

There is a useful reference available.

The only native mechanism of system calls on the x86_64 architecture is the syscall instruction. The call takes place as follows:

the syscall number is passed in rax.
parameters are passed in: rdi, rsi, rdx, r10, r8, r9 (in this order),
the syscall instruction is called,
the contents of rcx and r11 are destroyed by the kernel as a side effect of the syscall,
the result is in rax.

System calls on the i386 architecture¶

There are 3 mechanisms of system calls on the i386 architecture:

interrupt 0x80 (available on all processors)
sysenter instruction (available on Intel processors starting from Pentium Pro)
syscall instruction (available on AMD processors starting from K6)

The sysenter / syscall instructions were introduced in later processors due to poor performance of interrupts on x86 processors.

If you use interrupt 0x80, the system call looks like this:

the syscall number is passed in eax
parameters are passed in the following registers: ebx, ecx, edx, esi, edi, ebp (in this order); syscalls requiring more parameters have special conventions
syscall is called by the int $0x80 instruction
the syscall result will be in the eax register

Calls through syscall and sysenter are quite similar, but a bit more complicated.

VDSO and vsyscall mechanisms¶

Due to the existence of many syscall mechanisms on the i386 architecture and the need to choose the right one for the given machine, the VDSO mechanism was introduced. VDSO is a small shared library provided by the kernel that contains the syscall function appropriate for a given processor. The kernel contains several prepared versions of this library (int 0x80, syscall, sysenter) and selects the appropriate one at run time.

The x86_64 architecture does not require a syscall selection mechanism, but an improved mechanism for the execution of certain syscalls (clock, time, get_cpu) has been introduced there. These syscalls have optimized versions that do not require the processor to go into kernel mode (they only read global kernel variables, which are made available to read from the user space in a special way). This mechanism also uses a code block exported by the kernel to the user space, named vsyscall.

The VDSO and vsyscall code also include the implementation of the sigreturn and rt_sigreturn functions used when returning from the user-space signal handling function.

System calls in libc¶

Most system calls have their "wrappers" in the standard C library (libc). These are functions whose only task is to move parameters to the right place, call the appropriate syscall, and return the result. Note that the kernel and libc have different conventions for passing error information -- the kernel returns the negated error code (eg -EINVAL) directly from the syscall, whereas the library functions always return -1 on error, and the error code is passed in the global variable errno (error code in errno is not negated).

The simplified version of the write syscall wrapper (omitting vdso, errno, cancellation point) may look like this:

.global write
write:
movl $1, %eax           # syscall number
syscall
cmpq $-4096, %rax       # error?
jna out                 # if not error, then exit
neg %rax                # -EINVAL -> EINVAL etc.
movl %eax, errno        # set errno
movl $-1, %eax          # return -1
out:
ret

Not all system calls correspond directly to library functions, for many reasons:

many syscalls have several versions with parameters of various sizes (mainly those related to uids / gids, pids, offsets in files, etc.). Older versions with smaller parameters are preserved as part of compatibility with older versions of libc. Examples are syscall getuid (16-bit uid) and getuid32 (32-bit uid) as well as lseek (32-bit offset) and _llseek (64-bit offset). Existing versions of a syscall depend strongly on the architecture -- for example, 64-bit architectures have never had syscalls with 32-bit offsets.
some syscalls (ipc, socketcall in fact have many subfunctions with different parameters (shmat, shmctl, msgctl, socket, connect, bind, listen, etc.). Each of these subfunctions has its own wrapper function in libc.
many syscalls have semantics modified by the thread library (more about this below)
syscall requires special intervention from libc (vfork, clone, etc.)
because it is so

Hand-on

Compile the following program statically (-static, so we won't be spammed by the linker) and execute it under strace.

#include <sys/time.h>
#include <stdio.h>

int main()
{
    struct timeval tv;
    printf("123\n");
    gettimeofday(&tv, NULL);
    printf("%d\n", tv.tv_sec);
}

Where did the call to gettimeofday (syscall #96 on x86_64) go? If you redirect the output to a file/pipe, why there is only a single write?

Other useful syscalls¶

Hands-on

In contrast to the fairly simple fork() system call, clone has a more advanced interface allowing fine-grained specification of information shared between the parent and child process. Moreover, instead of returning in both processes, it calls a given function on its own stack. Refer to the man clone page for a list of available flags.

The code attached below presents the basic usage and how data may be shared. Review the code below and explain what is the difference when the "vm" parameter is passed.

#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>

#define STACK_SIZE      65536

int global_value = 0;
char *heap;
static int child_func(void *arg)
{
    char *buf = (char *)arg;
    printf("Child sees buf = %s\n", buf);
    printf("Child sees global value = %d\n", global_value);
    printf("Child see heap = %s\n", heap);
    strcpy(buf, "hello from child");
    global_value = 10;
    strcpy(heap, "bye");
    return 0;
}

int main(int argc, char *argv[])
{
    //Allocate stack for child task
    char *stack = malloc(STACK_SIZE);
    unsigned long flags = 0;
    char buf[256];
    int status;

    if (!stack) {
        perror("Failed to allocate memory\n");
        exit(1);
    }
    heap = malloc(1024);
    if (!heap) {
        perror("Failed to allocate memory\n");
        exit(2);
    }
    if (argc == 2 && !strcmp(argv[1], "vm"))
        flags |= CLONE_VM;
    strcpy(buf, "Hello from Parent");
    strcpy(heap, "Hey");
    global_value = 5;
    if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) {
        perror("clone");
        exit(1);
    }
    if (wait(&status) == -1) {
        perror("Wait");
        exit(1);
    }
    printf("Child exited with status:%d\t", status);
    printf("buf:%s\t global_value=%d\n",
            buf, global_value);
    printf("Parent heap:%s\n", heap);
    return 0;
}

futex¶

Traditional methods of implementing mutexes in the userspace required, to avoid busy waiting, creating, for example, an unnamed pipe for each mutex (used to awaken waiting processes). This approach has a number of disadvantages:

mutexes are expensive: using two file descriptors for each mutex requiring active waiting, the mutex structure itself must also be quite large
implementation of interprocess mutexes (e.g., in shared memory) is very heavy

The 2.6 kernel has added the futex syscall (fast userspace mutex), which allows significantly simplifying the implementation of mutexes:

int futex (int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);

This syscall, like socketcall, is actually a wrapper for several subfunctions (selected by the op parameter):

FUTEX_WAIT: atomically checks if *uaddr == val and falls asleep, if so. If a timeout is given, it falls asleep at most for such a period of time, otherwise it falls asleep indefinitely.
FUTEX_WAKE: wakes up at most val processes waiting in FUTEX_WAIT on the address uaddr.
… and a few more, more complicated ones

The use of futex avoids problems with the traditional implementation:

syscall futex is called only when the mutex is already busy. There is no constant consumption of resources: kernel resources are used only when a thread is actually waiting for the mutex to be released
mutex is a very small structure (for the basic variant, a single int is enough)
mutexes work between processes without any special handling -- the kernel uses the physical address for comparisons and correctly handles references to the same place through different addresses in different processes, etc.

Signals¶

Signals are a mechanism for transmitting information about asynchronous or synchronous events to the user's process. This mechanism is very similar to the kernel-level interrupt mechanism.

There are approximately 32 signals with a fixed purpose (the list is to some extent dependent on the architecture), and 32 real-time signals that can be used by the user for any purpose. Important signals are:

Signals sent by the kernel caused by processor errors:

SIGSEGV: informs about violation of protection mechanisms, most often reference to a bad memory area
SIGILL: informs about the execution of an incorrect machine instruction
SIGBUS: indicates a memory access error for a reason other than an invalid address or lack of permissions. It's hard to get it on x86. On other architectures, it is often caused by, for example, access to a word at an unaligned address.
SIGFPE: floating point exception, originally reported exceptions in floating point calculations, later also used for arithmetic errors on integers (dividing by 0)
SIGTRAP: informs about hitting a breakpoint, used for debugging programs

Signals for controlling processes (sent by other processes):

SIGTERM: informs the process that it should end
SIGKILL: forcefully terminates the process
SIGSTOP: stop the process (with the possibility of continuing)
SIGCONT: continues the process
SIGCHLD: indicates the status of the child process

Signals related to terminal service:

SIGHUP: informs about disconnecting the terminal (e.g., closing the xterm window, disconnecting the ssh session)
SIGINT: indicates that Ctrl-C has been pressed
SIGQUIT: indicates that Ctrl-\ has been pressed
SIGTSTP: indicates that Ctrl-Z has been pressed
SIGTTIN: informs about trying to read from the controlling terminal without being in the foreground group
SIGTTOU: informs about trying to write to the controlling terminal without being in the foreground group
SIGWINCH: informs about the terminal size changing

Other signals:

SIGABRT: informs about the occurrence of an unforeseen error in the program (failed assert, etc.) and the necessity of its forceful closing
SIGPIPE: indicates that you have tried to write to the pipe or socket whose other end has been closed
SIGUSR1, SIGUSR2: without a predetermined purpose, intended for the user
SIGIO: informs you that an asynchronous IO operation has been finished, or that IO on a file can now be executed if the program has previously requested such information

Each signal has an assigned action that will be executed when it is delivered. This is one of:

ignoring: nothing will happen
executing a function: the function provided by the user will be called
default action, depending on the signal:
- ignore (SIGCHLD, SIGWINCH)
- stopping the process (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU)
- continuation of the process (SIGCONT)
- killing the process (SIGTERM, SIGKILL, SIGINT, …)
- killing the process with a core dump (SIGSEGV, SIGQUIT, …)

The action performed on most signals can be changed by syscalls signal (simple interface, but limited functionality) or sigaction (much greater functionality). Signals whose actions cannot be changed are SIGKILL and SIGSTOP. In addition, although the SIGCONT action can be changed, it will always continue the process in addition to calling the assigned action.

In addition to changing the action assigned to signals, you can also block their delivery by syscall sigprocmask. A blocked signal is not the same as an ignored signal -- an ignored signal will be discarded while a blocked signal will be waiting in the queue until it is unblocked.

In addition to the signals sent by the kernel, each signal can also be manually sent by the user. Signals can be sent to entire processes (or groups of processes) via syscall kill, or to individual threads inside your own process by syscall tkill (or the corresponding function pthread_kill). Sending a signal to a process results in delivering it to an arbitrarily selected thread in the process.

Signal handling¶

The signal handling functions are set by the sigaction syscall and have one of the following types:

void func (int signum)
void func (int signum, siginfo_t *info, ucontext_t *ctx)

signum is the number of the incoming signal. In case of using the second type, info is a structure containing information about the signal -- e.g., its source (processor error, sent by the user, terminal, etc.) and details (error address in the case of SIGSEGV, pid of the sender, etc.). ctx is a pointer to a structure containing the full state of the processor registers before entering the signal handler.

When a signal arrives with a handler function assigned, the kernel interrupts the process. If the process is executing a blocking system call and is in a state of interruptible sleep, the system call is interrupted. Depending on the semantics selected when setting the handler function, the call is either terminated and returns the error code EINTR, or it will be restarted when returning from the signal handler.

After interrupting the process and possibly a syscall, the kernel writes the state of the processor registers to the user stack (usually it is a "normal" stack of the program, but it is possible to set a separate stack for signals by syscall sigaltstack). After saving the state of registers, the kernel writes to the stack (or registers) parameters to the interrupt service function and the return address. Because of the need to clean up after the signal handler and restore the exact state of the processor before it was called, this return address points to a special sigreturn pseudo function, which is part of the VDSO / vsyscall block. Upon returning from the signal handler, sigreturn calls syscall sigreturn which deals with proper cleanup.

Simply returning from the signal handler is not the only way to leave it -- sometimes it is useful to use the siglongjmp call or to use information from the ucontext structure to unwind the stack.

Signal handling is a very delicate mechanism, because it is very difficult to control where the signal will interrupt the program (for example, we can not guarantee that we can use malloc while handling the signal -- this signal could have interrupted a malloc call from the main program). For this reason, the signal handling functions are often limited to the setting of a single variable, which is regularly checked by the main program. The functions which are safe to call in the asynchronous signal context are listed in man signal-safety.

Small task #3¶

As you scroll through the signal-safe libc functions, you may notice that write is there, but no int-to-string conversion function is present. What if we wanted to print the signal number to stderr? As the old interface provided no way to pass an extra per-signal-kind information to the function, let's pass it ourselves by generating a trampoline (thunk) to do the work.

Trampolines¶

Trampolines are small pieces of code whose only task is to transfer control to another location, perhaps modifying the process context in some way or passing additional parameters.

An interesting example of using trampolines are the FFI (foreign function interface) libraries for scripting languages. Let's assume that an API written in C, which we want to use with Python / Lua / etc, requires passing a pointer to a function that it will call in a given situation (a so-called callback). Because our program is written in an interpreted dynamic language, the FFI library supporting callbacks must support the dynamic generation of functions callable from C. To this end, the library generates a trampoline, which will remember the appropriate registers containing parameters, write the trampoline identifier (to select the appropriate high-level function to invocation), and then call the FFI library function that starts the dynamic language interpreter with the appropriate parameters. Thanks to such a solution, we can generate any number of functions for callbacks at runtime and pass pointers to them to C functions.

Task¶

Forwarding the signal handing logic to a dynamic language is one example of such an API, but we are going to start with a simplified task here.

Write an implementation of the make_signal_handler function:

typedef void (*sighandler_t)(int);
sighandler_t make_signal_handler(int signum);

The make_signal_handler function should set a new signal handler using the signal libc function and return the previous handler. As a new handler, you need to generate a pointer to a dynamically created printing function which would print the signal number using write(1, ...) while ignoring the input signum argument. For instance:

int main() {
    sighandler_t old_2 = make_signal_handler(2);
    sighandler_t old_12 = make_signal_handler(12);
    for (;;) pause();
}

should be equivalent to:

void write2(int signum) {
    write(1, "2", 1);
}
void write12(int signum) {
    write(1, "12", 2);
}

int main() {
    sighandler_t old_2 = signal(2, write2);
    sighandler_t old_12 = signal(12, write12);
    for (;;) pause();
}

Important

Such a handler could have just used a global strings table indexed by the signal number to perform the required task. However, such a solution would not take us closer to implementing a trampoline...

Note

The signal libc function has different semantics (BSD) from the Linux signal syscall. It uses sigaction internally.

Hints¶

Compile the following code (with the -no-pie -mcmodel=large -fno-pie option) and disassemble the resulting .o file:

void writer(int signum) {
    write(1, "12", 2);
}

From the compiled machine code, make a template that will be filled with a pointer to the appropriate string and its length by make_signal_handler.

This example program shows how to dynamically create an executable function:

#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h>   /* mmap(), mprotect() */

static uint8_t code[] = {
    0xB8,0xFF,0x00,0x00,0x00,   /* mov  eax,0xFF    */
    0xC3,                       /* ret              */
};

int main(void)
{
    const size_t len = sizeof(code);

    /* mmap a region for our code */
    void *p = mmap(NULL, len, PROT_READ|PROT_WRITE,  /* No PROT_EXEC */
            MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
    if (p==MAP_FAILED) {
        fprintf(stderr, "mmap() failed\n");
        return 2;
    }

    /* Copy it in (still not executable) */
    memcpy(p, code, len);

    /* Now make it execute-only */
    if (mprotect(p, len, PROT_EXEC) < 0) {
        fprintf(stderr, "mprotect failed to mark exec-only\n");
        return 2;
    }

    /* Go! */
    int (*func)(void) = (int(*)())p;
    printf("(dynamic) code returned %d\n", func());

    pause();
    return 0;
}

Extra Topics¶

If you want to extend your knowledge beyond the scope of this course, consider:

Reading¶

Manual section 2, in particular: syscall, futex, sigaction
VDSO/vsyscall sources in arch/x86/vdso
Syscall list in asm/unistd*.h
Ulrich Drepper "Futexes Are Tricky", 2011 - http://www.akkadia.org/drepper/futex.pdf
man 7 signal, man 7 signal-safety

Implementing¶

Some ideas to implement:

A warm shutdown on Ctrl-C

A warm shutdown of a system allows finishing all currently performed tasks and leaving external resources in a proper state. Typically, hitting Ctrl-C again transitions to a cold shutdown: the process should end as fast as possible without corrupting the external state.

Extend the code in the collapsed section below with support for warm shutdown. Subsequent SIGINT signals should:

Print a message, break the loop, flush libc buffers and flush all data to disk.
Print a message, flush the file to disk (ignore libc buffers) and exit (use siglongjmp).
Kill the program.

Be careful to read about requirements of signal safety (especially for siglongjmp), and restore the original handler after cancellable block.

Code

#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <err.h>

void calculator(FILE* fd) {
    int iters = 0;
    volatile long my_something = 3;

    while (iters < 1000) {
        for (int i = 0; i < 1000000000; ++i) {
            my_something = my_something * 3 + 1; // digits of PI...
        }
        if (fwrite((void*)&my_something, sizeof(my_something), 1, fd) != 1) {
            perror("Cannot write results\n");
            break;
        }
        if (iters % 10 == 0) {
            fflush(fd);
        }
        putchar('.');
        fflush(stdout);
    }
    // Make sure we have our important computation on disk!
    fflush(fd);
    fsync(fileno(fd));
}

void cancellable(const char* filename) {
    FILE* out = fopen(filename, "w");
    if (out == NULL) {
        err(1, "Cannot open %s:", filename);
    }

    if (false /* long jump */) {
        // We're inside the long jump here.
        // Calling non-async-safe functions is illegal,
        // or even returning from main.

        // Store to disk what was written.
        // sleep(5); to test the third case.
    }

    calculator(out);
    if (fclose(out) == EOF) {
        err(3, "Cannot close %s:", filename);
    };
}

int main() {
    cancellable("powerbytes");
    puts("We're done!");
}

A fork() server

A forkserver is a technique of launching new processes of an interpreted language without much of the overhead of initializing the interpreter again. A naive use of fork() has many issues, especially with multithreaded programs, such as inheriting locked mutexes.

A forkserver is a single-threaded server process (possibly with preloaded modules) that can fork itself on request. As the server is started early or is an independent process, no unnecessary resources would be inherited. For instance, you can make the multiprocessing Python module to use forkserver for starting new processes.

Play with arguments to clone and mmap to support proper data sharing. Use unnamed pipes / sockets to communicate between processes: the main process should have a socket interface for requests to the forkserver, which should enable communication with the newly spawned process through another pipe. The requester would then use that pipe to communicate work that needs to be done.

The main trick is to use SCM_RIGHTS Unix socket options to send ends (file descriptors) of pipes used for communication with the new process.