Class 2: kernel interfaces

Date: 06.03.2018

System calls in the Linux kernel

What is a system call?

System calls (syscalls) are a mechanism that allows processes operating in the user space to request certain functions from the kernel.

Syscalls is the main mechanism of communication of user processes with the outside world – the only thing that a program that does not use syscalls can do is hang or cause a runtime error.

The mechanism of calling syscalls from the user level varies for each architecture, and sometimes even for processors and / or kernel versions within a single architecture. Usually, this is done using an assembler instruction specially designed for this purpose.

Syscalls are identified by numbers. Their exact list and numbering depends on architecture - you can view it in the file /usr/include/asm/unistd*.h or arch/*/include/uapi/asm/unistd*.h

Before calling a syscall, you should put its number and parameters in certain fixed registers of the processor. After the call, the result or error code is also available in a register.

A value returned from the kernel in the range [-4096, -1] means an error and is a negated standard error code (list in asm-generic/errno.h and asm-generic/errno-base.h) . Other values mean success and their meaning depends on the syscall.

System calls on the x86_64 architecture

The only native mechanism of system calls on the x86_64 architecture is the syscall instruction. The call takes place as follows:

  • the syscall number is passed in rax.
  • parameters are passed in: rdi, rsi, rdx, r10, r8, r9 (in this order).
  • the syscall instruction is called.
  • the contents of rcx and r11 are destroyed by the kernel as a side effect of the syscall.
  • the result is in rax.

System calls on the i386 architecture

There are 3 mechanisms of system calls on the i386 architecture:

  • interrupt 0x80 (available on all processors)
  • sysenter instruction (available on Intel processors starting from Pentium Pro)
  • syscall instruction (available on AMD processors starting from K6)

The sysenter / syscall instructions were introduced in later processors due to poor performance of interrupts on x86 processors.

If you use interrupt 0x80, the system call looks like this:

  • the syscall number is passed in eax
  • parameters are passed in the following registers: ebx, ecx, edx, esi, edi, ebp (in this order); syscalls requiring more parameters have special conventions
  • syscall is called by the int $0x80 instruction
  • the syscall result will be in the eax register

Calls through syscall and sysenter are quite similar, but a bit more complicated.

VDSO and vsyscall mechanisms

Due to the existence of many syscall mechanisms on the i386 architecture and the need to choose the right one for the given machine, the VDSO mechanism was introduced. VDSO is a small shared library provided by the kernel that contains the syscall function appropriate for a given processor. The kernel contains several prepared versions of this library (int 0x80, syscall, sysenter) and selects the appropriate one at run time.

The x86_64 architecture does not require a syscall selection mechanism, but an improved mechanism for the execution of certain syscalls (clock, time, get_cpu) has been introduced there. These syscalls have optimized versions that do not require the processor to go into kernel mode (they only read global kernel variables, which are made available to read from the user space in a special way). This mechanism also uses a code block exported by the kernel to the user space, named vsyscall.

The VDSO and vsyscall code also includes the implementation of the sigreturn and rt_sigreturn functions used when returning from the user-space signal handling function.

System calls in libc

Most system calls have their “wrappers” in the standard C library (libc). These are functions whose only task is to move parameters to the right place, call the appropriate syscall, and return the result. Note that the kernel and libc have different conventions for passing error information – the kernel returns the negated error code (eg -EINVAL) directly from the syscall, whereas the library functions always return -1 on error and the error code is passed in the global variable errno (error code in errno is not negated).

The simplified version of the write syscall wrapper (omitting vdso, errno, cancellation point) may look like this:

.global write
write:
movl $1, %eax           # syscall number
syscall
cmpq $-4096, %rax       # error?
jna out                 # if not error, then exit
neg %rax                # -EINVAL -> EINVAL etc.
movl %eax, errno        # set errno
movl $-1, %eax          # return -1
out:
ret

Not all system calls correspond directly to library functions, for many reasons:

  • many syscalls have several versions with parameters of various sizes (mainly those related to uids / gids, pids, offsets in files, etc.). Older versions with smaller parameters are preserved as part of compatibility with older versions of libc. Examples are syscall getuid (16-bit uid) and getuid32 (32-bit uid) as well as lseek (32-bit offset) and _llseek (64-bit offset). Existing versions of a syscall depend strongly on the architecture – for example, 64-bit architectures have never had syscalls with 32-bit offsets.
  • some syscalls (ipc, socketcall in fact have many subfunctions with different parameters (shmat, shmctl, msgctl, socket, connect, bind, listen …). Each of these subfunctions has its own wrapper function in libc.
  • many syscalls have semantics modified by the thread library (more about this below)
  • syscall requires special intervention from libc (vfork, clone, etc.)
  • because it is so

There is also a syscall function that allows you to execute any system call directly. This is useful when invoking syscalls that do not have their own wrappers in libc. The write implementation using this function might look like this:

ssize_t write(int fd, const char *buf, size_t len) {
        return syscall(SYS_write, fd, buf, len);
}

Review of the most important syscalls

Process control

These syscalle are related to process management. It should be noted that the Linux kernel and the POSIX standard (including the pthreads thread library) use different process definitions: kernel processes correspond to POSIX threads. Here we use the POSIX definition.

noreturn void _exit ()
ends the thread
noreturn void exit_group ()
ends the process
pid_t getpid()
returns the ID of the current process
pid_t gettid()
returns the ID of the current thread
int fork()
creates a new process that is a copy of the current (but only with one thread); this is a special case of the clone syscall
int clone (int (*fn) (void *), void *newstack, int flags, void *arg, ...)
creates a new process or thread, quite a complicated function
pid_t waitpid (pid_t pid, int *stat_loc, int options)
waits for an event (exit, stop, etc.) in the child process
int execve (const char *path, char *const *argv, char *const *envp)
launches a new program in the current process, replacing the current one
long ptrace (int request, pid_t pid, void *addr, void *data)
performs many operations related to tracing other processes: it allows you to stop a given process, single-step it, read and write its address space, registers, etc. Used by gdb and strace.

File support

At the syscall level, open files are identified by so-called file descriptors, i.e. small non-negative integers. Descriptors 0-2 typically correspond to the standard input, output, and error output. Other descriptors rarely have specific roles defined.

The most important syscalls from this group:

ssize_t read (int fd, void *buf, size_t len)
reads from the file fd to the buffer; returns the number of bytes read. 0 means the end of the file (this is not considered an error). A positive number, but less than len means a partial read – this may be due to an error, hitting the end of the file, or simply lack of more available data at the moment.
ssize_t write (int fd, const void *buf, size_t len)
writes to a file, works similarly to read.
int open (const char *fname, int flags, mode_t mode)
opens a file, returns the descriptor.
int close (int fd)
closes the file; may return an error if there is a problem in emptying kernel buffers.
int ioctl (int fd, int request, void *arg)

performing a special operation on the file. The set of available operations is very dependent on the file. Mostly only applies to device files. Examples of special operations:

  • change of terminal settings (performed on a terminal file)
  • volume change (performed on the sound card device file)
  • reading information about the manufacturer and physical parameters (performed on the hard disk device file)
int poll(struct pollfd *fds, int nfds, int timeout)
waits for one of the listed events to take place on one of the given descriptor lists, useful when the program can receive input from many sources

Memory management

void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)
creates a new memory area. If flags contains MAP_ANONYMOUS, this is simply a new block of memory. Otherwise, the area will be mapped to the given file – reading from the area will give us the contents of the file. If flags contains MAP_SHARED, writing to the area will also write to the file; otherwise (MAP_PRIVATE), writing to the area will create a new copy of the data from the file and modify only this copy. If the MAP_FIXED flag is passed, the area will be mapped at the given address; otherwise, the kernel will look for a free address.
void munmap (void *addr, size_t len)
unmaps (destroys) the given memory area
int mprotect (void *addr, size_t len, int prot)
changes the access rights to a given memory area
void *brk (void *addr)
shortens or extends the process heap segment

It should be noted that on Linux (as well as many other UNIXes) there are two methods for the allocation of “normal” memory: mmap with the option MAP_ANONYMOUS and brk. The usual malloc() with libc in the standard configuration uses the latter for small allocations, and the former for large allocations.

futex

Traditional methods of implementing mutexes in the user’s space required, to avoid busy waiting, creating, for example, an unnamed pipe for each mutex (used to awaken waiting processes). This approach has a number of disadvantages:

  • mutexes are expensive: using two file descriptors for each mutex requiring active waiting, the mutex structure itself must also be quite large
  • implementation of interprocess mutexes (eg in shared memory) is very heavy

The 2.6 kernel has added the futex syscall (fast userspace mutex), which allows significantly simplifying the implementation of mutexes:

int futex (int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);

This syscall, like socketcall, is actually a wrapper for several subfunctions (selected by the op parameter):

  • FUTEX_WAIT: atomically checks if *uaddr == val and falls asleep, if so. If a timeout is given, it falls asleep at most for such a period of time, otherwise it falls asleep indefinitely.
  • FUTEX_WAKE: wakes up at most val processes waiting in FUTEX_WAIT on the address uaddr.
  • … and a few more, more complicated ones

The use of futex avoids problems with the traditional implementation:

  • syscall futex is called only when the mutex is already busy. There is no constant consumption of resources: kernel resources are used only when a thread is actually waiting for the mutex to be released
  • mutex is a very small structure (for the basic variant, a single int is enough)
  • mutexes work between processes without any special handling – the kernel uses the physical address for comparisons and correctly handles references to the same place through different addresses in different processes, etc.

Signals

Signals are a mechanism for transmitting information about asynchronous or synchronous events to the user’s process. This mechanism is very similar to the kernel-level interrupt mechanism.

There are approximately 32 signals with a fixed purpose (the list is to some extent dependent on the architecture), and 32 real-time signals that can be used by the user for any purpose. Important signals are:

Signals sent by the kernel caused by processor errors:

  • SIGSEGV: informs about violation of protection mechanisms, most often reference to a bad memory area
  • SIGILL: informs about the execution of an incorrect machine instruction
  • SIGBUS: indicates a memory access error for a reason other than an invalid address or lack of permissions. It’s hard to get it on x86. On other architectures, it is often caused by, for example, access to a word at an unaligned address.
  • SIGFPE: floating point exception, originally reported exceptions in floating point calculations, later also used for arithmetic errors on integers (dividing by 0)
  • SIGTRAP: informs about hitting a breakpoint, used for debugging programs

Signals for controlling processes (sent by other processes):

  • SIGTERM: informs the process that it should end
  • SIGKILL: forcefully terminates the process
  • SIGSTOP: stop the process (with the possibility of continuing)
  • SIGCONT: continues the process
  • SIGCHLD: indicates the status of the child process

Signals related to terminal service:

  • SIGHUP: informs about disconnecting the terminal (eg closing the xterm window, disconnecting the ssh session)
  • SIGINT: indicates that Ctrl-C has been pressed
  • SIGQUIT: indicates that Ctrl-\ has been pressed
  • SIGTSTP: indicates that Ctrl-Z has been pressed
  • SIGTTIN: informs about trying to read from the controlling terminal without being in the foreground group
  • SIGTTOU: informs about trying to write to the controlling terminal without being in the foreground group
  • SIGWINCH: informs about the terminal size changing

Other signals:

  • SIGABRT: informs about the occurrence of an unforeseen error in the program (failed assert, etc.) and the necessity of its forceful closing
  • SIGPIPE: indicates that you have tried to write to the pipe or socket whose other end has been closed
  • SIGUSR1, SIGUSR2: without a predetermined purpose, intended for the user
  • SIGIO: informs you that an asynchronous IO operation has been finished, or that IO on a file can now be executed if the program has previously requested such information

Each signal has an assigned action that will be executed when it is delivered. This is one of:

  • ignoring: nothing will happen
  • executing a function: the function provided by the user will be called
  • default action, depending on the signal:
    • ignore (SIGCHLD, SIGWINCH)
    • stopping the process (SIGSTOP, SIGTSTP, SIGTTIN, SIGTTOU)
    • continuation of the process (SIGCONT)
    • killing the process (SIGTERM, SIGKILL, SIGINT, …)
    • killing the process with a core dump (SIGSEGV, SIGQUIT, …)

The action performed on most signals can be changed by syscalls signal (simple interface, but limitted functionality) or sigaction (much greater functionality). Signals whose actions can not be changed are SIGKILL and SIGSTOP. In addition, although the SIGCONT action can be changed, it will always continue the process in addition to calling the assigned action.

In addition to changing the action assigned to signals, you can also block their delivery by syscall `` sigprocmask``. A blocked signal is not the same as an ignored signal – an ignored signal will be discarded while a blocked signal will be waiting in the queue until it is unblocked.

In addition to the signals sent by the kernel, each signal can also be manually sent by the user. Signals can be sent to entire processes (or groups of processes) via syscall kill, or to individual threads inside your own process by syscall tkill (or the corresponding function pthread_kill). Sending a signal to a process results in delivering it to an arbitrarily selected thread in the process.

Signal handling

The signal handling functions are set by the sigaction syscall and have one of the following types:

  • void func (int signum)
  • void func (int signum, siginfo_t *info, ucontext_t *ctx)

signum is the number of the incoming signal. In case of using the second type, info is a structure containing information about the signal – e.g. its source (processor error, sent by the user, terminal, etc.) and details (error address in the case of SIGSEGV, pid of the sender, etc.). ctx is a pointer to a structure containing the full state of the processor registers before entering the signal handler.

When a signal arrives with a handler function assigned, the kernel interrupts the process. If the process is executing a blocking system call and is in the state of interruptible sleep, the system call is interrupted. Depending on the semantics selected when setting the handler function, the call is either terminated and returns the error code EINTR, or it will be restarted when returning from the signal handler.

After interrupting the process and possibly a syscall, the kernel writes the state of the processor registers to the user stack (usually it is a “normal” stack of the program, but it is possible to set a separate stack for signals by syscall sigaltstack). After saving the state of registers, the kernel writes to the stack (or registers) parameters to the interrupt service function and the return address. Because of the need to clean up after the signal handler and restore the exact state of the processor before it was called, this return address points to a special sigreturn pseudo function, which is part of the VDSO / vsyscall block. Upon returning from the signal handler, sigreturn calls syscall sigreturn which deals with proper cleanup.

Simply returning from the signal handler is not the only way to leave it – sometimes it is useful to use the siglongjmp call or to use information from the ucontext structure to unwind the stack.

Signal handling is a very delicate mechanism, because it is very difficult to control where the signal will interrupt the program (for example, we can not guarantee that we can use malloc while handling the signal – this signal could have interrupted a malloc call from the main program). For this reason, the signal handling functions are often limited to the setting of a single variable, which is regularly checked by the main program.

Reading

  1. Manual section 2, in particular: syscall, futex, sigaction
  2. VDSO/vsyscall sources in arch/x86/vdso
  3. Syscall list in asm/unistd*.h
  4. Ulrich Drepper “Futexes Are Tricky”, 2011 - http://www.akkadia.org/drepper/futex.pdf
  5. man 7 signal