Class 3: kernel interfaces¶
Date: 11.03.2025
System calls in the Linux kernel¶
What is a system call?¶
System calls (syscalls) are a mechanism that allows processes operating in the user space to request certain functions from the kernel.
Syscalls are the main mechanism of communication of user processes with the outside world -- the only thing that a program that does not use syscalls can do is to hang or cause a runtime error. System calls are a stable interface of the Linux kernel -- they may only be added.
The mechanism of calling syscalls from the user level varies for each architecture, and sometimes even for processors and / or kernel versions within a single architecture. Usually, this is done using an assembler instruction specially designed for this purpose.
Syscalls are identified by numbers. Their exact list and numbering
depends on architecture -- you can view it in the file /usr/include/asm/unistd*.h
or
arch/*/include/uapi/asm/unistd*.h
in Linux sources.
A value returned from the kernel in the range [-4096, -1] means an error and
is a negated standard error code (list in /usr/include/asm-generic/errno.h
and
errno-base.h
) . Other values mean success, and their
meaning depends on the syscall.
Review of the most important syscalls¶
Hands-on
Syscall executed by a process may be inspected with strace
.
This program shows all system calls by the process -- including the ones done by the dynamic linker.
Go ahead and trace a process, for instance:
strace ls
You'll find a list of the most common system calls below. Try to figure out what's happening and analyze the work of the dynamic linker in correspondence to the previous labs.
Process control¶
These syscalls are related to process management. It should be noted that the Linux kernel and the POSIX standard (including the pthreads thread library) use different process definitions: kernel processes correspond to POSIX threads. Here we use the POSIX definition.
noreturn void _exit ()
ends the thread
noreturn void exit_group ()
ends the process
pid_t getpid()
returns the ID of the current process
pid_t gettid()
returns the ID of the current thread
int fork()
creates a new process that is a copy of the current (but only with one thread); this is a special case of the
clone
syscallint clone (int (*fn) (void *), void *newstack, int flags, void *arg, ...)
creates a new process or thread, quite a complicated function
pid_t waitpid (pid_t pid, int *stat_loc, int options)
waits for an event (exit, stop, etc.) in the child process
int execve (const char *path, char *const *argv, char *const *envp)
launches a new program in the current process, replacing the current one
long ptrace (int request, pid_t pid, void *addr, void *data)
performs many operations related to tracing other processes: it allows you to stop a given process, single-step it, read and write its address space, registers, etc. Used by gdb and strace.
File support¶
At the syscall level, open files are identified by so-called file descriptors, i.e., small non-negative integers. Descriptors 0-2 typically correspond to the standard input, output, and error output. Other descriptors rarely have specific roles defined.
The most important syscalls from this group:
ssize_t read (int fd, void *buf, size_t len)
reads from the file
fd
to the buffer; returns the number of bytes read. 0 means the end of the file (this is not considered an error). A positive number, but less than len means a partial read -- this may be due to an error, hitting the end of the file, or simply lack of more available data at the moment.ssize_t write (int fd, const void *buf, size_t len)
writes to a file, works similarly to read.
int open (const char *fname, int flags, mode_t mode)
opens a file, returns the descriptor.
int close (int fd)
closes the file; may return an error if there is a problem in emptying kernel buffers.
int ioctl (int fd, int request, void *arg)
performing a special operation on the file. The set of available operations is very dependent on the file. Mostly only applies to device files. Examples of special operations:
change of terminal settings (performed on a terminal file)
volume change (performed on the sound card device file)
reading information about the manufacturer and physical parameters (performed on the hard disk device file)
int poll(struct pollfd *fds, int nfds, int timeout)
waits for one of the listed events to take place on one of the given descriptor lists, useful when the program can receive input from many sources
Memory management¶
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off)
creates a new memory area. If
flags
containsMAP_ANONYMOUS
, this is simply a new block of memory. Otherwise, the area will be mapped to the given file -- reading from the area will give us the contents of the file. Ifflags
containsMAP_SHARED
, writing to the area will also write to the file; otherwise (MAP_PRIVATE
), writing to the area will create a new copy of the data from the file and modify only this copy. If theMAP_FIXED
flag is passed, the area will be mapped at the given address; otherwise, the kernel will look for a free address.void munmap (void *addr, size_t len)
unmaps (destroys) the given memory area
int mprotect (void *addr, size_t len, int prot)
changes the access rights to a given memory area
void *brk (void *addr)
shortens or extends the process heap segment
It should be noted that on Linux (as well as many other UNIXes) there are two
methods for the allocation of "normal" memory: mmap
with the option
MAP_ANONYMOUS
and brk
. The usual malloc()
with libc in the standard
configuration uses the latter for small allocations, and the former for large allocations.
Performing system calls on Linux¶
Hands-on
The most direct way of invoking a system call from C is to use the syscall
wrapper function,
which expects a syscall number (or a constant from sys/syscall.h
) and its arguments.
The function return the syscall result on success, and -1 on error,
while moving the actual (non-negated) error code to a global variable errno
.
Compile the program listed below and open in a debugger.
Single-step it from the main()
function to see how the system call is executed,
and refer to the sections below for the exact description.
Note the instruction address which performed the actual call -- check what was mapped there in the maps
file.
#include <unistd.h>
#include <sys/syscall.h>
#include <errno.h>
int main()
{
int rc = syscall(SYS_chmod, "./f", 777); // oops
if (rc == -1)
return errno;
return 0;
}
Before calling a syscall, you should put its number and parameters in certain fixed registers of the processor. After the call, the result or error code is also available in a register.
System calls on the x86_64 architecture¶
Tip
There is a useful reference available.
The only native mechanism of system calls on the x86_64 architecture is
the syscall
instruction. The call takes place as follows:
the syscall number is passed in
rax
.parameters are passed in:
rdi
,rsi
,rdx
,r10
,r8
,r9
(in this order),the
syscall
instruction is called,the contents of
rcx
andr11
are destroyed by the kernel as a side effect of the syscall,the result is in
rax
.
System calls on the i386 architecture¶
There are 3 mechanisms of system calls on the i386 architecture:
interrupt
0x80
(available on all processors)sysenter
instruction (available on Intel processors starting from Pentium Pro)syscall
instruction (available on AMD processors starting from K6)
The sysenter
/ syscall
instructions were introduced in later processors
due to poor performance of interrupts on x86 processors.
If you use interrupt 0x80
, the system call looks like this:
the syscall number is passed in
eax
parameters are passed in the following registers:
ebx
,ecx
,edx
,esi
,edi
,ebp
(in this order); syscalls requiring more parameters have special conventionssyscall is called by the
int $0x80
instructionthe syscall result will be in the
eax
register
Calls through syscall
and sysenter
are quite similar, but a bit more
complicated.
VDSO and vsyscall mechanisms¶
Due to the existence of many syscall mechanisms on the i386 architecture and
the need to choose the right one for the given machine, the VDSO mechanism
was introduced. VDSO is a small shared library provided by the kernel that
contains the syscall function appropriate for a given processor. The kernel
contains several prepared versions of this library (int 0x80
, syscall
,
sysenter
) and selects the appropriate one at run time.
The x86_64 architecture does not require a syscall selection mechanism,
but an improved mechanism for the execution of certain syscalls (clock
,
time
, get_cpu
) has been introduced there. These syscalls have optimized
versions that do not require the processor to go into kernel mode (they only
read global kernel variables, which are made available to read from the user
space in a special way). This mechanism also uses a code block exported by
the kernel to the user space, named vsyscall.
The VDSO and vsyscall code also include the implementation of the sigreturn
and rt_sigreturn
functions used when returning from the user-space signal
handling function.
System calls in libc¶
Most system calls have their "wrappers" in the standard C library (libc).
These are functions whose only task is to move parameters to the right place,
call the appropriate syscall, and return the result. Note that the kernel
and libc have different conventions for passing error information --
the kernel returns the negated error code (eg -EINVAL
) directly from
the syscall, whereas the library functions always return -1 on error, and
the error code is passed in the global variable errno
(error code in errno
is not negated).
The simplified version of the write syscall wrapper (omitting vdso, errno, cancellation point) may look like this:
.global write
write:
movl $1, %eax # syscall number
syscall
cmpq $-4096, %rax # error?
jna out # if not error, then exit
neg %rax # -EINVAL -> EINVAL etc.
movl %eax, errno # set errno
movl $-1, %eax # return -1
out:
ret
Not all system calls correspond directly to library functions, for many reasons:
many syscalls have several versions with parameters of various sizes (mainly those related to uids / gids, pids, offsets in files, etc.). Older versions with smaller parameters are preserved as part of compatibility with older versions of libc. Examples are syscall
getuid
(16-bit uid) andgetuid32
(32-bit uid) as well aslseek
(32-bit offset) and_llseek
(64-bit offset). Existing versions of a syscall depend strongly on the architecture -- for example, 64-bit architectures have never had syscalls with 32-bit offsets.some syscalls (
ipc
,socketcall
in fact have many subfunctions with different parameters (shmat
,shmctl
,msgctl
,socket
,connect
,bind
,listen
, etc.). Each of these subfunctions has its own wrapper function in libc.many syscalls have semantics modified by the thread library (more about this below)
syscall requires special intervention from libc (
vfork
,clone
, etc.)because it is so
Hand-on
Compile the following program statically (-static
, so we won't be spammed by the linker)
and execute it under strace.
#include <sys/time.h>
#include <stdio.h>
int main()
{
struct timeval tv;
printf("123\n");
gettimeofday(&tv, NULL);
printf("%d\n", tv.tv_sec);
}
Where did the call to gettimeofday
(syscall #96 on x86_64) go?
If you redirect the output to a file/pipe, why there is only a single write
?
Other useful syscalls¶
Hands-on
In contrast to the fairly simple fork()
system call, clone
has a more advanced
interface allowing fine-grained specification of information shared between the parent and child process.
Moreover, instead of returning in both processes, it calls a given function on its own stack.
Refer to the man clone
page for a list of available flags.
The code attached below presents the basic usage and how data may be shared. Review the code below and explain what is the difference when the "vm" parameter is passed.
#define _GNU_SOURCE
#include <stdio.h>
#include <sched.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/wait.h>
#define STACK_SIZE 65536
int global_value = 0;
char *heap;
static int child_func(void *arg)
{
char *buf = (char *)arg;
printf("Child sees buf = %s\n", buf);
printf("Child sees global value = %d\n", global_value);
printf("Child see heap = %s\n", heap);
strcpy(buf, "hello from child");
global_value = 10;
strcpy(heap, "bye");
return 0;
}
int main(int argc, char *argv[])
{
//Allocate stack for child task
char *stack = malloc(STACK_SIZE);
unsigned long flags = 0;
char buf[256];
int status;
if (!stack) {
perror("Failed to allocate memory\n");
exit(1);
}
heap = malloc(1024);
if (!heap) {
perror("Failed to allocate memory\n");
exit(2);
}
if (argc == 2 && !strcmp(argv[1], "vm"))
flags |= CLONE_VM;
strcpy(buf, "Hello from Parent");
strcpy(heap, "Hey");
global_value = 5;
if (clone(child_func, stack + STACK_SIZE, flags | SIGCHLD, buf) == -1) {
perror("clone");
exit(1);
}
if (wait(&status) == -1) {
perror("Wait");
exit(1);
}
printf("Child exited with status:%d\t", status);
printf("buf:%s\t global_value=%d\n",
buf, global_value);
printf("Parent heap:%s\n", heap);
return 0;
}
futex¶
Traditional methods of implementing mutexes in the userspace required, to avoid busy waiting, creating, for example, an unnamed pipe for each mutex (used to awaken waiting processes). This approach has a number of disadvantages:
mutexes are expensive: using two file descriptors for each mutex requiring active waiting, the mutex structure itself must also be quite large
implementation of interprocess mutexes (e.g., in shared memory) is very heavy
The 2.6 kernel has added the futex
syscall (fast userspace mutex), which
allows significantly simplifying the implementation of mutexes:
int futex (int *uaddr, int op, int val, const struct timespec *timeout, int *uaddr2, int val3);
This syscall, like socketcall
, is actually a wrapper for several subfunctions (selected by the op
parameter):
FUTEX_WAIT
: atomically checks if*uaddr == val
and falls asleep, if so. If a timeout is given, it falls asleep at most for such a period of time, otherwise it falls asleep indefinitely.FUTEX_WAKE
: wakes up at mostval
processes waiting inFUTEX_WAIT
on the addressuaddr
.… and a few more, more complicated ones
The use of futex avoids problems with the traditional implementation:
syscall
futex
is called only when the mutex is already busy. There is no constant consumption of resources: kernel resources are used only when a thread is actually waiting for the mutex to be releasedmutex is a very small structure (for the basic variant, a single
int
is enough)mutexes work between processes without any special handling -- the kernel uses the physical address for comparisons and correctly handles references to the same place through different addresses in different processes, etc.
Signals¶
Signals are a mechanism for transmitting information about asynchronous or synchronous events to the user's process. This mechanism is very similar to the kernel-level interrupt mechanism.
There are approximately 32 signals with a fixed purpose (the list is to some extent dependent on the architecture), and 32 real-time signals that can be used by the user for any purpose. Important signals are:
Signals sent by the kernel caused by processor errors:
SIGSEGV
: informs about violation of protection mechanisms, most often reference to a bad memory areaSIGILL
: informs about the execution of an incorrect machine instructionSIGBUS
: indicates a memory access error for a reason other than an invalid address or lack of permissions. It's hard to get it on x86. On other architectures, it is often caused by, for example, access to a word at an unaligned address.SIGFPE
: floating point exception, originally reported exceptions in floating point calculations, later also used for arithmetic errors on integers (dividing by 0)SIGTRAP
: informs about hitting a breakpoint, used for debugging programs
Signals for controlling processes (sent by other processes):
SIGTERM
: informs the process that it should endSIGKILL
: forcefully terminates the processSIGSTOP
: stop the process (with the possibility of continuing)SIGCONT
: continues the processSIGCHLD
: indicates the status of the child process
Signals related to terminal service:
SIGHUP
: informs about disconnecting the terminal (e.g., closing the xterm window, disconnecting the ssh session)SIGINT
: indicates that Ctrl-C has been pressedSIGQUIT
: indicates that Ctrl-\ has been pressedSIGTSTP
: indicates that Ctrl-Z has been pressedSIGTTIN
: informs about trying to read from the controlling terminal without being in the foreground groupSIGTTOU
: informs about trying to write to the controlling terminal without being in the foreground groupSIGWINCH
: informs about the terminal size changing
Other signals:
SIGABRT
: informs about the occurrence of an unforeseen error in the program (failed assert, etc.) and the necessity of its forceful closingSIGPIPE
: indicates that you have tried to write to the pipe or socket whose other end has been closedSIGUSR1
,SIGUSR2
: without a predetermined purpose, intended for the userSIGIO
: informs you that an asynchronous IO operation has been finished, or that IO on a file can now be executed if the program has previously requested such information
Each signal has an assigned action that will be executed when it is delivered. This is one of:
ignoring: nothing will happen
executing a function: the function provided by the user will be called
default action, depending on the signal:
ignore (
SIGCHLD
,SIGWINCH
)stopping the process (
SIGSTOP
,SIGTSTP
,SIGTTIN
,SIGTTOU
)continuation of the process (
SIGCONT
)killing the process (
SIGTERM
,SIGKILL
,SIGINT
, …)killing the process with a core dump (
SIGSEGV
,SIGQUIT
, …)
The action performed on most signals can be changed by syscalls signal
(simple interface, but limited functionality) or sigaction
(much greater functionality).
Signals whose actions cannot be changed are SIGKILL
and SIGSTOP
. In addition,
although the SIGCONT
action can be changed, it will always continue the process in addition
to calling the assigned action.
In addition to changing the action assigned to signals, you can also block their delivery
by syscall sigprocmask
. A blocked signal is not the same as an ignored signal --
an ignored signal will be discarded while a blocked signal will be waiting in
the queue until it is unblocked.
In addition to the signals sent by the kernel, each signal can also be manually sent
by the user. Signals can be sent to entire processes (or groups of processes) via
syscall kill
, or to individual threads inside your own process by syscall
tkill
(or the corresponding function pthread_kill
). Sending a signal
to a process results in delivering it to an arbitrarily selected thread in the process.
Signal handling¶
The signal handling functions are set by the sigaction
syscall and have
one of the following types:
void func (int signum)
void func (int signum, siginfo_t *info, ucontext_t *ctx)
signum
is the number of the incoming signal. In case of using the second type,
info
is a structure containing information about the signal -- e.g., its source
(processor error, sent by the user, terminal, etc.) and details (error address
in the case of SIGSEGV
, pid of the sender, etc.). ctx
is a pointer to
a structure containing the full state of the processor registers before entering the signal handler.
When a signal arrives with a handler function assigned, the kernel interrupts the process.
If the process is executing a blocking system call and is in a state of interruptible
sleep, the system call is interrupted. Depending on the semantics selected when setting
the handler function, the call is either terminated and returns the error code EINTR
,
or it will be restarted when returning from the signal handler.
After interrupting the process and possibly a syscall, the kernel writes the state of
the processor registers to the user stack (usually it is a "normal" stack of the program,
but it is possible to set a separate stack for signals by syscall sigaltstack
).
After saving the state of registers, the kernel writes to the stack (or registers)
parameters to the interrupt service function and the return address. Because of
the need to clean up after the signal handler and restore the exact state of
the processor before it was called, this return address points to a special sigreturn
pseudo function, which is part of the VDSO / vsyscall block. Upon returning from
the signal handler, sigreturn
calls syscall sigreturn
which deals with proper cleanup.
Simply returning from the signal handler is not the only way to leave it --
sometimes it is useful to use the siglongjmp
call or to use information
from the ucontext
structure to unwind the stack.
Signal handling is a very delicate mechanism, because it is very difficult
to control where the signal will interrupt the program (for example, we can
not guarantee that we can use malloc
while handling the signal -- this signal
could have interrupted a malloc
call from the main program). For this reason,
the signal handling functions are often limited to the setting of a single variable,
which is regularly checked by the main program.
The functions which are safe to call in the asynchronous signal context are listed in man signal-safety
.
Small task #3¶
As you scroll through the signal-safe libc functions, you may notice that write
is there,
but no int-to-string conversion function is present. What if we wanted to print the signal number to stderr?
As the old interface provided no way to pass an extra per-signal-kind information to the function,
let's pass it ourselves by generating a trampoline (thunk) to do the work.
Trampolines¶
Trampolines are small pieces of code whose only task is to transfer control to another location, perhaps modifying the process context in some way or passing additional parameters.
An interesting example of using trampolines are the FFI (foreign function interface) libraries for scripting languages. Let's assume that an API written in C, which we want to use with Python / Lua / etc, requires passing a pointer to a function that it will call in a given situation (a so-called callback). Because our program is written in an interpreted dynamic language, the FFI library supporting callbacks must support the dynamic generation of functions callable from C. To this end, the library generates a trampoline, which will remember the appropriate registers containing parameters, write the trampoline identifier (to select the appropriate high-level function to invocation), and then call the FFI library function that starts the dynamic language interpreter with the appropriate parameters. Thanks to such a solution, we can generate any number of functions for callbacks at runtime and pass pointers to them to C functions.
Task¶
Forwarding the signal handing logic to a dynamic language is one example of such an API, but we are going to start with a simplified task here.
Write an implementation of the make_signal_handler
function:
typedef void (*sighandler_t)(int);
sighandler_t make_signal_handler(int signum);
The make_signal_handler
function should set a new signal handler using the signal
libc function
and return the previous handler.
As a new handler, you need to generate a pointer to a dynamically created printing function
which would print the signal number using
write(1, ...)
while ignoring the input signum
argument.
For instance:
int main() {
sighandler_t old_2 = make_signal_handler(2);
sighandler_t old_12 = make_signal_handler(12);
for (;;) pause();
}
should be equivalent to:
void write2(int signum) {
write(1, "2", 1);
}
void write12(int signum) {
write(1, "12", 2);
}
int main() {
sighandler_t old_2 = signal(2, write2);
sighandler_t old_12 = signal(12, write12);
for (;;) pause();
}
Important
Such a handler could have just used a global strings table indexed by the signal number to perform the required task. However, such a solution would not take us closer to implementing a trampoline...
Note
The signal
libc function has different semantics (BSD) from the Linux signal
syscall.
It uses sigaction
internally.
Hints¶
Compile the following code (with the -no-pie -mcmodel=large -fno-pie
option) and
disassemble the resulting .o
file:
void writer(int signum) {
write(1, "12", 2);
}
From the compiled machine code, make a template that will be filled with
a pointer to the appropriate string and its length by make_signal_handler
.
This example program shows how to dynamically create an executable function:
#include <stdio.h>
#include <stdint.h>
#include <unistd.h>
#include <string.h>
#include <sys/mman.h> /* mmap(), mprotect() */
static uint8_t code[] = {
0xB8,0xFF,0x00,0x00,0x00, /* mov eax,0xFF */
0xC3, /* ret */
};
int main(void)
{
const size_t len = sizeof(code);
/* mmap a region for our code */
void *p = mmap(NULL, len, PROT_READ|PROT_WRITE, /* No PROT_EXEC */
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
if (p==MAP_FAILED) {
fprintf(stderr, "mmap() failed\n");
return 2;
}
/* Copy it in (still not executable) */
memcpy(p, code, len);
/* Now make it execute-only */
if (mprotect(p, len, PROT_EXEC) < 0) {
fprintf(stderr, "mprotect failed to mark exec-only\n");
return 2;
}
/* Go! */
int (*func)(void) = (int(*)())p;
printf("(dynamic) code returned %d\n", func());
pause();
return 0;
}
Extra Topics¶
If you want to extend your knowledge beyond the scope of this course, consider:
Reading¶
Manual section 2, in particular:
syscall
,futex
,sigaction
VDSO/vsyscall sources in
arch/x86/vdso
Syscall list in
asm/unistd*.h
Ulrich Drepper "Futexes Are Tricky", 2011 - http://www.akkadia.org/drepper/futex.pdf
man 7 signal
,man 7 signal-safety
Implementing¶
Some ideas to implement:
- A warm shutdown on Ctrl-C
A warm shutdown of a system allows finishing all currently performed tasks and leaving external resources in a proper state. Typically, hitting Ctrl-C again transitions to a cold shutdown: the process should end as fast as possible without corrupting the external state.
Extend the code in the collapsed section below with support for warm shutdown. Subsequent
SIGINT
signals should:Print a message, break the loop, flush libc buffers and flush all data to disk.
Print a message, flush the file to disk (ignore libc buffers) and exit (use
siglongjmp
).Kill the program.
Be careful to read about requirements of signal safety (especially for
siglongjmp
), and restore the original handler after cancellable block.- Code
#include <unistd.h> #include <stdio.h> #include <string.h> #include <err.h> void calculator(FILE* fd) { int iters = 0; volatile long my_something = 3; while (iters < 1000) { for (int i = 0; i < 1000000000; ++i) { my_something = my_something * 3 + 1; // digits of PI... } if (fwrite((void*)&my_something, sizeof(my_something), 1, fd) != 1) { perror("Cannot write results\n"); break; } if (iters % 10 == 0) { fflush(fd); } putchar('.'); fflush(stdout); } // Make sure we have our important computation on disk! fflush(fd); fsync(fileno(fd)); } void cancellable(const char* filename) { FILE* out = fopen(filename, "w"); if (out == NULL) { err(1, "Cannot open %s:", filename); } if (false /* long jump */) { // We're inside the long jump here. // Calling non-async-safe functions is illegal, // or even returning from main. // Store to disk what was written. // sleep(5); to test the third case. } calculator(out); if (fclose(out) == EOF) { err(3, "Cannot close %s:", filename); }; } int main() { cancellable("powerbytes"); puts("We're done!"); }
- A
fork()
server A forkserver is a technique of launching new processes of an interpreted language without much of the overhead of initializing the interpreter again. A naive use of
fork()
has many issues, especially with multithreaded programs, such as inheriting locked mutexes.A forkserver is a single-threaded server process (possibly with preloaded modules) that can fork itself on request. As the server is started early or is an independent process, no unnecessary resources would be inherited. For instance, you can make the
multiprocessing
Python module to use forkserver for starting new processes.Play with arguments to
clone
andmmap
to support proper data sharing. Use unnamed pipes / sockets to communicate between processes: the main process should have a socket interface for requests to the forkserver, which should enable communication with the newly spawned process through another pipe. The requester would then use that pipe to communicate work that needs to be done.The main trick is to use
SCM_RIGHTS
Unix socket options to send ends (file descriptors) of pipes used for communication with the new process.