Class 14: BPF tracing

Date: 3.06.2025

Introduction

Previous classes on BPF focused on demonstrating multitude of tools that can be used to write BPF programs. This time, we’ll take a look at how to practically use them to diagnose performance issues in particular programs and entire machines. Out of necessity, we’ll only look at selected aspects. Excellent sources of information on this topic are Brendan Gregg’s books: BPF Performance Tools and Systems Performance. His website, available at https://www.brendangregg.com/, also contains links to materials from many conferences and articles.

BPF is a great tool for analyzing problems at the point of occurrence. It often doesn’t put much load on the machine, allowing analysis in a production environment, where reproducing encountered problems is usually the easiest.

We’ll need a virtual machine with a kernel that supports BPF. At the beginning of the scenario for Class 6: BPF, there is a description of how to run such a machine. This time we’ll use a slightly different set of packages, which can be installed as follows:

apt install bpftrace strace iperf

And a program available from the following repository:

git clone https://github.com/brendangregg/FlameGraph.git

We will be running all the commands as root again.

bpftrace

We’ll be making changes to programs created using the bpftrace tool. We’ve used the tool before, now let’s look more closely at its syntax.

An example shown in class 6 was:

bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'

bpftrace loads and executes a program. The -e parameter means that the program’s content is the next argument. Instead of -e, you can provide the name of a file containing the program or use -, which causes the program text to be read from standard input. If the program is in a file, you can write at the top:

#!/usr/bin/bpftrace

Then make the file executable and run it by name.

Additional parameters are described in the manual page: man bpftrace. That page also includes a description of the language, with the most important elements discussed below.

A file consists of a preamble with type definitions and included files, plus at least one action block. The syntax of such a block is:

probe[,probe]
/predicate/ {
  action
}

Here, probe is the name of the event(s) the program hooks into, predicate is an optional condition that must be met for the action to run, and action is the program itself.

The program syntax is inspired by the D language, which is similar to C. You can use variables — temporary variables' names must start with $, and map names (see class 6) with @. You have access to conditional expressions, loops (regular while, and for to iterate over all map elements), and arithmetic/logical operators. Over 20 built-in symbols and dozens of functions are available.

Analyzing the example above, we attached our program to the do_nanosleep kernel function using a kprobe, and for every call, we printed information about the process ID that invoked it. pid is a built-in bpftrace symbol, and printf is a built-in function.

Another example from the manual is:

BEGIN {
        printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
}
tracepoint:syscalls:sys_enter_open,
tracepoint:syscalls:sys_enter_openat {
        printf("%-6d %-16s %s\n", pid, comm, str(args->filename));
}

Here we start with a block using a special BEGIN probe. This block runs only at program start. (There are also END blocks that run at the end.) The second block uses two probes to trace system call invocations using the tracepoint mechanism. In the action, we use built-in symbols: pid, comm, args, and functions printf and str. The actions run whenever either function is called, as there's no condition. comm is the process name, and args, which is unfortunately unavailable in some probe types, is a data structure containing the arguments.

You can list available probes with bpftrace -l, and get details with bpftrace -lv, e.g.:

bpftrace -lv tracepoint:syscalls:sys_enter_open

This shows that args has a filename field, which we use.

Programs often follow a pattern: data is collected in regular blocks and stored in maps, and before program termination (in an END block), results are printed or unused maps are cleared to prevent automatic dumping. Other programs print data during execution.

More language details, including a useful list of built-in symbols and functions, and probe types, can be found in the bpftools manual page.

Hands-on

Modify the program above to print information only about files opened by the ls program, and only those files whose name contain /bin. You may use conditions or if statements.

The program from lab 6 used kprobe, while this one uses tracepoint. What’s the difference? Modify the program to use kprobe instead. Optionally, also make a version using fentry or kfunc.

CPU

There are many reasons why a program might run slowly—for example, it may use a non-optimal algorithm. It’s worth starting troubleshooting by checking whether the CPU is actually busy. You can do this with the top program. If it shows that the CPU is used most of the time, it's time to try diagnosing further.

Hands-on

A common scenario is having too many tasks running on a machine. This happens when the CPU is under load, and the run queues are long. You can check them with the command:

runqlen.bt

After stopping, it shows a histogram indicating what portion of samples fell into which bucket. On an idle machine, queues are short, and most samples will indicate a queue length of 0. Let’s try to put some load the machine. In a second terminal, run the iperf server:

iperf -s

And in a third terminal, the client:

iperf -c localhost

While the client finishes, check the queue length again — it should be a bit higher. In this case, the program ran single-threaded, so in a multi-core system both client and server could use separate CPUs. Let’s try a multi-threaded version:

iperf -c localhost -P 64

What does runqlen.bt show now? And what about 1024 threads (-P 1024)? How does the throughput change?

Let’s see how the runqlen.bt program works:

BEGIN
{
        printf("Sampling run queue length at 99 Hertz... Hit Ctrl-C to end.\n");
}

profile:hz:99
{
        $task = (struct task_struct *)curtask;
        $my_q = (struct cfs_rq *)$task->se.cfs_rq;
        $len = $my_q->nr_running;
        $len = $len > 0 ? $len - 1 : 0; // subtract currently running task
        @runqlen = lhist($len, 0, 100, 1);
}

In the on-disk version, the cfs_rq structure is defined, but since our kernel supports BTF, it’s not needed. BTF — BPF Type Format — generally means a format for data associated with BPF programs, but here it's a way for the kernel to provide type info. You can view the cfs_rq structure's fields with:

bpftrace -lv 'struct cfs_rq'

This time, BEGIN is not very interesting. What's new is the profile probe. profile:hz:99 means the code runs 99 times per second. Instead of hz, you can use us, ms, s. This differs from earlier probes which responded to specific events — this one is triggered by the passage of time and runs asynchronously relative to code execution. Crucially, a copy of the BPF code runs on each CPU—so in an n-CPU system, the code executes n times per tick.

Now, think about a following question: Why 99 Hz instead of 100?

The program inspects the current task’s data (task_struct), and via the se field of sched_entity type accesses the scheduler info (if CFS is used). The cfs_rq field holds nr_running, which somewhat unintuitively indicates the number of runnable processes, on that specific CPU. Queue length is of course one less than that.

The final line is key: it builds a histogram in the runqlen map, from 0 to 100 with step 1. As we saw with 1024 threads, values beyond the range are bucketed together. The @name syntax marks global maps; temporary locals start with $.

Hands-on

Modify the program to show statistics separately for each CPU.

It’s worth taking a moment to consider concurrency issues. In the program above, the code runs concurrently on all CPUs and updates shared map values. Does this mean bpftrace ensures correctness of such updates? Unfortunately, no — or at least, not always. Currently, BPF doesn't offer synchronization primitives like spinlocks for tracing programs. Consider this example:

bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }'

While simultaneously running cat /dev/zero > /dev/null in two terminals.

The count() call tallies invocations, and @mapa2 starts at 0. It looks like both values should be equal — but if we wait long enough, the first will be higher.

bpftrace uses a clever trick: if it can, it allocates per-CPU maps with separate data for each CPU, avoiding concurrency issues, but requiring summing at the end and using more memory. count() allows using per-CPU maps, but ++ does not, it introduces classic read-modify-write race conditions.

How to check what kind of map is used? Use strace:

strace -febpf bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }' 2>&1 | grep mapa

You’ll see that @mapa1 is BPF_MAP_TYPE_PERCPU_ARRAY, and @mapa2 is BPF_MAP_TYPE_HASH. Fortunately, small inconsistencies rarely hinder diagnosing performance problems.

If queues are long, a good solution is reducing the number of active processes—by changing program settings or offloading work to another machine.

Flamegraphs

But what if the program is slow and queues are short — i.e., the issue lies within the program itself? It's worth identifying where time is spent. If we want to see what the system is doing, we can use an unexpectedly simple program:

profile:hz:99 { @[kstack()] = count(); }

kstack returns a string representing the kernel stack. We index the map with these strings, counting occurrences. Sounds odd — since the system is always doing something, won't every stack be different?

Turns out that when little is happening, many stack frames repeat frequently. If you stopped collection quickly, there won’t even be that many unique frames, the whole file might only be a few hundred bytes.

Hands-on

In a second console, start compiling the kernel with many threads (e.g., make -j 8). Rerun the stack-dumping program and stop it after a few seconds. What does the file contain now?

Now there are so many different frames that it's hard to make sense of it. Let's use a visualization tool to see where the system spends time. A popular type of graph for this purpose is the flamegraph. We've already downloaded tools to make one.

Hands-on

First, we need to collapse the stack dump from bpftrace:

cd FlameGraph
./stackcollapse-bpftrace.pl tracedata.bpftrace > tracedata.collapsed

Where tracedata.bpftrace is your generated trace file. Then create the graph:

./flamegraph.pl tracedata.collapsed > /hshare/flamegraph.svg

Since the result is an image, place it somewhere accessible outside the VM (like a shared folder), so it’s easy to view.

What does the system do during compilation?

Open the graph to examine it. A convenient way is to use a web browser. Move mouse around.

The graph shows function calls from bottom to top, as seen in stack traces. The width of each bar shows how often that function appeared in collected stack dumps. Wider bars indicate functions appearing more frequently on the stack. These are the functions that are worth checking if they’re expected.

To analyze your own program, it may be more useful to view the user mode stack. Just use ustack() instead of kstack(). To resolve symbols the program binary should have a symbol table; otherwise, you’ll only see addresses. Also note that without extra tools, this kind of profiling is less useful for programs written in languages that use their own virtual machine. Optimizations like inlining can also make stack traces misleading. Fortunately, there are many good profilers for user mode programs, while bpftrace is great for analyzing kernel behavior.

Further Reading

This short presentation doesn’t exhaust the topic of system monitoring using BPF or other tools. For those interested, we recommend the aforementioned books by Brendan Gregg, which cover performance analysis in great detail.