.. _lab-bpf-tracing-en: ===================== Class 14: BPF tracing ===================== Date: 3.06.2025 Introduction ============ Previous classes on BPF focused on demonstrating multitude of tools that can be used to write BPF programs. This time, we’ll take a look at how to practically use them to diagnose performance issues in particular programs and entire machines. Out of necessity, we’ll only look at selected aspects. Excellent sources of information on this topic are Brendan Gregg’s books: BPF Performance Tools and Systems Performance. His website, available at https://www.brendangregg.com/, also contains links to materials from many conferences and articles. BPF is a great tool for analyzing problems at the point of occurrence. It often doesn’t put much load on the machine, allowing analysis in a production environment, where reproducing encountered problems is usually the easiest. We’ll need a virtual machine with a kernel that supports BPF. At the beginning of the scenario for :ref:`lab-bpf-en`, there is a description of how to run such a machine. This time we’ll use a slightly different set of packages, which can be installed as follows:: apt install bpftrace strace iperf And a program available from the following repository:: git clone https://github.com/brendangregg/FlameGraph.git We will be running all the commands as root again. bpftrace ======== We’ll be making changes to programs created using the bpftrace tool. We’ve used the tool before, now let’s look more closely at its syntax. An example shown in class 6 was:: bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }' bpftrace loads and executes a program. The ``-e`` parameter means that the program’s content is the next argument. Instead of ``-e``, you can provide the name of a file containing the program or use ``-``, which causes the program text to be read from standard input. If the program is in a file, you can write at the top:: #!/usr/bin/bpftrace Then make the file executable and run it by name. Additional parameters are described in the manual page: ``man bpftrace``. That page also includes a description of the language, with the most important elements discussed below. A file consists of a preamble with type definitions and included files, plus at least one action block. The syntax of such a block is:: probe[,probe] /predicate/ { action } Here, ``probe`` is the name of the event(s) the program hooks into, ``predicate`` is an optional condition that must be met for the action to run, and ``action`` is the program itself. The program syntax is inspired by the D language, which is similar to C. You can use variables — temporary variables' names must start with ``$``, and map names (see class 6) with ``@``. You have access to conditional expressions, loops (regular while, and for to iterate over all map elements), and arithmetic/logical operators. Over 20 built-in symbols and dozens of functions are available. Analyzing the example above, we attached our program to the ``do_nanosleep`` kernel function using a ``kprobe``, and for every call, we printed information about the process ID that invoked it. ``pid`` is a built-in bpftrace symbol, and ``printf`` is a built-in function. Another example from the manual is:: BEGIN { printf("Tracing open syscalls... Hit Ctrl-C to end.\n"); } tracepoint:syscalls:sys_enter_open, tracepoint:syscalls:sys_enter_openat { printf("%-6d %-16s %s\n", pid, comm, str(args->filename)); } Here we start with a block using a special ``BEGIN`` probe. This block runs only at program start. (There are also ``END`` blocks that run at the end.) The second block uses two probes to trace system call invocations using the ``tracepoint`` mechanism. In the action, we use built-in symbols: ``pid``, ``comm``, ``args``, and functions ``printf`` and ``str``. The actions run whenever either function is called, as there's no condition. ``comm`` is the process name, and ``args``, which is unfortunately unavailable in some probe types, is a data structure containing the arguments. You can list available probes with ``bpftrace -l``, and get details with ``bpftrace -lv``, e.g.:: bpftrace -lv tracepoint:syscalls:sys_enter_open This shows that ``args`` has a ``filename`` field, which we use. Programs often follow a pattern: data is collected in regular blocks and stored in maps, and before program termination (in an ``END`` block), results are printed or unused maps are cleared to prevent automatic dumping. Other programs print data during execution. More language details, including a useful list of built-in symbols and functions, and probe types, can be found in the bpftools manual page. .. admonition:: Hands-on Modify the program above to print information only about files opened by the ``ls`` program, and only those files whose name contain ``/bin``. You may use conditions or ``if`` statements. The program from lab 6 used ``kprobe``, while this one uses ``tracepoint``. What’s the difference? Modify the program to use ``kprobe`` instead. Optionally, also make a version using ``fentry`` or ``kfunc``. .. notes :: Hint (if no one figures it out): uptr will be useful. CPU === There are many reasons why a program might run slowly—for example, it may use a non-optimal algorithm. It’s worth starting troubleshooting by checking whether the CPU is actually busy. You can do this with the ``top`` program. If it shows that the CPU is used most of the time, it's time to try diagnosing further. .. admonition:: Hands-on A common scenario is having too many tasks running on a machine. This happens when the CPU is under load, and the run queues are long. You can check them with the command:: runqlen.bt After stopping, it shows a histogram indicating what portion of samples fell into which bucket. On an idle machine, queues are short, and most samples will indicate a queue length of 0. Let’s try to put some load the machine. In a second terminal, run the ``iperf`` server:: iperf -s And in a third terminal, the client:: iperf -c localhost While the client finishes, check the queue length again — it should be a bit higher. In this case, the program ran single-threaded, so in a multi-core system both client and server could use separate CPUs. Let’s try a multi-threaded version:: iperf -c localhost -P 64 What does runqlen.bt show now? And what about 1024 threads (``-P 1024``)? How does the throughput change? Let’s see how the ``runqlen.bt`` program works:: BEGIN { printf("Sampling run queue length at 99 Hertz... Hit Ctrl-C to end.\n"); } profile:hz:99 { $task = (struct task_struct *)curtask; $my_q = (struct cfs_rq *)$task->se.cfs_rq; $len = $my_q->nr_running; $len = $len > 0 ? $len - 1 : 0; // subtract currently running task @runqlen = lhist($len, 0, 100, 1); } In the on-disk version, the ``cfs_rq`` structure is defined, but since our kernel supports BTF, it’s not needed. BTF — BPF Type Format — generally means a format for data associated with BPF programs, but here it's a way for the kernel to provide type info. You can view the ``cfs_rq`` structure's fields with:: bpftrace -lv 'struct cfs_rq' This time, ``BEGIN`` is not very interesting. What's new is the ``profile`` probe. ``profile:hz:99`` means the code runs 99 times per second. Instead of ``hz``, you can use ``us``, ``ms``, ``s``. This differs from earlier probes which responded to specific events — this one is triggered by the passage of time and runs asynchronously relative to code execution. Crucially, a copy of the BPF code runs on each CPU—so in an n-CPU system, the code executes n times per tick. Now, think about a following question: Why 99 Hz instead of 100? .. notes :: Because 100 is a round number and more likely to always occur with something that also triggers 100 times per second. Though not a big deal here, it would matter more with stack traces. The program inspects the current task’s data (``task_struct``), and via the ``se`` field of ``sched_entity`` type accesses the scheduler info (if CFS is used). The ``cfs_rq`` field holds ``nr_running``, which somewhat unintuitively indicates the number of runnable processes, on that specific CPU. Queue length is of course one less than that. The final line is key: it builds a histogram in the ``runqlen`` map, from 0 to 100 with step 1. As we saw with 1024 threads, values beyond the range are bucketed together. The ``@name`` syntax marks global maps; temporary locals start with ``$``. .. admonition:: Hands-on Modify the program to show statistics separately for each CPU. It’s worth taking a moment to consider concurrency issues. In the program above, the code runs concurrently on all CPUs and updates shared map values. Does this mean bpftrace ensures correctness of such updates? Unfortunately, no — or at least, not always. Currently, BPF doesn't offer synchronization primitives like spinlocks for tracing programs. Consider this example:: bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }' While simultaneously running ``cat /dev/zero > /dev/null`` in two terminals. The ``count()`` call tallies invocations, and ``@mapa2`` starts at 0. It looks like both values should be equal — but if we wait long enough, the first will be higher. bpftrace uses a clever trick: if it can, it allocates per-CPU maps with separate data for each CPU, avoiding concurrency issues, but requiring summing at the end and using more memory. ``count()`` allows using per-CPU maps, but ``++`` does not, it introduces classic read-modify-write race conditions. How to check what kind of map is used? Use ``strace``:: strace -febpf bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }' 2>&1 | grep mapa You’ll see that ``@mapa1`` is ``BPF_MAP_TYPE_PERCPU_ARRAY``, and ``@mapa2`` is ``BPF_MAP_TYPE_HASH``. Fortunately, small inconsistencies rarely hinder diagnosing performance problems. If queues are long, a good solution is reducing the number of active processes—by changing program settings or offloading work to another machine. Flamegraphs =========== But what if the program is slow and queues are short — i.e., the issue lies within the program itself? It's worth identifying where time is spent. If we want to see what the system is doing, we can use an unexpectedly simple program:: profile:hz:99 { @[kstack()] = count(); } ``kstack`` returns a string representing the kernel stack. We index the map with these strings, counting occurrences. Sounds odd — since the system is always doing something, won't every stack be different? .. admonition:: Hands-on Run the above program on an idle system, redirecting output to a file. Interrupt after a few seconds and inspect the file. Turns out that when little is happening, many stack frames repeat frequently. If you stopped collection quickly, there won’t even be that many unique frames, the whole file might only be a few hundred bytes. .. admonition:: Hands-on In a second console, start compiling the kernel with many threads (e.g., ``make -j 8``). Rerun the stack-dumping program and stop it after a few seconds. What does the file contain now? Now there are so many different frames that it's hard to make sense of it. Let's use a visualization tool to see where the system spends time. A popular type of graph for this purpose is the flamegraph. We've already downloaded tools to make one. .. admonition:: Hands-on First, we need to collapse the stack dump from ``bpftrace``:: cd FlameGraph ./stackcollapse-bpftrace.pl tracedata.bpftrace > tracedata.collapsed Where ``tracedata.bpftrace`` is your generated trace file. Then create the graph:: ./flamegraph.pl tracedata.collapsed > /hshare/flamegraph.svg Since the result is an image, place it somewhere accessible outside the VM (like a shared folder), so it’s easy to view. What does the system do during compilation? Open the graph to examine it. A convenient way is to use a web browser. Move mouse around. The graph shows function calls from bottom to top, as seen in stack traces. The width of each bar shows how often that function appeared in collected stack dumps. Wider bars indicate functions appearing more frequently on the stack. These are the functions that are worth checking if they’re expected. To analyze your own program, it may be more useful to view the user mode stack. Just use ``ustack()`` instead of ``kstack()``. To resolve symbols the program binary should have a symbol table; otherwise, you’ll only see addresses. Also note that without extra tools, this kind of profiling is less useful for programs written in languages that use their own virtual machine. Optimizations like inlining can also make stack traces misleading. Fortunately, there are many good profilers for user mode programs, while bpftrace is great for analyzing kernel behavior. Further Reading =============== This short presentation doesn’t exhaust the topic of system monitoring using BPF or other tools. For those interested, we recommend the aforementioned books by Brendan Gregg, which cover performance analysis in great detail.