.. _lab-bpf-tracing-en:

=====================
Class 14: BPF tracing
=====================

Date: 3.06.2025

Introduction
============

Previous classes on BPF focused on demonstrating multitude of tools
that can be used to write BPF programs. This time, we’ll take a look
at how to practically use them to diagnose performance issues in
particular programs and entire machines. Out of necessity, we’ll only
look at selected aspects. Excellent sources of information on this
topic are Brendan Gregg’s books: BPF Performance Tools and Systems
Performance. His website, available at https://www.brendangregg.com/,
also contains links to materials from many conferences and articles.

BPF is a great tool for analyzing problems at the point of
occurrence. It often doesn’t put much load on the machine, allowing
analysis in a production environment, where reproducing encountered
problems is usually the easiest.

We’ll need a virtual machine with a kernel that supports BPF. At the
beginning of the scenario for :ref:`lab-bpf-en`, there is a
description of how to run such a machine. This time we’ll use a
slightly different set of packages, which can be installed as
follows::

  apt install bpftrace strace iperf

And a program available from the following repository::

  git clone https://github.com/brendangregg/FlameGraph.git

We will be running all the commands as root again.

bpftrace
========

We’ll be making changes to programs created using the bpftrace
tool. We’ve used the tool before, now let’s look more closely at its
syntax.

An example shown in class 6 was::

  bpftrace -e 'kprobe:do_nanosleep { printf("PID %d sleeping...\n", pid); }'

bpftrace loads and executes a program. The ``-e`` parameter means that
the program’s content is the next argument. Instead of ``-e``, you can
provide the name of a file containing the program or use ``-``, which
causes the program text to be read from standard input. If the program
is in a file, you can write at the top::

  #!/usr/bin/bpftrace

Then make the file executable and run it by name.

Additional parameters are described in the manual page: ``man
bpftrace``. That page also includes a description of the language,
with the most important elements discussed below.

A file consists of a preamble with type definitions and included
files, plus at least one action block. The syntax of such a block is::

  probe[,probe]
  /predicate/ {
    action
  }

Here, ``probe`` is the name of the event(s) the program hooks into,
``predicate`` is an optional condition that must be met for the action
to run, and ``action`` is the program itself.

The program syntax is inspired by the D language, which is similar
to C. You can use variables — temporary variables' names must start
with ``$``, and map names (see class 6) with ``@``. You have access to
conditional expressions, loops (regular while, and for to iterate over
all map elements), and arithmetic/logical operators. Over 20 built-in
symbols and dozens of functions are available.

Analyzing the example above, we attached our program to the
``do_nanosleep`` kernel function using a ``kprobe``, and for every
call, we printed information about the process ID that invoked
it. ``pid`` is a built-in bpftrace symbol, and ``printf`` is a
built-in function.

Another example from the manual is::

  BEGIN {
          printf("Tracing open syscalls... Hit Ctrl-C to end.\n");
  }
  tracepoint:syscalls:sys_enter_open,
  tracepoint:syscalls:sys_enter_openat {
          printf("%-6d %-16s %s\n", pid, comm, str(args->filename));
  }

Here we start with a block using a special ``BEGIN`` probe. This block
runs only at program start. (There are also ``END`` blocks that run at
the end.) The second block uses two probes to trace system call
invocations using the ``tracepoint`` mechanism. In the action, we use
built-in symbols: ``pid``, ``comm``, ``args``, and functions
``printf`` and ``str``. The actions run whenever either function is
called, as there's no condition. ``comm`` is the process name, and
``args``, which is unfortunately unavailable in some probe types, is a
data structure containing the arguments.

You can list available probes with ``bpftrace -l``, and get details
with ``bpftrace -lv``, e.g.::

  bpftrace -lv tracepoint:syscalls:sys_enter_open

This shows that ``args`` has a ``filename`` field, which we use.

Programs often follow a pattern: data is collected in regular blocks
and stored in maps, and before program termination (in an ``END``
block), results are printed or unused maps are cleared to prevent
automatic dumping. Other programs print data during execution.

More language details, including a useful list of built-in symbols and
functions, and probe types, can be found in the bpftools manual page.

.. admonition:: Hands-on

  Modify the program above to print information only about files
  opened by the ``ls`` program, and only those files whose name
  contain ``/bin``. You may use conditions or ``if`` statements.

  The program from lab 6 used ``kprobe``, while this one uses
  ``tracepoint``. What’s the difference? Modify the program to use
  ``kprobe`` instead. Optionally, also make a version using ``fentry``
  or ``kfunc``.

.. notes ::
  Hint (if no one figures it out): uptr will be useful.

CPU
===

There are many reasons why a program might run slowly—for example, it
may use a non-optimal algorithm. It’s worth starting troubleshooting
by checking whether the CPU is actually busy. You can do this with the
``top`` program. If it shows that the CPU is used most of the time,
it's time to try diagnosing further.

.. admonition:: Hands-on

  A common scenario is having too many tasks running on a
  machine. This happens when the CPU is under load, and the run queues
  are long. You can check them with the command::

    runqlen.bt

  After stopping, it shows a histogram indicating what portion of
  samples fell into which bucket. On an idle machine, queues are
  short, and most samples will indicate a queue length of 0. Let’s try
  to put some load the machine. In a second terminal, run the ``iperf``
  server::

    iperf -s

  And in a third terminal, the client::

    iperf -c localhost
    
  While the client finishes, check the queue length again — it should
  be a bit higher. In this case, the program ran single-threaded, so
  in a multi-core system both client and server could use separate
  CPUs. Let’s try a multi-threaded version::

    iperf -c localhost -P 64
    
  What does runqlen.bt show now? And what about 1024 threads (``-P
  1024``)? How does the throughput change?

Let’s see how the ``runqlen.bt`` program works::

  BEGIN
  {
          printf("Sampling run queue length at 99 Hertz... Hit Ctrl-C to end.\n");
  }
  
  profile:hz:99
  {
          $task = (struct task_struct *)curtask;
          $my_q = (struct cfs_rq *)$task->se.cfs_rq;
          $len = $my_q->nr_running;
          $len = $len > 0 ? $len - 1 : 0; // subtract currently running task
          @runqlen = lhist($len, 0, 100, 1);
  }
  
In the on-disk version, the ``cfs_rq`` structure is defined, but since
our kernel supports BTF, it’s not needed. BTF — BPF Type Format —
generally means a format for data associated with BPF programs, but
here it's a way for the kernel to provide type info. You can view the
``cfs_rq`` structure's fields with::

  bpftrace -lv 'struct cfs_rq'

This time, ``BEGIN`` is not very interesting. What's new is the
``profile`` probe. ``profile:hz:99`` means the code runs 99 times per
second. Instead of ``hz``, you can use ``us``, ``ms``, ``s``. This
differs from earlier probes which responded to specific events — this
one is triggered by the passage of time and runs asynchronously
relative to code execution. Crucially, a copy of the BPF code runs on
each CPU—so in an n-CPU system, the code executes n times per tick.

Now, think about a following question: Why 99 Hz instead of 100?

.. notes ::
  Because 100 is a round number and more likely to always occur
  with something that also triggers 100 times per second. Though not a
  big deal here, it would matter more with stack traces.

The program inspects the current task’s data (``task_struct``), and
via the ``se`` field of ``sched_entity`` type accesses the scheduler
info (if CFS is used). The ``cfs_rq`` field holds ``nr_running``,
which somewhat unintuitively indicates the number of runnable
processes, on that specific CPU. Queue length is of course one less
than that.

The final line is key: it builds a histogram in the ``runqlen`` map, from
0 to 100 with step 1. As we saw with 1024 threads, values beyond the
range are bucketed together. The ``@name`` syntax marks global maps;
temporary locals start with ``$``.

.. admonition:: Hands-on

                Modify the program to show statistics separately for
                each CPU.

It’s worth taking a moment to consider concurrency issues. In the
program above, the code runs concurrently on all CPUs and updates
shared map values. Does this mean bpftrace ensures correctness of such
updates?  Unfortunately, no — or at least, not always. Currently, BPF
doesn't offer synchronization primitives like spinlocks for tracing
programs. Consider this example::

  bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }'

While simultaneously running ``cat /dev/zero > /dev/null`` in two terminals.

The ``count()`` call tallies invocations, and ``@mapa2`` starts
at 0. It looks like both values should be equal — but if we wait long
enough, the first will be higher.

bpftrace uses a clever trick: if it can, it allocates per-CPU maps
with separate data for each CPU, avoiding concurrency issues, but
requiring summing at the end and using more memory. ``count()`` allows
using per-CPU maps, but ``++`` does not, it introduces classic
read-modify-write race conditions.

How to check what kind of map is used? Use ``strace``::

  strace -febpf bpftrace -e 'kprobe:vfs_read { @mapa1 = count(); @mapa2++ }' 2>&1 | grep mapa
  
You’ll see that ``@mapa1`` is ``BPF_MAP_TYPE_PERCPU_ARRAY``, and
``@mapa2`` is ``BPF_MAP_TYPE_HASH``. Fortunately, small
inconsistencies rarely hinder diagnosing performance problems.

If queues are long, a good solution is reducing the number of active
processes—by changing program settings or offloading work to another
machine.

Flamegraphs
===========

But what if the program is slow and queues are short — i.e., the issue
lies within the program itself? It's worth identifying where time is
spent. If we want to see what the system is doing, we can use an
unexpectedly simple program::

  profile:hz:99 { @[kstack()] = count(); }

``kstack`` returns a string representing the kernel stack. We index
the map with these strings, counting occurrences. Sounds odd — since the
system is always doing something, won't every stack be different?

.. admonition:: Hands-on
  Run the above program on an idle system, redirecting output to a file. Interrupt after a few seconds and inspect the file.

Turns out that when little is happening, many stack frames repeat
frequently. If you stopped collection quickly, there won’t even be
that many unique frames, the whole file might only be a few hundred
bytes.

.. admonition:: Hands-on

  In a second console, start compiling the kernel with many threads
  (e.g., ``make -j 8``). Rerun the stack-dumping program and stop it
  after a few seconds. What does the file contain now?

Now there are so many different frames that it's hard to make sense of
it. Let's use a visualization tool to see where the system spends
time. A popular type of graph for this purpose is the
flamegraph. We've already downloaded tools to make one.

.. admonition:: Hands-on

  First, we need to collapse the stack dump from ``bpftrace``::

    cd FlameGraph
    ./stackcollapse-bpftrace.pl tracedata.bpftrace > tracedata.collapsed

  Where ``tracedata.bpftrace`` is your generated trace file. Then create the graph::

    ./flamegraph.pl tracedata.collapsed > /hshare/flamegraph.svg


  Since the result is an image, place it somewhere accessible outside
  the VM (like a shared folder), so it’s easy to view.

  What does the system do during compilation?

Open the graph to examine it. A convenient way is to use a web
browser. Move mouse around.

The graph shows function calls from bottom to top, as seen in stack
traces. The width of each bar shows how often that function appeared
in collected stack dumps. Wider bars indicate functions appearing more
frequently on the stack. These are the functions that are worth
checking if they’re expected.

To analyze your own program, it may be more useful to view the user
mode stack. Just use ``ustack()`` instead of ``kstack()``. To resolve
symbols the program binary should have a symbol table; otherwise,
you’ll only see addresses. Also note that without extra tools, this
kind of profiling is less useful for programs written in languages
that use their own virtual machine. Optimizations like inlining can
also make stack traces misleading. Fortunately, there are many good
profilers for user mode programs, while bpftrace is great for
analyzing kernel behavior.

Further Reading
===============

This short presentation doesn’t exhaust the topic of system monitoring
using BPF or other tools. For those interested, we recommend the
aforementioned books by Brendan Gregg, which cover performance
analysis in great detail.