.. _02-elf-en:

.. sectionauthor:: Maciej Matraszek <m.matraszek@mimuw.edu.pl>

========================
Class 2: ELF: modularity
========================

Date: 04.03.2025
:ref:`Small Task #2 <small_task_2>`

Extras
======

Same as for the previous labs

- :ref:`elf-en`
 
Scenario
====================

.. notes::

    Intro (20 mins)

    - Ask about technical issues (slack, QEMU, new students)
    - Ask about any troubles with the first task (like problems with access)
    - Plan for today:

      - How ELF and OS facilitate modularity: static and dynamic linking
      - We'll take a look at the task and explain the remaining parts (relocations)
      - We have a small task.

Modularity in general
-------------------------

Virtually every programming language has a concept of modularity: allowing building abstractions and code reuse.
This, in turn, facilitates creating libraries: a generic piece of code used by both applications and other libraries.
Although these dependency relationships are usually described as a tree, it is usually actually a DAG:
if there are two paths to a dependency, should it have a single "instance" or a set of distinct ones?
If the module has a global state, usually the former is the expected behavior.

With library evolution (versioning), this inherently creates a resolution problem so-called **dependency hell**,
when it is increasingly hard to provide a consistent set of libraries for a set of required applications.
With the advent of containerization, when memory is at premium, duplication is a popular solution.
However, this leads to even simple applications weighting hundreds of megabytes.

During this lab, we'll discuss how such challenges are reflected in the context of native code execution.

Memory layout and sharing
-------------------------

At a very high level, there two kinds of "modules" for native code: static (``.o``) and dynamic (``.so``).
Static modules are combined in an executable at compile (linking) time,
while dynamic modules are only referenced to be resolved at runtime.
In particular, statically linked code offers more opportunities for optimizations,
while shared libraries offer more portability and memory saving.
A shared library will have only copy present in the physical memory, 
and virtual memory addressing makes it easy to reference it from multiple applications.

To visualize, take a look at this very simplified diagram showing virtual-to-physical memory mapping of two executables (left).
Sharing physical memory (right) enabled us to save two "blocks":


.. mermaid::
    :caption: Very simplified representation of virtual memory paging of (shared) objects.
    :config: {"theme": "base", "darkMode": "true", "themeVariables": {"lineColor": "#aaa", "primaryColor": "#fefefe" }}

    block-beta
        columns 3

        block:1
            columns 1

            block:P1:1
                columns 1
                %%title process 1

                A.o B.o C.so D.so
            end
            block:P2:1
                columns 1
                %%title process 2

                A'.o DD["D.so"] CC["C.so"] F.so
            end
        end

        space:1
        block:Memory:1
            columns 1
            %%title physical memory

            Ar["A.o"]
            Br["B.o"]
            Dr["D.so"]
            Ar2["A'.o"]
            Cr["C.so"]
            Fr["F.so"]
            f1["free"]
            f2["free"]
        end

        A.o --> Ar
        B.o --> Br
        C.so --> Cr
        D.so --> Dr

        A'.o --> Ar2
        DD --> Dr
        CC --> Cr
        F.so --> Fr

A bit more detailed view
^^^^^^^^^^^^^^^^^^^^^^^^

.. admonition:: Hands-on

    You can view a description of a process's virtual memory map by reading the
    ``/proc/<PID>/maps`` pseudo-file. (More detailed view is in ``/proc/<PID>/smaps``.)
    You can find a detailed description of the format `in the kernel docs <https://docs.kernel.org/filesystems/proc.html>`_,
    but for now it sufficies to say that the columns represent, in order:

    - start and end addresses of the mapping,
    - permission for memory pages (rwx),
    - offset to a file (if present in last column),
    - (device and inode),
    - and the mapped file or a marker: "[heap]", "[stack]".

    Read the pseudo-file for a process of a simple program. For instance, you may use this trick to read a mapping
    of a just-spawned ``cat`` process::

        cat /proc/self/maps

    Think about what you see there and try to cross-reference the offset with permissions to segments of the binary file
    (look at the "Offset" column). You may see a list of sections and segments for ``cat`` with::

        readelf -lS /usr/bin/cat

    Check the addresses of libc for various processes: are they the same?

    Open the ``smaps`` pseudo-file and search for ``libc.so``. Look for ``Rss`` and ``Pss`` rows:

    RSS
        Is the amount of the mapping currently resident in RAM.
    PSS
        Is this process' proportional share of this mapping (i.e., RSS divided by the number of other processes with it).

    Try to estimate how much memory would be required if each process on your machine had its own copy of libc.

    .. class:: details

    This is how it looks on my router:
        .. code-block::

            00400000-0044f000 r-xp 00000000 1f:04 4          /bin/busybox
            0045e000-0045f000 r-xp 0004e000 1f:04 4          /bin/busybox
            0045f000-00460000 rwxp 0004f000 1f:04 4          /bin/busybox
            77e24000-77e46000 r-xp 00000000 1f:04 288        /lib/libgcc_s.so.1
            77e46000-77e47000 r-xp 00012000 1f:04 288        /lib/libgcc_s.so.1
            77e47000-77e48000 rwxp 00013000 1f:04 288        /lib/libgcc_s.so.1
            77e48000-77ee4000 r-xp 00000000 1f:04 286        /lib/libc.so
            77ef3000-77ef5000 rwxp 0009b000 1f:04 286        /lib/libc.so
            77ef5000-77ef7000 rwxp 00000000 00:00 0
            7f761000-7f782000 rw-p 00000000 00:00 0          [stack]
            7ffe0000-7ffe1000 r--p 00000000 00:00 0          [vvar]
            7ffe1000-7ffe3000 r-xp 00000000 00:00 0          [vdso]

.. notes::
    Explain [heap] and [stack]. We'll talk about [vdso] and [vsyscall] on the next labs.

    We should be at 20 mins mark.
    Spend about 10 minutes to discuss the need for relocations and PIC, ASLR,
    and shortly explain the attributes seen in the cross-referencing task (as below).
    We'll work with linker in the next part.

    On a modern CPU the maps roughly reference:

    #. ``.interp``,
    #. ``.text``,
    #. ``.rodata``,
    #. sections becoming RO after init (``.data.rel.ro``),
    #. and RW data: ``.data, .bss``.

    Consider showing how it looks on an embedded system.


File types
""""""""""

Let's review ELF file types again:

``ET_REL`` (relocatable file)
  A compiled, but not yet linked file (``.o``). Usually created as the result
  of compiling a single source file. It is not possible to run it directly --
  these files are an intermediate stage of compilation and are combined by
  the linker (program ``ld``) into executable files or dynamic libraries.

  There is also an (uncommonly used) ability to combine several ``.o`` files
  into one larger file using ``ld -r``.

  As intermediate files, ``ET_REL`` can contain undefined symbols and undefined
  references -- they will be fixed up by further linking steps.

  In the ``ET_REL`` type, section headers are required and program headers
  are not used.

``ET_EXEC`` (executable file)
  A compiled and fully linked program, usually created by linking ``.o`` files
  through a linker. Such a file is ready to be launched -- all segments have
  a fixed address at which they will be available during the program's operation.
  All references in the file are also fixed -- the only exceptions are special
  types of references to shared libraries, limited to one segment. This ensures
  that almost all the memory content loaded from the executable file is identical
  in all processes executing the given program and allows for sharing the memory.

  In the ``ET_EXEC`` type, program headers are required. Section headers are not
  needed to run the program, but are used by debuggers and are usually included.

``ET_DYN`` (shared object file)
  A compiled and linked dynamic library (``.so``). Very similar to ``ET_EXEC``,
  but with the following differences:

  - although most of the content is already set (undefined references, like in
    ``ET_EXEC``, are limited to external references in one segment), the address
    at which the library will be loaded is not fixed -- the library can be loaded
    to any place in memory.

  - because the library code cannot contain references to its own address,
    a special code style is used, which is called PIC (Position-Independent
    Code). Whenever an address of an object in the library is needed, PIC code
    must somehow determine its own position and calculate the address of
    the desired object from it. This code is usually bigger and slower than
    "regular" code.

  - thanks to the above features, the program can load many dynamic libraries
    into its address space, and even load them during operation

  It should be noted that although the ``ET_DYN`` type is usually used for
  libraries, there is nothing to prevent it from being used for the main
  program as well -- this technique is called PIE (Position-Independent Executable)
  and is sometimes used because of the possibility of full randomization of the process address space.

  An example of an executable ``ET_DYN`` file is the libc library
  (``/lib/libc.so.6``) -- it prints its version information on startup. Also,
  the dynamic linker is implemented as an executable ``ET_DYN`` (to avoid address
  conflict with the program that loads).


Interestingly, Linux kernel modules (``.ko``) are of the ``ET_REL`` type, and are
directly loaded by the kernel -- the benefits of ``ET_EXEC`` and ``ET_DYN``
(i.e., shared memory) do not apply in kernel mode, and their disadvantages (fixed
position ``ET_EXEC``, PIC ineffectiveness) would be quite severe.

Sections
""""""""

The information contained in a section header is:

- section name (a section can have any name, but for standard sections it is customary to use names that begin with a period)
- section type
- section attributes
- size, location in the file, and section alignment
- for ``ET_EXEC`` and ``ET_DYN``: the final address of the section in memory (relative to the base address in the case of ``ET_DYN``)
- associated section IDs (for some types)

The section type determines most of its semantics. The more important types are:

``SHT_PROGBITS``
  normal section, content loaded from a file
``SHT_NOBITS``
  ordinary section, but the content is filled with zeros instead of being loaded from a file
``SHT_SYMTAB``
  symbol table -- contains information about objects contained in the file and external objects to which this file has references
``SHT_STRTAB``
  table of strings -- contains the names used by section headers and entries in the symbol table
``SHT_REL``/``SHT_RELA``
  contains information about unknown references used in a given (affiliated) section
``SHT_DYNAMIC``
  contains information for the dynamic linker

The more important section attributes are:

``SHF_WRITE``
  the section is writable at runtime
``SHF_EXECINSTR``
  the section contains executable code
``SHF_ALLOC``
  the section will be loaded into memory at runtime (sections without this
  flag are used only by build and debugging tools)

Segments
""""""""

The information contained in a program header is:

- segment type
- segment attributes
- location of the segment in the file and its address in memory
- size of the segment in the file and size of the segment in memory
  (if they are different, the remaining part is filled with zeros --
  used for sections of the type ``SHT_NOBITS``)

The more important types of segments are:

``PT_LOAD``
   "regular" segment: loads the area into memory
``PT_DYNAMIC``
   marks the area as containing information for the dynamic linker
``PT_INTERP``
   indicates the file name of the dynamic linker to be used

The only architecture-independent / system-independent attributes are
the access rights (rwx).

During linking, ``PT_LOAD`` segments are created by merging all sections
with the ``SHF_ALLOC`` flag with compatible access rights.  All other segments
that are used at runtime are contained within ``PT_LOAD`` segments.

Memory map, again
"""""""""""""""""

As we've seen, a shared library needs an extra per-process segment to enable resolving symbols.
Below is a little less simplified virtual (left) to physical (right) memory mapping for three processes:
two running the same binary *A* and one executing *B*, which use a shared library *L*.

.. mermaid::
    :caption: Somewhat less simplified representation of virtual memory and instances of the same executable.
    :config: {"theme": "base", "darkMode": "true", "themeVariables": {"lineColor": "#aaa", "primaryColor": "#fefefe" }}
    :zoom:

    block-beta
        columns 4

        block:As:2
            columns 1

            block:P1:1
                columns 1
                %%title process 1

                A1["A.text"] A1d["A.data"] L1["L.so.text"] L1r["L.so.rel"]
            end

            block:P12:1
                columns 1
                %%title process 2

                A2["A.text"] A2d["A.data"] L2["L.so.text"] L2r["L.so.rel"]
            end

            block:Pb:1
                columns 1
                %%title process 1

                B1["B.text"] B1d["B.data"] Lb["L.so.text"] Lbr["L.so.rel"]
            end
        end


        space:1
        block:Memory:1
            columns 1
            %%title physical memory

            Ar["A.text"]
            A1dr["A.data of A1"]
            A2dr["A.data of A2"]
            Lr["L.so.text"]
            f3["free"]
            L1dr["L.so.rel of A1"]
            L2dr["L.so.rel of A2"]
            f1["free"]
            Br["B.text"]
            B1dr["B.data of B"]
            Lbdr["L.so.rel of B"]
        end

    A1 --> Ar
    A2 --> Ar
    A1d --> A1dr
    A2d --> A2dr
    L1 --> Lr
    L2 --> Lr
    L1r --> L1dr
    L2r --> L2dr
    B1 --> Br
    B1d --> B1dr
    Lb --> Lr
    Lbr --> Lbdr


Relocatable code and PIC
------------------------

.. tip::
    In the following tasks run the compiler with the following flags to simplify the generated code::

      CFLAGS="-no-pie -march=haswell -fno-asynchronous-unwind-tables -fcf-protection=none -fno-stack-protector"

.. admonition:: Hand-on

    Create two C files which depend on each other like these two:


    .. code-block:: c
        :caption: part_a.c

        extern int YOUR_DATA;
        int bar(int*);


        static int DATA = 42;

        static void baz() {
            int a = 32 * 6;
        }

        int foo(int c) {
            DATA = YOUR_DATA;
            baz();
            return bar(&DATA);
        }

    .. code-block:: c
        :caption: part_b.c

        int foo(); // oopsie

        int bar(int* arg) {
            return *arg + 4;
        }

        int YOUR_DATA = 1337;

        int main() {
            return foo();
        }

    Compile these files without linking like so::

        gcc -c part_a.c -o part_a.o $CFLAGS
        gcc -c part_b.c -o part_b.o $CFLAGS

    Observe them under ``objdump -dr`` and ``readelf -a``.
    You should see various kind of relocations which we will discuss next.

    Link these files together by calling gcc with::

        gcc part_a.o part_b.o -o parts $CFLAGS  -Wl,-emit-relocs

    Examine the result with ``objdump``: how were the holes resolved by the linker?

    Try adding other flags like ``-fpic``, ``-fno-pic``, ``-fno-plt``, or ``-mcmodel=large``.
    Compile and examine the files again.

    Try invoking the linker directly with::

        ld part_a.o part_b.o -o parts

    What happened? Does the executable still work?
    If there is a discrepancy, consider executing gcc in verbose mode (``-v``) to see what flags it passes to the linker.

    Examine the ``crt1.o`` (or ``crt0.o``) file.

.. tip::
    Use ``objdump -xtrds`` to dump code, section table, symbols, and other data about a binary file
    in a single command.

.. notes::

    Ask the students some extra questions:

    - Why ``baz`` is not relocated?
    - Why it works with mismatched signature of ``foo``?
    - Find the path and show ``objdump -xdtrls .../crt1.o``.

Symbols
^^^^^^^

One of the main tasks of the ELF format is storing information about objects
contained in the file and about references to external objects.  By object,
we mean a function or a (global) variable.  From the ELF point of view,
an object is simply an area within a section (``ET_REL``) or the address space
of a program (``ET_EXEC``, ``ET_DYN``).

Symbols are names assigned to objects. The symbol can be defined (assigned to
an object in a given file) or undefined (it will be defined at the moment
of linking with the file that defines it).

The symbols are stored in the symbol table. The information stored about a symbol is:

- name
- value: position in the section (``ET_REL``) or memory (``ET_EXEC``, ``ET_DYN``)
- the containing section
- size (the size of the variable or size of the function code); it can be zero
  if we're only interested in the address
- type:

  ``STT_OBJECT``
    a global variable
  ``STT_FUNC``
    a function
  ``STT_SECTION``
    a special symbol representing the beginning of the section (used for
    internal references)

- linking rules:

  ``STB_LOCAL``
    local symbol (``static`` in C) -- will not participate in linking
  ``STB_GLOBAL``
    global symbol
  ``STB_WEAK``
    weak global symbol (``__attribute__((weak))`` in gcc) -- a special variant
    of a global symbol that automatically "loses" to the usual global symbol
    with the same name when both are defined

- visibility rules -- used to bind symbols between modules (a module is
  an executable program or a dynamic library):

  ``STV_DEFAULT``
    default rules -- the symbol is visible and can be shadowed by a symbol
    with the same name from another module
  ``STV_PROTECTED``
    the symbol is visible, but references to it from within the containing
    module will not be shadowed
  ``STV_HIDDEN``
    the symbol is not visible from outside the module -- like ``STB_LOCAL``,
    but at the module level, not the source file level
  ``STV_INTERNAL``
    like ``STV_HIDDEN``, but when the symbol is a function, we also assume that
    it will never be called from outside the module (which would be possible
    by passing the pointer). It can be used to further optimize PIC code.

  These rules can be set in gcc by the appropriate ``__attribute__``.

Relocations
^^^^^^^^^^^

.. notes::
    Use ``readelf`` view for relocations table.
    You'll find these relocations if compiled with ``-fno-pic``.
    Remember that ``_PLT`` is downgraded to ``_PC``.

Symbols can be used in the code by references (called relocations). Relocation
is information for the linker, that in a given place of the section,
instead of the bytes set at the time of compilation, it should insert
the address of a symbol (or some other value unknown at compile time). Relocations
are stored in the relocation tables (one for each section that requires it).
The information stored for each relocation is:

- index of the referenced symbol in the symbol table
- the relocation position in the section
- type of relocation
- addend: an additional component to the value -- the exact interpretation depends
  on the type of relocation, most often it is simply a number added to the relocated
  value. It can be used, for example, when someone asks for the address ``a.y``, when
  we have the definition ``struct {int x, y; } a;``

There are two types of relocation tables: ``SHT_REL`` and ``SHT_RELA``.
For ``SHT_RELA``, the addend is stored in the relocation table, whereas
for ``SHT_REL``, the addend is stored as the initial content of the relocated
space. ``SHT_REL`` allows you to reduce the file size, but ``SHT_RELA`` is
required for architectures with complex relocation types (e.g., two-part
relocations of 16 bits each). The i386 architecture always uses ``SHT_REL``,
and the x86_64 architecture always uses ``SHT_RELA``.

Relocation types are very dependent on architecture. Most types of relocations
are used for dynamic linking. The basic types of relocation on i386 are:

``R_386_32``
  A 32-bit field is relocated, the relocated value is the address of
  the symbol + addend. For example, the following code::

    extern struct {
        int x;
        int y;
    } a;
    a.y = 13;

  will look like this in assembly::

    movl $13, a+4

  which translates into machine code as follows::

    c7 05 XX XX XX XX 0d 00 00 00

  where ``XX XX XX XX`` should be replaced with the address of ``a + 4``.
  The assembler will save this in the ELF file section as::

    c7 05 04 00 00 00 0d 00 00 00

  And in the relocation table for this section, it will make a relocation of
  type ``R_386_32`` referencing the symbol ``a`` at position 2 within the section
  (assuming that this code is at the very beginning of the section).

``R_386_PC32``
  A 32-bit field is relocated, the relocated value is the symbol address
  - field address + addend. This type of relocation is used for jumps and
  calls instructions (I remind you that in x86 jump and call statements
  the destination is stored as the difference between the jump instruction end address
  and the destination address). The following code::

    extern void f (void);
      f();

  which, in assembly, is::

    call f

  will be saved in the machine code as::

    e8 XX XX XX XX .

  where ``XX XX XX XX`` is (address of ``f`` - address of the instruction ``.``).
  In the ELF file section, this will be saved as::

    e8 fc ff ff ff

  And in the relocation table there will be a relocation of the ``R_386_PC32`` type
  referencing the `f`` symbol at position 1.  Please note that the assembler
  has set the relocation addend to ``0xfffffffc`` (i.e., -4) -- this is a correction
  included because ``R_386_PC32`` is defined as an offset from the beginning of
  the relocated field, and the jump instruction uses the offset from the end
  of the jump instruction, i.e., from the end of the relocated field.

The basic types of relocation on x86_64 are:

``R_X86_64_64``
  A 64-bit field is relocated, analogous to ``R_386_32``.

``R_X86_64_32S``
  Like ``R_X86_64_64``, but a signed 32-bit field is relocated. If the full
  64-bit value cannot be represented by this field, a linking error occurs.
  On the x86_64 architecture, most immediate parameters for instructions can
  only contain 32-bit signed numbers -- so long as the finished program fits
  into the lower 2GB of the address space, this type of relocation is used
  for most code references.  If the program becomes too large, it must be
  compiled with the ``-mcmodel=large`` option, which uses only ``mov``
  instructions to load addresses, supporting the full 64-bit range and
  using relocation type ``R_X86_64_64``.

``R_X86_64_PC32``
  Analogous to ``R_386_PC32``.


Large Task #1 - again
---------------------

Head over to :ref:`z1-elf` to discuss relocations there.

Shared code and PIE
-------------------

.. admonition:: Hands-on

    If you haven't tried it before, compile ``part_a.c`` with these extra flags::

        -fpic -fno-plt -mcmodel=large

    this will generate a relocatable object that uses Position Independent Code
    (and forces some simplifications).

    In the previous code change the ``bar`` function to call ``printf("%p\n", arg)``.
    Compile and link the files again.
    Find how ``printf`` is called and how it is defined in the disassembly.
    How does it compare to calling ``foo``? These two use the same kind of relocation in ``part_b.o``

    Try linking to an executable just ``part_b.o`` alone.
    How is ``printf`` different from ``foo``?

.. tip::
    You can set the ``LD_DEBUG`` environment variable to ask the dynamic linker to display what it is doing.
    Try ``LD_DEBUG=help ./parts`` to see a list of available options.

.. admonition:: Hands-on

    Let's compile ``part_a`` as a dynamic library::

        gcc  part_a.c $CFLAGS -o libpart_a.so -Wl,-soname=libpart_a.so -shared  -fpic

    This time, we can pass additional flags to gcc to make ``foo`` on part with ``printf``::

        gcc part_b.c -o parts  $CFLAGS -lpart_a -L. -Wl,-emit-relocs

    Examine the result under objdump. Try running the binary.
    Why doesn't it work? Find out with the above tip or using ``ldd``!

    .. class:: details

      Hint
            ``LD_LIBRARY_PATH=.``

.. notes::
    TODO: explain differences between -fPIC and -fPIE

Global Offset Table
^^^^^^^^^^^^^^^^^^^

As previously mentioned, ELF's main design goal for ``ET_EXEC`` and ``ET_DYN``
was the ability to share code and data between processes. Because external
references (i.e., relocations) obviously require modification of the memory
content in relation to the "template" contained in the file, it was decided
to gather them into one place, limiting the number of pages of memory that
cannot be shared.

This place is called GOT (Global Offset Table). There is one GOT for each
module (library or main program) that needs it. It is simply a large array
of external symbol addresses required by a given module. When we write
a dynamic library (and we use PIC), the compiler automatically generates
code that loads the appropriate address from the GOT every time it needs
the address of an external object.  For ``ET_EXEC`` files, several tricks
are used so that the compiler does not have to explicitly use the GOT, but
the GOT is still used in some form for external function calls.

The linker automatically creates GOT when linking a program or a dynamic
library. A special relocation table is created in the ``.rel.dyn`` section,
in which relocations that fill the GOT are stored (as well as all other
relocations required during the dynamic linking process). These relocations
are of the type ``R_<arch>_GLOB_DAT``, which (in the case of x86) works
identically to ``R_386_32`` or ``R_X86_64_64``, but additionally identifies
the purpose of the relocation as a GOT slot fill.

PIC on i386
^^^^^^^^^^^

.. notes::
    You'll find these relocations in ``part_a.o:foo`` when compiled with the last flags suggested.

The position-independent code sequences (PIC) are often tricky and their degree
of complexity depends on the architecture. The i386 architecture is quite
average in this respect -- relative jump instructions are available, but there
are no other ways of addressing memory relative to the instruction pointer.
The basic code sequences used in the i386 architecture are:

- finding the GOT position::

    call _l1
    _l1:
    popl %ebx
    addl $_GLOBAL_OFFSET_TABLE_+(.-_l1), %ebx

  In this sequence, the ``call`` instruction is used to store address of the
  label ``_l1`` (i.e., the 'return' address) to the stack.  This address is
  then removed from the stack, and the GOT address is obtained by adding
  the difference between the GOT address and the ``_l1`` address.

  The dot in the ``addl`` statement (denoting the address of the current
  instruction) is caused by historical reasons -- ``_GLOBAL_OFFSET_TABLE_``
  is a special symbol understood by the assembler as (GOT address - address
  of the current instruction). Using this symbol also emits a special
  relocation ``R_386_GOTPC`` (works like `` R_386_PC32``, but uses the GOT
  address instead of the destination symbol).

  After the sequence has been executed, the GOT address is in the register
  ``%ebx``. This is the standard register for the GOT address -- according
  to the calling conventions, it must be set to the GOT address whenever
  a PLT call is made (see below).

- Finding the address of the local variable (``static int x;``) (having
  already determined the GOT address)::

    leal x@gotoff(%ebx), %ecx

  Since most functions need to find the GOT anyway, the fact that local
  variables have a fixed offset from the GOT is used -- the address of
  the variable is simply found by adding this difference to the GOT address.
  ``x@gotoff`` is a special assembler syntax for this difference.
  This corresponds to the relocation ``R_386_GOTOFF`` (value = symbol address
  + addend - GOT address).

- Finding the address of an external variable (``extern int x;``) (having
  already determined the GOT address)::

    movl x@got(%ebx), %ecx

  ``x@got`` is a special syntax denoting (address of ``x`` address in GOT
  - the GOT address). This instruction simply loads the contents of
  the appropriate GOT slot. ``x@got`` corresponds to the relocation
  ``R_386_GOT32``. Using this relocation automatically creates a slot in
  the GOT for the corresponding symbol.

PIC on x86_64
^^^^^^^^^^^^^

.. notes::
    You'll find these relocations in ``libpart_a.so:foo``.

The x86_64 architecture always allows the use of memory addressing relative
to the instruction pointer. Thanks to this, you can avoid the trick code
sequence looking for the GOT address and directly address slots in the GOT
by offset from the instructions that use them. For example, finding the
address of an external variable (``extern int x;``) looks like this::

    movq    x@GOTPCREL(%rip), %rax

This corresponds to the relocation ``R_X86_64_GOTPCREL``.

Moreover, in order to get to a local variable, you do not have to use
the GOT in any way -- you have to encode an offset between
the instruction and the given variable and use relative addressing.
It uses the relocation ``R_X86_64_PC32``, the same one that is used
by jump and call instructions.

PLT
^^^

As an optimization in relation to the above mechanisms, a special mechanism
for calling external functions was created: PLT (Procedure Linkage Table),
allowing lazy binding of functions by a dynamic linker.

PLT is a special table containing (on x86) code instead of data. Each external
function called through the PLT has an entry in the PLT. The entry for function
``f`` looks like this (i386)::

    f@plt:
    jmp *f_GOT_PLT_OFF(%ebx)
    f_unbound:
    pushl $f_REL_OFF
    jmp plt0

Or like this (x86_64)::

    f@plt:
    jmpq *f_GOT_PLT(%rip)
    f_unbound:
    pushq $f_REL_OFF
    jmp plt0

And ``plt0`` is a single special entry that looks like this::

    pushq _GLOBAL_OFFSET_TABLE_+8(%rip)
    jmpq *_GLOBAL_OFFSET_TABLE_+16(%rip)

Calling a function in PIC code looks like this::

    call f@plt

And, in the case of i386, it assumes that ``%ebx`` contains the GOT address.

The mechanism works as follows:

- ``f_GOT_PLT_OFF`` is an offset in the GOT of a special slot for the given PLT entry
- this slot works like a regular GOT slot, but uses ``R_<arch>_JMP_SLOT``
  instead of ``R_<arch>_GLOB_DAT``, and is initially set (via the linker)
  to the offset of the ``f_unbound`` label relative to the library base.
  What's more, the relocation ``R_<arch>_JMP_SLOT`` is placed in a special,
  separate relocation table ``.rel.plt``
- the dynamic linker, seeing this type of relocation, initially fills this slot with
  the address of the ``f_unbound`` label by adding the base address of
  the library, instead of looking for the symbol ``f``
- When the program calls ``f@plt`` for the first time, the ``jmp`` statement
  will be executed with the contents of the slot, leading to the ``f_unbound`` label
- the offset of the ``R_<arch>_JMP_SLOT`` relocation corresponding to this slot
  inside the ``.rel.plt`` section is placed on the stack.
- ``plt0`` code pushes the contents of a special GOT slot with offset 4 (or 8
  on x86_64) onto the stack -- this slot is previously filled by the dynamic
  linker and contains some kind of identifying handle for the given module
- the control is transferred to a special function from another special GOT slot
  with offset 8 (or 16) -- this slot is also already filled by the dynamic linker
  and contains the address of a special function that binds symbols at runtime
- the dynamic linker, using the two parameters on the stack, determines what
  symbol it is, and where to enter its address, after which the GOT slot is
  refilled with the correct address and control is passed to the function ``f``
- when the program calls ``f@plt`` next time, the slot will already be filled
  and the control will go straight to the ``f`` function

ET_EXEC -- special tricks for dynamic linking
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To ensure that compilation of the main program (``ET_EXEC`` file) does not
require any knowledge of the GOT / PLT mechanisms in the compiler, two
additional tricks are used:

- if the program refers to an external function symbol, a PLT entry for this
  function within the main program is automatically created, and the address
  of this PLT entry becomes the "official" address of this function inside
  the whole process (this is required to ensure the address of this function
  is fixed while linking the program code, and ``&f`` returns the same value
  throughout the program)
- if the program refers to an external variable symbol, the linker automatically
  creates a copy of this variable in the main program data segment and emits
  to ``.rel.dyn`` a special relocation ``R_<arch>_COPY``, which will copy the
  initial contents of this variable from the module that originally defined it.
  The created copy of the variable becomes the "official" location of this
  variable at runtime, and the original variable in the defining library is no longer used.

_DYNAMIC structure
^^^^^^^^^^^^^^^^^^


The ``_DYNAMIC`` structure is a table of key:value consisting of information
about the contents of the module for the dynamic linker. It contains mainly:

- address and size of the symbol table involved in dynamic binding
- address and size of the ``.rel.dyn`` and ``.rel.plt`` tables
- GOT address
- list of libraries required by this module
- a list of library search paths

The linker finds the ``_DYNAMIC`` structure by looking for the ``PT_DYNAMIC`` segment.

Program launch sequence and the dynamic linker
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Running a dynamically linked program is much more complicated than a statically linked one --
the kernel cannot do it in its entirety. Instead, a special program called a dynamic linker
is used. This program is also known as ``ld.so`` (from the name of the file
in which it was originally located). On i386 in Linux, the dynamic linker
is in the file ``/lib/ld-linux.so.2``, and on x86_64 -- ``/lib64/ld-linux-x86-64.so.2``.

The kernel recognizes dynamically linked programs by the presence of a ``PT_INTERP`` segment,
which contains the name of the file containing the dynamic linker. When it finds such a segment,
instead of passing the control to the program after loading it, it additionally loads and runs
the indicated dynamic linker (which is a file of the ``ET_DYN`` type).

The dynamic linker starts working by finding its own ``_DYNAMIC`` section and completing
its own relocation. In the next phase, the linker looks through the auxiliary vector (``auxv``)
provided by the kernel. This is a list of key:value pairs describing the state of the process
and its environment. It contains, for example, information about the location of the program
headers of the main executable file in memory. After locating the executable file, the linker
loads (recursively) its dependencies. Then, the linker fills all relocations from ``.rel.dyn``,
stuffs ``.rel.plt`` relocations with stubs, and finally transfers control to the main program
(by executing the entry point indicated in its ELF header).

The dynamic linker remains in memory after the program has been loaded and it is possible
to continue using its functions to open additional libraries, search for symbols, etc. using
the ``dlopen``, ``dlsym`` and other functions. These functions are available by linking with
the ``libdl`` library.

.. _small_task_2:

Small Task #2
-------------

.. notes::
    TODO: consider changing the task to ``puts`` to eliminate the fuss with variadics

Write a small wrapper dynamic library that will hijack calls to ``printf`` and print address of the format string
before the actual message.

The injection should happen by preloading the library with ``LD_PRELOAD`` like so::

    LD_PRELOAD=libtroll.so LD_LIBRARY_PATH=. ./parts

You can start by modifying the following code::

    #include <stdio.h>
    #include <stdarg.h>

    void my_printf(const char* fmt, ...)
    {
        printf("[%p] ", fmt);
        va_list args;
        va_start(args, fmt);
        vprintf(fmt, args);
        va_end(args);
    }

    int main()
    {
       my_printf("printing %d + %d = %s\n", 2, 7, "no idea");
    }

Make sure your code **calls** ``printf`` again to show you can write a proper wrapper:
override the function and call the original.

If you want to see it in action with typical programs like gcc,
override the ``int __printf_chk(int flag, char* fmt, ...)`` function as well.

.. hint::

    - ``man 3 dlsym printf strcat``
    - ``man 8 ld.so``
    - `function pointers <https://www.gnu.org/software/c-intro-and-ref/manual/html_node/Declaring-Function-Pointers.html>`_
    - `variadic functions <https://en.cppreference.com/w/c/language/variadic>`_

Extra Topics
=============

If you want to extend your knowledge beyond the scope of this course, consider:

- reading about :ref:`elf_en_tls` and :ref:`elf_en_dwarf`,
- reading the whole :ref:`elf-en`,
- writing a similar wrapper to :ref:`small_task_2`, but using static linking (``--wrap=symbol`` flag to ``ld``),
- reading about `startup tricks with crt <https://github.com/kraj/uClibc/blob/master/docs/crt.txt>`_ and the
  `implementation in a small libc (uClibc) <https://github.com/kraj/uClibc/tree/master/libc/sysdeps/linux/x86_64>`_,
- reading about `linker scripts <https://wiki.osdev.org/Linker_Scripts>`_ (``.lds`` files).


Literature
==========

1. Linker and Libraries Guide, 2004, Sun Microsystems (chapters 7 and 8) - http://docs.oracle.com/cd/E19683-01/817-3677/817-3677.pdf
2. gabi: http://www.uclibc.org/docs/SysV-ABI.pdf
3. psABI-i386a: http://www.uclibc.org/docs/psABI-i386.pdf
4. man dlsym, dlopen
5. ELF handling for the thread local storage: www.akkadia.org/drepper/tls.pdf
6. The DWARF debugging standard: http://www.dwarfstd.org/