Class 1: ELF¶
Date: 25.02.2025 Small Task #1
Extras¶
Scenario¶
Tip
Important links:
The main site: https://students.mimuw.edu.pl/ZSO/
Labs: https://students.mimuw.edu.pl/ZSO/PUBLIC-SO/2024-2025/
Slack: https://app.slack.com (you should get an invitation)
QEMU¶
We are going to use QEMU for virtualization at this course. Using QEMU simplifies debugging changes to the kernel by a lot. While it is not strictly necessary to solve the first tasks, the solution is required to compile there. (More on that later.)
Important
Please, verify during this week that you can use QEMU with the provided image,
as some of you may have troubles with your machine (especially on non-x86 CPUs or with disabled hypervisor).
The safe bet is to use computers in the labs, as you cannot run long-standing processes on the students
machine.
If you're stuck, feel free to asks us questions on Slack!
You'll find details on starting QEMU here: QEMU.
Executable Files¶
You may wonder what makes a file executable. Well, for a starter, it is a file permission flag, which instructs the Operating System to consider the file as such.
You might have seen text files being executable with a "magic" shebang (#!
) at the start.
For instance, a file with the following content would be "executed" with a Python interpreter:
#!/usr/bin/python3
print(2)
But, ultimately, there must be a program natively executed by the processor.
On Linux, and many other platforms, such programs are typically stored in a standard container format: ELF
.
Such files start with a magic byte sequence of \7fELF
.
Two main utilities to inspect ELF files are objdump
and readelf
.
In order to dump information about ELF structures, run:
readelf -a <file>
To see even more, including disassembled code, section table, symbols, and other data about a binary file:
objdump -xtrds <file>
If you prefer GUI applications, you may try one of many reverse engineering toolkits.
For instance, iaito
from the Radare toolkit.
ELF¶
Hands-on
Compile a simple C program like the following:
int main()
{
int x = 7;
return x;
}
with gcc main.c -o main
and inspect it with the above commands.
ELF is a file format used in Linux (and many other systems) for programs,
shared libraries (.so
), intermediate build results (.o
), and memory
dump files (core
).
Although the basic features of the ELF format are always the same, there are many elements depending on the processor architecture and sometimes on the operating system. Here we will only deal with the ELF format on the x86 architecture in the Linux system.
Basic structure¶
The ELF files on the basic level are composed of 4 areas:
ELF header (at the beginning of the file): contains information about the parameters of the file and the machine it is intended for, as well as information about the position of section and program headers
section headers: each header describes the type and location of one section. A section is a contiguous block of memory with uniform attributes, and can represent both code and metadata.
program headers: each header describes the type and location of one segment. A segment is a contiguous block of memory with uniform purpose and attributes from the point of view of loading and running the program.
contents of sections / segments
Types of ELF files¶
Whether sections / program headers may or must be present depends on the type of ELF file. There are 4 types of ELF files:
ET_REL
(relocatable file)A compiled, but not yet linked file (
.o
). Usually created as the result of compiling a single source file. It is not possible to run it directly -- these files are an intermediate stage of compilation and are combined by the linker (programld
) into executable files or dynamic libraries.ET_EXEC
(executable file)A compiled and fully linked program, usually created by linking
.o
files through a linker. Such a file is ready to be launched -- all segments have a fixed address at which they will be available during the program's operation.ET_DYN
(shared object file)A compiled and linked dynamic library (
.so
).ET_CORE
(core file)A process memory dump, created when a process is killed by certain signals.
Hint
We'll discuss composition from ET_REL
and ET_DYN
files on the next labs.
Sections¶
The standard section names used for regular code in C are:
.text
code section
.rodata
read-only data section (
const int x = 3;
).data
data section (
int x = 3;
).bss
the zeroed data section (
int x = 0;
)
You will also typically find other sections such as:
.comment
and.note
these are pure metadata about the file
.symtab
describes symbols (see below)
.strtab
contains string values used by
.symtab
.eh_frame
unwinding tables (think: C++ exceptions)
.init
,.init_array
,.fini
, and.fini_array
constructors and destructors executed before, respectively after, the
main
function.got
,.plt
,.plt.got
,.dynamic
,.rela.dyn
,.dyn*
relocations and position-independent helpers
Symbols and references¶
One of the main tasks of the ELF format is storing information about objects
contained in the file and about references to external objects. By object,
we mean a function or a (global) variable. From the ELF point of view,
an object is simply an area within a section (ET_REL
) or the address space
of a program (ET_EXEC
, ET_DYN
).
Symbols are names assigned to objects. The symbol can be defined (assigned to an object in a given file) or undefined (it will be defined at the moment of linking with the file that defines it).
Running statically linked programs¶
Hands-on
Take the simple main.c
file from the previous example and compile it statically with:
gcc main.c -o main -static
Run it under a debugger (e.g., gdb ./main
) and single step it from the very beginning (_start
symbol),
then skip to the main
function.
With gdb in a terminal, you may find these commands useful:
layout asm
split the screen to show assembly view
b _start
set up a breaking point at the entrypoint of the file
ni/si
next instruction; step instruction
In the case of programs connected statically, the entire program initialization process is performed by the kernel. The kernel reads the ELF header, program headers, and loads all segments into memory. Then it creates the initial state of the program:
the main thread stack is allocated
on the main thread stack, the following things are placed:
program arguments (
argc
,argv
)environmental variables (
environ
)auxiliary vector (
auxv
)
the instruction pointer is set to the beginning of the program (from the ELF header). With the standard compilation process, this field is set by the linker to the address of the
_start
symbolthe program starts running
Note that _start
is not a function -- it does not use the standard
parameter passing convention, nor can it return. The standard implementation
of _start
passes the parameters to the main()
function, and then
executes exit()
with the value returned by main()
as a parameter.
Function calling¶
Hands-on
Create a C file with a contents like:
#include <stdio.h>
int f(int b) {
return b + b;
}
char g(int a, int b, int c) {
if (a > b)
return 'a';
if (b > c)
return 'b';
return 'c';
}
int main() {
printf("f(42) is %d\n", f(42));
}
and compile it to an assembly with:
gcc foo.c -S -fno-asynchronous-unwind-tables # this flag simplifies the output
and subsequently to a binary with:
gcc foo.s -o foo
Check that the binary works.
Now, change the assembly file so the g
function is called instead, and a proper formatting string is used.
You'll find details about function calling convention below.
If you don't finish it on the labs, you can submit it till the end of the day.
Calling convention on x86 architecture on Linux¶
The i386 architecture basically has 7 general-purpose registers: %eax
,
%ecx
, %edx
, %ebx
, %ebp
, %esi
, %edi
. In addition,
the %esp
stack pointer and the %eflags
register are also
available from user programs.
The x86_64 architecture expands all of these registers to 64-bits (%rax
,
%rcx
, %rdx
, %rbx
, %rbp
, %rsi
, %rdi
, %rsp
,
%rflags
), and adds 8 new general-purpose registers (%r8
- %r15
).
The standard calling conventions for i386 architecture are as follows:
the stack grows down,
%esp
indicates the top of the stack, which is the smallest address currently in use by the program. Any address on the stack smaller than%esp
can be destroyed at any time (e.g., by calling a signal service function).at the entry point to a function (i.e., immediately after executing the call instruction),
%esp
= -4 (mod 16), and the word at the top of the stack (at%esp
) is the return address from the functionthe function should return by removing the return address from the stack (increasing
%esp
by 4) and jumping to it. This is usually done with theret
instruction.contents of registers
%ebx
,%ebp
,%esi
,%edi
after returning from the function must be equal to their contents at the moment it was called -- the function must either save and restore the value of these registers, or not use them at allthe contents of the registers
%eax
,%ecx
,%edx
,%eflags
can be changed by a function without any consequencesif the function uses parameters, they will be passed on the stack, starting with
%esp+4
(i.e., immediately after the return address). The function is to leave them there -- only the return address is removed from the stackif the function returns a value, it is stored in
%eax
.
And for x86_64:
the stack grows down,
%rsp
indicates the top of the stack. 128 bytes below the top of the stack constitute the so-called red zone, i.e., an area that can be used and will not be overwritten, despite being located below the stack (only the area below the red zone can be overwritten by a signal service). This area is useful in functions that do not call other functions (so-called leaf functions), because it avoids moving the stack pointer if the function does not need a lot of space.at the entry point to a function,
%rsp
= 8 (mod 16), and the word at the top of the stack (at%rsp
) is the return address from the function.the function should return by removing the return address from the stack (increasing
%rsp
by 8) and jumping to it. This is usually done with theret
instruction.the content of the stack below
%rsp
at the entrance to the function can be freely modified by it, and the stack above should not be modified.the contents of the registers
%rbx
,%rbp
,%r12
-%r15
after returning from the function must be equal to their contents at the moment it was calledcontents of registers
%rax
,%rcx
,%rdx
,%rsi
,%rdi
,%r8
-%r11
can be changed by the function without any consequencesparameters to the function are passed in the following registers (in order):
%rdi
,%rsi
,%rdx
,%rcx
,%r8
,%r9
. If the function takes more than 6 parameters, they are passed on the stack starting at%rsp+8
. The function is to leave them there -- only the return address is removed from the stack.if the function returns a value, it returns it in
%rax
.
The above list does not include passing parameters and returning values other than ints / pointers or stranger x86 registers. For more details, we refer to psABI-i386 and psABI-x86_64.
Literature¶
Linker and Libraries Guide, 2004, Sun Microsystems (chapters 7 and 8) - http://docs.oracle.com/cd/E19683-01/817-3677/817-3677.pdf
psABI-i386a: http://www.uclibc.org/docs/psABI-i386.pdf
man dlsym, dlopen
ELF handling for the thread local storage: www.akkadia.org/drepper/tls.pdf
The DWARF debugging standard: http://www.dwarfstd.org/