1 The Basics
This section covers the
beginning of understanding the VM system. It largely will consist of material
that is obvious to the average kernel hacker but not as obvious to the wannabie
who just found http://www.kernelnewbies.org. Parts of this is ix86
specific but this author hasn't access to any other architecture and so has no
interest in guessing what the arch specific code does.
1.1 Segmentation
If you have never heard of
segmentation, none of this will make an iota of sense. You need a book on the
basics of Intel hardware.
Linux tries to use segmentation only when
required but it's still important to have a good understanding of what is going
on. There is three types of addresses that are talked about in this section.
Logical, linear and physical. A logical address consists of two parts, the
segment identifier (aka the segment selector) and an offset inside that
segment. Put a logical address through the segmentation unit and a 32
bit linear address will be revealed. This can address up to up to 4GB of memory
to be addressed. A linear address can reveal a physical address which is the
actual physical location of the memory on the chip.
To help retrieving
segments, there is six registers whose sole purpose in life is to store a
segment selector. They are the cs (code segment), ss (stack segment),
ds (data segment), es, fs and gs (three general use registers). The cs register
has a 2 bit special field that denotes the Current Privilege Level (CPL)
of the CPU. Linux only uses levels 0 and 3. 0 for kernel space and 3 for
user.
Each segment is represented by an 8 byte segment descriptor. It is
laid out as follows
Field |
Bit range (size) |
Function |
Limit |
0 - 15 (16) |
The segment length in bytes unless the
granularity flag is set. When set, the length is in multiples of 4096
bytes. The full limit length is 20 but the remaining 4 bits aren't until
further in the descriptor for some reason |
Base |
16 - 39 (24) |
The first 24 bits of the linear address pointing
to the segment |
Type |
40 - 43 (4) |
There is 4 types, code segment, data segment,
task state segment (TSS) and local descriptor table descriptor
(LDTD) |
S Flag |
44 (1) |
Clear if the segment contains kernel data |
DPL |
45 - 46 (2) |
Descriptor Privilege Level, the value the CPL
must be less than to access the segment |
Present |
47 (1) |
Always 1 |
Limit |
48 - 51 (4) |
The other 4 bits of the Limit field |
AVL |
52 (1) |
Not used by Linux |
Reserved |
53 (1) |
Reserved by Intel |
B Flag |
54 (1) |
Set depending on the segment content |
Granularity |
55 (1) |
When clear, the size is on bytes, otherwise it's
a multiple of 4096 |
Base |
56 - 63 (8) |
The remaining 8 bits of the base
address |
Segment descriptors are contained in either
Global Descriptor Tables (GDT) of which one exists in the system or
Local Descriptor Tables (LDT). All processes can have a LDT if
requested but in Linux all processes generally have the same LDT. Most processes
just use the default_ldt (include/asm-i386/desc.h) The address of the GDT
is contained in a special register called the gdtr. The currently used LDT
(generally only one in Linux) is contained in the (suprise) ldtr.
The
first entry in the GDT is 0 do that references to 0 will result in a NULL
reference exception. The GDT can store 213 minus one for the NULL descriptor. As
Linux doesn't really use segmentation, the number of defined segments is well
below the 8192 limit.
Once upon a time the TSS and LDT segments for each
process was kept in the GDT. This limited the number of processes which was a
Bad Thing. In these modern times, the task_struct
(include/linux/sched.h) contains an arch specific struct called the
thread_struct (include/asm-i386/processor.h). All the register information is
kept here. It is loaded in during a context switch. See __switch_to in
arch/i386/kernel/process.c .
The layout of the GDT is described at the
top of include/asm-i386/desc.h . The important descriptors are
- 0 NULL
- 2 Kernel code segment
- 3 Kernel data segment
- 4 User code segment
- 5 User data segment
- 8-11 APM BIOS support
- (12 + CPU_ID * 4) TSS Segment for this CPU
- (13 + CPU_ID * 4) LDT Segment for this CPU
the TSS's and LDT's are
spread out a bit to get different cache lines. Using different cache lines will
be discussed in detail later but the short version is if they are spread out,
the different TSS's won't keep overwriting each other in the CPU L1 or L2
cache.
Note that all processes have the same code and data segment. This
is because Linux favours using paging. One reason is because life is easier when
all processes have the same linear address space. The second is that not
architectures support segmentation but everyone supports paging. Linux goes for
the common denominator which brings us onto the next section.
1.2 Page Tables
As above, if you haven't
heard of paging, this will make no sense if you're lucky or cause anger and
confusion if you aren't. Instead you need a book on basic OS
design.
Linux uses a 3 level paging mechanism even when the machine it's
running on doesn't support 3 level paging. This is to support 64 bit
architectures. The three levels of the paging table are called Page Global
Directory, Page Middle Directory and Page Table. It is
important to note that Linux will use a paging interface to the higher layers
even if the hardware doesn't support paging at all.
Each process has it's
own PGD. It is an array of pgd_t items (include/asm/page.h)
referenced by the mm_struct contained in the task_struct
(include/linux/sched.h). On Intel, the address of the current PGD is contained
in the cr3 register loading from the TSS segement.
This crappy diagram
might help you figure out how the tables relate to each other and what macros
are used to get places. To access all the macros both page.h and
pgtable.h must be included.
struct task_struct { struct mm_struct {
... ...
struct mm_struct mm ------------> pgt_t *pgd ------------|
... ... |
} |
|
PGD Table |
pgd_t |
.... |
pgt_t-| <---------pgt_offset(mm_struct, addr)---------|
.... |
.... |
|-pmd_offset(pgt_t, addr)
|
| PMD
| pmd_t
| ....
|--> pmd_t --pte_offset(pmd_t, addr)
.... |
| PTE
| pte_t struct page {
| .... ...
|-> pte_t --> virtual /* Kernel virtual address */
.... ...
}
The size of each table is PTRS_PER_PGD, PTRS_PER_PMD and
PTRS_PER_PTE. In systems with only two levels of pagins, the PTRS_PER_PMD
is 1. When moving through or altering the page table entries, the spinlock
page_table_lock in the mm_struct must be held.
To avoid
having to have hardware specific defines when traversing the page tables, the
macro for pmd_offset returns back the pgt_t entry and so the compiler just
optimises out references to the page middle directory.
1.3 Translating Addresses
So that a
programmer doesn't need to know exactly how an address is laid out, macros are
available to work out what an address means.
This is a short list which
tries to explain what each of the macros, defines and functions are and what
their meaning it.
macro/define |
Declared in include/ |
Purpose |
PAGE_SHIFT |
asm/page.h |
The length in bytes of the offset. |
PGDIR_SHIFT |
asm/pgtable.h |
Number of bits to shift to reach the end of the
PGT element of a linear address |
PMD_SHIFT |
asm/pgtable.h |
Number of bits to shift to reach the end of the
PMD element of a linear address. Same as PGDIR_SHIFT on Intel |
PGDIR_MASK |
asm/pgtable.h |
Bit mask to apply to have only the PGD
offset |
PMD_MASK |
asm/pgtable.h |
Bit mask to apply to have only the PMD
offset |
PAGE_MASK |
asm/page.h |
Bit mask to apply to have only the offset within
the page |
Broadly speaking, an address looks like
the following.
|--------------------------Linear Address--------------------------|
Which Table: PGT PMD PTE Offset
Part Sizes: >--PGDIR_SIZE--<>--PMD_SIZE--<>--log2(PTRS_PER_PTE)-<>--PAGE_SIZE--<
Bit shifts: <-----------------------------PGDIR_SHIFT----------|
<---------------PMD_SHIFT------------|
<--PAGE_SHIFT-|
1.4 Page struct
The page struct is used to
keep track off all page frames in the system. Note the distinction between a
page and a page frame. A page frame is a physical "thing" in memory that can
store a page. A data structure called struct page exists for every page
frame in the system.
Up until relatively recently, the kernel used
logical addresses to refer to pages. This limited how much memory a 32 bit
machine could address to only 4GB of memory. This changed with the introduction
of things like Physical Address Extension (PAE) giving access to the high memory
area. Kernel functions that used to refer to pages with logical addresses now
must refer to them with pointers to a struct page. Obviously an
understanding of whats in this struct is benefical.
Taken from
include/linux/mm.h, here is the struct in all it's glory.
typedef struct page {
struct list_head list; /* ->mapping has some page lists. */
struct address_space *mapping; /* The inode (or ...) we belong to. */
unsigned long index; /* Our offset within mapping. */
struct page *next_hash; /* Next page sharing our hash bucket in
the pagecache hash table. */
atomic_t count; /* Usage count, see below. */
unsigned long flags; /* atomic flags, some possibly
updated asynchronously */
struct list_head lru; /* Pageout list, eg. active_list;
protected by pagemap_lru_lock !! */
wait_queue_head_t wait; /* Page locked? Stand in line... */
struct page **pprev_hash; /* Complement to *next_hash. */
struct buffer_head * buffers; /* Buffer maps us to a disk block. */
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
} mem_map_t;
Here is a rough description of some of the fields.
mapping |
The address space it belongs to. References page
lists, inodes, etc. See later sections |
index |
Badly named. It's an offset into the
page_cache_hash table. The hash is on inode and offset |
next_hash |
Next page on the page cache hash |
count |
The number of references to this page. Moved to
free page list when 0 |
flags |
Flags associated with the page. See
include/linux/mm for the list of flags. They are prepended with
PG_ so easy to find |
lru |
Gives the head for what activity list the page
is on, active or inactive |
wait |
A list of processes waiting on the page if
locked |
virtual |
Kernel virtual address if it's mapped. NULL if
it's note. Generally high-memory pages aren't and have to be kmapped
in |
1.5 Process Address Map
As well as
maintaining the architecture independant page tables described above, the kernal
maintains additional information on each memory address that is mapped by a
process. Each process is assigned a struct mm_struct pointed to by the
task_struct. It is declared as follows in include/sched.h
struct mm_struct {
struct vm_area_struct * mmap;
rb_root_t mm_rb;
struct vm_area_struct * mmap_cache;
pgd_t * pgd;
atomic_t mm_users;
atomic_t mm_count;
int map_count;
struct rw_semaphore mmap_sem;
spinlock_t page_table_lock;
struct list_head mmlist;
unsigned long start_code, end_code, start_data, end_data;
unsigned long start_brk, brk, start_stack;
unsigned long arg_start, arg_end, env_start, env_end;
unsigned long rss, total_vm, locked_vm;
unsigned long def_flags;
unsigned long cpu_vm_mask;
unsigned long swap_address;
unsigned dumpable:1;
mm_context_t context;
};
Pretty daunting at first look but relatively straight forward at close
examination.
mmap |
Pointer to the root vm_area_struct describing a
single mapping in the address space |
mm_rb |
Pointer to root of the page cache Red-Black
tree |
mmap_cache |
The vm_area_struct that was last found. Avoids
having to use mmap every time |
pgd |
Pointer to the PGD for the process (see
above) |
mm_users |
How many references from user space there
is |
mm_count |
Total number of users |
map_count |
The number of vm_area_structs there is on the
list pointed to by mmap |
rw_semaphore_mmap_sem |
Lock that must be held when changing the
vm_area_struct list |
page_table_lock |
Lock that must be held when traversing or
changing the page tables |
mmlist |
List of all mm_structs in the system. Protected
by mmlist_lock |
start_code, end_code |
Where the code segment begins and ends |
start_data, end_data |
Data segment |
arg_start, arg_end |
Pointer to arguements passed in |
env_start, env_end |
Environment variables |
rss |
Number of page frames assigned to the
process |
total_vm |
Self explanatory |
locked_vm |
Count of locked pages frames |
cpu_vm_mask |
A bitmask which records if a CPU has executed on
this address space |
swap_address |
Address that is been swapped out next. |
dumpable |
Set when teh mm_struct can be removed |
context |
Current MMU context |
With
this structure, all information about what memory a process is addressing can be
accessed.
The vm_area_structs refer to a block of memory in the address
space that is been used. They refer to parts of the process address space which
require special handling by the page-fault handlers. There is three areas which
always exist
Text |
Code and initialised data from the executable
itself. Starts at 0x08040000 |
Heap |
Uninitialised data and the heap starting after
text |
Stack |
The stack. Grows down from __PAGE_OFFSET (default
0xC00000000) |
To see these three areas, look at
/proc/PID/maps where PID is the process to be examined. A look at this file will
also show that shared libraries are mmaped in starting from 0x40000000.
A
vm_area_struct is defined as follows from include/linux/mm.h
struct vm_area_struct {
struct mm_struct * vm_mm;
unsigned long vm_start;
unsigned long vm_end;
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next;
pgprot_t vm_page_prot;
unsigned long vm_flags;
rb_node_t vm_rb;
struct vm_area_struct *vm_next_share;
struct vm_area_struct **vm_pprev_share;
struct vm_operations_struct * vm_ops;
/* Information about our backing store: */
unsigned long vm_pgoff; /* Offset (within vm_file) in PAGE_SIZE
units, *not* PAGE_CACHE_SIZE */
struct file * vm_file; /* File we map to (can be NULL). */
unsigned long vm_raend; /* XXX: put full readahead info here. */
void * vm_private_data; /* was vm_pte (shared mem) */
};
As with the mm_struct, this has daunting at the start but mostly straight
forward.
vm_mm |
Pointer to the mm struct that this vm_area
belongs to |
vm_start |
Starting virtual address |
vm_end |
End of virtual address |
vm_next |
Pointer to the next vm_area_struct for this
mm_struct. Single linked list ordered by address |
vm_page_prot |
A mapping from the vm_flags to the page
protection bits |
vm_flags |
Properties of the area struct. See
include/linux/mm.h for details |
vm_rb |
This vma's rb node |
vm_next_share |
Used to link vm_area_structs together that are
shared between processes. The code section of a library for instance |
vm_pprev_share |
Double linked list for vm_next_share |
vm_ops |
Operations that can be performed on the vm_area.
Needed to keep files on disk up to date |
vm_pgoff |
Offset within the file that is mmaped |
vm_file |
Pointer to the struct file that is mapped |
vm_raend |
Information related to the read ahead |
vm_private_data |
Not sure |
Note that there
is two ways to traverse through the vm_area_structs for a process. They are
ordered on a single linked list (vm_next) based on address. In addition they are
stored in a Red-Black (rb) tree. This is an ordered balanced tree based on
address. This makes searching for a vm_area_struct really fast ( OlogN time ),
which is important when handling page faults for instance. Obviously the penalty
is a slightly higher memory footprint. The specifics of RB trees are beyound the
scope of this document. See any half-decent data structures and algorithms book
on the subject. There is a bucket of helpful rb related functions in
include/linux/rbtree.h so take a look there.
For the moment,
just bear in mind when rb_node's appear in the code, it's dealing with tree
traversal more often than not. The easiest to read code example for this is the
find_vma() function in mmap.c .
1.6 Relationship Between Structs
This rough
diagram attempts to describe how each of the structs described above connect to
each other. Remember that each vm_area_struct points back to it's parent. Also
note that it's the linked list relationship between the vm_area_structs that is
drawn here.
task_struct
| ---pmd /-pte->page
| / /
mm_struct ----> pgd----pmd---pte->page
/ | \ \
/ | \ ---pmd
/ | \
/ | \
vm_area_struct-->vm_area_struct-->vm_area_struct
| |
| |
struct file struct file
/ | \
/ | \
/ | \
page-->page-->page
1.7 Non-Contiguous Kernel Memory
Memory
alloced in the kernel using vmalloc is linear in address space but not linear in
physical memory. vmalloc allocates the required number of pages and places them
in order with the page tables. The upper portion of memory (1GB on x86) is used
and goes from VMALLOC_START to VMALLOC_END. On the x86 this looks
like
When vmalloc is called, a vm_struct is assigned to keep track of the
area. It is defined as
struct vm_struct {
unsigned long flags;
void * addr;
unsigned long size;
struct vm_struct * next;
};
- flags are the GFP flags used for the assignment
- addr is the linear address this area starts from
- size is the PAGE_SIZE aligned size of the allocation
- next points to the next vm_struct in the list
vmalloc is described
in detail later in another section. The first area is assigned at VMALLOC_START
about 8MB after the end of main memory. This is to give a gap in case of memory
overruns. Each area is padded by one PAGE_SIZE for similar reasons. On an x86,
this looks like;
3GB (PAGE_OFFSET) PAGE_SIZE PAD 4GB
| | |
main_memory-->|---8MB---|---area---|PAD|---area---|PAD|---area---|---|
|
VMALLOC_START
1.8 Page Faulting - Arch Dependant
This
section initially is very x86 centric in the initial stages but like so many
other aspects of Linux, it's only at the lowest levels that it's architecture
specific.
On the i386, faults are handled in
arch/i386/mm.h:do_page_fault. The state of the registers and an error
code is passed in. The fault address is passed in implicitly by been stored in
the cr2 register. In other architectures such as PPC, Sparc and Alpha, it's
passed in directly.
The code for page faulting is really heavily
commented so we won't go into detail here until we reach the non-arch specific
stuff. To start, the following checks are made.
In kernel vmalloc fault |
Synchronize the current tasks page tables with
the reference page table |
No context |
Occurs if in an interrupt or there is no user mm
context. The appropriate exception handler is called if
available |
15cmlX
Now we are sure
there is a valid MM context to work with so the vm_area_struct for this process
is looked up by calling find_vma which simply traverses the RB Tree of
vm_area_structs looking for the correct area. The type of faults that are
handled then are as follows.
Fault Type |
Action |
No vma |
A vm_area_struct doesn't exist for this address.
Send a SIGSEGV signal (segmentation fault) |
Bottom of stack |
Presuming the stack wasn't accessed too far
down, the stack will be expanded by calling expand_stack(). If it fails, a
SIGSEGV is sent |
Good area |
One checks that permissions to access the page
are valid, handle_mm_fault() is called to swap the page
in |
Granted, this is particularly
detailed but the code is really self explanatory, the comments are ultra-clear
and it's only one function. Onto architecture independant....
1.9 Page Faulting - Arch Independant
Once the
arch specific code has determined where the vm_area_struct the fault occured in
happened and that it's valid, mm/memory.c:handle_mm_fault() is
called.
At this point, it has been established that the area and address
is valid to what remains to be done is swap in the page the fault occured in.
The page_table_lock is acquired and pmd/pte entries are alloc'ed if possible. If
this fails (unlikely), the process is killed with extreme prejudice by the arch
specific code. What is far more likely to occur is handle_pte_fault is
called.
There is four conditions that handle_pte_fault has to deal with
depending on what type of fault this is.
Page not present && Page
never existed |
do_no_page() is called which tries to create a
new page mapping. If it's a normal page just for memory, then
do_anonymous_memory() allocates a page entry and adds it to the page
tables. If it's a read-only page, a system wide zero'd out page is placed
in the pte. If write access is required, alloc_page is called and the
appropriate properties are set. |
Page not present && Page does
exist |
This page was swapped out at some stage so
do_swap_page() is called. This will lookup the swap cache and use the page
if found, otherwise the page will be swapped in from disk. The in's and
out's of swapping will be covered in a later section. |
Page present && Writing page |
If it is allowable to write to the page, it is
simply marked dirty and young (so it's less likely to get swapped out
later), and the tlb is updated. |
Page present and read-only &&
Writing page |
In this case, an attempt is been made to write
to a page even though the pte says it can't be written to. In this case
do_wp_page() is called. This case can occur when Copy-On-Write is used
when forking a process for instance. This function has a lot of error
checking in it but on close examination can be summerised as "call
page_alloc for new page, copy over old
page". |
That covers basically what Linux
does during a page fault. As always reading the code is the only way to be sure
of understanding. This is just a rough roadmap.
1.10 Summary
This gives a beginning of how
memory is laid out in Linux. Not all details are 100% correct so be
careful.
This document was translated from LATEX
by HEVEA.