1 The Basics

This section covers the beginning of understanding the VM system. It largely will consist of material that is obvious to the average kernel hacker but not as obvious to the wannabie who just found http://www.kernelnewbies.org. Parts of this is ix86 specific but this author hasn't access to any other architecture and so has no interest in guessing what the arch specific code does.

1.1 Segmentation

If you have never heard of segmentation, none of this will make an iota of sense. You need a book on the basics of Intel hardware.

Linux tries to use segmentation only when required but it's still important to have a good understanding of what is going on. There is three types of addresses that are talked about in this section. Logical, linear and physical. A logical address consists of two parts, the segment identifier (aka the segment selector) and an offset inside that segment. Put a logical address through the segmentation unit and a 32 bit linear address will be revealed. This can address up to up to 4GB of memory to be addressed. A linear address can reveal a physical address which is the actual physical location of the memory on the chip.

To help retrieving segments, there is six registers whose sole purpose in life is to store a segment selector. They are the cs (code segment), ss (stack segment), ds (data segment), es, fs and gs (three general use registers). The cs register has a 2 bit special field that denotes the Current Privilege Level (CPL) of the CPU. Linux only uses levels 0 and 3. 0 for kernel space and 3 for user.

Each segment is represented by an 8 byte segment descriptor. It is laid out as follows

Field	Bit range (size)	Function
Limit	0 - 15 (16)	The segment length in bytes unless the granularity flag is set. When set, the length is in multiples of 4096 bytes. The full limit length is 20 but the remaining 4 bits aren't until further in the descriptor for some reason
Base	16 - 39 (24)	The first 24 bits of the linear address pointing to the segment
Type	40 - 43 (4)	There is 4 types, code segment, data segment, task state segment (TSS) and local descriptor table descriptor (LDTD)
S Flag	44 (1)	Clear if the segment contains kernel data
DPL	45 - 46 (2)	Descriptor Privilege Level, the value the CPL must be less than to access the segment
Present	47 (1)	Always 1
Limit	48 - 51 (4)	The other 4 bits of the Limit field
AVL	52 (1)	Not used by Linux
Reserved	53 (1)	Reserved by Intel
B Flag	54 (1)	Set depending on the segment content
Granularity	55 (1)	When clear, the size is on bytes, otherwise it's a multiple of 4096
Base	56 - 63 (8)	The remaining 8 bits of the base address

Segment descriptors are contained in either Global Descriptor Tables (GDT) of which one exists in the system or Local Descriptor Tables (LDT). All processes can have a LDT if requested but in Linux all processes generally have the same LDT. Most processes just use the default_ldt (include/asm-i386/desc.h) The address of the GDT is contained in a special register called the gdtr. The currently used LDT (generally only one in Linux) is contained in the (suprise) ldtr.

The first entry in the GDT is 0 do that references to 0 will result in a NULL reference exception. The GDT can store 213 minus one for the NULL descriptor. As Linux doesn't really use segmentation, the number of defined segments is well below the 8192 limit.

Once upon a time the TSS and LDT segments for each process was kept in the GDT. This limited the number of processes which was a Bad Thing. In these modern times, the task_struct (include/linux/sched.h) contains an arch specific struct called the thread_struct (include/asm-i386/processor.h). All the register information is kept here. It is loaded in during a context switch. See __switch_to in arch/i386/kernel/process.c .

The layout of the GDT is described at the top of include/asm-i386/desc.h . The important descriptors are

0 NULL
2 Kernel code segment
3 Kernel data segment
4 User code segment
5 User data segment
8-11 APM BIOS support
(12 + CPU_ID * 4) TSS Segment for this CPU
(13 + CPU_ID * 4) LDT Segment for this CPU

the TSS's and LDT's are spread out a bit to get different cache lines. Using different cache lines will be discussed in detail later but the short version is if they are spread out, the different TSS's won't keep overwriting each other in the CPU L1 or L2 cache.

Note that all processes have the same code and data segment. This is because Linux favours using paging. One reason is because life is easier when all processes have the same linear address space. The second is that not architectures support segmentation but everyone supports paging. Linux goes for the common denominator which brings us onto the next section.

1.2 Page Tables

As above, if you haven't heard of paging, this will make no sense if you're lucky or cause anger and confusion if you aren't. Instead you need a book on basic OS design.

Linux uses a 3 level paging mechanism even when the machine it's running on doesn't support 3 level paging. This is to support 64 bit architectures. The three levels of the paging table are called Page Global Directory, Page Middle Directory and Page Table. It is important to note that Linux will use a paging interface to the higher layers even if the hardware doesn't support paging at all.

Each process has it's own PGD. It is an array of pgd_t items (include/asm/page.h) referenced by the mm_struct contained in the task_struct (include/linux/sched.h). On Intel, the address of the current PGD is contained in the cr3 register loading from the TSS segement.

This crappy diagram might help you figure out how the tables relate to each other and what macros are used to get places. To access all the macros both page.h and pgtable.h must be included.

struct task_struct {              struct mm_struct {
...                               ...
struct mm_struct mm ------------> pgt_t *pgd ------------|
...                               ...                    |
}                                                        |
                                                         |
PGD Table                                                |
 pgd_t                                                   |
 ....                                                    |
 pgt_t-|   <---------pgt_offset(mm_struct, addr)---------|        
 ....  |
 ....  |
       |-pmd_offset(pgt_t, addr)
          |
          |    PMD
          |    pmd_t
          |    ....
          |--> pmd_t --pte_offset(pmd_t, addr)
               ....     |
                        |   PTE
                        |   pte_t     struct page {
                        |   ....      ...
                        |-> pte_t --> virtual /* Kernel virtual address */
                            ....      ...
                                      }

The size of each table is PTRS_PER_PGD, PTRS_PER_PMD and PTRS_PER_PTE. In systems with only two levels of pagins, the PTRS_PER_PMD is 1. When moving through or altering the page table entries, the spinlock page_table_lock in the mm_struct must be held.

To avoid having to have hardware specific defines when traversing the page tables, the macro for pmd_offset returns back the pgt_t entry and so the compiler just optimises out references to the page middle directory.

1.3 Translating Addresses

So that a programmer doesn't need to know exactly how an address is laid out, macros are available to work out what an address means.

This is a short list which tries to explain what each of the macros, defines and functions are and what their meaning it.

macro/define	Declared in include/	Purpose
PAGE_SHIFT	asm/page.h	The length in bytes of the offset.
PGDIR_SHIFT	asm/pgtable.h	Number of bits to shift to reach the end of the PGT element of a linear address
PMD_SHIFT	asm/pgtable.h	Number of bits to shift to reach the end of the PMD element of a linear address. Same as PGDIR_SHIFT on Intel
PGDIR_MASK	asm/pgtable.h	Bit mask to apply to have only the PGD offset
PMD_MASK	asm/pgtable.h	Bit mask to apply to have only the PMD offset
PAGE_MASK	asm/page.h	Bit mask to apply to have only the offset within the page

Broadly speaking, an address looks like the following.

                 |--------------------------Linear Address--------------------------|
Which Table:            PGT            PMD                 PTE             Offset
Part Sizes:      >--PGDIR_SIZE--<>--PMD_SIZE--<>--log2(PTRS_PER_PTE)-<>--PAGE_SIZE--<
Bit shifts:                      <-----------------------------PGDIR_SHIFT----------|
                                               <---------------PMD_SHIFT------------|
                                                                      <--PAGE_SHIFT-|

1.4 Page struct

The page struct is used to keep track off all page frames in the system. Note the distinction between a page and a page frame. A page frame is a physical "thing" in memory that can store a page. A data structure called struct page exists for every page frame in the system.

Up until relatively recently, the kernel used logical addresses to refer to pages. This limited how much memory a 32 bit machine could address to only 4GB of memory. This changed with the introduction of things like Physical Address Extension (PAE) giving access to the high memory area. Kernel functions that used to refer to pages with logical addresses now must refer to them with pointers to a struct page. Obviously an understanding of whats in this struct is benefical.

Taken from include/linux/mm.h, here is the struct in all it's glory.

typedef struct page {
        struct list_head list;          /* ->mapping has some page lists. */
        struct address_space *mapping;  /* The inode (or ...) we belong to. */
        unsigned long index;            /* Our offset within mapping. */
        struct page *next_hash;         /* Next page sharing our hash bucket in
                                           the pagecache hash table. */
        atomic_t count;                 /* Usage count, see below. */
        unsigned long flags;            /* atomic flags, some possibly
                                           updated asynchronously */
        struct list_head lru;           /* Pageout list, eg. active_list;
                                           protected by pagemap_lru_lock !! */
        wait_queue_head_t wait;         /* Page locked?  Stand in line... */
        struct page **pprev_hash;       /* Complement to *next_hash. */
        struct buffer_head * buffers;   /* Buffer maps us to a disk block. */
        void *virtual;                  /* Kernel virtual address (NULL if
                                           not kmapped, ie. highmem) */
} mem_map_t;

Here is a rough description of some of the fields.

mapping	The address space it belongs to. References page lists, inodes, etc. See later sections
index	Badly named. It's an offset into the page_cache_hash table. The hash is on inode and offset
next_hash	Next page on the page cache hash
count	The number of references to this page. Moved to free page list when 0
flags	Flags associated with the page. See include/linux/mm for the list of flags. They are prepended with PG_ so easy to find
lru	Gives the head for what activity list the page is on, active or inactive
wait	A list of processes waiting on the page if locked
virtual	Kernel virtual address if it's mapped. NULL if it's note. Generally high-memory pages aren't and have to be kmapped in

1.5 Process Address Map

As well as maintaining the architecture independant page tables described above, the kernal maintains additional information on each memory address that is mapped by a process. Each process is assigned a struct mm_struct pointed to by the task_struct. It is declared as follows in include/sched.h

struct mm_struct {
        struct vm_area_struct * mmap;
        rb_root_t mm_rb;
        struct vm_area_struct * mmap_cache;
        pgd_t * pgd;
        atomic_t mm_users;
        atomic_t mm_count;
        int map_count;
        struct rw_semaphore mmap_sem;
        spinlock_t page_table_lock;
        struct list_head mmlist;
        unsigned long start_code, end_code, start_data, end_data;
        unsigned long start_brk, brk, start_stack;
        unsigned long arg_start, arg_end, env_start, env_end;
        unsigned long rss, total_vm, locked_vm;
        unsigned long def_flags;
        unsigned long cpu_vm_mask;
        unsigned long swap_address;

        unsigned dumpable:1;

        mm_context_t context;
};

Pretty daunting at first look but relatively straight forward at close examination.

mmap	Pointer to the root vm_area_struct describing a single mapping in the address space
mm_rb	Pointer to root of the page cache Red-Black tree
mmap_cache	The vm_area_struct that was last found. Avoids having to use mmap every time
pgd	Pointer to the PGD for the process (see above)
mm_users	How many references from user space there is
mm_count	Total number of users
map_count	The number of vm_area_structs there is on the list pointed to by mmap
rw_semaphore_mmap_sem	Lock that must be held when changing the vm_area_struct list
page_table_lock	Lock that must be held when traversing or changing the page tables
mmlist	List of all mm_structs in the system. Protected by mmlist_lock
start_code, end_code	Where the code segment begins and ends
start_data, end_data	Data segment
arg_start, arg_end	Pointer to arguements passed in
env_start, env_end	Environment variables
rss	Number of page frames assigned to the process
total_vm	Self explanatory
locked_vm	Count of locked pages frames
cpu_vm_mask	A bitmask which records if a CPU has executed on this address space
swap_address	Address that is been swapped out next.
dumpable	Set when teh mm_struct can be removed
context	Current MMU context

With this structure, all information about what memory a process is addressing can be accessed.

The vm_area_structs refer to a block of memory in the address space that is been used. They refer to parts of the process address space which require special handling by the page-fault handlers. There is three areas which always exist

Text	Code and initialised data from the executable itself. Starts at 0x08040000
Heap	Uninitialised data and the heap starting after text
Stack	The stack. Grows down from __PAGE_OFFSET (default 0xC00000000)

To see these three areas, look at /proc/PID/maps where PID is the process to be examined. A look at this file will also show that shared libraries are mmaped in starting from 0x40000000.

A vm_area_struct is defined as follows from include/linux/mm.h

struct vm_area_struct {
        struct mm_struct * vm_mm;
        unsigned long vm_start;
        unsigned long vm_end;

        /* linked list of VM areas per task, sorted by address */
        struct vm_area_struct *vm_next;

        pgprot_t vm_page_prot;
        unsigned long vm_flags;

        rb_node_t vm_rb;
        struct vm_area_struct *vm_next_share;
        struct vm_area_struct **vm_pprev_share;

        struct vm_operations_struct * vm_ops;

        /* Information about our backing store: */
        unsigned long vm_pgoff;         /* Offset (within vm_file) in PAGE_SIZE
                                           units, *not* PAGE_CACHE_SIZE */
        struct file * vm_file;          /* File we map to (can be NULL). */
        unsigned long vm_raend;         /* XXX: put full readahead info here. */
        void * vm_private_data;         /* was vm_pte (shared mem) */
};

As with the mm_struct, this has daunting at the start but mostly straight forward.

vm_mm	Pointer to the mm struct that this vm_area belongs to
vm_start	Starting virtual address
vm_end	End of virtual address
vm_next	Pointer to the next vm_area_struct for this mm_struct. Single linked list ordered by address
vm_page_prot	A mapping from the vm_flags to the page protection bits
vm_flags	Properties of the area struct. See include/linux/mm.h for details
vm_rb	This vma's rb node
vm_next_share	Used to link vm_area_structs together that are shared between processes. The code section of a library for instance
vm_pprev_share	Double linked list for vm_next_share
vm_ops	Operations that can be performed on the vm_area. Needed to keep files on disk up to date
vm_pgoff	Offset within the file that is mmaped
vm_file	Pointer to the struct file that is mapped
vm_raend	Information related to the read ahead
vm_private_data	Not sure

Note that there is two ways to traverse through the vm_area_structs for a process. They are ordered on a single linked list (vm_next) based on address. In addition they are stored in a Red-Black (rb) tree. This is an ordered balanced tree based on address. This makes searching for a vm_area_struct really fast ( OlogN time ), which is important when handling page faults for instance. Obviously the penalty is a slightly higher memory footprint. The specifics of RB trees are beyound the scope of this document. See any half-decent data structures and algorithms book on the subject. There is a bucket of helpful rb related functions in include/linux/rbtree.h so take a look there.

For the moment, just bear in mind when rb_node's appear in the code, it's dealing with tree traversal more often than not. The easiest to read code example for this is the find_vma() function in mmap.c .

1.6 Relationship Between Structs

This rough diagram attempts to describe how each of the structs described above connect to each other. Remember that each vm_area_struct points back to it's parent. Also note that it's the linked list relationship between the vm_area_structs that is drawn here.

                     task_struct                  
                         |                  ---pmd /-pte->page
                         |                 /      / 
                     mm_struct    ----> pgd----pmd---pte->page
                  /      |      \          \
                 /       |       \          ---pmd
                /        |        \
               /         |         \
  vm_area_struct-->vm_area_struct-->vm_area_struct
        |                |
        |                |
   struct file      struct file
    /   |   \
   /    |    \
  /     |     \
page-->page-->page

1.7 Non-Contiguous Kernel Memory

Memory alloced in the kernel using vmalloc is linear in address space but not linear in physical memory. vmalloc allocates the required number of pages and places them in order with the page tables. The upper portion of memory (1GB on x86) is used and goes from VMALLOC_START to VMALLOC_END. On the x86 this looks like

When vmalloc is called, a vm_struct is assigned to keep track of the area. It is defined as

struct vm_struct {
        unsigned long flags;
        void * addr;
        unsigned long size;
        struct vm_struct * next;
};

flags are the GFP flags used for the assignment
addr is the linear address this area starts from
size is the PAGE_SIZE aligned size of the allocation
next points to the next vm_struct in the list

vmalloc is described in detail later in another section. The first area is assigned at VMALLOC_START about 8MB after the end of main memory. This is to give a gap in case of memory overruns. Each area is padded by one PAGE_SIZE for similar reasons. On an x86, this looks like;

     3GB (PAGE_OFFSET)         PAGE_SIZE PAD                        4GB
              |                      |                               |
main_memory-->|---8MB---|---area---|PAD|---area---|PAD|---area---|---|
                 |
                   VMALLOC_START

1.8 Page Faulting - Arch Dependant

This section initially is very x86 centric in the initial stages but like so many other aspects of Linux, it's only at the lowest levels that it's architecture specific.

On the i386, faults are handled in arch/i386/mm.h:do_page_fault. The state of the registers and an error code is passed in. The fault address is passed in implicitly by been stored in the cr2 register. In other architectures such as PPC, Sparc and Alpha, it's passed in directly.

The code for page faulting is really heavily commented so we won't go into detail here until we reach the non-arch specific stuff. To start, the following checks are made.

In kernel vmalloc fault	Synchronize the current tasks page tables with the reference page table
No context	Occurs if in an interrupt or there is no user mm context. The appropriate exception handler is called if available

15cmlX

Now we are sure there is a valid MM context to work with so the vm_area_struct for this process is looked up by calling find_vma which simply traverses the RB Tree of vm_area_structs looking for the correct area. The type of faults that are handled then are as follows.

Fault Type	Action
No vma	A vm_area_struct doesn't exist for this address. Send a SIGSEGV signal (segmentation fault)
Bottom of stack	Presuming the stack wasn't accessed too far down, the stack will be expanded by calling expand_stack(). If it fails, a SIGSEGV is sent
Good area	One checks that permissions to access the page are valid, handle_mm_fault() is called to swap the page in

Granted, this is particularly detailed but the code is really self explanatory, the comments are ultra-clear and it's only one function. Onto architecture independant....

1.9 Page Faulting - Arch Independant

Once the arch specific code has determined where the vm_area_struct the fault occured in happened and that it's valid, mm/memory.c:handle_mm_fault() is called.

At this point, it has been established that the area and address is valid to what remains to be done is swap in the page the fault occured in. The page_table_lock is acquired and pmd/pte entries are alloc'ed if possible. If this fails (unlikely), the process is killed with extreme prejudice by the arch specific code. What is far more likely to occur is handle_pte_fault is called.

There is four conditions that handle_pte_fault has to deal with depending on what type of fault this is.

Page not present && Page never existed	do_no_page() is called which tries to create a new page mapping. If it's a normal page just for memory, then do_anonymous_memory() allocates a page entry and adds it to the page tables. If it's a read-only page, a system wide zero'd out page is placed in the pte. If write access is required, alloc_page is called and the appropriate properties are set.
Page not present && Page does exist	This page was swapped out at some stage so do_swap_page() is called. This will lookup the swap cache and use the page if found, otherwise the page will be swapped in from disk. The in's and out's of swapping will be covered in a later section.
Page present && Writing page	If it is allowable to write to the page, it is simply marked dirty and young (so it's less likely to get swapped out later), and the tlb is updated.
Page present and read-only && Writing page	In this case, an attempt is been made to write to a page even though the pte says it can't be written to. In this case do_wp_page() is called. This case can occur when Copy-On-Write is used when forking a process for instance. This function has a lot of error checking in it but on close examination can be summerised as "call page_alloc for new page, copy over old page".

That covers basically what Linux does during a page fault. As always reading the code is the only way to be sure of understanding. This is just a rough roadmap.

1.10 Summary

This gives a beginning of how memory is laid out in Linux. Not all details are 100% correct so be careful.

This document was translated from L^AT_EX by H^EV^EA.