Wersja polska

Memory management
Buddy allocator, Slab allocator

Table of Contents


Introduction

High overview of Virtual Memory subsystem

High overview of Virtual Memory subsystem
(source: N. Murray, N. Horman, Understanding Virtual Memory)

Based on kvmalloc() (Jonathan Corbet, January 2017)

The kernel offers two fundamental mechanisms for allocating memory, both of which are, in turn, built on top of the kernel's page allocator (zoned buddy allocator).

As a general rule, slab allocations are preferred for all but the largest of allocations.

Allocations with vmalloc() do not need physically contiguous pages and are thus much more likely to succeed when memory is tight.

There are a number of places in the kernel where a large allocation must be physically contiguous, but there are probably even more where that doesn't matter. In the latter case, the code doesn't have a reason to care which allocation method was used to obtain its memory, as long as the memory is available. For this sort of indifferent code, it can make sense to try an allocation first with the slab allocator, then fall back to vmalloc() should that attempt fail. And, indeed, the kernel is full of code fragments that do exactly that.


Managing free page frames - Buddy allocator

The kernel often needs contiguous memory areas that span multiple page frames, so when serving memory allocation requests, it must minimize external fragmentation. This is the task of the buddy allocator. The concept of the buddy allocator is to maintain direct-mapped table for memory blocks of various orders. The bottom level table contains the map for the smallest allocatable units of memory (here, pages), and each level above it describes pairs of units from the levels below, hence, buddies.

Data structures

Linux uses a separate buddy allocator in each zone.

The free_area is a table of structures of type struct free_area.


struct free_area  free_area[MAX_ORDER];

struct free_area {
        struct list_head        free_list[MIGRATE_TYPES];
        unsigned long           nr_free;
};

Linking blocks in the buddy allocator

Linking blocks in the buddy allocator (source: W. Mauerer, Professional Linux Kernel Architecture)

Information about being a free area of a certain size is included in the structure page (when the page frame is free): the private field of the first page frame in a block 2k of free page frames holds the number k, the __mapcount field of the descriptor holds value PAGE_BUDDY_MAPCOUNT_VALUE (-128).

The PageBuddy() function returns TRUE, when the page frame is managed by the buddy allocator.


/*
 * include/linux/page-flags.h
 *
 * PageBuddy() indicate that the page is free and in the buddy system
 * -128 can be created very efficiently by most CPU architectures.
 */
#define PAGE_BUDDY_MAPCOUNT_VALUE (-128)

static inline int PageBuddy(struct page *page)
{
        return atomic_read(&page->_mapcount) == PAGE_BUDDY_MAPCOUNT_VALUE;
}
static inline void __SetPageBuddy(struct page *page)
{
        VM_BUG_ON(atomic_read(&page->_mapcount) != -1);
        atomic_set(&page->_mapcount, PAGE_BUDDY_MAPCOUNT_VALUE);
}

static inline void __ClearPageBuddy(struct page *page)
{
        VM_BUG_ON(!PageBuddy(page));
        atomic_set(&page->_mapcount, -1);
}

VM_BUG_ON() macro, which comes in a few variants, dumps out a bunch of information specific to the memory-management subsystem; it is thus useful for identifying memory-management bugs.


/*
 *  linux/mm/page_alloc.c
 *
 * This function checks whether a page is free && is the buddy
 * we can do coalesce a page and its buddy if
 * (a) the buddy is not in a hole (check before calling!) &&
 * (b) the buddy is in the buddy system &&
 * (c) a page and its buddy have the same order &&
 * (d) a page and its buddy are in the same zone.
 *
 * For recording whether a page is in the buddy system, we set ->_mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE.
 *
 * For recording page's order, we use page_private(page).
 */
static inline int page_is_buddy(struct page *page, struct page *buddy, unsigned int order)
{   
        ... 
       if (PageBuddy(buddy) && page_order(buddy) == order) {
			if (page_zone_id(page) != page_zone_id(buddy))
                return 0;	   
            
			VM_BUG_ON(page_count(buddy) != 0);
                return 1;
        }
        return 0;
}

In the /proc/buddyinfo file, you can view the status of memory under the supervision of the buddy allocator. Each column shows the number of pages available in blocks of a given size (order). In the case shown (my workstation) there are 12 blocks of size 22 * PAGE_SIZE in the DMA zone and 47 blocks of size 23 * PAGE_SIZE in the NORMAL zone.


Node 0, zone      DMA    111     59     12      1      1      1      1      1      1      1      0
Node 0, zone   Normal   2770    753    169     47      0      0      0      0      0      0      1
Node 0, zone  HighMem     23      4      2      1      1      0      0      0      0      0      0

Additional reading:

Allocating page frames


struct page * alloc_pages(gfp_t gfp_mask, unsigned int order)

This function requests the allocation of a free memory area and returns the pointer to the descriptor of the first allocated frame.

Freeing page frames


/*
 * linux/mm/page_alloc.c
 *
 * Freeing function for a buddy system allocator.
 * At a high level, all that happens here is marking the table entry
 * at the bottom level available, and propagating the changes upward
 * as necessary.
 * At each level, we keep a list of pages, which are heads of continuous
 * free pages of length of (1 << order) and marked with _mapcount
 * PAGE_BUDDY_MAPCOUNT_VALUE. Page's order is recorded in page_private(page)
 * field.
 * So when we are allocating or freeing one, we can derive the state of the
 * other.  That is, if we allocate a small block, and both were
 * free, the remainder of the region must be split into blocks.
 * If a block is freed, and its buddy is also free, then this
 * triggers coalescing into a block of larger size.
 */

void __free_pages(struct page *page, unsigned int order)
void free_pages(unsigned long addr, unsigned int order)

This function frees the area indicated (indirectly) by page of size: (2order) * PAGE_SIZE and inserts it into one of the queues in the free_area table. The freed area had to be once assigned. If it was assigned, it had to be separated from a block two times bigger. Perhaps its buddy (that is, the other half of this larger block) is also free and can they be combined back into one larger contiguous area. If it succeeds, then you can look for a buddy for this larger piece, etc. The block is then added to the appropriate queue, and the information about those areas as belonging to the buddy allocator is updated.

Two blocks are buddies if:

While combining the areas, the kernel has to calculate two values: the address of the buddy and the index of the pair of buddies after they are combined back. The function pfn_to_page(int pfn) returns the pointer to the descriptor for a given frame number (pfn - page frame number), function page_to_pfn(struct page *page) gives the opposite mapping.


static inline unsigned long
__find_buddy_index(unsigned long page_idx, unsigned int order)
{
    return page_idx ^ (1 << order);  // bitwise XOR
}

static inline void __free_one_page(struct page *page,
struct zone *zone, unsigned int order, int migratetype)
{
    unsigned long page_idx;
    unsigned long combined_idx;
    unsigned long uninitialized_var(buddy_idx);
    struct page *buddy;
    ...
    page_idx = page_to_pfn(page) & ((1 << MAX_ORDER) - 1);
	...
    while (order < MAX_ORDER-1) {
        buddy_idx = __find_buddy_index(page_idx, order);
        buddy = page + (buddy_idx - page_idx);
        if (!page_is_buddy(page, buddy, order))
            break; /* Move the buddy up one level. */
        ...
        list_del(&buddy->lru);
        zone->free_area[order].nr_free--;
        rmv_page_order(buddy);
        ...
        combined_idx = buddy_idx & page_idx;
        page = page + (combined_idx - page_idx);
        page_idx = combined_idx;
        order++;
    }
...

Freeing page frames in buddy allocator

Freeing page frames in buddy allocator (source: W. Mauerer, Professional Linux Kernel Architecture)

Freeing page frames in buddy allocator - example

Freeing page frames in buddy allocator - example (source: W. Mauerer, Professional Linux Kernel Architecture)

Buddy allocator API

API of the buddy allocator includes the following functions (it is worth noticing parameters and returned values):

Notes


Slab allocator

Introduction

It was used for the first time in the Solaris 2.4 system. There is a rich literature available on the subject:

  1. Jeff Bonwick from Sun Microsystems describes interestingly the version of the allocator for SunOS 5.4 in the article The Slab Allocator: An Object-Caching Kernel Memory Allocator).
  2. A. Naynai described in great detail the version for Linux 2.4.19 in Memory Management in Linux in chapter 3.
  3. Mel Gorman described in great detail the version for Linux 2.4.22 (with information on what will appear in 2.6) Understanding the Linux Virtual Memory Manager in chapter 8.
  4. Bovet and Cesati described in detail the version for Linux 2.6 in the book Linux Kernel in chapter 8.
  5. Vahalia gives an outline in the book UNIX Internals. The New Frontiers, in chapter 12.10.
  6. The SLUB allocator (Jonathan Corbet, February 2007)
  7. Cramming more into struct page (Jonathan Corbet, 2013).
  8. Chris Lameter (one of the maintainers) in 2014 delivered the presentation Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB
  9. Toward a more efficient slab allocator (Jonathan Corbet, January 2015)

The slab allocator supports memory allocation for the kernel. The kernel needs many different temporary objects, such as the dentry, mm_struct, inode, files_struct structures. Temporary kernel objects can be both very small and very large, moreover, they are often allocated and often freed, so you have to perform these operations efficiently. The buddy allocator, which operates with areas composed of entire frames of memory, is not suitable for this.

What are the benefits of the slab allocator compared to traditional methods of memory allocation?

This is how the cache structure of the slab allocator looks like:

The slab allocator components

The slab allocator components
(source: Bovet, Cesati, Understanding the Linux Kernel)

The slab allocator does not work well for very small and very large systems. Therefore, two alternative slab allocators have been added to Linux:

  1. slob allocator - it is designed for small systems. The slob name comes from lists of blocks, the allocator only takes 600 lines of code, a simple first fit algorithm is used to allocate memory;
  2. slub allocator - for large systems. It tries to minimize memory overhead by operating on groups of page frames and using unused fields in the page descriptor. This complicates the descriptor structure, but it can improve performance in large systems.

The higher levels of the kernel do not have to be aware of which slab allocator is actually used.


Slab allocator in Linux

The main task of the slab allocator in Linux is to reduce the number of references to the buddy allocator. The slab allocator maintains many different caches. For frequently used objects (such as files_struct) it maintains a dedicated cache, and for other objects a number of generic caches, one for each area of size being the next power of two. In the /proc/slabinfo file, you can view the list of available slab allocator caches, the number of free objects and allocated objects in each cache.

Here are dedicated caches:


extern struct kmem_cache     *vm_area_cachep;
extern struct kmem_cache     *names_cachep;
extern struct kmem_cache     *files_cachep;
extern struct kmem_cache     *fs_cachep;
extern struct kmem_cache     *sighand_cachep;

There are also general caches of sizes corresponding (in most cases) to the powers of two.

This is the sample content of the /proc/slabinfo. The following columns contain:

name cache nam
active_objs number of objects that are currently active (i.e. in use)
num_objs total number of allocated objects (i.e. objects that are both in use and not in use)
object_size size of objects in this slab, in bytes
objsperslab number of objects stored in each slab
pagesperslab number of pages allocated for each slab
active_slabs number of active slabs
num_slabs total number of slabs


Name              active    num   object objs  pages active  num 
                    objs    objs   size  per   per   slabs  slabs
                                         slab  slab
inode_cache         9174   9660    568   28    4     345    345
dentry           1532244 1532244   192   42    2   36483  36483
buffer_head       794118 794118    104   39    1   20362  20362
vm_area_struct    195838 204286    176   46    2    4441   4441
mm_struct           7658   8352    896   36    8     232    232
files_cache         5831   7314    704   46    8     159    159
signal_cache        3111   3780   1088   30    8     126    126
sighand_cache       2094   2310   2112   15    8     154    154
task_struct         2793   3258   1776   18    8     181    181
anon_vma           91453  99584     64   64    1    1556   1556
radix_tree_node   360485 553224    568   28    4   19758  19758
dma-kmalloc-8192       0      0   8192    4    8       0      0
dma-kmalloc-4096       0      0   4096    8    8       0      0
dma-kmalloc-2048       0      0   2048   16    8       0      0
dma-kmalloc-1024       0      0   1024   32    8       0      0
dma-kmalloc-512       32     32    512   32    4       1      1
dma-kmalloc-256        0      0    256   32    2       0      0
dma-kmalloc-128        0      0    128   32    1       0      0
dma-kmalloc-64         0      0     64   64    1       0      0
dma-kmalloc-32         0      0     32  128    1       0      0
dma-kmalloc-16         0      0     16  256    1       0      0
dma-kmalloc-8          0      0      8  512    1       0      0
dma-kmalloc-192        0      0    192   42    2       0      0
dma-kmalloc-96         0      0     96   42    1       0      0
kmalloc-8192         256    268   8192    4    8      67     67
kmalloc-4096        1595   1680   4096    8    8     210    210
kmalloc-2048        3200   3664   2048   16    8     229    229
kmalloc-1024        6786   7584   1024   32    8     237    237
kmalloc-512        12203  36096    512   32    4    1128   1128
kmalloc-256        64100 101408    256   32    2    3169   3169
kmalloc-128       138762 215136    128   32    1    6723   6723
kmalloc-64        482131 782976     64   64    1   12234  12234
kmalloc-32         25499  33024     32  128    1     258    258
kmalloc-16         12544  12544     16  256    1      49     49
kmalloc-8          19456  19456      8  512    1      38     38
kmalloc-192        47364 191814    192   42    2    4568   4568
kmalloc-96         43456 116298     96   42    1    2769   2769
kmem_cache           224    224    256   32    2       7      7
kmem_cache_node      384    384     64   64    1       6      6

The memory is physically allocated and initialized with whole slabs. Each slab consists of one or more page frames containing both allocated and free objects.


Data structures of the slab allocator

The slab allocator uses three data structures:

ATTENTION: the attached pictures show the general view of the slab allocator data structures. Details may vary depending on the implementation.

In the cache descriptor (kmem_cache), in addition to many fields for data management (such as the number of free and allocated objects and flags) there are two important elements:

  1. a pointer to the array through which recently freed objects for each CPU can be reached. The array has one entry for each CPU. The entry points to the array_cache, which contains the data needed to manage the objects. Behind this data there is a table with pointers to unused objects on the slabs.
  2. array of pointers (one entry per each NUMA node) to the structure kmem_cache_node, which contains pointers to three lists:

Allocation of objects takes place on three levels and the cost of allocation and the negative impact of these operations on cache and TLB increases from level to level:

  1. Per-CPU objects from the CPU cache.
  2. Unused objects from an existing slab.
  3. Unused objects from the new slab obtained on request from the buddy allocator.

SLAB data structures

SLAB data structures
(source: Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB)


The redzone is used to detect writes after the object. All bytes should always have the same value. If there is any deviation then it is due to a write after the object boundary. (Redzone information is only available if SLAB_RED_ZONE is set.) If the object is inactive then the bytes typically contain poison values. Any non-poison value shows a corruption by a write after free. Padding is an unused data to fill up the space in order to get the next object properly aligned.

A slab is a contiguous area of memory, into which you can put a certain number of objects of a specific size. Slab descriptors once have been stored inside the slab (for small objects, smaller than 1/8 page) or outside (for large objects). Now the slab metadata is stored in the page descriptor.

Deskryptory płyt i deskryptory obiektów

Rysunek: Deskryptory płyt i deskryptory obiektów
(Źródło: Bovet, Cesati, Understanding the Linux Kernel)

SLAB data structures - free list

SLAB data structures - free list
(source: Slab allocators in the Linux Kernel: SLAB, SLOB, SLUB)


The object descriptor is used to connect free objects to the list. It is only needed for a free object, but it must be separate from the object itself, because we do not want to destroy properly constructed object.

The object descriptor can be a simple integer, because to establish the cache and the slab to which the object belongs, all you need to have is a pointer to the page descriptor containing the object.

All links to free objects are kept in an array that has as many entries as there are objects on the slab. The array is kept just behind the slab descriptor and there is no direct pointer to it.


Slab coloring

The final task of the slab allocator is optimal hardware cache use. If there is space left over after objects are packed into a slab, the remaining space is used to color the slab. Slab coloring is a scheme that attempts to have objects in different slabs use different lines in the cache. By placing objects at a different starting offset within the slab, objects will likely use different lines in the CPU cache, which helps ensure that objects from the same slab cache will be unlikely to flush each other. With this scheme, space that would otherwise be wasted fulfills a new function. Figure shows how a page allocated from the buddy allocator is used to store objects that use coloring to align the objects to the L1 CPU cache.

To use the hardware cache better, the slab allocator will offset objects in different slabs by different amounts depending on the amount of space left over in the slab. The offset is in units of BYTES_PER_WORD unless SLAB_HWCACHE_ALIGN is set, in which case it is aligned to blocks of L1_CACHE_BYTES for alignment to the L1 hardware cache.

During cache creation, it is calculated how many objects can fit on a slab and how many bytes would be wasted. Based on wastage, two figures are calculated for the cache descriptor:

colour: the number of different offsets that can be used,
colour_off: the multiple to offset each object in the slab.

The slab allocator takes advantage of the free unused bytes to color the slab. Slabs having different colors store the first object of the slab in different memory locations, while satisfying the alignment constraint. Coloring leads to moving some of the free area of the slab from the end to the beginning.

The various colors are distributed equally among slabs of a given object type. Each slab is created with a different color from the previous one, up to the maximum available colors.

Slab coloring

Slab coloring (source: W. Mauerer, Professional Linux Kernel Architecture)


Slab allocator API

The function used to initialize general caches:


void __init kmem_cache_init(void)

The function used to create dedicated caches:


/**
 * kmem_cache_create - Create a cache.
 * @name: A string which is used in /proc/slabinfo to identify this cache.
 * @size: The size of objects to be created in this cache.
 * @align: The required alignment for the objects.
 * @flags: SLAB flags
 * @ctor: A constructor for the objects.
 */

struct kmem_cache *
kmem_cache_create (const char *name, size_t size, 
       size_t align, slab_flags_t flags, void (*ctor)(void*))

If the SLAB_HWCACHE_ALIGN flag is set when creating the cache, the object size is aligned to L1_CACHE_BYTES (hardware cacheline - for most architectures it is equal to 32), so that the objects will be aligned in the hardware cache. This creates gaps between the objects on the slab.

The allocation of the object from dedicated cache memory is performed by kmem_cache_alloc() function. The function first attempts to allocate the object in a partially filled slab, then in a free slab. If it fails, it tries to allocate new frames from the buddy allocator. The kmem_cache_free() function is used to free memory allocated for the kernel. Releasing empty slabs occurs while deleting the cache. The function of the buddy allocator will be finally called here.


void *kmem_cache_alloc(kmem_cache_t *cache, gfp_t flags);
void kmem_cache_free(struct kmem_cache *cachep, void *objp);

Examples of calls to the memory allocation function in the kernel code.


tmp = kmem_cache_alloc(vm_area_cachep, GFP_KERNEL)
...
struct fs_struct *fs = kmem_cache_alloc(fs_cachep, GFP_KERNEL);
...
newf = kmem_cache_alloc(files_cachep, GFP_KERNEL);
...
pgd_t *pgd_alloc(struct mm_struct *mm)
{
        return kmem_cache_alloc(pgd_cachep, PGALLOC_GFP);
}

The kmalloc() function, similar to the malloc() function used in user mode, is used to allocate memory for the kernel from general purpose caches (the parameter is the number of bytes, not the object type). The function determines the size being the nearest multiple of 2 (rounded up), and then calls kmem_cache_alloc() indicating the appropriate cache. The function kfree() is used to release such objects, which is equivalent to free() from user mode.


void *kmalloc(size_t size, gfp_t flags);
void kfree(const void *objp);

Examples of calls to the memory allocation function in the kernel code.


c->wbuf = kmalloc(c->wbuf_pagesize, GFP_KERNEL);
... 
/* don't ask for more than the kmalloc() max size */
if (size > KMALLOC_MAX_SIZE)
    size = KMALLOC_MAX_SIZE;
buf = kmalloc(size, GFP_KERNEL);
if (!buf)
    return -ENOMEM;

The kmalloc() function can not allocate higher memory (highmem). There is a limit on the memory size allocated by kmalloc(). It depends on the architecture and configuration of the kernel. It can be assumed that this limit is 128 KB (32 frames size 4 KB).


Some additional explanations

Based on:

Address types

Let's summarize the used terminology.

If you have a logical address, the macro _ _pa() returns its associated physical address. Physical addresses can be mapped back to logical addresses with _ _va(), but only for low-memory pages.

Memory addresses returned by kmalloc() and _ _get_free_pages() are also virtual addresses. Their actual value is still massaged by the MMU before it is used to address physical memory. The (virtual) address range used by kmalloc() and _ _get_free_pages() features a one-to-one mapping to physical memory, possibly shifted by a constant PAGE_OFFSET value, the functions don't need to modify the page tables for that address range. The address range used by vmalloc(), on the other hand, is completely synthetic, and each allocation builds the (virtual) memory area by suitably setting up the page tables.

This difference can be perceived by comparing the pointers returned by the allocation functions. On some platforms (for example, the x86), addresses returned by vmalloc() are just beyond the addresses that kmalloc() uses.

Typy adresów w Linuksie

Rysunek: Address types in Linux (source: Jonathan Corbet, Greg Kroah-Hartman, Alessandro Rubini, Linux Device Drivers, 3rd Edition)

High and Low Memory

The kernel (on the x86 architecture, in the default configuration) splits the 4-GB virtual address space between user-space and the kernel, the same set of mappings is used in both contexts. A typical split dedicates 3 GB to user space, and 1 GB for kernel space. The kernel's code and data structures must fit into that space, but the biggest consumer of kernel address space is virtual mappings for physical memory. The kernel cannot directly manipulate memory that is not mapped into the kernel's address space. The kernel, in other words, needs its own virtual address for any memory it must touch directly. Thus, for many years, the maximum amount of physical memory that could be handled by the kernel was the amount that could be mapped into the kernel's portion of the virtual address space, minus the space needed for the kernel code itself. As a result, x86-based Linux systems could work with a maximum of a little under 1 GB of physical memory.

Only the lowest portion of memory has logical addresses, the rest (high memory) does not. Before accessing a specific high-memory page, the kernel must set up an explicit virtual mapping to make that page available in the kernel's address space. Thus, many kernel data structures must be placed in low memory, high memory tends to be reserved for user-space process pages.

Low memory - Memory for which logical addresses exist in kernel space. High memory - Memory for which logical addresses do not exist, because it is beyond the address range set aside for kernel virtual addresses. On i386 systems, the boundary between low and high memory is usually set at just under 1 GB, although that boundary can be changed at kernel configuration time. This boundary is not related in any way to the old 640 KB limit found on the original PC, and its placement is not dictated by the hardware. It is, instead, a limit set by the kernel itself as it splits the 32-bit address space between kernel and user space.

High memory is used mainly by user processes and for the pagecache. Low memory can be used for the same for what high memory is used, but is also available for the kernel for its data structures. Memory for slab allocator comes from the low memory.

Historically, the kernel has used logical addresses to refer to pages of physical memory. The addition of high-memory support, however, has exposed an obvious problem with that approach - logical addresses are not available for high memory. Therefore, kernel functions that deal with memory are increasingly using pointers to struct page instead. This data structure contains the field void *virtual, which keeps the kernel virtual address of the page, if it is mapped, NULL, otherwise. Low-memory pages are always mapped, high-memory pages usually are not. If you want to look at this field, the proper method is to use the page_address() macro. It returns the kernel virtual address of this page, if such an address exists. For high memory, that address exists only if the page has been mapped. In most situations, you want to use a version of kmap() rather than page_address().

Function kmap() returns a kernel virtual address for any page in the system. For low-memory pages, it just returns the logical address of the page; for high-memory pages, kmap() creates a special mapping in a dedicated part of the kernel address space. Mappings created with kmap() should always be freed with kunmap(), a limited number of such mappings is available, so it is better not to hold on to them for too long. Function kmap() calls maintain a counter, so if two or more functions both call kmap() on the same page, the right thing happens. Note also that kmap() can sleep if no mappings are available. To non-blocking mapping use kmap_atomic().

Memory Management API

The kernel obtains dynamic memory by calling functions from the rich set:

If only the memory is available, the kernel requests are executed immediately and the function returns the address of the page descriptor or the linear address identifying the allocated dynamic memory area.

Low-level memory allocator

Low-level memory allocator
(source: Shinpei Kato, Advanced Operating Systems)


Janina Mincer-Daszkiewicz