1 Buddy Algorithm

This section tries to explain approximatly how the buddy algorithm works for allocing pages. All the code discussed is in page_alloc.c . To get an explanation on what a buddy algorithm is, consult any operating systems theory book such as UNIX Internals.

1.1 Starting Out

Life starts out in the inline function alloc_pages in the header file include/linux/mm.h. It's a simple wrapper function which makes sure the number of pages isn't too great before calling _alloc_pages which takes the following parameters

gfp_mask	The GFP (Get Free Pages) flags. This will tell the allocator what actions may be taken. For example GFP_WAIT is set, the allocator will not block and instead return NULL if memory is tight.
order	The power of two number of contigeous pages required

This in turn calls __alloc_pages with three parameters. The first two are the same as _alloc_pages. The third is the zonelist which is the preferred zone to alloc pages from. If that zone can't provide enough pages, the other zones are fallen back on. Which zone is calculated from the rather cryptic formula

contig_page_data.node_zonelists+(gfp_mask &
GFP_ZONEMASK

.

contig_page_data is a struct that is used to describe the memory layout in NUMA architectures. It's node_zonelists parameter is a list of zones that exist in that chunk of memory. Currently there is three sets of zones, ZONE_DMA, ZONE_NORMAL and ZONE_HIGH. GFP_ZONEMASK is defined as 0x0f. ANDing it with gfp_mask gives the index into node_zonelists that is preferred for usage.

1.2 Allocating Pages

At this stage, we've reached what is described as the "heart of the zoned buddy allocator", the __alloc_pages function, that has multiple paths and poorly documented. Beware that this description is likely to be only half accurate. Lets take the first block

        zone = zonelist->zones;
        classzone = *zone;      
        min = 1UL << order;

This is straight forward. Go to the first zone. Record what classzone we are in so we can get it balanced later if necessary. As order is a power of two, bitshifting it left order number of times will give the number of pages required.

        for (;;) {
                zone_t *z = *(zone++);  
                if (!z)                 
                        break;          
                                        
                min += z->pages_low;
                if (z->free_pages > min) { 
                        page = rmqueue(z, order);
                        if (page)       
                                return page;    
                }                       
        }

This is the first attempt to alloc the required pages. For each zone that exists on this node, it checks if there is enough pages available in total. The minimum number of pages is pages_low + min . It's unclear why it accumulates the pages_low value from each zone. If there is enough free pages, rmqueue is called which will be discussed later.

       classzone->need_balance = 1;            
        mb();                           
        if (waitqueue_active(&kswapd_wait))     
                wake_up_interruptible(&kswapd_wait);

If a page was not returned, this zone needs to be balanced because not enough pages were found to satisify the request. need_balance is picked up my kswapd later. mb() is called to make sure the value is definetly written out and then kswapd is woken up if it isn't already awake.

       zone = zonelist->zones;         
        min = 1UL << order;                     
        for (;;) {                      
                unsigned long local_min;        
                zone_t *z = *(zone++);
                if (!z)
                        break;

                local_min = z->pages_min;
                if (!(gfp_mask & __GFP_WAIT))
                        local_min >>= 2;
                min += local_min;
                if (z->free_pages > min) {
                        page = rmqueue(z, order);
                        if (page)
                                return page;
                }
        }

At first glance, this looks very like the first for() block and you'd be right it is. The main difference is that if the calling process can sleep, it'll allow more pages to be allocated even though the pages_min limit could be hit. Later this might force the process to sleep as pages are freed up.

At this point, we have hit the slow path. We are low on memory and finding it is pretty tough. This is where we go back to later if it turns out all attempts to allocate the pages fail.

        if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
                zone = zonelist->zones;
                for (;;) {
                        zone_t *z = *(zone++);
                        if (!z)
                                break;

                        page = rmqueue(z, order);
                        if (page)
                                return page;
                }
                return NULL;
        }

This block is only executed if the called is a memory allocator such as kswapd or would be forcibly killed if out of memory. Both belong to critical processes that we want to keep happy no matter what. So here, we give the pages no matter what if they are available and ignore the pages_min settings.

       if (!(gfp_mask & __GFP_WAIT))
                return NULL;

Self explanatory, if the caller has said they can't sleep, thats it, the next section should not be executed.

        page = balance_classzone(classzone, gfp_mask, order, &freed);
        if (page)
                return page;

This block attempts to rebalance the zone in a synchronous fashion. How this balancing differs from what kswapd normally does will be discussed later. For the moment, understand that similar work to kswapd is done except the pages freed will be cached by the current process.

The next block of code is a repeat of the first attempt to allocate pages. At this point we have tried to free up pages in the zones but the pages freed up in a block were not enough to satisify the request. This recheck of the whole zone sees if there was a whole new block freed up that can be used.

If successful we reach here

        if (order > 3)
                return NULL;

        current->policy |= SCHED_YIELD;
        __set_current_state(TASK_RUNNING);
        schedule();
        goto rebalance;

The first check is to make sure we don't loop trying to satisify a large request. If at this stage, we haven't found a block sizable enough, we're not likely to find it any time soon so the caller might want to start thinking of something new.

If the allocation is small, the processor is yielded so kswapd can do some work. It's already been woken up so it's perfectly possible it'll free up the necessary pages. Once it's done a bit of work, the slow path is entered again mentioned earlier.

1.3 Remove pages (rmqueue)

This section deals with what happens when rmqueue is called by __alloc_pages. This is part that does the prime work of the buddy allocator. That is, removes the pages, updates the bitmasks and adds to the free lists as appropriate.

First, it sets area

        free_area_t * area = zone->free_area + order

free_area is an array pointing to free pages of each order. This will be the first area checked to see can the allocation be satisified.

        spin_lock_irqsave(&zone->lock, flags);
        do {    
           ...
        } (curr_order < MAX_ORDER);

This while block says starting at the first free list, go through each larger list until a large enough free block is found. Onto whats inside the while block.

                head = &area->free_list;
                curr = memlist_next(head);
                
                if (curr != head) {

Self explanatory code really. If curr != head, then there is a free chunk available of this size so it can be alloced.

                page = memlist_entry(curr, struct page, list);
                if (BAD_RANGE(zone,page))
                      BUG();
                memlist_del(curr);

This simply removes the page block from the free list.

                index = page - zone->zone_mem_map;
                if (curr_order != MAX_ORDER-1)
                        MARK_USED(index, curr_order, area);
                zone->free_pages -= 1UL << order;

zone->zone_mem_map is the first page of the current zone been dealt with so index is an offset within it. MARK_USED sets the bit in the area->map which is the bitmap the buddy algorithm uses to determine if the buddy page is free or not.

                page = expand(zone, page, index, order, curr_order,area);
                spin_unlock_irqrestore(&zone->lock, flags);
                set_page_count(page, 1);

expand does the actual work of removing the pages from the freelists and juggling them around. It'll be explained in the next section. The lock is freed and IRQ's restored. set_page_count just marks the page to show it's in use.

If the pages couldn't be served by this freelist, the order (curr_order) is upped and the next freearea is examined

1.4 expand()

This function is responsible for updating all the free lists.

        unsigned long size = 1 << high;

high at the beginning is the freelist the allocation is been made from

while (high > low) {

low is the size of the actual allocation we are looking to make

                area--;
                high--;
                size >>= 1;

Move to the next smallest area, decrease the high order allocation and as we are now dealing with the next area size, divide size by two

                memlist_add_head(&(page)->list, &(area)->free_list);
                MARK_USED(index, high, area);

Add the buddy of the page set to the freelist. To understand why this is done, read a high level description of what a buddy allocator does.

                index += size;
                page += size;

Move to the next block of pages to split up and move around the free areas.

1.5 Freeing pages

This is where things get slightly trickier. According to the buddy algorithm, pages when freed should be amalgamated with their "buddy" while they are been freed. This is the trickiest section of the page allocation algorithms. The main bulk of the work is done in the function __free_pages_ok() . It takes two parameters

page	The first page of the set to be freed
order	The power of 2 pages to be freed

The first section deals with debugging code to make sure the caller isn't tyring to free a page thats in active use. It clears the bits on the page to mark it clean and not referenced. Then it makes the following check

        if (current->flags & PF_FREE_PAGES)
                goto local_freelist;

This flag is only set if the process is in the slow path for page allocation. If it is, pages that it frees should be kept for the process itself. If it is set, the following path is execute.

 local_freelist:
        if (current->nr_local_pages)
                goto back_local_freelist;
        if (in_interrupt())
                goto back_local_freelist;

        list_add(&page->list, &current->local_pages);
        page->index = order;
        current->nr_local_pages++;

The first check is to make sure the process has not already freed up the required number of pages for itself. The second check is because an interrupt doesn't have a memory context and so can't have local pages. Next the page is added to the list of local pages and the number of local pages is incremented. It's unclear why index is set to be order.

In the normal scheme of things with a normal free, the code runs as follows

        zone = page->zone;

        mask = (~0UL) << order;
        base = zone->zone_mem_map;
        page_idx = page - base;
        if (page_idx & ~mask)
                BUG();
        index = page_idx >> (1 + order);

        area = zone->free_area + order;

mask will have multiple uses, each of which will be discussed in turn. base is the beginning of the mem_map array for this zone. page_idx is what index in the mem_map array we are looking at. index is the offset within the free area bitmap. This is slightly confusing because index is used when allocating pages to mean what page inside mem_map is been referred to. area is the free area we are currrently dealing with, remmeber that zone->free_area is an array.

        zone->free_pages -= mask;

This is subtle. The value of -mask is the number of pages been freed. Remember as a helpful comment in the code says

        -mask = ~mask + 1

Next section

        while (mask + (1 << (MAX_ORDER-1))) {

This while loop will exit when the maximum order is reached no matter what mask started out as. Each iteration of the loop moves mask one bit to the left. When mask has reached the max order it can handle, this addition will overflow resulting in 0 and exiting the while loop.

                struct page *buddy1, *buddy2;

                if (area >= zone->free_area + MAX_ORDER)
                        BUG();
                if (!__test_and_change_bit(index, area->map))
                        /*
                         * the buddy page is still allocated.
                         */
                        break;

Basic declaration and a bug check. The test_and_change_bit section is not clearly understood but it is suspected that irregardless of how pages are allocated and freed, the bits for the pages we are about to free will be set to 1 if it's buddy is free.

                buddy1 = base + (page_idx ^ -mask);
                buddy2 = base + page_idx;

The value of buddy2 should be obvious. It's simply the current page_idx we are about to free up. The calculation of buddy1 is much more subtle and requires careful thought. -mask is evaluated to be the number of pages been freed which also gives an index into the mem_map array. When exclusive or'd with page_idx, it'll always give the index of the buddy. The easiest way to see this is to draw out the bitmaps.

                memlist_del(&buddy1->list);
                mask <<= 1;
                area++;
                index >>= 1;
                page_idx &= mask;

The page is now freed so remove it off the list. buddy2 has already been freed remember. The last four lines set up the while loop to try and see if the freed pages allow more buddies to be merged together.

        }
        memlist_add_head(&(base + page_idx)->list, &area->free_list);

        spin_unlock_irqrestore(&zone->lock, flags);
        return;

As this is the last page to be freed, it's added to the free list for the last area that was examined. The spinlock is released and the the function exits.

This document was translated from L^AT_EX by H^EV^EA.