1 Buddy Algorithm
This section tries to
explain approximatly how the buddy algorithm works for allocing pages. All the
code discussed is in page_alloc.c . To get an explanation on what a buddy
algorithm is, consult any operating systems theory book such as UNIX
Internals.
1.1 Starting Out
Life starts out in the
inline function alloc_pages in the header file include/linux/mm.h. It's
a simple wrapper function which makes sure the number of pages isn't too great
before calling _alloc_pages which takes the following parameters
gfp_mask |
The GFP (Get Free Pages) flags. This will tell
the allocator what actions may be taken. For example GFP_WAIT is set, the
allocator will not block and instead return NULL if memory is tight. |
order |
The power of two number of contigeous pages
required |
This in turn calls __alloc_pages
with three parameters. The first two are the same as _alloc_pages. The third is
the zonelist which is the preferred zone to alloc pages from. If that zone can't
provide enough pages, the other zones are fallen back on. Which zone is
calculated from the rather cryptic formula contig_page_data.node_zonelists+(gfp_mask &
GFP_ZONEMASK
.
contig_page_data is a struct that is used to
describe the memory layout in NUMA architectures. It's node_zonelists parameter
is a list of zones that exist in that chunk of memory. Currently there is three
sets of zones, ZONE_DMA, ZONE_NORMAL and ZONE_HIGH. GFP_ZONEMASK is defined as
0x0f. ANDing it with gfp_mask gives the index into node_zonelists that is
preferred for usage.
1.2 Allocating Pages
At this stage, we've
reached what is described as the "heart of the zoned buddy allocator", the
__alloc_pages function, that has multiple paths and poorly documented. Beware
that this description is likely to be only half accurate. Lets take the first
block
zone = zonelist->zones;
classzone = *zone;
min = 1UL << order;
This is straight forward. Go to the first zone. Record what classzone we
are in so we can get it balanced later if necessary. As order is a power of two,
bitshifting it left order number of times will give the number of pages
required.
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
min += z->pages_low;
if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
}
}
This is the first attempt to alloc the required pages. For each zone that
exists on this node, it checks if there is enough pages available in total. The
minimum number of pages is pages_low + min . It's unclear why it accumulates the
pages_low value from each zone. If there is enough free pages, rmqueue is called
which will be discussed later.
classzone->need_balance = 1;
mb();
if (waitqueue_active(&kswapd_wait))
wake_up_interruptible(&kswapd_wait);
If a page was not returned, this zone needs to be balanced because not
enough pages were found to satisify the request. need_balance is picked up my
kswapd later. mb() is called to make sure the value is definetly written out and
then kswapd is woken up if it isn't already awake.
zone = zonelist->zones;
min = 1UL << order;
for (;;) {
unsigned long local_min;
zone_t *z = *(zone++);
if (!z)
break;
local_min = z->pages_min;
if (!(gfp_mask & __GFP_WAIT))
local_min >>= 2;
min += local_min;
if (z->free_pages > min) {
page = rmqueue(z, order);
if (page)
return page;
}
}
At first glance, this looks very like the first for() block and you'd be
right it is. The main difference is that if the calling process can sleep, it'll
allow more pages to be allocated even though the pages_min limit could be hit.
Later this might force the process to sleep as pages are freed up.
At
this point, we have hit the slow path. We are low on memory and finding it is
pretty tough. This is where we go back to later if it turns out all attempts to
allocate the pages fail.
if (current->flags & (PF_MEMALLOC | PF_MEMDIE)) {
zone = zonelist->zones;
for (;;) {
zone_t *z = *(zone++);
if (!z)
break;
page = rmqueue(z, order);
if (page)
return page;
}
return NULL;
}
This block is only executed if the called is a memory allocator such as
kswapd or would be forcibly killed if out of memory. Both belong to critical
processes that we want to keep happy no matter what. So here, we give the pages
no matter what if they are available and ignore the pages_min settings.
if (!(gfp_mask & __GFP_WAIT))
return NULL;
Self explanatory, if the caller has said they can't sleep, thats it, the
next section should not be executed.
page = balance_classzone(classzone, gfp_mask, order, &freed);
if (page)
return page;
This block attempts to rebalance the zone in a synchronous fashion. How
this balancing differs from what kswapd normally does will be discussed later.
For the moment, understand that similar work to kswapd is done except the pages
freed will be cached by the current process.
The next block of code is a
repeat of the first attempt to allocate pages. At this point we have tried to
free up pages in the zones but the pages freed up in a block were not enough to
satisify the request. This recheck of the whole zone sees if there was a whole
new block freed up that can be used.
If successful we reach
here
if (order > 3)
return NULL;
current->policy |= SCHED_YIELD;
__set_current_state(TASK_RUNNING);
schedule();
goto rebalance;
The first check is to make sure we don't loop trying to satisify a large
request. If at this stage, we haven't found a block sizable enough, we're not
likely to find it any time soon so the caller might want to start thinking of
something new.
If the allocation is small, the processor is yielded so
kswapd can do some work. It's already been woken up so it's perfectly possible
it'll free up the necessary pages. Once it's done a bit of work, the slow path
is entered again mentioned earlier.
1.3 Remove pages (rmqueue)
This section deals
with what happens when rmqueue is called by __alloc_pages. This is part
that does the prime work of the buddy allocator. That is, removes the pages,
updates the bitmasks and adds to the free lists as appropriate.
First, it
sets area
free_area_t * area = zone->free_area + order
free_area is an array pointing to free pages of each order. This will be
the first area checked to see can the allocation be satisified.
spin_lock_irqsave(&zone->lock, flags);
do {
...
} (curr_order < MAX_ORDER);
This while block says starting at the first free list, go through each
larger list until a large enough free block is found. Onto whats inside the
while block.
head = &area->free_list;
curr = memlist_next(head);
if (curr != head) {
Self explanatory code really. If curr != head, then there is a free chunk
available of this size so it can be alloced.
page = memlist_entry(curr, struct page, list);
if (BAD_RANGE(zone,page))
BUG();
memlist_del(curr);
This simply removes the page block from the free list.
index = page - zone->zone_mem_map;
if (curr_order != MAX_ORDER-1)
MARK_USED(index, curr_order, area);
zone->free_pages -= 1UL << order;
zone->zone_mem_map is the first page of the current zone been dealt
with so index is an offset within it. MARK_USED sets the bit in the area->map
which is the bitmap the buddy algorithm uses to determine if the buddy page is
free or not.
page = expand(zone, page, index, order, curr_order,area);
spin_unlock_irqrestore(&zone->lock, flags);
set_page_count(page, 1);
expand does the actual work of removing the pages from the freelists and
juggling them around. It'll be explained in the next section. The lock is freed
and IRQ's restored. set_page_count just marks the page to show it's in
use.
If the pages couldn't be served by this freelist, the order
(curr_order) is upped and the next freearea is examined
1.4 expand()
This function is responsible for
updating all the free lists.
unsigned long size = 1 << high;
high at the beginning is the freelist the allocation is been made
from
while (high > low) {
low is the size of the actual allocation we are looking to make
area--;
high--;
size >>= 1;
Move to the next smallest area, decrease the high order allocation and as
we are now dealing with the next area size, divide size by two
memlist_add_head(&(page)->list, &(area)->free_list);
MARK_USED(index, high, area);
Add the buddy of the page set to the freelist. To understand why this is
done, read a high level description of what a buddy allocator does.
index += size;
page += size;
Move to the next block of pages to split up and move around the free
areas.
1.5 Freeing pages
This is where things get
slightly trickier. According to the buddy algorithm, pages when freed should be
amalgamated with their "buddy" while they are been freed. This is the trickiest
section of the page allocation algorithms. The main bulk of the work is done in
the function __free_pages_ok() . It takes two parameters
page |
The first page of the set to be freed |
order |
The power of 2 pages to be
freed |
The first section deals with debugging code
to make sure the caller isn't tyring to free a page thats in active use. It
clears the bits on the page to mark it clean and not referenced. Then it makes
the following check
if (current->flags & PF_FREE_PAGES)
goto local_freelist;
This flag is only set if the process is in the slow path for page
allocation. If it is, pages that it frees should be kept for the process itself.
If it is set, the following path is execute.
local_freelist:
if (current->nr_local_pages)
goto back_local_freelist;
if (in_interrupt())
goto back_local_freelist;
list_add(&page->list, ¤t->local_pages);
page->index = order;
current->nr_local_pages++;
The first check is to make sure the process has not already freed up the
required number of pages for itself. The second check is because an interrupt
doesn't have a memory context and so can't have local pages. Next the page is
added to the list of local pages and the number of local pages is incremented.
It's unclear why index is set to be order.
In the normal scheme of things
with a normal free, the code runs as follows
zone = page->zone;
mask = (~0UL) << order;
base = zone->zone_mem_map;
page_idx = page - base;
if (page_idx & ~mask)
BUG();
index = page_idx >> (1 + order);
area = zone->free_area + order;
mask will have multiple uses, each of which will be discussed in
turn. base is the beginning of the mem_map array for this zone.
page_idx is what index in the mem_map array we are looking at.
index is the offset within the free area bitmap. This is slightly
confusing because index is used when allocating pages to mean what page inside
mem_map is been referred to. area is the free area we are currrently
dealing with, remmeber that zone->free_area is an array.
zone->free_pages -= mask;
This is subtle. The value of -mask is the number of pages been freed.
Remember as a helpful comment in the code says
-mask = ~mask + 1
Next section
while (mask + (1 << (MAX_ORDER-1))) {
This while loop will exit when the maximum order is reached no matter what
mask started out as. Each iteration of the loop moves mask one bit to the left.
When mask has reached the max order it can handle, this addition will overflow
resulting in 0 and exiting the while loop.
struct page *buddy1, *buddy2;
if (area >= zone->free_area + MAX_ORDER)
BUG();
if (!__test_and_change_bit(index, area->map))
/*
* the buddy page is still allocated.
*/
break;
Basic declaration and a bug check. The test_and_change_bit section is not
clearly understood but it is suspected that irregardless of how pages are
allocated and freed, the bits for the pages we are about to free will be set to
1 if it's buddy is free.
buddy1 = base + (page_idx ^ -mask);
buddy2 = base + page_idx;
The value of buddy2 should be obvious. It's simply the current page_idx we
are about to free up. The calculation of buddy1 is much more subtle and requires
careful thought. -mask is evaluated to be the number of pages been freed which
also gives an index into the mem_map array. When exclusive or'd with page_idx,
it'll always give the index of the buddy. The easiest way to see this is to draw
out the bitmaps.
memlist_del(&buddy1->list);
mask <<= 1;
area++;
index >>= 1;
page_idx &= mask;
The page is now freed so remove it off the list. buddy2 has already been
freed remember. The last four lines set up the while loop to try and see if the
freed pages allow more buddies to be merged together.
}
memlist_add_head(&(base + page_idx)->list, &area->free_list);
spin_unlock_irqrestore(&zone->lock, flags);
return;
As this is the last page to be freed, it's added to the free list for the
last area that was examined. The spinlock is released and the the function
exits.
This document was translated from LATEX
by HEVEA.