Physical address space
Buddy allocator
Physical memory is allocated big chunks at a time; to efficiently allocate many smaller and different-size objects, the memory block gets partitioned with the buddy algorithm.
First, round up the request to the next power-of-two size. Check the free list for that size; if available, pop and use one block. If empty, take a block from the next larger free list, recursively split into buddies, push the unused buddies to their sizes list, and allocate the one you need.
If no space is available, allocate pages to create a new buddy block. When releasing a block, check if its buddy is also free, and in that case, merge them into a one big free block, possibly going up recursively.
kmalloc is built on top of the slab allocator, itself sitting on top of the
buddy allocator.
vmalloc lives directly on top of the buddy allocator.
Zonal page allocation
System's memory is partitioned among each NUMA node.
Each node has an instance in the pgdata_list, with which it manages its own
chunk of memory.
Each node's memory is partitioned in "zones", each being a physical memory range addressed to specific use cases. Each zone manages its free pages, storing page descriptors in a free list.
Essential zones are the ZONE_NORMAL (upper addresses) and ZONE_DMA (lower
addresses): the first is mapped by the kernel while the second is used by
devices.
User space page caching
Physically allocated pages can be anonymous (swap) or linked to files.
Each page is cached in a struct page descriptor.
Two types of mapping exist.
Forward mapping
Forward mapping is used to find the struct page given a virtual address.
For example, given a file struct inode, it contains a struct address_space,
which stores mappings between file's offsets and corresponding physical pages.
Anonymous pages (pages not linked to files) are linked to the swapfile; when
they can be swapped out, they enter the swap cache.
The swap cache is implemented using a single global struct address_space
named swapper_space, indexed by swap location rather than file offset.
Given a struct file or a swap location it is therefore possible to get the
corresponding underlying physical struct page.
This is useful for example to check residency: the struct page can tell if
its data is stored in RAM, if it's up to date or if it's to be reloaded from
disk.
Reverse mapping
Reverse mapping from a struct page identifies VMAs mapping that physical
page.
This is necessary when evicting the page, as all referencing virtual
pages must be invalidated in the PTEs (Page Table Entry) of processes.
This struct also keeps flags representing the dirty state of the page.
The struct page contains a mapping pointer referencing either a file's
address_space or an anonymous memory range (anon_vma) of a process using
the page.
Via these struct it's possible to reach a list of VMAs using the page.
By walking these VMAs, each is checked if containing virtual pages pointing to
the evicted page, possibly removing it from PTEs.
PFRA
The Page Frame Reclaim Algorithm is used to unload pages when free space
shrinks.
It is mainly operated by the kswapd daemon, which runs periodically, but
under high loads the buddy allocator can run it itself.
Pages are reclaimed via reverse mapping.
The reclaim policy is derived from the LRU-like "clock algorithm". A circular list keeps pointers to all pages, each containing a R bit representing recent references. When a program uses a page, it sets R=1. When in need of reclaiming a page, the page under the hand (index) is evicted if R=0, otherwise, if R=1, R is set to 0 and the hand moves forward until a reclaimable page is found.
File pages are majorly clean at the time of eviction, whilst anonymous ones are always dirty. For this reason, Linux keeps two different groups based on page type, to better balance use cases.
Each group (file's and anonymous pages) applies the clock algorithm on two
different circular list: actives and inactives.
When a file is accessed, it is put in inactives with R=1.
When shrink_list is called, it runs the clock algorithm on inactives.
When refill_inactive_zone is called, it runs the algorithm on actives,
evicting pages to inactives.
mark_page_accessed sets R=1 on the page; when it gets called on a R=1
inactive page, that gets moved to actives, setting R=0.
The inactive list should to be kept balanced with the active one, with its size matching the working set (number of "recently" used pages). The PFRA algorithm can thus be called periodically reallocating actives to inactives. This is needed as a small inactive list can lead to thrashing, resulting in premature evictions and frequent reloads.