Title: Memory Management in Linux
1Memory Management in Linux
2Memory Model
- Linear Address Space
- Each process has its own memory address space
- Mapping to Physical Memory by kernel MM
- X86 architecture 32bit (4GB)
- Actual Physical Memory
- Can range from KB to TB
- Larger than 4GB in i386 architecture? Need PAE
- Linux Memory Model
- Architecture-independent (mm/)
- Must be mapped to architecture specific
(arch/xxx/mm)
3Process Memory Layout
- User Space
- Code segment, data segment (include/asm/page.h)
- Addressable 0 to PAGE_OFFSET-1
- Kernel Space
- Code segment, data segment
- Shared by all processes
- Addressable PAGE_OFFSET-1 to (unsigned)(-1)
- In i386 architecture
- PAGE_OFFSET0xc0000000
- This means 1GB reserved for kernel, 3GB for user
4A Very Abstract Model
5Architecture Specific
- Page Basic unit of memory management
- PAGE_SIZE (include/asm/page.h)
- i386 architecture page size is 4KB
- Alpha architecture page size is 8KB
- Addressable memory space
- 32-bit iX86 architectures 4GB
- 64-bit Alpha architecture 8TB
- Memory management highly related to hardware
- Lots of routines implemented in assembly code
6Page Translation Tables
- Linear virtual address
- Low bits offset in the page (e.g., 12 bits in
32b arch) - High bits to identify the page unit (e.g., 20
bits) - Page translation
- Mapping the high bits to the base address of the
physical page - OS kernel prepare the tables (pointed by CR3)
- Actual translation done by hardware
- Multi-level page tables
- Single-level is not efficient for large address
space
73-Level Page Table
- 3-level of indirect table access to address a
page - Page Global Directory -gt Page Middle Directory -gt
Page Table -gt Page - Look for pgd_t, pmd_t, pte_t
- In include/asm-xxx/ page.h, pgalloc.h, pgtable.h
- x86 memory address
- 10-bit (table), 0-bit (table), 10-bit (table),
12-bit (page) - include/asm-i386/pgtable-2level.h
- Alpha memory address
- 10-bit, 10-bit, 10-bit, 13-bit (page)
8Page Table Entries
- Page Table pte_t
- Each entry 32-bit integer
- Fields the high bits, flags
- Flags present, accessed, dirty, R/W, user, ...
- Look for _PAGE_ macros
- Functions
- Look for pte_() macros
- Page Directories Functions
- Look for pgd_() and pmd_() macros
9Virtual Memory
- Linux Virtual Memory Management
- Mapping, allocating, and managing physical pages
- Managing secondary memory swapping
- Linux terminology
- Swapping demand-paging
- Architecture independent
- A clean interface to support different memory
mapping used under different architecture - To start include/linux/mm.h, mm/
10Using Memory in Kernel
- Kernel space is permanent mapped
- No virtual or secondary memory
- Be careful when you write code in kernel space
- Dont use too much memory
- Never use recursive call (kernel stack is limited
too) - Routines
- kmalloc(), kfree()
- vmalloc(), vfree()
- Macros to allocate pages
11Managing Physical Memory
- NUMA (Non-Uniform Memory Access)
- Memory in a machine is divided into nodes
- memory access costs (non-uniform)
- same node fast
- different node slower
- Each node has the same access time
- Data structure pg_data_t (include/linux/mmzone.h)
- Memory Zones
- Each node has several zones
- Each zone is different type of memory
- Data structure zone_t (include/linux/mmzone.h)
- Pages (Frames)
- Data structure struct page (include/linux/mm.h)
12Nodes and Zones
pgdat_list
Nodes
pg_data_t
free_area0
Zones
free_area9
zone_t
zone_pgdat
zone_mem_map
133 Zones
- ZONE_DMA
- 0-16M (i386)
- DMA capability some device driver need to use
this memory for I/O - ZONE_NORMAL
- 16M-896M (i386)
- Normal memory direct mapped by kernel
- ZONE_HIGHMEM
- gt896M (i396)
- Not used in 64-bit architecture
14Physical Pages
List
List
List
struct page struct list_head list ...
atomic_t count unsigned long flags struct
list_head lru mem_map_t
free_area0
free_area9
zone_pgdat
zone_mem_map
15Page Allocation
- Contiguous and non-contiguous allocation
- The Buddy System Algorithm
- Pages are allocated in blocks which are powers of
2 in size (1, 2, 4, 8, 16, 32, 64, 128, 256, 512
pages) - Each size with its own free list (called free
area) - De-allocation if the adjacent buddy block is
also free, combine to form a new free block for
the next size block of pages - Data structure free_area_t
16Page Allocation Functions
- Allocate a block of 2order contiguous pages
- struct page alloc_pages(unsigned int gfp_mask,
unsigned int order) - Allocate a single page
- struct page alloc_page(unsigned int gfp_mask)
- Other functions See include/linux/mm.h
- GFP flags where to allocate the page(s)
- GFP_KERNEL, GFP_USER, GFP_DMA, ...
17Noncontiguous Memory
- Noncontiguous page frames in contiguous linear
address - Not all virtual memory maps to the contiguous
page frames - Make sense for infrequent use
- To allocate (in include/linux/vmalloc.h)
- void vmalloc(unsigned long size)
- To release
- void vfree(const void addr)
18Contiguous vs Non-Contiguous
19Dynamic Kernel Memory
- For small memory use 32, 64, ..., 131072 bytes
- For kernel data objects that need to be
frequently allocated and released - Examples file descriptors, sockets, inodes,
- Slab Allocator
- Group memory area (called cache) by kernel data
object type - For each type, maintain memory cache (free list)
for objects that are previously allocated then
released - Interface with page frame allocator
- Source code mm/slab.c
20Caches
- One cache for each type of object
- Look at /proc/slabinfo
- size-N and size-N (DMA) general purpose caches
- Data structures
- Cache struct kmem_cache_s
- Slab struct slab_s
- Functions
- To create a cache kmem_cache_create(...)
- To allocate an object kmem_cache_alloc(...)
- To free (return to cache) kmem_cache_free(...)
21Caches and Slabs
22General Purpose Cache
- For general purpose object
- 13 general purpose cache in slab allocator
- Power of 2 32, 64, , 131072
- To allocate memory in this category
- include ltlinux/slab.hgt
- void kmalloc(size_t size, init flags)
- Flags SLAB_
- To free memory
- void kfree(const void objp)
23Summary Kernel Memory
Kernel data objects
Allocation functions kmem_cache_alloc(),
kmalloc()
Caches and slabs
Kernel virtual address space
Allocation functions alloc_pages(), vmalloc()
Page Tables
Physical Memory nodes, zones, pages
24Process Address Space
- Abstract model of user-space memory use
25Address Space in ELF Format
- Program segments
- Code start_code to end_code
- Data start_data to end_data
- BSS (Heap) start_brk to brk (growable)
- Can have more than one segment (program, shared
libraries, etc.) - Run-Time segment
- Stack start_stack (growable)
- Arguments arg_start to arg_end
- Environment env_start to env_end (0xbfffffff)
26Process Memory Descriptor
- One per process (in include/linux/sched.h)
- struct mm_struct
- struct vm_area_struct mmap
-
- pgd_t pgd
-
- unsigned long start_code, end_code, start_data,
end_data - unsigned long start_brk, brk, start_stack
- unsigned long arg_start, arg_end, env_start,
env_end -
-
27VM Area Descriptor
- Linear Address Interval (in include/linux/mm.h)
- struct vm_area_struct
- struct mm_struct vm_mm
- unsigned long vm_start
- unsigned long vm_end
- struct vm_area_struct vm_next
- .
-
28VM Area
29VM Area Handlers
- To Create VM Area and Do the Mapping
- do_mmap() (in include/linux/mm.h)
- Used in system calls execve() , brk()
- To Release a VM Area and Shrink the Address Space
- do_munmap() (in mm/mmap.c)
- Used in system calls brk()
- To Find a VM Area that includes a given address
- find_vma() (in mm/mmap.c)
30VM Area Lookup by Address
- Given a virtual address, need to look up a VM
Area fast - Used in page fault, memory mapping, VM area
operations (like locking, etc.) - Data structure Red-Black tree and Cache
- struct mm_struct
- ....
- rb_root_t mm_rb
- struct vm_area_struct mmap_cache
- ....
-
31Find VMA
- More data structure
- struct vm_area_struct
- ....
- rb_node_t vm_rb
- ....
-
- Function
- struct vm_area_struct find_vma(struct mm_struct
mm, unsigned long addr)
32Shared Memory
- Two (or more) processes can shared the same
physical memory region - May mapped to different virtual addresses
- Through system calls
- shmat(), shmdt()
- Implicitly
- Shared library
33VM Area Shared Memory
34Backing Store
- Each VM area can be mapped to a file (in
secondary memory) - Explicit memory mapping through system call
- mmap(), munmap(), mremap()
- Implicit mmaping
- Code segment (loading from an excutable binary
file) - Swapping (mapped to the swap file)
35VM Area Backing Store
- Data structure (in include/linux/mm.h)
- struct vm_area_struct
- ....
- unsigned long vm_pgoff / offset in page /
- struct file vm_file / mapped file /
- ....
-
36Swapping and Page Cache
- For Pages in a Processs User Space
- Swap Secondary Memory on the disk
- Page Cache Main Memory
- Data Structure 3 sets of lists
- Active pages, usually mapped by a processs PTE
- active_list (in mm/page_alloc.c)
- Inactive, unmapped, clean or dirty
- inactive_dirty_list (in mm/page_alloc.c)
- Clean pages, unmapped (one list per zone)
- zone_t.inactive_clean_list (in
include/linux/mmzone.h)
37Kernel Swap Daemon
- Implemented as a kernel thread
- kswapd() (in mm/vmscan.c)
- Wake up periodically
- Wake up more frequently if memory shortage
- Check memory and if memory is tight
- Age pages that have not be used
- Move pages to inactive lists
- Write dirty pages to disk
- Swap pages out if necessary
38More kswapd()
- Call swap_out() to scan inactive page lists
- Removes page reference from process s page table
- Actual swapping is done independently by file I/O
- Call refill_inactive_scan() to
- Scan the active_list to find unused page
- Call age_page_down() to reduce page-gtage count
- If page-gtage is zero, move to inactive_dirty_list
- Call page_launder() to clean dirty pages
- Scan inactive_dirty_list for dirty pages, write
to disk - Move clean pages to the zone s
inactive_clean_list
39Life Cycle of a Page in Cache
- A page is read into memory from disk
- Caused by page-fault, or read-ahead
- Added to page cache active_list
- Page is dirty if written by the process
- If not used, page-gtage count gradually reduced
- If page-gtage is zero, moved to inactive_dirty_list
- May be moved to an inactive_clean_list later
- Page can be recovered from an inactive list
- Inactive page can be released
- reclaim_page() to free page for used by others
40Demand Paging
- Page frame for a VM area is not in core
- Page frame is not allocated when VM area is
created - Page frame can be swapped out
- Handled by page fault
41Page Fault
- Exception handler
- Raised by address translation (hardware)
- Call do_page_fault() to handle this interrupt
- do_page_fault()
- Architecture-specific
- i386 arch/i386/mm/fault.c
- Find a page frame in the physical memory
- Load the missing page in
- Update the page tables
42do_page_fault()
-
- vma find_vma(mm, address)
- if (!vma)
- goto bad_area
- if (vma-gtvm_start lt address)
- goto good_area
-
- good_area
-
- handle_mm_fault(mm, vma, address, write)
-
- bad_area
-
- force_sig_info(SIGSEGV, info, tsk)
-
43handle_mm_fault()
-
- pgd pgd_offset(mm, address)
- pmd pmd_alloc(mm, pgd, address)
- if (pmd)
- pte pte_alloc(mm, pmd, address)
- if (pte)
- return handle_pte_fault(...)
-
-
- handle_pte_fault()
- do_no_page() if pte entry is all-zero
- Do_swap_page() if pte entry is none-zero
44Summary Process Memory
- Process virtual address space
Memory Area
Page Tables
Page Fault
Backing Store
Kswapd
Physical Memory
45Memory Management Summary
- Physical Page Management
- Page Table Structure
- Memory Zones and Page Allocation
- Kernel Memory Management
- Slab and Allocation Routines
- Virtual Memory Management (Process Address Space)
- Memory Layout and VM Area
- Page Cache and Swapping
- Page Fault