Title: Advanced Memory Management
1Advanced Memory Management
2Advanced Topics
- Distributed Shared Memory
- Application Controlled Memory Management
- TLB Issues
3Distributed Shared Memory
on hardware
on opearting system
in user space
4Distributed Shared Memory
- a way to facilitate the programming of
distributed systems - DSM
- message passing
- provides a uniform single address space view
- easy to use
- separate the memory view from threads
- each node can access all the memories even though
machines do not physically share memory - software layer allow application software to
access shared data not resident at a node - integrated with OS (IVY, Clouds, ..)
- runs on top of the OS (NOW)
5Issues in the design of DSM
- Virtual memory and DSM
- find where is the physical page frame
- make it coherent if duplicated
- memory model and coherence protocol
- what is guaranteed when a data is accessed in
parallel - how to make shared pages coherent
- synchronization
- will be used frequently in parallel programs on
DSM - hardware support
- speed network, CPU
- functionality TLB, message processing
6Distributed Shared Memory
- Virtual Memory Mechanism
- 1. CPU generates a virtual address
- 2. if the page is in local memory, keep going
- else send a request for the page to ???
- needs information who has it or
- needs information who knows where the page is
- 3. the node who has the page replies with the
page - with write privilege if it is a write request
invalidating its own copy - with read-only privilege keeping the ownership
- when a page has to be replaced either
- swap it to disk or
- send it to network DRAM
7Consistency Protocol
- similar to those of distributed file system and
SMP caches - defines actions on a write to a shared page
- invalidate (a page)
- update (a page or a word or cache line or ,,,,)
- the length of write run decides which one is
better - defines state of a page
- shared or exclusive
- write or read
- ownership, if any
- defines the information about
- how to find the owner
- how to find replicated copies
8Memory Model
- defines result of series of memory operations
- restricts the implementation of shared memory
- may disallow cacheing
- may allow out of order processing of memory
operations - sequential consistency (Lamport)
- most strong model
- result should be the same as serializable
operations - restrict any out of order accesses to memory
- release consistency
- based on a synchronization
- acquire
- release
- guarantee consistency only at the end of release
operation - allows buffering memory accesses inside a
critical section
9GLUnix
- DSM that runs on top of OS
- minimizes OS modification
- fast prototyping
- most applications can be run without modification
- modifications needed
- network protocol that is suited for page handling
- use network RAM as a secondary storage
- page fault handler
- page tables
10GLUnix(2)
- Virtual OS layer
- captures DSM-related interrupts and syscalls
generated from running application programs - how to capture external interrupts?
- software fault isolation
- insert a code to check before the instruction
that may cause interrupt - Issues
- load balancing
- parallel computation finishes when the slowest
component finishes - communicating processes
- lots of small messages
11IVY
- First DSM
- invalidation-based
- ownership based
- test three algorithms to find an owner
- centralized server
- fixed distributed manager
- each node has the ownership for predetermined
pages - dynamic distributed manager
- page table of each node has the possible owner of
each page - if it is not the true owner, it should know where
to look for further search
12IVY(2)
- Process Migration
- needed for load balancing
- sends PCB to destination
- send pages containing stack to destination with
write privilege - why? those pages will be transferred anyway as
the stack is accessed in the destination node - Good performance
- runs only those good for DSM
- some applications show even super-linear speedup
13Munin
- DSMized the V kernel
- Software Release Consistency
- Multiple Consistency Protocols
- performance of a consistency protocol is
sensitive to - structure of parallel programs
- sharing pattern inside a program
- a protocol is defined on each shared object
- an annotation is needed for each object to define
the protocol - if it is missing, default protocol works
14Munin Annotation
- what is defined in an annotation?
- invalidate vs update
- replication allowed?
- delayed operation allowed?
- fixed owner?
- multiple writes allowed?
- static sharing pattern
- the object is accessed by a single thread by a
static pattern, updates are sent to the same node
even before the node requests them - flush changes to owner?
- writable?
15Munin Release Consistency
- thread-A thread-B
- lock(A) lock(B)
- X X1 Y Y1 X and Y are in the
- unlock(A) unlock(B) same page
- when updated pages are flushed to owner, how to
update the home page? - introduces twin page
- if there is no twin no problem
- else, difference of each twin page are updated
- where to keep the twin pages?
16Munin Directory Structure
- hash function directs an address to an entry in
the table - the entry contains object description
- start address and size
- protocol defined
- state valid, writable, modified, replicated
- copyset bitmap? linked-list?
- synchq pointer to the synch object that governs
this object - probable owner best guess
- home node for book keeping
- access control semaphore
17Munin(3)
- Merging Sync with Data Transfer
- a message transfer is expensive in DSM
- most sync operations are used to control shared
data - then let's merge them in a single message
- how do we know which sync object governs which
object? - programmer knows it
- can be defined at variable declaration
- when a lock is released to a node
- the data governed is sent together
18Shasta
- Motivations
- fine-grain sharing will reduce false sharing
- run binary executables
- most commercial softwares are distributed in this
way - insert checks in application executables (like
Blizzard-S) at loads and stores - ordinary overhead 50150
- support SMP as a node
- Virtual address space
- conventional space is private
- code, static data, stack
- shared space is dynamically allocated (followed
the convention of SPLASH)
19Shasta Coherence Protocol
- three states
- invalid, shared, exclusive
- directory-based invalidation
- home node is assigned for each virtual page
- owner node is the last node that updates the page
- directory contains
- a pointer to the owner
- full map bit-vector for all sharers
- coherence unit and coherence information
- block (multiple of lines) by directory
information - lines (64128 bytes) by state table
20Shasta(2)
- Polling instead of interrupt for coherence
actions - polling is much more efficient (only 3
instructions) - simplifies concurrency problem handling a miss
- while a miss is being check, messages related to
this miss can arrive - places to insert polling code
- when the protocol waits for a message
- depends on desired response time
- at every function call
- at every loop backedge
21Shasta Shared Miss Check
- Each load and store should be checked if it is a
miss instructions that need not be checked - private, stack accesses
- addresses calculated from the above addresses
- check if they use registers used for private data
- normal operations
- 1. checks if the target address is in shared
region - 2. if so, check out the state from the state
table - 3. if needed, call miss handling routine
22Inserted Code for Store Check
1. lda rx, offset(base) 2. srl rx, 39, ry 3.
beq ry, nomiss 4. srl rx, 6, rx 5. ldq_u ry,
0(rx) 6. extbl ry, rx, ry 7. beq ry, nomiss 8.
call miss handler check for a store
shared region
state table
static data text stack
23Inserted Code for Store Check(2)
- no register save/restore
- use unused ones
- if cannot find unused ones, insert a code to
secure two registers - instruction 2 and 6 requires smart address space
allocation - shared region
- state table
- operations
- 1. calculate effective address of the target
- 2. 3. check if the target address is within the
shared region - 4. calculate the (byte) address in the state
table for the target address - 5.6. extract the state information
- 7. if it is 0 (exclusive) go to nomiss
24Shasta Optimization
- code rescheduling
- shift instruction needs 2 cycles to generate the
result - branch delay slots can be filled with the above
code - rx, and ry are unused registers no dependency
- load checks
- when a line becomes invalid, store a fixed value
to each long word in the line - for a load check, compare the value of a long
word with the flag value - if equal, call miss handling routine
- else, continue
- the flag value should not be the one that used
frequently in normal computing - not zero, not a small positive integer
- 253 was chosen in Shasta
25Shasta Optimization(2)
- store check
- separate exclusive bit(1 bit) for each line is
prepared - a table of such bits will occupy a small space in
the data cache reduce cache misses for looking
up the state table - batching miss checks
- if several instructions touch the same line, one
check is enough for all of them - multiple granularity
- allow applications to define the block size at
malloc() system call
26Alpha LL and SC
- synchronization primitives of Alpha
- lock_flag and lock_address per processor
- operations
- LL sets the lock_flag and the lock_address
- the lock_flag will be reset if another processor
writes in its own cache at the lock_address - SC succeeds if the lock_flag is still set
- exact implementation would be expensive
(inefficient) for shasta - Alpha programming recommendation
- for an SC there is a unique LL
- no store or load between SC and LL
- SC and LL are in the same line
27Shasta Approach to LL and SC
- before LL
- save the state of the line in a register
- get the latest copy if the state is invalid
- before SC
- if the saved state is exclusive, OK
- if invalid return fail
- if shared, send a special message to the home
- at the home node
- if the requester is still a sharer, send OK
- else send failure
28Memory Barrier
- fence operation to make all the pending
operations of the processor to be globally
performed - inefficient due to blind performing at MB
- Shasta at MB
- finishes all the pending operations
- the hardware also executes MB operation for SMP
29System Calls
- validating arguments
- what if arguments are in shared region?
- copy the arguments from shared region to local
memory - expensive operations
- validate arguments using a wrapper
- make sure the arguments are in proper states
- support multiple cluster
- replace all related calls with Shastas calls
- process managements
- shared memory management
- threads that share an address space
- need inline checking even for the accesses to
private and stack expensive - page based protocol can be used for stack
- access to remote file
- distributed file system is needed
30Process Handling
- Processes are created/terminated dynamically
- Issues
- data and states information owned by a terminated
process - more processes than processors
- inactive ones may cause delays for servicing
requests from other processes - solution
- a daemon process per processor while the
application is running - it shares all the data with processes allocated
to the same processor - it runs at a low priority and handles messages
arrived for peer processes
31Code modification
- when?
- load time
- it may slow down the loading time
- you have only binaries
- cacheing will reduce the number of modifications
of frequently-used code - caveat program that generates code
32Page Table for 64 bit OS
- Motivations
- page table is huge for 64-bit address space
- inverted page table is not a solution due to
increased physical memory size - most programs use the address space sparsely
- Multi-level page tables
- PTEs are structured into an n-ary tree
- reduces significantly PTEs for unused address
space - when the height of the tree is large
- too many memory reference to find a PTE
- when it is too small
- lose the benefit of multi-level
33Page Table for 64 bit OS
- Hashed page tables
- hash function maps a VPN to a bucket
- a bucket is a linked list of elements which
consist of - PTE (PPN, attribute, valid bit,..)
- VPN (almost 8 bytes)
- next pointer (8 bytes)
- space overhead 16 bytes per PTE
- next pointer can be eliminated by allocating a
fixed number of elements for each bucket - overflow problem remains
34Page Table for 64 bit OS
- Clustered page table
- each element of the linked list maps multiple
pages - VPN
- next pointer
- n PTEs
- a memory object (in virtual address) usually
occupies multiple pages - space overhead of the hashed page table is
amortized - more efficient than linear table for sparse space
35Clustered Page Table
- Operations
- adding a PTE
- hash table memory allocation list insertion
PTE initialization per each new PTE - clustered
- mem alloc list insertion per n PTEs
- initialization for each PTE
- modifying a PTE
- modification is done for a memory object not for
a page, so cluster scheme is more efficient - synchronization
- many threads use the page table concurrently
- cluster lock for a group of pages
- reduces concurrency
- less blocking overhead
- can support finer granularity with some overhead
36TLB Issues
- Can this scheme support new TLB technologies such
as superpage and subblocking? - Superpage
- a superpage is 2n of base page size
- each TLB entry must has the size field
- why not segmentation
- complex because its size is arbitrary and it
starts at an arbitrary location - reduce TLB misses since each entry maps wider
region - good for frame buffer, kernel data, DB buffer
pool - how about file cache?
37TLB Issues(2)
- Subblocking
- put multiple PPNs in a TLB entry
- it may waste TLB space
- partial subblocking
- physical memory is aligned, so
- one PPN in a TLB entry
- multiple valid bits needed
- Clustered page table
- just need a field to indicate if this list
element is for a normal cluster or
partial-subblock or superpage - mechanism for operation is naturally similar
38Application Controlled MM
- why user-controlled something?
- computing usages are too diverse
- multimedia data
- real time
- personal computing
- large scientific application (usually parallel
computing) - a general OS cannot satisfy the needs that come
from such diversity - User-controlled what?
- almost any part of OS scheduling, memory
management, file system, network protocol,
security,
39Application Controlled MM
- Mechanisms for User Control
- microkernel approach
- parts of OS that need to be customized are
prepared as user processes or - they are prepared as library functions that can
be binded into applications (ExoKernel) - modular but inefficient
- binary loadable into the OS
- needs dynamic linking method inside the OS
- efficient but insecure
40External Pager
- Motivations applications doesnt know and cant
control - the amount of memory available to it
- some programs can use as much memory as it is
given - what parts of it are kept in the memory
- some data accesses are predictable
- Some solutions
- allow applications to pin pages in memory
- disable OS ability to shared memory
- hard to know how much to pin
- allow application to advise VM system
- madvise() system call
- very primitive and complex mechanism
41External Pager
- segment manager
- user level pager
- reclaims page frames
- writeback page frames
- on page fault
- kernel forwards this event to the segment manager
- via signal or interrupt
- the manager reclaims a page frame
- may writeback a page
- need to maintain a list of free pages
42External Pager
- system call
- SetSegmentManager(seg, manager)
- specify the manager of a segment
- MigratePages(srcSeg, dstSeg, flags)
- move pages from a segment to another
- ModifyPageFlags(seg, flags)
- set/clear dirty bit, protection
- GetPageAttribute(seg, pages)
- determine the flags and mapping of pages
- the manager can be a part of the application
- recursive page faults may occur
- a manger pins its stack into memory
43External Pager
- how a manager gets a free page
- from a free-page segment that it manages
- reclaim a page from another segment it manages
- request an additional page from the kernel
- System page cache manager
- the controller of the machines global memory
pool - segment managers get segments from this
- it may approve, deny, or partially fulfill
requests
44Memory Market Model
- until now, scheduling is for the time a program
uses - with multiprocessors memory will be more
contended than CPU - charges each process for space_used x time
- an application requests an amount of dram
initially - the kernel allocates dram according to system
status and user request - applications choose if they want large memory to
be executed fast OR - small memory since it is not very urgent
- questions remain
- interactions with CPU scheduling
45Summary
- external pager is a trend
- OS should provide abstractions of the hardware
complete in - functionality (dont hide useful functionality)
- performance (dont hide performance)
- other user-controlled approaches
- gang scheduling on parallel machine
- scheduler activation
- user-level devices and file system