Advanced Memory Management

About This Presentation

Title:

Advanced Memory Management

Description:

writable? Munin Release Consistency. thread-A thread-B. lock(A) lock(B) ... state: valid, writable, modified, replicated. copyset: bitmap? linked-list? ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 46

Provided by: camarsK

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Memory Management

1
Advanced Memory Management
2
Advanced Topics

Distributed Shared Memory
Application Controlled Memory Management
TLB Issues

3
Distributed Shared Memory
on hardware
on opearting system
in user space
4
Distributed Shared Memory

a way to facilitate the programming of
distributed systems
DSM
message passing
provides a uniform single address space view
easy to use
separate the memory view from threads
each node can access all the memories even though
machines do not physically share memory
software layer allow application software to
access shared data not resident at a node
integrated with OS (IVY, Clouds, ..)
runs on top of the OS (NOW)

5
Issues in the design of DSM

Virtual memory and DSM
find where is the physical page frame
make it coherent if duplicated
memory model and coherence protocol
what is guaranteed when a data is accessed in
parallel
how to make shared pages coherent
synchronization
will be used frequently in parallel programs on
DSM
hardware support
speed network, CPU
functionality TLB, message processing

6
Distributed Shared Memory

Virtual Memory Mechanism
1. CPU generates a virtual address
2. if the page is in local memory, keep going
else send a request for the page to ???
needs information who has it or
needs information who knows where the page is
3. the node who has the page replies with the
page
with write privilege if it is a write request
invalidating its own copy
with read-only privilege keeping the ownership
when a page has to be replaced either
swap it to disk or
send it to network DRAM

7
Consistency Protocol

similar to those of distributed file system and
SMP caches
defines actions on a write to a shared page
invalidate (a page)
update (a page or a word or cache line or ,,,,)
the length of write run decides which one is
better
defines state of a page
shared or exclusive
write or read
ownership, if any
defines the information about
how to find the owner
how to find replicated copies

8
Memory Model

defines result of series of memory operations
restricts the implementation of shared memory
may disallow cacheing
may allow out of order processing of memory
operations
sequential consistency (Lamport)
most strong model
result should be the same as serializable
operations
restrict any out of order accesses to memory
release consistency
based on a synchronization
acquire
release
guarantee consistency only at the end of release
operation
allows buffering memory accesses inside a
critical section

9
GLUnix

DSM that runs on top of OS
minimizes OS modification
fast prototyping
most applications can be run without modification
modifications needed
network protocol that is suited for page handling
use network RAM as a secondary storage
page fault handler
page tables

10
GLUnix(2)

Virtual OS layer
captures DSM-related interrupts and syscalls
generated from running application programs
how to capture external interrupts?
software fault isolation
insert a code to check before the instruction
that may cause interrupt
Issues
load balancing
parallel computation finishes when the slowest
component finishes
communicating processes
lots of small messages

11
IVY

First DSM
invalidation-based
ownership based
test three algorithms to find an owner
centralized server
fixed distributed manager
each node has the ownership for predetermined
pages
dynamic distributed manager
page table of each node has the possible owner of
each page
if it is not the true owner, it should know where
to look for further search

12
IVY(2)

Process Migration
needed for load balancing
sends PCB to destination
send pages containing stack to destination with
write privilege
why? those pages will be transferred anyway as
the stack is accessed in the destination node
Good performance
runs only those good for DSM
some applications show even super-linear speedup

13
Munin

DSMized the V kernel
Software Release Consistency
Multiple Consistency Protocols
performance of a consistency protocol is
sensitive to
structure of parallel programs
sharing pattern inside a program
a protocol is defined on each shared object
an annotation is needed for each object to define
the protocol
if it is missing, default protocol works

14
Munin Annotation

what is defined in an annotation?
invalidate vs update
replication allowed?
delayed operation allowed?
fixed owner?
multiple writes allowed?
static sharing pattern
the object is accessed by a single thread by a
static pattern, updates are sent to the same node
even before the node requests them
flush changes to owner?
writable?

15
Munin Release Consistency

thread-A thread-B
lock(A) lock(B)
X X1 Y Y1 X and Y are in the
unlock(A) unlock(B) same page
when updated pages are flushed to owner, how to
update the home page?
introduces twin page
if there is no twin no problem
else, difference of each twin page are updated
where to keep the twin pages?

16
Munin Directory Structure

hash function directs an address to an entry in
the table
the entry contains object description
start address and size
protocol defined
state valid, writable, modified, replicated
copyset bitmap? linked-list?
synchq pointer to the synch object that governs
this object
probable owner best guess
home node for book keeping
access control semaphore

17
Munin(3)

Merging Sync with Data Transfer
a message transfer is expensive in DSM
most sync operations are used to control shared
data
then let's merge them in a single message
how do we know which sync object governs which
object?
programmer knows it
can be defined at variable declaration
when a lock is released to a node
the data governed is sent together

18
Shasta

Motivations
fine-grain sharing will reduce false sharing
run binary executables
most commercial softwares are distributed in this
way
insert checks in application executables (like
Blizzard-S) at loads and stores
ordinary overhead 50150
support SMP as a node
Virtual address space
conventional space is private
code, static data, stack
shared space is dynamically allocated (followed
the convention of SPLASH)

19
Shasta Coherence Protocol

three states
invalid, shared, exclusive
directory-based invalidation
home node is assigned for each virtual page
owner node is the last node that updates the page
directory contains
a pointer to the owner
full map bit-vector for all sharers
coherence unit and coherence information
block (multiple of lines) by directory
information
lines (64128 bytes) by state table

20
Shasta(2)

Polling instead of interrupt for coherence
actions
polling is much more efficient (only 3
instructions)
simplifies concurrency problem handling a miss
while a miss is being check, messages related to
this miss can arrive
places to insert polling code
when the protocol waits for a message
depends on desired response time
at every function call
at every loop backedge

21
Shasta Shared Miss Check

Each load and store should be checked if it is a
miss instructions that need not be checked
private, stack accesses
addresses calculated from the above addresses
check if they use registers used for private data
normal operations
1. checks if the target address is in shared
region
2. if so, check out the state from the state
table
3. if needed, call miss handling routine

22
Inserted Code for Store Check
1. lda rx, offset(base) 2. srl rx, 39, ry 3.
beq ry, nomiss 4. srl rx, 6, rx 5. ldq_u ry,
0(rx) 6. extbl ry, rx, ry 7. beq ry, nomiss 8.
call miss handler check for a store
shared region
state table
static data text stack
23
Inserted Code for Store Check(2)

no register save/restore
use unused ones
if cannot find unused ones, insert a code to
secure two registers
instruction 2 and 6 requires smart address space
allocation
shared region
state table
operations
1. calculate effective address of the target
2. 3. check if the target address is within the
shared region
4. calculate the (byte) address in the state
table for the target address
5.6. extract the state information
7. if it is 0 (exclusive) go to nomiss

24
Shasta Optimization

code rescheduling
shift instruction needs 2 cycles to generate the
result
branch delay slots can be filled with the above
code
rx, and ry are unused registers no dependency
load checks
when a line becomes invalid, store a fixed value
to each long word in the line
for a load check, compare the value of a long
word with the flag value
if equal, call miss handling routine
else, continue
the flag value should not be the one that used
frequently in normal computing
not zero, not a small positive integer
253 was chosen in Shasta

25
Shasta Optimization(2)

store check
separate exclusive bit(1 bit) for each line is
prepared
a table of such bits will occupy a small space in
the data cache reduce cache misses for looking
up the state table
batching miss checks
if several instructions touch the same line, one
check is enough for all of them
multiple granularity
allow applications to define the block size at
malloc() system call

26
Alpha LL and SC

synchronization primitives of Alpha
lock_flag and lock_address per processor
operations
LL sets the lock_flag and the lock_address
the lock_flag will be reset if another processor
writes in its own cache at the lock_address
SC succeeds if the lock_flag is still set
exact implementation would be expensive
(inefficient) for shasta
Alpha programming recommendation
for an SC there is a unique LL
no store or load between SC and LL
SC and LL are in the same line

27
Shasta Approach to LL and SC

before LL
save the state of the line in a register
get the latest copy if the state is invalid
before SC
if the saved state is exclusive, OK
if invalid return fail
if shared, send a special message to the home
at the home node
if the requester is still a sharer, send OK
else send failure

28
Memory Barrier

fence operation to make all the pending
operations of the processor to be globally
performed
inefficient due to blind performing at MB
Shasta at MB
finishes all the pending operations
the hardware also executes MB operation for SMP

29
System Calls

validating arguments
what if arguments are in shared region?
copy the arguments from shared region to local
memory
expensive operations
validate arguments using a wrapper
make sure the arguments are in proper states
support multiple cluster
replace all related calls with Shastas calls
process managements
shared memory management
threads that share an address space
need inline checking even for the accesses to
private and stack expensive
page based protocol can be used for stack
access to remote file
distributed file system is needed

30
Process Handling

Processes are created/terminated dynamically
Issues
data and states information owned by a terminated
process
more processes than processors
inactive ones may cause delays for servicing
requests from other processes
solution
a daemon process per processor while the
application is running
it shares all the data with processes allocated
to the same processor
it runs at a low priority and handles messages
arrived for peer processes

31
Code modification

when?
load time
it may slow down the loading time
you have only binaries
cacheing will reduce the number of modifications
of frequently-used code
caveat program that generates code

32
Page Table for 64 bit OS

Motivations
page table is huge for 64-bit address space
inverted page table is not a solution due to
increased physical memory size
most programs use the address space sparsely
Multi-level page tables
PTEs are structured into an n-ary tree
reduces significantly PTEs for unused address
space
when the height of the tree is large
too many memory reference to find a PTE
when it is too small
lose the benefit of multi-level

33
Page Table for 64 bit OS

Hashed page tables
hash function maps a VPN to a bucket
a bucket is a linked list of elements which
consist of
PTE (PPN, attribute, valid bit,..)
VPN (almost 8 bytes)
next pointer (8 bytes)
space overhead 16 bytes per PTE
next pointer can be eliminated by allocating a
fixed number of elements for each bucket
overflow problem remains

34
Page Table for 64 bit OS

Clustered page table
each element of the linked list maps multiple
pages
VPN
next pointer
n PTEs
a memory object (in virtual address) usually
occupies multiple pages
space overhead of the hashed page table is
amortized
more efficient than linear table for sparse space

35
Clustered Page Table

Operations
adding a PTE
hash table memory allocation list insertion
PTE initialization per each new PTE
clustered
mem alloc list insertion per n PTEs
initialization for each PTE
modifying a PTE
modification is done for a memory object not for
a page, so cluster scheme is more efficient
synchronization
many threads use the page table concurrently
cluster lock for a group of pages
reduces concurrency
less blocking overhead
can support finer granularity with some overhead

36
TLB Issues

Can this scheme support new TLB technologies such
as superpage and subblocking?
Superpage
a superpage is 2n of base page size
each TLB entry must has the size field
why not segmentation
complex because its size is arbitrary and it
starts at an arbitrary location
reduce TLB misses since each entry maps wider
region
good for frame buffer, kernel data, DB buffer
pool
how about file cache?

37
TLB Issues(2)

Subblocking
put multiple PPNs in a TLB entry
it may waste TLB space
partial subblocking
physical memory is aligned, so
one PPN in a TLB entry
multiple valid bits needed
Clustered page table
just need a field to indicate if this list
element is for a normal cluster or
partial-subblock or superpage
mechanism for operation is naturally similar

38
Application Controlled MM

why user-controlled something?
computing usages are too diverse
multimedia data
real time
personal computing
large scientific application (usually parallel
computing)
a general OS cannot satisfy the needs that come
from such diversity
User-controlled what?
almost any part of OS scheduling, memory
management, file system, network protocol,
security,

39
Application Controlled MM

Mechanisms for User Control
microkernel approach
parts of OS that need to be customized are
prepared as user processes or
they are prepared as library functions that can
be binded into applications (ExoKernel)
modular but inefficient
binary loadable into the OS
needs dynamic linking method inside the OS
efficient but insecure

40
External Pager