Title: Implementation and Performance of Munin (Distributed Shared Memory System)
1ECE 1147, Parallel Computation Oct. 30, 2006
Implementation and Performance of Munin
(Distributed Shared Memory System)
Dongying Li
(Original Authors J. B. Carter, et al.)
Department of Electrical and Computer
Engineering University of Toronto
2Distributed Shared Memory
- Shared address space spanning the processors of a
distributed memory multiprocessor
proc1
proc3
proc2
X0
X0
X0
X0
3Distributed Shared Memory
shared memory
network
mem0
mem1
mem2
memN
...
proc0
proc1
proc2
procN
4Distributed Shared Memory
- Design objectives
- Good performance comparable to shared memory
programs - No significant deviation from shared memory
coding model - Low communication and message passing overheads
5Munin System
- Characterized features
- Software released consistency
- Multiple consistency protocols
- Same interface with shared memory code model
- Threads, syncs, data sharing etc.
- Deviations
- All shared variable annotated by access pattern
- Syncs explicitly visible to runtime system
(important for release consistency!)
6Contents
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
7Basic Concepts
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
8Shared Object
8-kilo
8-kilo
8-kilo
x
x
x
y
9Software Release Consistency
- Sequential Consistency
- All processors observe the same order
- Must correspond to some serial order
- Only ordering constraint is that reads/writes of
P1 appear in the same order, but no restrictions
on relative ordering between processors. - Synchronous read/write
- Writes must be propagated before moving on to the
next operation
10Software Release Consistency
- Special weak consistency protocol
- Reduction of message passing overhead
- Two categories of shared variable operations
- Ordinary access
- Read
- Write
- Synchronization access (lock, semaphore, barrier)
- Acquire
- Release
11Software Release Consistency
- Before ordinary access (read, write) allowed, all
previous acquire performed - Before release allowed, all previous ordinary
access performed - Before acquire allowed, all previous release
performed - Before release allowed, all previous acquire
performed - In a word, results of writes prior to a release
propagated before next processor acquiring this
released lock
12Release Consistency
- Write propagating at release
13Multiple Consistency Protocols
- No single consistency protocol suitable for all
parallelization purpose - Shared variables accessed in different ways
within single program - Variable access pattern changes during execution
- Multiple protocols allow access pattern-oriented
tuning for different shared variables
14Multiple Consistency Protocols
- High-level sharing pattern annotation
- Specified in shared variable declaration
- Combinations of low-level protocol parameters
- Low-level protocol parameter
- Specified in shared variable directory
- Specific aspect of protocol
15Protocol Parameters
- I propagate invalidating or updating after
modification? - R Replicas allowed in other nodes?
- D Delayed operation (update, invalidation)
allowed? - FO Having fixed owner (no writes at other
nodes)? - M Multiple writers allowed?
- S Stable sharing pattern (accessed by fixed
threads)? - FL Flush changes to owner invalidate local
copy? - W Writable?
16Sharing annotations
- Read only
- Simplest pattern once initialized, no further
access - Suitable for constant etc.
- Migratory
- Only one thread can access at one period of time
- Suitable for variables accessed only in critical
session - Write-shared
- Can be written concurrently by multiple threads
- Different threads update different words of
variable - Producer-consumer
- Written only by one threads and read by others
- Replicate and update the object, not invalidate
17Sharing annotations
- Example producer-consumer
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
-
18Sharing annotations
- Reduction
- Accessed by fetching and operation (read, write
then release) - Example min(), a
- Result
- Phase 1 multiple write allowed
- Phase 2 one thread (the result) access
exclusively - Conventional
- Conventional update protocol for shared variables
19Sharing annotations
w(x)
w(x)
r(x)
w(x)
w(x)
w(x)
w(x)
r(x)
w(x)
w(x)
20Sharing annotations
Sharing Annotations Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters
Sharing Annotations I R D FO M S FL W
Read-only N Y - - - - - N
Migratory Y N - N N - N Y
Write-shared N Y Y N Y N N Y
Producer-Consumer N Y Y N Y Y N Y
Reduction N Y N Y N - N Y
Result N Y Y Y Y - Y Y
Conventional Y Y N N N - N Y
21Software Implementation
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
22Prototype Overview
- A simple processor converting annotations to
suitable format - A linker creating the shared memory segment
- Library routines linked into program
- Operating system support for page fault handling
and page table manipulation
23Execution Process
Munin processor
Sharing annotations
Auxiliary files
Linker
Shared data description table
Shared data segment
24Execution Process
Munin root thread
user root thread
User_init()
P1
Code copy
Data segment
P2
Munin worker thread
. .
Code copy
Data segment
Pn
Munin worker thread
25Execution Process
Munin root thread
P1
Synchronization operation
P2
User thread
Munin worker thread
. .
Pn
26Advanced Programming Features
rel(m)
msg
acq(m)
r(x)
r(x)
rel(m)
msg
w(x)
acq(m)
r(x)
27Advanced Programming Features
- PhaseChange()
- Change the producer consumer relationship
- Example adaptive mesh sor
- ChangeAnnotation()
- Change the access pattern in execution
- Invalidate()
- Flush()
- SingleObject()
- PreAcquire()
28Data Object Directory
- Start Address and Size
- Protocol parameters
- Object state (valid, writable, invalid)
- Copyset (which remote has copies)
- Synchq (corresponding synchronization object)
- Probable owner
- Home node
- Access control semaphore
- Links
29Delayed Update Queue
rel(m)
acq(m)
w(x)
w(y)
x
x
y
30Multiple Writer Handling
31Synchronization
- Queue based synchronization
- Request reply lock forward mechanism
- CreateLock(), AcquireLock(), ReleaseLock(),
CreateBarrier(), WaitAtBarrier()
32Performance
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
33Matrix Multiply
34Matrix Multiply Optimized
35SOR
36Effect of Multiple Protocols
Protocol Matrix Multiply SOR
Multiple 72.41 27.64
Write-shared 75.59 64.48
Conventional 75.85 67.64
37Performance Problem with Munin
- Note inefficient performance for task-queue
model! (TSP-Q, quicksort, etc.) - Eg. Speed up with MPI for TSP (16 procs)
- code I code II
- 8.9 13.4
- Speed up with Munin
- code I code II
- 6.0 8.9
- Major overhead time for thread waiting at the
lock which protects the work queue caused by
transferring whole work queue between threads
38Overview of Other DSM System
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
39Overview of Other DSM System
- Clouds per-segment (object) based consistency
protocol - Mirage per-page based
- Orca reliable ordered broadcast protocol
- Amber user responsible for the data distribution
among processors - Linda shared variable in tuple space, atomic
operation insertion, removal, reading - Midway using entry consistency (weaker
consistency than release consistency) - DASH hardware DSM
40Conclusion
- Objective efficient DSM system with similar
protocol to shared memory programming and small
message passing overhead - Special feature multiple protocols, software
release consistency - Implementation synchronization realized by Munin
root thread and Munin worker threads
41Thank you