Title: Implementation and Performance of Munin (Distributed Shared Memory System)
1ECE 1147, Parallel Computation Oct. 30, 2006
Implementation and Performance of Munin
(Distributed Shared Memory System)
Dongying Li
(Original Authors J. B. Carter, et al.)
Department of Electrical and Computer
Engineering University of Toronto
2Distributed Shared Memory
- Shared address space spanning the processors of a
distributed memory multiprocessor
proc1
proc3
proc2
X0
X0
X0
X0
3Distributed Shared Memory
shared memory
network
mem0
mem1
mem2
memN
...
proc0
proc1
proc2
procN
4Distributed Shared Memory
- Challenges
- Good performance comparable to shared memory
programs - No significant deviation from shared memory
coding model - Low communication and message passing overheads
5Munin System
- Characterized features
- Software released consistency
- Multiple consistency protocols
- Deviations from shared memory model
- Annotated shared memory variable pattern
- All Synchronization visible to system
6Contents
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
7Basic Concepts
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
8Shared Object
8-kilo
8-kilo
8-kilo
x
x
x
y
9Software Release Consistency
- Sequential Consistency
- All processors observe the same order
- Must correspond to some serial order
- Only ordering constraint is that reads/writes of
P1 appear in the same order, but no restrictions
on relative ordering between processors. - Synchronous read/write
- Writes must be propagated before moving on to the
next operation
10Software consistency
- Problems
- Message passing overhead
- False sharing
w(x)
w(x)
w(x)
r(x)
r(y)
r(y)
11Weak Consistency
- Data modifications only propagated at
synchronization. - Works fine if program properly synchronized
through system primitives.
w(x)
w(x)
w(x)
r(x)
r(y)
r(y)
synch
12Weak Consistency
w(x)
w(x)
r(y)
r(y)
r(x)
synch
13Software Release Consistency
- Special weak consistency protocol
- Reduction of message passing overhead
- Two categories of shared variable operations
- Ordinary access
- Read
- Write
- Synchronization access (lock, semaphore, barrier)
- Acquire
- Release
14Software Release Consistency
- Before ordinary access (read, write) allowed, all
previous acquire performed - Before release allowed, all previous ordinary
access performed - Before acquire allowed, all previous release
performed - Before release allowed, all previous acquire
performed - In a word, results of writes prior to a release
propagated before next processor acquiring this
released lock
15Eager Release Consistency
- Write propagating at release
16Lazy Release Consistency
- Write propagating at acquire
17Multiple Consistency Protocols
- No single consistency protocol suitable for all
parallelization purpose - Shared variables accessed in different ways
within single program - Variable access pattern changes during execution
- Multiple protocols allow access pattern-oriented
tuning for different shared variables
18Multiple Consistency Protocols
- High-level sharing pattern annotation
- Specified in shared variable declaration
- Combinations of low-level protocol parameters
- Low-level protocol parameter
- Specified in shared variable directory
- Specific aspect of protocol
19Protocol Parameters
- I invalidate or update?
- R Replicas allowed?
- D Delayed operation allowed?
- FO Having fixed owner?
- M Multiple writers allowed?
- S Stable access pattern?
- FL Flushing changes to owner?
- W Writable? (write protected?)
20Sharing annotations
- Read only
- Simplest pattern once initialized, no further
access - Suitable for constant etc.
- Migratory
- Only one thread can access at one period of time
- Suitable for variables accessed only in critical
session - Write-shared
- Can be written concurrently by multiple threads
- Different threads update different words of
variable - Producer-consumer
- Written only by one threads and read by others
- Replicate and update the object, not invalidate
21Sharing annotations
- Example producer-consumer
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
-
- back
22Sharing annotations
- Reduction
- Accessed by fetching and operation (read, write
then release) - Example min(), a
- Result
- Phase 1 multiple write allowed
- Phase 2 one thread (the result) access
exclusively - Conventional
- Conventional update protocol for shared variables
23Sharing annotations
Sharing Annotations Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters Protocol Parameters
Sharing Annotations I R D FO M S FL W
Read-only N Y - - - - - N
Migratory Y N - N N - N Y
Write-shared N Y Y N Y N N Y
Producer-Consumer N Y Y N Y Y N Y
Reduction N Y N Y N - N Y
Result N Y Y Y Y - Y Y
Conventional Y Y N N N - N Y
24Software Implementation
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
25Prototype Overview
- A simple processor converting annotations to
suitable format - A linker creating the shared memory segment
- Library routines linked into program
- Operating system support for fault handling and
page table manipulation
26Execution Process
Munin processor
Sharing annotations
Auxiliary file
Linker
Shared data description table
Shared data segment
27Execution Process
Munin root thread
user root thread
User_init()
P1
Code copy
Data segment
P2
Munin worker thread
. .
Code copy
Data segment
Pn
Munin worker thread
28Execution Process
Munin root thread
P1
Synchronization operation
P2
User thread
Munin worker thread
. .
Pn
29Advanced Programming Features
- Associate data Synch
back
rel(m)
msg
acq(m)
r(x)
r(x)
rel(m)
msg
w(x)
acq(m)
r(x)
30Advanced Programming Features
- PhaseChange()
- Change the producer consumer relationship
- Example adaptive mesh sor
- ChangeAnnotation()
- Change the access pattern in execution
- Invalidate()
- Flush()
- SingleObject()
- PreAcquire()
31Data Object Directory
- Start Address and Size
- Protocol parameters
- Object state (valid, writable, invalid)
- Copyset (which remote has copies)
- Synchq (corresponding synchronization object)
- Probable owner
- Home node
- Access control semaphore
- Links
32Delayed Update Queue
rel(m)
acq(m)
w(x)
w(y)
x
x
y
33Multiple Writer Handling
34Multiple Writer Handling
35Synchronization
- Queue based synchronization
- Request reply lock forward mechanism
- AcquireLock(), Unlock(), WaitAtBarrier()
36Performance
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
37Matrix Multiply
38Matrix Multiply Optimized
39SOR
40Effect of Multiple Protocols
Protocol Matrix Multiply SOR
Multiple 72.41 27.64
Write-shared 75.59 64.48
Conventional 75.85 67.64
41Overview of Other DSM System
- Basic concepts
- Shared object
- Software release consistency
- Multiple consistency protocols
- Software implementation
- Prototype overview
- Execution process
- Advanced programming features
- Data object directory and delayed update queue
- Synchronization
- Performance
- Overview of other DSM systems
- Conclusion
42Overview of Other DSM System
- Clouds per-segment (object) based consistency
protocol - Mirage per-page based
- Orca reliable ordered broadcast protocol
- Amber user responsible for the data distribution
among processors - Linda shared variable in tuple space, atomic
operation insertion, removal, reading - Midway using entry consistency (weaker
consistency than release consistency) - DASH hardware DSM
43Conclusion
- Objective efficient DSM system with similar
protocol to shared memory programming and small
message passing overhead - Special feature multiple protocols, software
release consistency - Implementation synchronization realized by Munin
root thread and Munin worker threads
44Thank you