Title: Distributed Shared Memory (part 1)
1Distributed Shared Memory (part 1)
2Distributed Shared Memory (DSM)
shared memory
network
mem0
mem1
mem2
memN
...
proc0
proc1
proc2
procN
3Shared memory programming
- Standard pthread
- synchronizations
- Barriers
- Locks
- Semaphores
4Sequential SOR
- for some number of timesteps/iterations
- for (i0 iltn i )
- for( j1, jltn, j )
- tempij 0.25
- ( gridi-1j gridi1j
- gridij-1 gridij1 )
- for( i0 iltn i )
- for( j1 jltn j )
- gridij tempij
5Parallel SOR with Barriers (1 of 2)
- void sor (void arg)
-
- int slice (int)arg
- int from (slice (n-1))/p 1
- int to ((slice1) (n-1))/p 1
- for some number of iterations
-
6Parallel SOR with Barriers (2 of 2)
- for (ifrom iltto i)
- for (j1 jltn j)
- tempij 0.25 (gridi-1j
gridi1j gridij-1 gridij1) - barrier()
- for (ifrom iltto i)
- for (j1 jltn j)
- gridijtempij
- barrier()
7Differences between SMP and Software DSM
- Delay tradeoffs, such as block size
- Software gt traps cost of read/write misses
- Goals of caches multiprocessor performance,
dist. system transparency - bus vs. long networks reliance on serialization
and broadcast.
8Consequent differences in protocols and
applications
- Bigger block size
- Cost amortization, higher hit ratio for larger
blocks? - Reduced overhead
- But therefore...
- Migration vs. Replication
- False sharing increases
- DSM protocol more complex Must handle lost,
corrupted, and out-of-order packets - Above, coupled with cost of traps, gt SDSM
consistency cost much higher!
9Results of high consistency costs
- Manage sharing more carefully
- Align data to page boundaries
10Consistency Models
- Sequential Consistency
- All processors observe the same order
- Must correspond to some serial order
- Only ordering constraint is that reads/writes of
P1 appear in the same order, but no restrictions
on relative ordering between processors.
11Common consistency protocols
- Write update
- Multicast update to all replicas
- Write invalidate
- Invalidate cached copies in p2, p3
- Cache miss if p2/p3 access X
- Valid data from other cache
12Conventional Implementation
- As proposed by Li Hudak, TOCS 86.
- Use virtual memory to implement sharing.
- Shared memory divided up by virtual memory pages.
- Use single-writer, multiple-reader
write-invalidate coherence protocol. - Keep pages in one of three states
- invalid, read-only, read-write
13Example
shared memory
proc0
proc1
proc2
procN
14Example Read Access Hit
read
proc0
proc1
proc2
procN
15Example Write Access Hit
write
proc0
proc1
proc2
procN
16Example Read Access Miss
read
proc0
proc1
proc2
procN
17Example Read Fault
read
fault
proc0
proc1
proc2
procN
18Example Replication on Read
read
proc0
proc1
proc2
procN
19Example Write Access Miss
write
proc0
proc1
proc2
procN
20Example Write Fault
write
fault
proc0
proc1
proc2
procN
21Example Write Invalidation
write
proc0
proc1
proc2
procN
22Example Write Access to Read-Only
write
proc0
proc1
proc2
procN
23Example Write Fault
write
fault
proc0
proc1
proc2
procN
24Example Write Invalidation
write
proc0
proc1
proc2
procN
25How to Remember Locations?
- Broadcast on miss (as in SMP).
- Static home.
- Dynamic home or owner.
26Ownership and Owner Location
- Owner is the last writer.
- Owner maintains copyset.
- Every processor maintains probable owner (not
always the real owner).
27Ownership Location
- Every read or write miss is sent to (local)
probable owner. - If owner, handle appropriately, else forward to
probable owner.
28Ownership Modification
- If write miss, new writer becomes owner, and all
forwarders set probable owner to requester. - If read miss, set probable owner to responding
processor.
29Example
- Initially, owner(page0) p0, and probable
owner(page0) p0 everywhere. - Write miss by p1, sends message to its probable
owner (p0), handled there, new owner p1,
probable owner(0) on p0 1. - Read miss by p2, sends message to probable owner
(p0), forwarded to probable owner (p1), handled
there, probable owner(0) on p2 becomes p1.
30Implement synchronizations
- Use messages to implement synchronizations
31Barriers
- Designate one processor as barrier manager.
- When a process waits at a barrier, it sends an
arrival message to the barrier manager and waits. - When barrier manager has received all messages,
it sends a departure message to all processes.
32Locks
- Designate one process as the lock manager for a
particular lock. - When a process acquires a lock, it sends an
acquire message to the manager and waits. - Manager forwards message to last acquirer.
- If lock free, send lock grant message.
- If lock held, hold on to request until free, and
then send lock grant message.
33Problem False Sharing
- Concurrent access to different data within the
same consistency unit. - With page as consistency unit, lots of
opportunity for false sharing. - Two flavors
- read-write
- write-write
34Read-Write False Sharing
x
y
35Read-Write False Sharing (Cont.)
w(x)
w(x)
w(x)
r(x)
r(y)
r(y)
synch
36Read-Write False Sharing (Cont.)
w(x)
w(x)
w(x)
r(x)
r(y)
r(y)
synch
37Write-Write False Sharing
w(x)
w(x)
w(x)
r(x)
w(y)
w(y)
synch
38Summary
- Software shared memory on distributed memory
hardware. - Uses virtual memory.
- Home migration to improve locality
- important because of high latencies.
- Sequential consistency suffers from false sharing