Title: Speculative SharedMemory Architectures
1Speculative Shared-Memory Architectures
- ECE 572 Parallel Computer Architecture
- Spring 2004
2Problem 1 Conservative Parallelization
- No parallelization unless 100 safe
- Hard-to-analyze access patterns
- Subscripted array subscripts
- Pointer accesses
- Corner cases in mostly parallel codes
RAW
for(i0iltni) abi
aci
3Problem 2 Conservative Synchronization
- Synchronization in parallel codes conservative
- Hard-to-analyze access patterns
- Corner cases in mostly race-free codes
- Aggressive sync not affordable
- Too time consuming
- Too complicated
BARRIER
OPTIMIZED
4Thread-level Speculation
- Execute speculatively in parallel hard-to-analyze
codes - Assume no dependences and execute in parallel
- Track memory accesses, detect violations
- Squash and restart offending threads
for(i0iltni) abi
aci
5Thread-level Speculation
- Execute speculatively in parallel hard-to-analyze
codes - Assume no dependences and execute in parallel
- Track memory accesses, detect violations
- Squash and restart offending threads
for(i0iltni) abi
aci
RAW
6Thread-level Speculation
- Execute speculatively in parallel hard-to-analyze
codes - Assume no dependences and execute in parallel
- Track memory accesses, detect violations
- Squash and restart offending threads
for(i0iltni) abi
aci
RAW
a5 a6
7Two TLS Extensions
- Scalable TLS Cintra et al., ISCA00
- Scalable ? larger speedups
- Hierarchical ? built out of (future) commodity
spec CMPs - Serial codes sped up 5.3? on avg for 16 processors
- Speculative Synchronization MartÃnez and
Torrellas, ASPLOS02 - Apply TLS to explicitly parallel codes
- Speculatively execute past active barriers,
locks, flags - Use TLSs safe thread to guarantee forward
progress - 40 avg sync time reduction
8Part 1
- Scalable TLS
- Introduction
- Implementation
9Thread-Level Speculation (TLS)
- Execute speculatively in parallel hard-to-analyze
codes - Track memory accesses, detect violations
- Squash and restart offending threads
- Keep safe thread (earliest) at all times
- Use caches to buffer speculative state
- If overflow, stall until safe
- Typical speculative loop parallelization
- Threads loop iterations
10Speculative Chip-Multiprocessor (CMP)
P0
P1
P2
P3
L1
L1
L1
L1
Memory Disambiguation Table (MDT)
MDT
L2
Krishnan Torrellas
11Memory Disambiguation Table (MDT)
P0
P1
P2
P3
P0 P1 P2 P3
P0 P1 P2 P3
L
S
TAG
WORD 0
WORD 1
12Speculative Memory Accesses
P1
P2
P3
P0
P0 lt P1 lt P2 lt P3
P2 has loaded value from memory
P0 P1 P2 P3
MDT
L
S
13Write Dependence Violation
P1
P0 P1 P2 P3
P2s load was premature
MDT
L
S
14Write Dependence Violation
P2 and successors squashed
P2
P3
Clear P2s state
P0 P1 P2 P3
MDT
L
S
Record P1s store
15Write Shielding
P0
P0 P1 P2 P3
MDT
L
Store bit for P1 shields P2
S
Store bit for P0 is set
16Read Value Forwarding
P2
P0 P1 P2 P3
MDT
L
Most recent version in P1
S
17Read Value Forwarding
P1
P2
P1 delivers data
Record P2s load
MDT forwards request
P0 P1 P2 P3
MDT
L
S
18Scalable TLS
- How can we make TLS scalable?
- Can we leverage speculative CMPs?
19Attempt 1 Isolated Mapping
MDT
MDT
T1 lt T2 lt T3 lt T4
T5 lt T6 lt T7 lt T8
?
MEM
20Attempt 2 Flat TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
MDT
MDT
?
?
L2
L2
Global MDT
MDT
MEM
Speculative
21Attempt 3 Use Thread Chunks
T1-4
T5-8
T1-4 lt T5-8
L2
L2
MDT
MEM
Speculative
22Result Hierarchical TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
Total thread order
MDT
MDT
L2
L2
MDT
MEM
Speculative
23Scalable Hierarchical TLS System
DIR
DIR
MDT
MDT
MEM
MEM
24Speculative Memory Accesses
P0
P1
P2
P3
P0
P1
P2
P3
P0 lt P1 lt P2 lt P3
P0 lt P1 lt P2 lt P3
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 lt N1
N0 N1
L
MDT
S
25Read Value Forwarding Local Hit
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
Most recent version local
N0 N1
L
MDT
S
26Read Value Forwarding Local Hit
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 N1
L
MDT
S
27Read Value Forwarding Local Miss
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
No local version
S
S
N0 N1
L
MDT
S
28Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
Most recent version in N0
29Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
To be supplied by P1
N0 N1
L
MDT
S
30Read Value Forwarding Local Miss
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
31Write Remote Dependence Violation
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
No local shielding
N0 N1
L
MDT
S
32Write Remote Dependence Violation
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
Premature Load in N1
S
33Write Remote Dependence Violation
P2
Squash chunk
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
Clear all local state
N0 N1
L
MDT
Clear N1 state
S
34Dynamic Chunk Mapping
- On-chip static mapping
- Simplifies design
- Imbalance ? not scalable
- Use dynamic mapping across chips
- Node may hold state from several chunks
N0
N1
P0
P1
N1
P2
P3
Chip
System
35Other Issues
- Aggressive exposed loads (hide latency of MDT
access loads may initiate squash) - Synchronizing Global MDT ops
- Commit
- Squash
36Experimental Setup
- 4-issue dynamic superscalar processors
- CC-NUMA memory system
- Scalable TLS
- Hierarchical configuration
- Node 4-way spec CMP
- Private L1s, shared L2
- MDT
- System 16 procs (4?4)
- Distributed G-MDT
- Speculative Sync
- Flat configuration
- Node single processor
- Private L1L2
- SSU
- System 16-64 procs
37Applications
- Scalable TLS
- Mix of serial codes
- Hard-to-analyze loops
- Identified by compiler
- 50 avg serial time
- Different suites
- PerfectClub (track, bdna)
- HPF-2 (dsmc3d, euler)
- Univ. Hawaii (tree)
- Results reported for loops
- Speculative Sync
- Mix of parallel codes
- Parallelization methods
- Compiler 16p (applu)
- Annotated 16p (mst, bisort)
- Hand 64p (ocean, 2?barnes)
38Static X-Thread Dependences
39Scalable TLS Summary of Results
- Scalable TLS 5.3 avg speedup on 16 procs
- For serial codes!
- Loop unrolling usually helps
- Improved cache locality
- Per-line speculative state poor performance
- Squashes due to false sharing
40Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
41Scalable TLS What We Learned
- Hierarchical scalable TLS works
- Good speedups ? scalable
- Node abstracted away ? leverages commodity CMPs
- Word-based protocol increases traffic
- Later improvement studies in Illinois
- Adaptation (Prvulovic et al., ISCA 2001)
- Late disambiguation/prediction (Cintra et al.,
HPCA 2002)
42Part 2
- Speculative Synchronization
- Introduction
- Implementation
43Synchronization Primitives
- Barriers, locks, flags widely used
- Parallelizing compilers mostly full barriers
- Programmers M4 macros, OpenMP directives
- Often placed conservatively
- Hard-to-analyze memory access patterns
- Corner cases in mostly race-free codes
- Aggressive sync not affordable
- Too time-consuming
- Too complicated
44Proposal Speculative Synchronization
- Idea off-load synchronizing ops from processor
- Apply TLS to speculate past active barriers,
locks, flags - Detect conflicts, roll back offending threads
- Use caches to store speculative state
- Maintain 1 or more safe threads ? forward
progress - Lock owner
- Flag producer
- Barrier lagging threads
- Speculative threads execute past sync points
45Important Features
- Concurrency possible even if conflicts
- All in-order safe-to-spec conflicts tolerated
- No order among spec threads ? simpler HW
- No MDT
- No programming effort
- Retargetted macros/directives
- Can coexist with conventional sync at run-time
46Speculative Barrier
A
B
C
BARRIER
Speculative
47Speculative Barrier
A
B
BARRIER
C
Speculative
48Speculative Barrier
B
BARRIER
A
C
Speculative
49Speculative Barrier
BARRIER
A
B
C
Speculative
50Speculative Lock
A
B
C
D
E
ACQUIRE
RELEASE
Speculative
51Speculative Lock
B
C
D
E
ACQUIRE
A
RELEASE
Speculative
52Speculative Lock
C
D
ACQUIRE
A
B
E
RELEASE
Speculative
53Speculative Lock
D
ACQUIRE
A
B
C
RELEASE
E
Speculative
54Speculative Lock
D
ACQUIRE
B
C
RELEASE
A
E
Speculative
55Speculative Synchronization Unit
- Extends cache controller
- Simple hardware
- 1 extra cache line
- 1 spec bit/line
- Some control logic
56Speculative Lock Request
- Processor side
- Program SSU for speculative lock
- Checkpoint register file
- SSU side
- Initiate TTS loop on lock variable
- Use caches as speculative buffer (like TLS)
- Set Speculative bit in lines accessed
speculatively
57Lock Acquire
- SSU acquires lock (TS successful)
- Clears all Speculative bits ? one-shot commit
- Becomes idle
- Release (store) later by processor
58Release while Speculative
- Processor issues release, SSU still active
- SSU intercepts release (store) by processor
- SSU toggles Release bitalready done
- When lock becomes available later
- SSU
- Does not perform TS
- Clears all Speculative bits ? one-shot commit
59Memory Access Conflict
- External coherence actions
- Request to safe line service normally
- Request to spec line squash thread
- Invalidate lines marked SpeculativeDirty ?
one-shot squash - Roll back restart at sync point
- Safe threads never squashed ? forward progress
- All safe-to-spec in-order dependences tolerated
60Support for Multiple Locks
61Speculative Flags and Barriers
- Flag spin Test onlyno TS
- Barrier leverage flag spin support
- Update thread counter
- If not last one, spin on flag speculatively
62Retargetted M4 Macros
No programming effort
63Experimental Setup
- 4-issue dynamic superscalar processors
- CC-NUMA memory system
- Scalable TLS
- Hierarchical configuration
- Node 4-way spec CMP
- Private L1s, shared L2
- MDT
- System 16 procs (4?4)
- Distributed G-MDT
- Speculative Sync
- Flat configuration
- Node single processor
- Private L1L2
- SSU
- System 16-64 procs
64Applications
- Scalable TLS
- Mix of serial codes
- Hard-to-analyze loops
- Identified by compiler
- 50 avg serial time
- Different suites
- PerfectClub (track, bdna)
- HPF-2 (dsmc3d, euler)
- Univ. Hawaii (tree)
- Results reported for loops
- Speculative Sync
- Mix of parallel codes
- Parallelization methods
- Compiler 16p (applu)
- Annotated 16p (mst, bisort)
- Hand 64p (ocean, 2?barnes)
65Scalable TLS Summary of Results
- Scalable TLS 5.3 avg speedup on 16 procs
- For serial codes!
- Loop unrolling usually helps
- Improved cache locality
- Per-line speculative state poor performance
- Squashes due to false sharing
66Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
67Scalable TLS What We Learned
- Hierarchical scalable TLS works
- Good speedups ? scalable
- Node abstracted away ? leverages commodity CMPs
- Word-based protocol increases traffic
- Later improvement studies in Illinois
- Adaptation (Prvulovic et al., ISCA 2001)
- Late disambiguation/prediction (Cintra et al.,
HPCA 2002)
68Speculative Sync Summary of Results
- Average sync time reduction 40
- Promising for such simple hardware
- Execution time reduction up to 15, avg 7.5
69Execution Time Reduction
Normalized Exec Time
Across-the-board reduction
70Sync Time Reduction
Large reduction 40Room for improvement
71Speculative Sync What We Learned
- Speculative Synchronization very effective
- Promising speedups
- TLSs forward progress guarantee
- Critical path not affected
- Speculative buffer overflow simply stalls
- Simple hardware
- No programming effort
- Room for improvement
- WAR, WAW dependences
- False sharing
72Overall Conclusions
- TLS very promising emerging technology
- Contributions to TLS
- Built hierarchical scalable TLS
- Used commodity spec CMPs as building blocks
- Improved synchronized parallel codes
- TLSs safe-thread mechanism effective
73Static X-Thread Dependences
74Static X-Thread Dependences
75Applications
76Support for Multiple Locks