Speculative SharedMemory Architectures - PowerPoint PPT Presentation

1 / 75

About This Presentation

Title:

Speculative SharedMemory Architectures

Description:

Problem 1: Conservative Parallelization. No parallelization unless 100% safe: ... Conservative Synchronization. Synchronization in parallel codes conservative: ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 76

Provided by: josfernan7

Category:

more less

Transcript and Presenter's Notes

Title: Speculative SharedMemory Architectures

1
Speculative Shared-Memory Architectures

ECE 572 Parallel Computer Architecture
Spring 2004

2
Problem 1 Conservative Parallelization

No parallelization unless 100 safe
Hard-to-analyze access patterns
Subscripted array subscripts
Pointer accesses
Corner cases in mostly parallel codes

RAW
for(i0iltni) abi
aci
3
Problem 2 Conservative Synchronization

Synchronization in parallel codes conservative
Hard-to-analyze access patterns
Corner cases in mostly race-free codes
Aggressive sync not affordable
Too time consuming
Too complicated

BARRIER
OPTIMIZED
4
Thread-level Speculation

Execute speculatively in parallel hard-to-analyze
codes
Assume no dependences and execute in parallel
Track memory accesses, detect violations
Squash and restart offending threads

for(i0iltni) abi
aci
5
Thread-level Speculation

Execute speculatively in parallel hard-to-analyze
codes
Assume no dependences and execute in parallel
Track memory accesses, detect violations
Squash and restart offending threads

for(i0iltni) abi
aci
RAW
6
Thread-level Speculation

Execute speculatively in parallel hard-to-analyze
codes
Assume no dependences and execute in parallel
Track memory accesses, detect violations
Squash and restart offending threads

for(i0iltni) abi
aci
RAW
a5 a6
7
Two TLS Extensions

Scalable TLS Cintra et al., ISCA00
Scalable ? larger speedups
Hierarchical ? built out of (future) commodity
spec CMPs
Serial codes sped up 5.3? on avg for 16 processors

Speculative Synchronization Martínez and
Torrellas, ASPLOS02
Apply TLS to explicitly parallel codes
Speculatively execute past active barriers,
locks, flags
Use TLSs safe thread to guarantee forward
progress
40 avg sync time reduction

8
Part 1

Scalable TLS
Introduction
Implementation

9
Thread-Level Speculation (TLS)

Execute speculatively in parallel hard-to-analyze
codes
Track memory accesses, detect violations
Squash and restart offending threads

Keep safe thread (earliest) at all times
Use caches to buffer speculative state
If overflow, stall until safe
Typical speculative loop parallelization
Threads loop iterations

10
Speculative Chip-Multiprocessor (CMP)
P0
P1
P2
P3
L1
L1
L1
L1
Memory Disambiguation Table (MDT)
MDT
L2
Krishnan Torrellas
11
Memory Disambiguation Table (MDT)
P0
P1
P2
P3
P0 P1 P2 P3
P0 P1 P2 P3
L
S
TAG
WORD 0
WORD 1
12
Speculative Memory Accesses
P1
P2
P3
P0
P0 lt P1 lt P2 lt P3
P2 has loaded value from memory
P0 P1 P2 P3
MDT
L
S
13
Write Dependence Violation
P1
P0 P1 P2 P3
P2s load was premature
MDT
L
S
14
Write Dependence Violation
P2 and successors squashed
P2
P3
Clear P2s state
P0 P1 P2 P3
MDT
L
S
Record P1s store
15
Write Shielding
P0
P0 P1 P2 P3
MDT
L
Store bit for P1 shields P2
S
Store bit for P0 is set
16
Read Value Forwarding
P2
P0 P1 P2 P3
MDT
L
Most recent version in P1
S
17
Read Value Forwarding
P1
P2
P1 delivers data
Record P2s load
MDT forwards request
P0 P1 P2 P3
MDT
L
S
18
Scalable TLS

How can we make TLS scalable?
Can we leverage speculative CMPs?

19
Attempt 1 Isolated Mapping
MDT
MDT
T1 lt T2 lt T3 lt T4
T5 lt T6 lt T7 lt T8
?
MEM
20
Attempt 2 Flat TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
MDT
MDT
?
?
L2
L2
Global MDT
MDT
MEM
Speculative
21
Attempt 3 Use Thread Chunks
T1-4
T5-8
T1-4 lt T5-8
L2
L2
MDT
MEM
Speculative
22
Result Hierarchical TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
Total thread order
MDT
MDT
L2
L2
MDT
MEM
Speculative
23
Scalable Hierarchical TLS System
DIR
DIR
MDT
MDT
MEM
MEM
24
Speculative Memory Accesses
P0
P1
P2
P3
P0
P1
P2
P3
P0 lt P1 lt P2 lt P3
P0 lt P1 lt P2 lt P3
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 lt N1
N0 N1
L
MDT
S
25
Read Value Forwarding Local Hit
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
Most recent version local
N0 N1
L
MDT
S
26
Read Value Forwarding Local Hit
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 N1
L
MDT
S
27
Read Value Forwarding Local Miss
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
No local version
S
S
N0 N1
L
MDT
S
28
Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
Most recent version in N0
29
Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
To be supplied by P1
N0 N1
L
MDT
S
30
Read Value Forwarding Local Miss
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
31
Write Remote Dependence Violation
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
No local shielding
N0 N1
L
MDT
S
32
Write Remote Dependence Violation
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
Premature Load in N1
S
33
Write Remote Dependence Violation
P2
Squash chunk
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
Clear all local state
N0 N1
L
MDT
Clear N1 state
S
34
Dynamic Chunk Mapping

On-chip static mapping
Simplifies design
Imbalance ? not scalable
Use dynamic mapping across chips
Node may hold state from several chunks

N0
N1
P0
P1
N1
P2
P3
Chip
System
35
Other Issues

Aggressive exposed loads (hide latency of MDT
access loads may initiate squash)
Synchronizing Global MDT ops
Commit
Squash

36
Experimental Setup

4-issue dynamic superscalar processors
CC-NUMA memory system

Scalable TLS
Hierarchical configuration
Node 4-way spec CMP
Private L1s, shared L2
MDT
System 16 procs (4?4)
Distributed G-MDT

Speculative Sync
Flat configuration
Node single processor
Private L1L2
SSU
System 16-64 procs

37
Applications

Scalable TLS
Mix of serial codes
Hard-to-analyze loops
Identified by compiler
50 avg serial time
Different suites
PerfectClub (track, bdna)
HPF-2 (dsmc3d, euler)
Univ. Hawaii (tree)
Results reported for loops

Speculative Sync
Mix of parallel codes
Parallelization methods
Compiler 16p (applu)
Annotated 16p (mst, bisort)
Hand 64p (ocean, 2?barnes)

38
Static X-Thread Dependences
39
Scalable TLS Summary of Results

Scalable TLS 5.3 avg speedup on 16 procs
For serial codes!
Loop unrolling usually helps
Improved cache locality
Per-line speculative state poor performance
Squashes due to false sharing

40
Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
41
Scalable TLS What We Learned

Hierarchical scalable TLS works
Good speedups ? scalable
Node abstracted away ? leverages commodity CMPs
Word-based protocol increases traffic
Later improvement studies in Illinois
Adaptation (Prvulovic et al., ISCA 2001)
Late disambiguation/prediction (Cintra et al.,
HPCA 2002)

42
Part 2

Speculative Synchronization
Introduction
Implementation

43
Synchronization Primitives

Barriers, locks, flags widely used
Parallelizing compilers mostly full barriers
Programmers M4 macros, OpenMP directives
Often placed conservatively
Hard-to-analyze memory access patterns
Corner cases in mostly race-free codes
Aggressive sync not affordable
Too time-consuming
Too complicated

44
Proposal Speculative Synchronization

Idea off-load synchronizing ops from processor
Apply TLS to speculate past active barriers,
locks, flags
Detect conflicts, roll back offending threads
Use caches to store speculative state
Maintain 1 or more safe threads ? forward
progress
Lock owner
Flag producer
Barrier lagging threads
Speculative threads execute past sync points

45
Important Features

Concurrency possible even if conflicts
All in-order safe-to-spec conflicts tolerated
No order among spec threads ? simpler HW
No MDT
No programming effort
Retargetted macros/directives
Can coexist with conventional sync at run-time

46
Speculative Barrier
A
B
C
BARRIER
Speculative
47
Speculative Barrier
A
B
BARRIER
C
Speculative
48
Speculative Barrier
B
BARRIER
A
C
Speculative
49
Speculative Barrier
BARRIER
A
B
C
Speculative
50
Speculative Lock
A
B
C
D
E
ACQUIRE
RELEASE
Speculative
51
Speculative Lock
B
C
D
E
ACQUIRE
A
RELEASE
Speculative
52
Speculative Lock
C
D
ACQUIRE
A
B
E
RELEASE
Speculative
53
Speculative Lock
D
ACQUIRE
A
B
C
RELEASE
E
Speculative
54
Speculative Lock
D
ACQUIRE
B
C
RELEASE
A
E
Speculative
55
Speculative Synchronization Unit

Extends cache controller
Simple hardware
1 extra cache line
1 spec bit/line
Some control logic

56
Speculative Lock Request

Processor side
Program SSU for speculative lock
Checkpoint register file
SSU side
Initiate TTS loop on lock variable
Use caches as speculative buffer (like TLS)
Set Speculative bit in lines accessed
speculatively

57
Lock Acquire

SSU acquires lock (TS successful)
Clears all Speculative bits ? one-shot commit
Becomes idle
Release (store) later by processor

58
Release while Speculative

Processor issues release, SSU still active
SSU intercepts release (store) by processor
SSU toggles Release bitalready done
When lock becomes available later
SSU
Does not perform TS
Clears all Speculative bits ? one-shot commit

59
Memory Access Conflict

External coherence actions
Request to safe line service normally
Request to spec line squash thread
Invalidate lines marked SpeculativeDirty ?
one-shot squash
Roll back restart at sync point
Safe threads never squashed ? forward progress
All safe-to-spec in-order dependences tolerated

60
Support for Multiple Locks
61
Speculative Flags and Barriers

Flag spin Test onlyno TS
Barrier leverage flag spin support
Update thread counter
If not last one, spin on flag speculatively

62
Retargetted M4 Macros
No programming effort
63
Experimental Setup

4-issue dynamic superscalar processors
CC-NUMA memory system

Scalable TLS
Hierarchical configuration
Node 4-way spec CMP
Private L1s, shared L2
MDT
System 16 procs (4?4)
Distributed G-MDT

Speculative Sync
Flat configuration
Node single processor
Private L1L2
SSU
System 16-64 procs

64
Applications

Scalable TLS
Mix of serial codes
Hard-to-analyze loops
Identified by compiler
50 avg serial time
Different suites
PerfectClub (track, bdna)
HPF-2 (dsmc3d, euler)
Univ. Hawaii (tree)
Results reported for loops

Speculative Sync
Mix of parallel codes
Parallelization methods
Compiler 16p (applu)
Annotated 16p (mst, bisort)
Hand 64p (ocean, 2?barnes)

65
Scalable TLS Summary of Results

Scalable TLS 5.3 avg speedup on 16 procs
For serial codes!
Loop unrolling usually helps
Improved cache locality
Per-line speculative state poor performance
Squashes due to false sharing

66
Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
67
Scalable TLS What We Learned

Hierarchical scalable TLS works
Good speedups ? scalable
Node abstracted away ? leverages commodity CMPs
Word-based protocol increases traffic
Later improvement studies in Illinois
Adaptation (Prvulovic et al., ISCA 2001)
Late disambiguation/prediction (Cintra et al.,
HPCA 2002)

68
Speculative Sync Summary of Results

Average sync time reduction 40
Promising for such simple hardware
Execution time reduction up to 15, avg 7.5

69
Execution Time Reduction
Normalized Exec Time
Across-the-board reduction
70
Sync Time Reduction
Large reduction 40Room for improvement
71
Speculative Sync What We Learned