Speculative SharedMemory Architectures - PowerPoint PPT Presentation

1 / 75
About This Presentation
Title:

Speculative SharedMemory Architectures

Description:

Problem 1: Conservative Parallelization. No parallelization unless 100% safe: ... Conservative Synchronization. Synchronization in parallel codes conservative: ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 76
Provided by: josfernan7
Category:

less

Transcript and Presenter's Notes

Title: Speculative SharedMemory Architectures


1
Speculative Shared-Memory Architectures
  • ECE 572 Parallel Computer Architecture
  • Spring 2004

2
Problem 1 Conservative Parallelization
  • No parallelization unless 100 safe
  • Hard-to-analyze access patterns
  • Subscripted array subscripts
  • Pointer accesses
  • Corner cases in mostly parallel codes

RAW
for(i0iltni) abi
aci
3
Problem 2 Conservative Synchronization
  • Synchronization in parallel codes conservative
  • Hard-to-analyze access patterns
  • Corner cases in mostly race-free codes
  • Aggressive sync not affordable
  • Too time consuming
  • Too complicated

BARRIER
OPTIMIZED
4
Thread-level Speculation
  • Execute speculatively in parallel hard-to-analyze
    codes
  • Assume no dependences and execute in parallel
  • Track memory accesses, detect violations
  • Squash and restart offending threads

for(i0iltni) abi
aci
5
Thread-level Speculation
  • Execute speculatively in parallel hard-to-analyze
    codes
  • Assume no dependences and execute in parallel
  • Track memory accesses, detect violations
  • Squash and restart offending threads

for(i0iltni) abi
aci
RAW
6
Thread-level Speculation
  • Execute speculatively in parallel hard-to-analyze
    codes
  • Assume no dependences and execute in parallel
  • Track memory accesses, detect violations
  • Squash and restart offending threads

for(i0iltni) abi
aci
RAW
a5 a6
7
Two TLS Extensions
  • Scalable TLS Cintra et al., ISCA00
  • Scalable ? larger speedups
  • Hierarchical ? built out of (future) commodity
    spec CMPs
  • Serial codes sped up 5.3? on avg for 16 processors
  • Speculative Synchronization Martínez and
    Torrellas, ASPLOS02
  • Apply TLS to explicitly parallel codes
  • Speculatively execute past active barriers,
    locks, flags
  • Use TLSs safe thread to guarantee forward
    progress
  • 40 avg sync time reduction

8
Part 1
  • Scalable TLS
  • Introduction
  • Implementation

9
Thread-Level Speculation (TLS)
  • Execute speculatively in parallel hard-to-analyze
    codes
  • Track memory accesses, detect violations
  • Squash and restart offending threads
  • Keep safe thread (earliest) at all times
  • Use caches to buffer speculative state
  • If overflow, stall until safe
  • Typical speculative loop parallelization
  • Threads loop iterations

10
Speculative Chip-Multiprocessor (CMP)
P0
P1
P2
P3
L1
L1
L1
L1
Memory Disambiguation Table (MDT)
MDT
L2
Krishnan Torrellas
11
Memory Disambiguation Table (MDT)
P0
P1
P2
P3
P0 P1 P2 P3
P0 P1 P2 P3
L
S
TAG
WORD 0
WORD 1
12
Speculative Memory Accesses
P1
P2
P3
P0
P0 lt P1 lt P2 lt P3
P2 has loaded value from memory
P0 P1 P2 P3
MDT
L
S
13
Write Dependence Violation
P1
P0 P1 P2 P3
P2s load was premature
MDT
L
S
14
Write Dependence Violation
P2 and successors squashed
P2
P3
Clear P2s state
P0 P1 P2 P3
MDT
L
S
Record P1s store
15
Write Shielding
P0
P0 P1 P2 P3
MDT
L
Store bit for P1 shields P2
S
Store bit for P0 is set
16
Read Value Forwarding
P2
P0 P1 P2 P3
MDT
L
Most recent version in P1
S
17
Read Value Forwarding
P1
P2
P1 delivers data
Record P2s load
MDT forwards request
P0 P1 P2 P3
MDT
L
S
18
Scalable TLS
  • How can we make TLS scalable?
  • Can we leverage speculative CMPs?

19
Attempt 1 Isolated Mapping
MDT
MDT
T1 lt T2 lt T3 lt T4
T5 lt T6 lt T7 lt T8
?
MEM
20
Attempt 2 Flat TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
MDT
MDT
?
?
L2
L2
Global MDT
MDT
MEM
Speculative
21
Attempt 3 Use Thread Chunks
T1-4
T5-8
T1-4 lt T5-8
L2
L2
MDT
MEM
Speculative
22
Result Hierarchical TLS System
T1
T2
T3
T4
T5
T6
T7
T8
L1
L1
L1
L1
L1
L1
L1
L1
Total thread order
MDT
MDT
L2
L2
MDT
MEM
Speculative
23
Scalable Hierarchical TLS System
DIR
DIR
MDT
MDT
MEM
MEM
24
Speculative Memory Accesses
P0
P1
P2
P3
P0
P1
P2
P3
P0 lt P1 lt P2 lt P3
P0 lt P1 lt P2 lt P3
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 lt N1
N0 N1
L
MDT
S
25
Read Value Forwarding Local Hit
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
Most recent version local
N0 N1
L
MDT
S
26
Read Value Forwarding Local Hit
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
L
S
N0 N1
L
MDT
S
27
Read Value Forwarding Local Miss
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
No local version
S
S
N0 N1
L
MDT
S
28
Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
Most recent version in N0
29
Read Value Forwarding Local Miss
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
To be supplied by P1
N0 N1
L
MDT
S
30
Read Value Forwarding Local Miss
P1
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
S
31
Write Remote Dependence Violation
P2
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
No local shielding
N0 N1
L
MDT
S
32
Write Remote Dependence Violation
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
N0 N1
L
MDT
Premature Load in N1
S
33
Write Remote Dependence Violation
P2
Squash chunk
MDT
MDT
N0
N1
P0 P1 P2 P3
P0 P1 P2 P3
L
L
S
S
Clear all local state
N0 N1
L
MDT
Clear N1 state
S
34
Dynamic Chunk Mapping
  • On-chip static mapping
  • Simplifies design
  • Imbalance ? not scalable
  • Use dynamic mapping across chips
  • Node may hold state from several chunks

N0
N1
P0
P1
N1
P2
P3
Chip
System
35
Other Issues
  • Aggressive exposed loads (hide latency of MDT
    access loads may initiate squash)
  • Synchronizing Global MDT ops
  • Commit
  • Squash

36
Experimental Setup
  • 4-issue dynamic superscalar processors
  • CC-NUMA memory system
  • Scalable TLS
  • Hierarchical configuration
  • Node 4-way spec CMP
  • Private L1s, shared L2
  • MDT
  • System 16 procs (4?4)
  • Distributed G-MDT
  • Speculative Sync
  • Flat configuration
  • Node single processor
  • Private L1L2
  • SSU
  • System 16-64 procs

37
Applications
  • Scalable TLS
  • Mix of serial codes
  • Hard-to-analyze loops
  • Identified by compiler
  • 50 avg serial time
  • Different suites
  • PerfectClub (track, bdna)
  • HPF-2 (dsmc3d, euler)
  • Univ. Hawaii (tree)
  • Results reported for loops
  • Speculative Sync
  • Mix of parallel codes
  • Parallelization methods
  • Compiler 16p (applu)
  • Annotated 16p (mst, bisort)
  • Hand 64p (ocean, 2?barnes)

38
Static X-Thread Dependences
39
Scalable TLS Summary of Results
  • Scalable TLS 5.3 avg speedup on 16 procs
  • For serial codes!
  • Loop unrolling usually helps
  • Improved cache locality
  • Per-line speculative state poor performance
  • Squashes due to false sharing

40
Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
41
Scalable TLS What We Learned
  • Hierarchical scalable TLS works
  • Good speedups ? scalable
  • Node abstracted away ? leverages commodity CMPs
  • Word-based protocol increases traffic
  • Later improvement studies in Illinois
  • Adaptation (Prvulovic et al., ISCA 2001)
  • Late disambiguation/prediction (Cintra et al.,
    HPCA 2002)

42
Part 2
  • Speculative Synchronization
  • Introduction
  • Implementation

43
Synchronization Primitives
  • Barriers, locks, flags widely used
  • Parallelizing compilers mostly full barriers
  • Programmers M4 macros, OpenMP directives
  • Often placed conservatively
  • Hard-to-analyze memory access patterns
  • Corner cases in mostly race-free codes
  • Aggressive sync not affordable
  • Too time-consuming
  • Too complicated

44
Proposal Speculative Synchronization
  • Idea off-load synchronizing ops from processor
  • Apply TLS to speculate past active barriers,
    locks, flags
  • Detect conflicts, roll back offending threads
  • Use caches to store speculative state
  • Maintain 1 or more safe threads ? forward
    progress
  • Lock owner
  • Flag producer
  • Barrier lagging threads
  • Speculative threads execute past sync points

45
Important Features
  • Concurrency possible even if conflicts
  • All in-order safe-to-spec conflicts tolerated
  • No order among spec threads ? simpler HW
  • No MDT
  • No programming effort
  • Retargetted macros/directives
  • Can coexist with conventional sync at run-time

46
Speculative Barrier
A
B
C
BARRIER
Speculative
47
Speculative Barrier
A
B
BARRIER
C
Speculative
48
Speculative Barrier
B
BARRIER
A
C
Speculative
49
Speculative Barrier
BARRIER
A
B
C
Speculative
50
Speculative Lock
A
B
C
D
E
ACQUIRE
RELEASE
Speculative
51
Speculative Lock
B
C
D
E
ACQUIRE
A
RELEASE
Speculative
52
Speculative Lock
C
D
ACQUIRE
A
B
E
RELEASE
Speculative
53
Speculative Lock
D
ACQUIRE
A
B
C
RELEASE
E
Speculative
54
Speculative Lock
D
ACQUIRE
B
C
RELEASE
A
E
Speculative
55
Speculative Synchronization Unit
  • Extends cache controller
  • Simple hardware
  • 1 extra cache line
  • 1 spec bit/line
  • Some control logic

56
Speculative Lock Request
  • Processor side
  • Program SSU for speculative lock
  • Checkpoint register file
  • SSU side
  • Initiate TTS loop on lock variable
  • Use caches as speculative buffer (like TLS)
  • Set Speculative bit in lines accessed
    speculatively

57
Lock Acquire
  • SSU acquires lock (TS successful)
  • Clears all Speculative bits ? one-shot commit
  • Becomes idle
  • Release (store) later by processor

58
Release while Speculative
  • Processor issues release, SSU still active
  • SSU intercepts release (store) by processor
  • SSU toggles Release bitalready done
  • When lock becomes available later
  • SSU
  • Does not perform TS
  • Clears all Speculative bits ? one-shot commit

59
Memory Access Conflict
  • External coherence actions
  • Request to safe line service normally
  • Request to spec line squash thread
  • Invalidate lines marked SpeculativeDirty ?
    one-shot squash
  • Roll back restart at sync point
  • Safe threads never squashed ? forward progress
  • All safe-to-spec in-order dependences tolerated

60
Support for Multiple Locks
61
Speculative Flags and Barriers
  • Flag spin Test onlyno TS
  • Barrier leverage flag spin support
  • Update thread counter
  • If not last one, spin on flag speculatively

62
Retargetted M4 Macros
No programming effort
63
Experimental Setup
  • 4-issue dynamic superscalar processors
  • CC-NUMA memory system
  • Scalable TLS
  • Hierarchical configuration
  • Node 4-way spec CMP
  • Private L1s, shared L2
  • MDT
  • System 16 procs (4?4)
  • Distributed G-MDT
  • Speculative Sync
  • Flat configuration
  • Node single processor
  • Private L1L2
  • SSU
  • System 16-64 procs

64
Applications
  • Scalable TLS
  • Mix of serial codes
  • Hard-to-analyze loops
  • Identified by compiler
  • 50 avg serial time
  • Different suites
  • PerfectClub (track, bdna)
  • HPF-2 (dsmc3d, euler)
  • Univ. Hawaii (tree)
  • Results reported for loops
  • Speculative Sync
  • Mix of parallel codes
  • Parallelization methods
  • Compiler 16p (applu)
  • Annotated 16p (mst, bisort)
  • Hand 64p (ocean, 2?barnes)

65
Scalable TLS Summary of Results
  • Scalable TLS 5.3 avg speedup on 16 procs
  • For serial codes!
  • Loop unrolling usually helps
  • Improved cache locality
  • Per-line speculative state poor performance
  • Squashes due to false sharing

66
Speedups for 16 Processors
Speedup
? 6.40(40 eff)
41
21
33
90
32
90
Across-the-board speedups
67
Scalable TLS What We Learned
  • Hierarchical scalable TLS works
  • Good speedups ? scalable
  • Node abstracted away ? leverages commodity CMPs
  • Word-based protocol increases traffic
  • Later improvement studies in Illinois
  • Adaptation (Prvulovic et al., ISCA 2001)
  • Late disambiguation/prediction (Cintra et al.,
    HPCA 2002)

68
Speculative Sync Summary of Results
  • Average sync time reduction 40
  • Promising for such simple hardware
  • Execution time reduction up to 15, avg 7.5

69
Execution Time Reduction
Normalized Exec Time
Across-the-board reduction
70
Sync Time Reduction
Large reduction 40Room for improvement
71
Speculative Sync What We Learned
  • Speculative Synchronization very effective
  • Promising speedups
  • TLSs forward progress guarantee
  • Critical path not affected
  • Speculative buffer overflow simply stalls
  • Simple hardware
  • No programming effort
  • Room for improvement
  • WAR, WAW dependences
  • False sharing

72
Overall Conclusions
  • TLS very promising emerging technology
  • Contributions to TLS
  • Built hierarchical scalable TLS
  • Used commodity spec CMPs as building blocks
  • Improved synchronized parallel codes
  • TLSs safe-thread mechanism effective

73
Static X-Thread Dependences
74
Static X-Thread Dependences
75
Applications
76
Support for Multiple Locks
Write a Comment
User Comments (0)
About PowerShow.com