Title: Clustered Data Cache Designs for VLIW Processors
1Clustered Data Cache Designs for VLIW Processors
- PhD Candidate Enric Gibert
- Advisors Antonio González, Jesús Sánchez
2Motivation
- Two major problems in processor design
- Wire delays
- Energy consumption
D. Matzke, "Will Physical Scalability Sabotage
Performance Gains? in IEEE Computer 30(9), pp.
37-39, 1997
Data from www.sandpile.org
3Clustering
L2 cache
L1 cache
Memory buses
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
4Data Cache
- Latency
- Energy
- Leakage will soon dominate energy consumption
- Cache memories will probably be the main source
of leakage
SIA projections (64KB cache) 100 nm 70nm 50nm 35nm
SIA projections (64KB cache) 4 cycles 5 cycles 7 cycles 7 cycles
(S. Hill, Hot Chips 13)
- In this Thesis
- Latency Reduction Techniques
- Energy Reduction Techniques
5Contributions of this Thesis
- Memory hierarchy for clustered VLIW processors
- Latency Reduction Techniques
- Distribution of the Data Cache among clusters
- Cost-effective cache coherence solutions
- Word-Interleaved distributed data cache
- Flexible Compiler-Managed L0 Buffers
- Energy Reduction Techniques
- Heterogeneous Multi-module Data Cache
- Unified processors
- Clustered processors
6Evaluation Framework
- IMPACT C compiler
- Compile optimize memory disambiguation
- Mediabench benchmark suite
Profile Execution Profile Execution
adpcmdec clinton S_16_44 jpegdec testimg monalisa
adpcmenc clinton S_16_44 jpegenc testimg monalisa
epicdec test_image titanic mpeg2dec mei16v2 tek6
epicenc test_image titanic pegwitdec pegwit techrep
g721dec clinton S_16_44 pegwitenc pgptest techrep
g721enc clinton S_16_44 pgpdec pgptext techrep
gsmdec clinton S_16_44 pgpenc pgptest techrep
gsmenc clinton S_16_44 rasta ex5_c1 ex5_c1
- Microarchitectural VLIW simulator
7Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
8Distributing the Data Cache
L2 cache
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
9MultiVLIW
L2 cache
cache block
MSI cache coherence protocol
L1 cache module
L1 cache module
L1 cache module
L1 cache module
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
(Sánchez and González, MICRO33)
10Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
11Memory Coherence
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
Remote accesses Misses Replacements Others
X
NON-DETERMINISTIC BUS LATENCY!!!
CLUSTER 3
CLUSTER 2
CLUSTER 1
CLUSTER 4
cycle i - - - store to X
cycle i1 - - - -
cycle i2 - - - -
cycle i3 - - - -
cycle i4 load from X - - -
12Coherence Solutions Overview
- Local scheduling solutions ? applied to loops
- Memory Dependent Chains (MDC)
- Data Dependence Graph Transformations (DDGT)
- Store replication
- Load-store synchronization
- Software-based solutions with little hardware
support - Applicable to different configurations
- Word-interleaved cache
- Replicated distributed cache
- Flexible Compiler-Managed L0 Buffers
13Scheme 1 Mem. Dependent Chains
- Sets of memory dependent instructions
- Memory disambiguation by the compiler
- Conservative assumptions
- Assign instructions in same set to same cluster
LD
LD
cache module
cache module
X
CLUSTER 3
CLUSTER 2
Register deps
ADD
CLUSTER 1
CLUSTER 4
Memory deps
store to X
store to X
ST
load from X
14Scheme 2 DDG transformations (I)
- 2 transformations applied together
- Store replication ? overcome MF and MO
- Little support from the hardware
cache module
cache module
cache module
cache module
X
CLUSTER 1
CLUSTER 4
CLUSTER 2
CLUSTER 3
store to X
store to X
store to X
store to X
load from X
15Scheme 2 DDG transformations (II)
- Load-store synchronization ? overcome MA
dependences
cache module
cache module
LD
X
CLUSTER 2
CLUSTER 4
RF
add
CLUSTER 1
CLUSTER 3
load from X
ST
store to X
add
16Results Memory Coherence
- Memory Dependent Chains (MDC)
- Bad since restrictions on the assignment of
instructions to clusters - Good when memory disambiguation is accurate
- DDG Transformations (DDGT)
- Good when there is pressure in the memory buses
- Increases number of local accesses
- Bad when there is pressure in the register buses
- Big increase in inter-cluster communications
- Solutions useful for different cache schemes
17Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
18Word-Interleaved Cache
- Simplify hardware
- As compared to MultiVLIW
- Avoid replication
- Strides 1/-1 element are predominant
- Page interleaved
- Block interleaved
- Word interleaved ? best suited
19Architecture
L2 cache
TAG
TAG
TAG
TAG
cache module
cache module
cache module
cache module
Func. Units
Func. Units
Func. Units
Func. Units
Register File
Register File
Register File
Register File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
20Instruction Scheduling (I) Unrolling
a0
a4
a1
a5
a2
a6
a3
a7
cache module
cache module
cache module
cache module
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
for (i0 iltMAX i) ld r3, _at_ai
for (i0 iltMAX ii4) ld r3, _at_ai ld
r3, _at_ai1 ld r3, _at_ai2 ld r3,
_at_ai3
ld r3, _at_ai
ld r3, _at_ai1
100 of local accesses
ld r3, _at_ai2
ld r3, _at_ai3
21Instruction Scheduling (II)
- Assign appropriate latency to memory instruction
- Small latencies ? ILP ?, stall time ?
- Large latencies ? ILP ?, stall time ?
- Start with large latency (remote miss)
iteratively reassign appropriate latencies (local
miss, remote hit, local hit)
LD
RF
add
22Instruction Scheduling (III)
- Assign instructions to clusters
- Non-memory instructions
- Minimize inter-cluster communications
- Maximize workload balance among clusters
- Memory instructions ? 2 heuristics
- Preferred cluster (PrefClus)
- Average preferred cluster of memory dependent set
- Minimize inter-cluster communications (MinComs)
- Min. Comms. for 1st instruction of the memory
dependent set
23Memory Accesses
- Sources of remote accesses
- Indirect, chains restrictions, double precision,
24Attraction Buffers
- Cost-effective mechanism ? ? local accesses
cache module
cache module
cache module
cache module
a3
a7
a1
a5
a2
a6
a0
a4
a0 a4
Attraction Buffer
AB
AB
AB
CLUSTER 4
CLUSTER 2
CLUSTER 3
CLUSTER 1
i0
local accesses 0 ? 50
load ai ii4
loop
- Results
- 15 INCREASE in local accesses
- 30-35 REDUCTION in stall time
- 5-7 REDUCTION in overall execution time
25Performance
26Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
27Why L0 Buffers
- Still keep hardware simple, but
- ... Allow dynamic binding between addresses and
clusters
28L0 Buffers
L1 cache
INT
FP
MEM
INT
FP
MEM
CLUSTER 3
CLUSTER 4
Register File
Register File
CLUSTER 1
CLUSTER 2
- Small number of entries ? flexibility
- Adaptative to application dynamic
address-cluster binding - Controlled by software ? load/store hints
- Mark instructions to access the buffers which
and how - Flexible Compiler-Managed L0 Buffers
29Mapping Flexibility
a1
a3
a0
a2
a4
a5
a6
a7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L1 block (16 bytes)
L1 cache
L0 Buffer
L0 Buffer
L0 Buffer
L0 Buffer
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
30Hints and L0-L1 Interface
- Memory hints
- Access or bypass the L0 Buffers
- Data mapping linear/interleaved
- Prefetch hints ? next/previous blocks
- L0 are write-through with respect to L1
- Simplifies replacements
- Makes hardware simple
- No arbitration
- No logic to pack data back correctly
- Simplifies coherence among L0 Buffers
31Instruction Scheduling
- Selective loop unrolling
- No unroll vs. unroll by N
- Assign latencies to memory instructions
- Critical instructions (slack) use L0 Buffers
- Do not overflow L0 Buffers
- Use counter of L0 Buffer free entries / cluster
- Do not schedule critical instruction into cluster
with counter 0 - Memory coherence
- Cluster assignment schedule instructions
- Minimize global communications
- Maximize workload balance
- Critical ? Priority to clusters where L0 Buffer
can be used - Explicit prefetching
32Number of Entries
33Performance
34Global Comparative
MultiVLIW Word Interleaved L0 Buffers
Hardware Complexity Lower is better High Low Low
Software Complexity Lower is better Low Medium High
Performance Higher is better High Medium High
35Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
36Motivation
- Energy consumption ? 1st class design goal
- Heterogeneity
- ? supply voltage and/or ? threshold voltage
- Cache memory ? ARM10
- D-cache ? 24 dynamic energy
- I-cache ? 22 dynamic energy
- Exploit heterogeneity in the L1 D-cache?
processor front-end
processor front-end
structure tuned for performance
processor back-end
processor back-end
structure tuned for energy
37Multi-Module Data Cache
Instruction-Based Multi-Module (Abella and
González, ICCD 2003)
L2 D-CACHE
FAST CACHE MODULE
SLOW CACHE MODULE
CRITICALITY TABLE
PROCESSOR
inst PC
38Cache Configurations
39Instr.-to-Variable Graph (IVG)
- Built with profiling information
- Variables global, local, heap
LD1
LD2
ST1
LD3
ST2
LD4
LD5
VAR V1
VAR V2
VAR V3
VAR V4
FIRST
SECOND
40Greedy Mapping Algorithm
Compute IVG
Compute affinities propagate affinities
Compute mapping
Schedule code
- Initial mapping ? all to first _at_ space
- Assign affinities to instructions
- Express a preferred cluster for memory
instructions 0,1 - Propagate affinities to other instructions
- Schedule code refine mapping
41Computing and Propagating Affinity
add1
add2
add3
add4
L1
L1
L1
L1
LD1
LD2
LD3
LD4
L1
L1
L1
L1
mul1
add5
L1
L3
AFFINITY0
AFFINITY1
AFF.0.4
LD1
LD2
LD3
LD4
ST1
add6
add7
L1
L1
V1
V2
V4
V3
ST1
L1
FIRST
SECOND
42Cluster Assignment
- Cluster affinity affinity range ? used to
- Define a preferred cluster
- Guide the instruction-to-cluster assignment
process - Strongly preferred cluster
- Schedule instruction in that cluster
- Weakly preferred cluster
- Schedule instruction where global comms. are
minimized
IC
IC
IA
IB
43EDD Results
Memory Ports Memory Ports
Sensitive Insensitive
Memory Latency Sensitive FASTFAST FASTNONE
Memory Latency Insensitive SLOWSLOW SLOWNONE
44Other Results
BEST UNIFIED FAST UNIFIED SLOW
EDD 0.89 1.29 1.25
ED 0.89 1.25 1.07
- ED
- The SLOW schemes are better
- In all cases, these schemes are better than
unified cache - 29-31 better in EDD, 19-29 better in ED
- No configuration is best for all cases
45Reconfigurable Cache Results
- The OS can set each module in one state
- FAST mode / SLOW mode / Turned-off
- The OS reconfigures the cache on a context switch
- Depending on the applications scheduled in and
scheduled out - Two different VDD and VTH for the cache
- Reconfiguration overhead 1-2 cycles Flautner et
al. 2002 - Simple heuristic to show potential
- For each application, choose the estimated best
cache configuration
BEST DISTRIBUTED RECONFIGURABLE SCHEME
EDD 0.89 (FASTSLOW) 0.86
ED 0.89 (SLOWSLOW) 0.86
46Presentation Outline
- Latency reduction techniques
- Software memory coherence in distributed caches
- Word-interleaved distributed cache
- Flexible Compiler-Managed L0 Buffers
- Energy reduction techniques
- Multi-Module cache for clustered VLIW processor
- Conclusions
47Conclusions
- Cache partitioning is a good latency reduction
technique - Cache heterogeneity can be used to exploit energy
efficiency - The best energy and performance efficient scheme
is a distributed data cache - Dynamic vs. Static mapping between addresses and
clusters - Dynamic for performance (L0 Buffers)
- Static for energy consumption (Variable-Based
mapping) - Hardware vs. Software-based memory coherence
solutions - Software solutions are viable
48List of Publications
- Distributed Data Cache Memories
- ICS, 2002
- MICRO-35, 2002
- CGO-1, 2003
- MICRO-36, 2003
- IEEE Transactions on Computers, October 2005
- Concurrency Computation practice and
experience - (to appear late 05 / 06)
- Heterogeneous Data Cache Memories
- Technical report UPC-DAC-RR-ARCO-2004-4, 2004
- PACT, 2005
49Questions