Clustered Data Cache Designs for VLIW Processors - PowerPoint PPT Presentation

1 / 49

About This Presentation

Title:

Clustered Data Cache Designs for VLIW Processors

Description:

D. Matzke, 'Will Physical Scalability Sabotage Performance Gains?'' in IEEE Computer 30(9), pp. 37-39, 1997. Data from www.sandpile.org ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 50

Provided by: egib6

Learn more at: http://arco.e.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Clustered Data Cache Designs for VLIW Processors

1
Clustered Data Cache Designs for VLIW Processors

PhD Candidate Enric Gibert
Advisors Antonio González, Jesús Sánchez

2
Motivation

Two major problems in processor design
Wire delays
Energy consumption

D. Matzke, "Will Physical Scalability Sabotage
Performance Gains? in IEEE Computer 30(9), pp.
37-39, 1997
Data from www.sandpile.org
3
Clustering
L2 cache
L1 cache
Memory buses
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
4
Data Cache

Latency
Energy
Leakage will soon dominate energy consumption
Cache memories will probably be the main source
of leakage

SIA projections (64KB cache) 100 nm 70nm 50nm 35nm
SIA projections (64KB cache) 4 cycles 5 cycles 7 cycles 7 cycles
(S. Hill, Hot Chips 13)

In this Thesis
Latency Reduction Techniques
Energy Reduction Techniques

5
Contributions of this Thesis

Memory hierarchy for clustered VLIW processors
Latency Reduction Techniques
Distribution of the Data Cache among clusters
Cost-effective cache coherence solutions
Word-Interleaved distributed data cache
Flexible Compiler-Managed L0 Buffers
Energy Reduction Techniques
Heterogeneous Multi-module Data Cache
Unified processors
Clustered processors

6
Evaluation Framework

IMPACT C compiler
Compile optimize memory disambiguation
Mediabench benchmark suite

Profile Execution Profile Execution
adpcmdec clinton S_16_44 jpegdec testimg monalisa
adpcmenc clinton S_16_44 jpegenc testimg monalisa
epicdec test_image titanic mpeg2dec mei16v2 tek6
epicenc test_image titanic pegwitdec pegwit techrep
g721dec clinton S_16_44 pegwitenc pgptest techrep
g721enc clinton S_16_44 pgpdec pgptext techrep
gsmdec clinton S_16_44 pgpenc pgptest techrep
gsmenc clinton S_16_44 rasta ex5_c1 ex5_c1

Microarchitectural VLIW simulator

7
Presentation Outline

Latency reduction techniques
Software memory coherence in distributed caches
Word-interleaved distributed cache
Flexible Compiler-Managed L0 Buffers
Energy reduction techniques
Multi-Module cache for clustered VLIW processor
Conclusions

8
Distributing the Data Cache
L2 cache
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
9
MultiVLIW
L2 cache
cache block
MSI cache coherence protocol
L1 cache module
L1 cache module
L1 cache module
L1 cache module
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
(Sánchez and González, MICRO33)
10
Presentation Outline

Latency reduction techniques
Software memory coherence in distributed caches
Word-interleaved distributed cache
Flexible Compiler-Managed L0 Buffers
Energy reduction techniques
Multi-Module cache for clustered VLIW processor
Conclusions

11
Memory Coherence
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
Remote accesses Misses Replacements Others
X
NON-DETERMINISTIC BUS LATENCY!!!
CLUSTER 3
CLUSTER 2
CLUSTER 1
CLUSTER 4
cycle i - - - store to X
cycle i1 - - - -
cycle i2 - - - -
cycle i3 - - - -
cycle i4 load from X - - -
12
Coherence Solutions Overview

Local scheduling solutions ? applied to loops
Memory Dependent Chains (MDC)
Data Dependence Graph Transformations (DDGT)
Store replication
Load-store synchronization
Software-based solutions with little hardware
support
Applicable to different configurations
Word-interleaved cache
Replicated distributed cache
Flexible Compiler-Managed L0 Buffers

13
Scheme 1 Mem. Dependent Chains

Sets of memory dependent instructions
Memory disambiguation by the compiler
Conservative assumptions
Assign instructions in same set to same cluster

LD
LD
cache module
cache module
X
CLUSTER 3
CLUSTER 2
Register deps
ADD
CLUSTER 1
CLUSTER 4
Memory deps
store to X
store to X
ST
load from X
14
Scheme 2 DDG transformations (I)

2 transformations applied together
Store replication ? overcome MF and MO
Little support from the hardware

cache module
cache module
cache module
cache module
X
CLUSTER 1
CLUSTER 4
CLUSTER 2
CLUSTER 3
store to X
store to X
store to X
store to X
load from X
15
Scheme 2 DDG transformations (II)

Load-store synchronization ? overcome MA
dependences

cache module
cache module
LD
X
CLUSTER 2
CLUSTER 4
RF
add
CLUSTER 1
CLUSTER 3
load from X
ST
store to X
add
16
Results Memory Coherence

Memory Dependent Chains (MDC)
Bad since restrictions on the assignment of
instructions to clusters
Good when memory disambiguation is accurate

DDG Transformations (DDGT)
Good when there is pressure in the memory buses
Increases number of local accesses
Bad when there is pressure in the register buses
Big increase in inter-cluster communications
Solutions useful for different cache schemes

17
Presentation Outline

Latency reduction techniques
Software memory coherence in distributed caches
Word-interleaved distributed cache
Flexible Compiler-Managed L0 Buffers
Energy reduction techniques
Multi-Module cache for clustered VLIW processor
Conclusions

18
Word-Interleaved Cache

Simplify hardware
As compared to MultiVLIW
Avoid replication
Strides 1/-1 element are predominant
Page interleaved
Block interleaved
Word interleaved ? best suited

19
Architecture
L2 cache
TAG
TAG
TAG
TAG
cache module
cache module
cache module
cache module
Func. Units
Func. Units
Func. Units
Func. Units
Register File
Register File
Register File
Register File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
20
Instruction Scheduling (I) Unrolling
a0
a4
a1
a5
a2
a6
a3
a7
cache module
cache module
cache module
cache module
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
for (i0 iltMAX i) ld r3, _at_ai
for (i0 iltMAX ii4) ld r3, _at_ai ld
r3, _at_ai1 ld r3, _at_ai2 ld r3,
_at_ai3
ld r3, _at_ai
ld r3, _at_ai1
100 of local accesses
ld r3, _at_ai2
ld r3, _at_ai3
21
Instruction Scheduling (II)

Assign appropriate latency to memory instruction
Small latencies ? ILP ?, stall time ?
Large latencies ? ILP ?, stall time ?
Start with large latency (remote miss)
iteratively reassign appropriate latencies (local
miss, remote hit, local hit)

LD
RF
add
22
Instruction Scheduling (III)

Assign instructions to clusters
Non-memory instructions
Minimize inter-cluster communications
Maximize workload balance among clusters
Memory instructions ? 2 heuristics
Preferred cluster (PrefClus)
Average preferred cluster of memory dependent set
Minimize inter-cluster communications (MinComs)
Min. Comms. for 1st instruction of the memory
dependent set

23
Memory Accesses

Sources of remote accesses
Indirect, chains restrictions, double precision,

24
Attraction Buffers

Cost-effective mechanism ? ? local accesses

cache module
cache module
cache module
cache module
a3
a7
a1
a5
a2
a6
a0
a4
a0 a4
Attraction Buffer
AB
AB
AB
CLUSTER 4
CLUSTER 2
CLUSTER 3
CLUSTER 1
i0
local accesses 0 ? 50
load ai ii4
loop

Results
15 INCREASE in local accesses
30-35 REDUCTION in stall time
5-7 REDUCTION in overall execution time

25
Performance
26
Presentation Outline

Latency reduction techniques
Software memory coherence in distributed caches
Word-interleaved distributed cache
Flexible Compiler-Managed L0 Buffers
Energy reduction techniques
Multi-Module cache for clustered VLIW processor
Conclusions

27
Why L0 Buffers

Still keep hardware simple, but
... Allow dynamic binding between addresses and
clusters

28
L0 Buffers
L1 cache
INT
FP
MEM
INT
FP
MEM
CLUSTER 3
CLUSTER 4
Register File
Register File
CLUSTER 1
CLUSTER 2

Small number of entries ? flexibility
Adaptative to application dynamic
address-cluster binding
Controlled by software ? load/store hints
Mark instructions to access the buffers which
and how
Flexible Compiler-Managed L0 Buffers

29
Mapping Flexibility
a1
a3
a0
a2
a4
a5
a6
a7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L1 block (16 bytes)
L1 cache
L0 Buffer
L0 Buffer
L0 Buffer
L0 Buffer
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
30
Hints and L0-L1 Interface

Memory hints
Access or bypass the L0 Buffers
Data mapping linear/interleaved
Prefetch hints ? next/previous blocks
L0 are write-through with respect to L1
Simplifies replacements
Makes hardware simple
No arbitration
No logic to pack data back correctly
Simplifies coherence among L0 Buffers

31
Instruction Scheduling

Selective loop unrolling
No unroll vs. unroll by N
Assign latencies to memory instructions
Critical instructions (slack) use L0 Buffers
Do not overflow L0 Buffers
Use counter of L0 Buffer free entries / cluster
Do not schedule critical instruction into cluster
with counter 0
Memory coherence
Cluster assignment schedule instructions
Minimize global communications
Maximize workload balance
Critical ? Priority to clusters where L0 Buffer
can be used
Explicit prefetching

32
Number of Entries
33
Performance
34
Global Comparative
MultiVLIW Word Interleaved L0 Buffers
Hardware Complexity Lower is better High Low Low
Software Complexity Lower is better Low Medium High
Performance Higher is better High Medium High
35
Presentation Outline

Latency reduction techniques
Software memory coherence in distributed caches
Word-interleaved distributed cache
Flexible Compiler-Managed L0 Buffers
Energy reduction techniques
Multi-Module cache for clustered VLIW processor
Conclusions

36
Motivation

Energy consumption ? 1st class design goal
Heterogeneity
? supply voltage and/or ? threshold voltage
Cache memory ? ARM10
D-cache ? 24 dynamic energy
I-cache ? 22 dynamic energy
Exploit heterogeneity in the L1 D-cache?

processor front-end
processor front-end
structure tuned for performance
processor back-end
processor back-end
structure tuned for energy
37
Multi-Module Data Cache
Instruction-Based Multi-Module (Abella and
González, ICCD 2003)
L2 D-CACHE
FAST CACHE MODULE
SLOW CACHE MODULE
CRITICALITY TABLE
PROCESSOR
inst PC
38
Cache Configurations
39
Instr.-to-Variable Graph (IVG)

Built with profiling information
Variables global, local, heap

LD1
LD2
ST1
LD3
ST2
LD4
LD5
VAR V1
VAR V2
VAR V3
VAR V4
FIRST
SECOND
40
Greedy Mapping Algorithm
Compute IVG
Compute affinities propagate affinities
Compute mapping
Schedule code

Initial mapping ? all to first _at_ space
Assign affinities to instructions
Express a preferred cluster for memory
instructions 0,1
Propagate affinities to other instructions
Schedule code refine mapping

41
Computing and Propagating Affinity
add1
add2
add3
add4
L1
L1
L1
L1
LD1
LD2
LD3
LD4
L1
L1
L1
L1
mul1
add5
L1
L3
AFFINITY0
AFFINITY1
AFF.0.4
LD1
LD2
LD3
LD4
ST1
add6
add7
L1
L1
V1
V2
V4
V3
ST1
L1
FIRST
SECOND
42
Cluster Assignment

Cluster affinity affinity range ? used to
Define a preferred cluster
Guide the instruction-to-cluster assignment
process
Strongly preferred cluster
Schedule instruction in that cluster
Weakly preferred cluster
Schedule instruction where global comms. are
minimized

IC
IC
IA
IB
43
EDD Results
Memory Ports Memory Ports
Sensitive Insensitive
Memory Latency Sensitive FASTFAST FASTNONE
Memory Latency Insensitive SLOWSLOW SLOWNONE
44
Other Results
BEST UNIFIED FAST UNIFIED SLOW
EDD 0.89 1.29 1.25
ED 0.89 1.25 1.07