Clustered Data Cache Designs for VLIW Processors - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Clustered Data Cache Designs for VLIW Processors

Description:

D. Matzke, 'Will Physical Scalability Sabotage Performance Gains?'' in IEEE Computer 30(9), pp. 37-39, 1997. Data from www.sandpile.org ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 50
Provided by: egib6
Learn more at: http://arco.e.ac.upc.edu
Category:

less

Transcript and Presenter's Notes

Title: Clustered Data Cache Designs for VLIW Processors


1
Clustered Data Cache Designs for VLIW Processors
  • PhD Candidate Enric Gibert
  • Advisors Antonio González, Jesús Sánchez

2
Motivation
  • Two major problems in processor design
  • Wire delays
  • Energy consumption

D. Matzke, "Will Physical Scalability Sabotage
Performance Gains? in IEEE Computer 30(9), pp.
37-39, 1997
Data from www.sandpile.org
3
Clustering
L2 cache
L1 cache
Memory buses
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
4
Data Cache
  • Latency
  • Energy
  • Leakage will soon dominate energy consumption
  • Cache memories will probably be the main source
    of leakage

SIA projections (64KB cache) 100 nm 70nm 50nm 35nm
SIA projections (64KB cache) 4 cycles 5 cycles 7 cycles 7 cycles
(S. Hill, Hot Chips 13)
  • In this Thesis
  • Latency Reduction Techniques
  • Energy Reduction Techniques

5
Contributions of this Thesis
  • Memory hierarchy for clustered VLIW processors
  • Latency Reduction Techniques
  • Distribution of the Data Cache among clusters
  • Cost-effective cache coherence solutions
  • Word-Interleaved distributed data cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy Reduction Techniques
  • Heterogeneous Multi-module Data Cache
  • Unified processors
  • Clustered processors

6
Evaluation Framework
  • IMPACT C compiler
  • Compile optimize memory disambiguation
  • Mediabench benchmark suite

Profile Execution Profile Execution
adpcmdec clinton S_16_44 jpegdec testimg monalisa
adpcmenc clinton S_16_44 jpegenc testimg monalisa
epicdec test_image titanic mpeg2dec mei16v2 tek6
epicenc test_image titanic pegwitdec pegwit techrep
g721dec clinton S_16_44 pegwitenc pgptest techrep
g721enc clinton S_16_44 pgpdec pgptext techrep
gsmdec clinton S_16_44 pgpenc pgptest techrep
gsmenc clinton S_16_44 rasta ex5_c1 ex5_c1
  • Microarchitectural VLIW simulator

7
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

8
Distributing the Data Cache
L2 cache
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
9
MultiVLIW
L2 cache
cache block
MSI cache coherence protocol
L1 cache module
L1 cache module
L1 cache module
L1 cache module
FUs
FUs
FUs
FUs
Reg. File
Reg. File
Reg. File
Reg. File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
(Sánchez and González, MICRO33)
10
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

11
Memory Coherence
NEXT MEMORY LEVEL
memory buses
Cache module
Cache module
Remote accesses Misses Replacements Others
X
NON-DETERMINISTIC BUS LATENCY!!!
CLUSTER 3
CLUSTER 2
CLUSTER 1
CLUSTER 4
cycle i - - - store to X
cycle i1 - - - -
cycle i2 - - - -
cycle i3 - - - -
cycle i4 load from X - - -
12
Coherence Solutions Overview
  • Local scheduling solutions ? applied to loops
  • Memory Dependent Chains (MDC)
  • Data Dependence Graph Transformations (DDGT)
  • Store replication
  • Load-store synchronization
  • Software-based solutions with little hardware
    support
  • Applicable to different configurations
  • Word-interleaved cache
  • Replicated distributed cache
  • Flexible Compiler-Managed L0 Buffers

13
Scheme 1 Mem. Dependent Chains
  • Sets of memory dependent instructions
  • Memory disambiguation by the compiler
  • Conservative assumptions
  • Assign instructions in same set to same cluster

LD
LD
cache module
cache module
X
CLUSTER 3
CLUSTER 2
Register deps
ADD
CLUSTER 1
CLUSTER 4
Memory deps
store to X
store to X
ST
load from X
14
Scheme 2 DDG transformations (I)
  • 2 transformations applied together
  • Store replication ? overcome MF and MO
  • Little support from the hardware

cache module
cache module
cache module
cache module
X
CLUSTER 1
CLUSTER 4
CLUSTER 2
CLUSTER 3
store to X
store to X
store to X
store to X
load from X
15
Scheme 2 DDG transformations (II)
  • Load-store synchronization ? overcome MA
    dependences

cache module
cache module
LD
X
CLUSTER 2
CLUSTER 4
RF
add
CLUSTER 1
CLUSTER 3
load from X
ST
store to X
add
16
Results Memory Coherence
  • Memory Dependent Chains (MDC)
  • Bad since restrictions on the assignment of
    instructions to clusters
  • Good when memory disambiguation is accurate
  • DDG Transformations (DDGT)
  • Good when there is pressure in the memory buses
  • Increases number of local accesses
  • Bad when there is pressure in the register buses
  • Big increase in inter-cluster communications
  • Solutions useful for different cache schemes

17
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

18
Word-Interleaved Cache
  • Simplify hardware
  • As compared to MultiVLIW
  • Avoid replication
  • Strides 1/-1 element are predominant
  • Page interleaved
  • Block interleaved
  • Word interleaved ? best suited

19
Architecture
L2 cache
TAG
TAG
TAG
TAG
cache module
cache module
cache module
cache module
Func. Units
Func. Units
Func. Units
Func. Units
Register File
Register File
Register File
Register File
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
Register-to-register communication buses
20
Instruction Scheduling (I) Unrolling
a0
a4
a1
a5
a2
a6
a3
a7
cache module
cache module
cache module
cache module
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
for (i0 iltMAX i) ld r3, _at_ai
for (i0 iltMAX ii4) ld r3, _at_ai ld
r3, _at_ai1 ld r3, _at_ai2 ld r3,
_at_ai3
ld r3, _at_ai
ld r3, _at_ai1
100 of local accesses
ld r3, _at_ai2
ld r3, _at_ai3
21
Instruction Scheduling (II)
  • Assign appropriate latency to memory instruction
  • Small latencies ? ILP ?, stall time ?
  • Large latencies ? ILP ?, stall time ?
  • Start with large latency (remote miss)
    iteratively reassign appropriate latencies (local
    miss, remote hit, local hit)

LD
RF
add
22
Instruction Scheduling (III)
  • Assign instructions to clusters
  • Non-memory instructions
  • Minimize inter-cluster communications
  • Maximize workload balance among clusters
  • Memory instructions ? 2 heuristics
  • Preferred cluster (PrefClus)
  • Average preferred cluster of memory dependent set
  • Minimize inter-cluster communications (MinComs)
  • Min. Comms. for 1st instruction of the memory
    dependent set

23
Memory Accesses
  • Sources of remote accesses
  • Indirect, chains restrictions, double precision,

24
Attraction Buffers
  • Cost-effective mechanism ? ? local accesses

cache module
cache module
cache module
cache module
a3
a7
a1
a5
a2
a6
a0
a4
a0 a4
Attraction Buffer
AB
AB
AB
CLUSTER 4
CLUSTER 2
CLUSTER 3
CLUSTER 1
i0
local accesses 0 ? 50
load ai ii4
loop
  • Results
  • 15 INCREASE in local accesses
  • 30-35 REDUCTION in stall time
  • 5-7 REDUCTION in overall execution time

25
Performance
26
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

27
Why L0 Buffers
  • Still keep hardware simple, but
  • ... Allow dynamic binding between addresses and
    clusters

28
L0 Buffers
L1 cache
INT
FP
MEM
INT
FP
MEM
CLUSTER 3
CLUSTER 4
Register File
Register File
CLUSTER 1
CLUSTER 2
  • Small number of entries ? flexibility
  • Adaptative to application dynamic
    address-cluster binding
  • Controlled by software ? load/store hints
  • Mark instructions to access the buffers which
    and how
  • Flexible Compiler-Managed L0 Buffers

29
Mapping Flexibility
a1
a3
a0
a2
a4
a5
a6
a7
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
L1 block (16 bytes)
L1 cache
L0 Buffer
L0 Buffer
L0 Buffer
L0 Buffer
CLUSTER 1
CLUSTER 2
CLUSTER 3
CLUSTER 4
30
Hints and L0-L1 Interface
  • Memory hints
  • Access or bypass the L0 Buffers
  • Data mapping linear/interleaved
  • Prefetch hints ? next/previous blocks
  • L0 are write-through with respect to L1
  • Simplifies replacements
  • Makes hardware simple
  • No arbitration
  • No logic to pack data back correctly
  • Simplifies coherence among L0 Buffers

31
Instruction Scheduling
  • Selective loop unrolling
  • No unroll vs. unroll by N
  • Assign latencies to memory instructions
  • Critical instructions (slack) use L0 Buffers
  • Do not overflow L0 Buffers
  • Use counter of L0 Buffer free entries / cluster
  • Do not schedule critical instruction into cluster
    with counter 0
  • Memory coherence
  • Cluster assignment schedule instructions
  • Minimize global communications
  • Maximize workload balance
  • Critical ? Priority to clusters where L0 Buffer
    can be used
  • Explicit prefetching

32
Number of Entries
33
Performance
34
Global Comparative
MultiVLIW Word Interleaved L0 Buffers
Hardware Complexity Lower is better High Low Low
Software Complexity Lower is better Low Medium High
Performance Higher is better High Medium High
35
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

36
Motivation
  • Energy consumption ? 1st class design goal
  • Heterogeneity
  • ? supply voltage and/or ? threshold voltage
  • Cache memory ? ARM10
  • D-cache ? 24 dynamic energy
  • I-cache ? 22 dynamic energy
  • Exploit heterogeneity in the L1 D-cache?

processor front-end
processor front-end
structure tuned for performance
processor back-end
processor back-end
structure tuned for energy
37
Multi-Module Data Cache
Instruction-Based Multi-Module (Abella and
González, ICCD 2003)
L2 D-CACHE
FAST CACHE MODULE
SLOW CACHE MODULE
CRITICALITY TABLE
PROCESSOR
inst PC
38
Cache Configurations
39
Instr.-to-Variable Graph (IVG)
  • Built with profiling information
  • Variables global, local, heap

LD1
LD2
ST1
LD3
ST2
LD4
LD5
VAR V1
VAR V2
VAR V3
VAR V4
FIRST
SECOND
40
Greedy Mapping Algorithm
Compute IVG
Compute affinities propagate affinities
Compute mapping
Schedule code
  • Initial mapping ? all to first _at_ space
  • Assign affinities to instructions
  • Express a preferred cluster for memory
    instructions 0,1
  • Propagate affinities to other instructions
  • Schedule code refine mapping

41
Computing and Propagating Affinity
add1
add2
add3
add4
L1
L1
L1
L1
LD1
LD2
LD3
LD4
L1
L1
L1
L1
mul1
add5
L1
L3
AFFINITY0
AFFINITY1
AFF.0.4
LD1
LD2
LD3
LD4
ST1
add6
add7
L1
L1
V1
V2
V4
V3
ST1
L1
FIRST
SECOND
42
Cluster Assignment
  • Cluster affinity affinity range ? used to
  • Define a preferred cluster
  • Guide the instruction-to-cluster assignment
    process
  • Strongly preferred cluster
  • Schedule instruction in that cluster
  • Weakly preferred cluster
  • Schedule instruction where global comms. are
    minimized

IC
IC
IA
IB
43
EDD Results
Memory Ports Memory Ports
Sensitive Insensitive
Memory Latency Sensitive FASTFAST FASTNONE
Memory Latency Insensitive SLOWSLOW SLOWNONE
44
Other Results
BEST UNIFIED FAST UNIFIED SLOW
EDD 0.89 1.29 1.25
ED 0.89 1.25 1.07
  • ED
  • The SLOW schemes are better
  • In all cases, these schemes are better than
    unified cache
  • 29-31 better in EDD, 19-29 better in ED
  • No configuration is best for all cases

45
Reconfigurable Cache Results
  • The OS can set each module in one state
  • FAST mode / SLOW mode / Turned-off
  • The OS reconfigures the cache on a context switch
  • Depending on the applications scheduled in and
    scheduled out
  • Two different VDD and VTH for the cache
  • Reconfiguration overhead 1-2 cycles Flautner et
    al. 2002
  • Simple heuristic to show potential
  • For each application, choose the estimated best
    cache configuration

BEST DISTRIBUTED RECONFIGURABLE SCHEME
EDD 0.89 (FASTSLOW) 0.86
ED 0.89 (SLOWSLOW) 0.86
46
Presentation Outline
  • Latency reduction techniques
  • Software memory coherence in distributed caches
  • Word-interleaved distributed cache
  • Flexible Compiler-Managed L0 Buffers
  • Energy reduction techniques
  • Multi-Module cache for clustered VLIW processor
  • Conclusions

47
Conclusions
  • Cache partitioning is a good latency reduction
    technique
  • Cache heterogeneity can be used to exploit energy
    efficiency
  • The best energy and performance efficient scheme
    is a distributed data cache
  • Dynamic vs. Static mapping between addresses and
    clusters
  • Dynamic for performance (L0 Buffers)
  • Static for energy consumption (Variable-Based
    mapping)
  • Hardware vs. Software-based memory coherence
    solutions
  • Software solutions are viable

48
List of Publications
  • Distributed Data Cache Memories
  • ICS, 2002
  • MICRO-35, 2002
  • CGO-1, 2003
  • MICRO-36, 2003
  • IEEE Transactions on Computers, October 2005
  • Concurrency Computation practice and
    experience
  • (to appear late 05 / 06)
  • Heterogeneous Data Cache Memories
  • Technical report UPC-DAC-RR-ARCO-2004-4, 2004
  • PACT, 2005

49
Questions
Write a Comment
User Comments (0)
About PowerShow.com