Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches - PowerPoint PPT Presentation

About This Presentation
Title:

Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches

Description:

Optimizes Bandwidth and Performance. Large L2/L3 caches especially. Fine ... 4-core CMP modeled after Piranha. Private 32KB, 4-way set-associative L1 caches ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 50
Provided by: eecgTo
Category:

less

Transcript and Presenter's Notes

Title: Two%20Ways%20to%20Exploit%20Multi-Megabyte%20Caches


1
Two Ways to Exploit Multi-Megabyte Caches
  • AENAO Research Group _at_ Toronto
  • Kaveh Aasaraai
  • Ioana Burcea
  • Myrto Papadopoulou
  • Elham Safi
  • Jason Zebchuk
  • Andreas Moshovos

aasaraai, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2
Future Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory
  • Big Picture Management
  • Store Metadata

3
Conventional Block Centric Cache
Fine-Grain View of Memory
L2 Cache
  • Small Blocks
  • Optimizes Bandwidth and Performance
  • Large L2/L3 caches especially

Big Picture Lost
4
Big Picture View
Coarse-Grain View of Memory
L2 Cache
  • Region 2n sized, aligned area of memory
  • Patterns and behavior exposed
  • Spatial locality
  • Exploit for performance/area/power

5
Exploiting Coarse-Grain Patterns
Coarse-Grain Framework
Circuit-Switched Coherence
CPU
Stealth Prefetching
RegionScout
  • Embed coarse-grain information in tag array
  • Support many different optimizations with less
    area overhead

Run-time Adaptive Cache Hierarchy Management via
Reference Analysis
L2 Cache
Destination-Set Prediction
Coarse-Grain Coherence Tracking
Spatial Memory Streaming
  • Many existing coarse-grain optimizations
  • Add new structures to track coarse-grain
    information

Adaptable optimization FRAMEWORK
Hard to justify for a commercial design
6
RegionTracker Solution
L2 Cache
L1
Data Array
Data Blocks
Tag Array
Region Tracker
L1
Block Requests
L1
Region Probes
L1
Region Responses
Block Requests
  • Manage blocks, but also track and manage regions

7
RegionTracker Summary
  • Replace conventional tag array
  • 4-core CMP with 8MB shared L2 cache
  • Within 1 of original performance
  • Up to 20 less tag area
  • Average 33 less energy consumption
  • Optimization Framework
  • Stealth Prefetching same performance, 36 less
    area
  • RegionScout 2x more snoops avoided, no area
    overhead

8
Road Map
  • Introduction
  • Goals
  • Coarse-Grain Cache Designs
  • RegionTracker A Tag Array Replacement
  • RegionTracker An Optimization Framework
  • Conclusion

9
Goals
  • Conventional Tag Array Functionality
  • Identify data block location and state
  • Leave data array un-changed
  • Optimization Framework Functionality
  • Is Region X cached?
  • Which blocks of Region X are cached? Where?
  • Evict or migrate Region X
  • Easy to assign properties to each Region

10
Coarse-Grain Cache Designs
Large Block Size
Tag Array
Data Array
Region X
  • Increased BW, Decreased hit-rates

11
Sector Cache
Tag Array
Data Array
Region X
  • Decreased hit-rates

12
Sector Pool Cache
Tag Array
Data Array
Region X
  • High Associativity (2 - 4 times)

13
Decoupled Sector Cache
Tag Array
Data Array
Status Table
Region X
  • Region information not exposed
  • Region replacement requires scanning multiple
    entries

14
Design Requirements
  • Small block size (64B)
  • Miss-rate does not increase
  • Lookup associativity does not increase
  • No additional access latency
  • (i.e., No scanning, no multiple block evictions)
  • Does not increase latency, area, or energy
  • Allows banking and interleaving
  • Fit in conventional tag array envelope

15
RegionTracker A Tag Array Replacement
L1
Data Array
Region Vector Array
L1
L1
Evicted Region Buffer
L1
Block Status Table
  • 3 SRAM arrays, combined smaller than tag array

16
Basic Structures
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag

status
block15
block0
3
2
way
V
1
4
  • Address specific RVA set and BST set
  • RVA entry multiple, consecutive BST sets
  • BST entry one of four RVA sets

17
Common Case Hit
Ex 8MB, 16-way set-associative cache, 64-byte
blocks, 1KB region
49
0
6
10
21
Address
Region Tag
RVA Index
Region Offset
Block Offset
Block Offset
Data Array BST Index
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Region Tag

status
block15
block0
3
2
way
V
To Data Array
1
4
18
Worst Case (Rare) Region Miss
49
0
6
10
21
Region Tag
RVA Index
Region Offset
Block Offset
Address
Block Offset
Data Array BST Index
Ptr
19
6
0
Region Vector Array (RVA)
Block Status Table (BST)
Evicted Region Buffer (ERB)
No Match!
Region Tag

status
Ptr
block15
block0
3
2
way
V
19
Methodology
P
P
P
P
  • Flexus simulator from CMU SimFlex group
  • Based on Simics full-system simulator
  • 4-core CMP modeled after Piranha
  • Private 32KB, 4-way set-associative L1 caches
  • Shared 8MB, 16-way set-associative L2 cache
  • 64-byte blocks
  • Miss-rates Functional simulation of 2 billion
    instructions per core
  • Performance and Energy Timing simulation using
    SMARTS sampling methodology
  • Area and Power Full custom implementation on
    130nm commercial technology
  • 9 commercial workloads
  • WEB SpecWEB on Apache and Zeus
  • OLTP TPC-C on DB2 and Oracle
  • DSS 5 TPC-H queries on DB2

D
I
D
I
D
I
D
I
Interconnect
L2
20
Miss-Rates vs. Area
Sector Cache (0.25, 1.26)
48-way
Relative Miss-Rate
52-way
14-way
15-way
better
Relative Tag Array Area
  • Sector Cache 512KB sectors, SPC and RT 1KB
    regions
  • Trade-offs comparable to conventional cache

21
Performance Energy
Performance
Energy
better
better
Reduction in Tag Energy
Normalized Execution Time
  • 12-way set-associative RegionTracker 20 less
    area
  • Error bars 95 confidence interval
  • Performance within 1, with 33 tag energy
    reduction

22
Road Map
  • Introduction
  • Goals
  • Coarse-Grain Cache Designs
  • RegionTracker A Tag Array Replacement
  • RegionTracker An Optimization Framework
  • Conclusion

23
RegionTracker An Optimization Framework
Stealth Prefetching Average 20 performance
improvement Drop-in RegionTracker for 36 less
area overhead
RegionScout In-depth analysis
24
Snoop Coherence Common Case
CPU
CPU
CPU
Read x
Read x1
Read x2
Read xn
miss
miss
Main Memory
Many snoops are to non-shared regions
25
RegionScout
CPU
CPU
CPU
Read x
Miss
Miss
Region Miss
Region Miss
Global Region Miss
Main Memory
Non-Shared Regions
Locally Cached Regions
Eliminate broadcasts for non-shared regions
26
RegionTracker Implementation
  • Minimal overhead to support RegionScout
    optimization
  • Still uses less area than conventional tag array

27
RegionTracker RegionScout
  • 4 processors, 512KB L2 Caches
  • 1KB regions

BlockScout (4KB)
better
Reduction in Snoop Broadcasts
Avoid 41 of Snoop Broadcasts, no area overhead
compared to conventional tag array
28
Result Summary
  • Replace Conventional Tag Array
  • 20 Less tag area
  • 33 Less tag energy
  • Within 1 of original performance
  • Coarse-Grain Optimization Framework
  • 36 reduction in area overhead for Stealth
    Prefetching
  • Filter 41 of snoop broadcasts with no area
    overhead compared to conventional cache

29
Predictor Virtualization
  • Ioana Burcea
  • Joint work with
  • Stephen Somogyi
  • Babak Falsafi

30
Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
31
Motivating Trends
  • Dedicating resources to predictors hard to
    justify
  • Chip multiprocessors
  • Space dedicated to predictors X processors
  • Larger predictor tables
  • Increased performance
  • Memory hierarchies offer the opportunity
  • Increased capacity
  • How many apps really use the space?

Use conventional memory hierarchies to store
predictor information
32
PV Architecture contd.
Optimization Engine
request
request
prediction
Predictor Table
33
PV Architecture contd.
Optimization Engine
request
prediction
Predictor Virtualization
34
PV Architecture contd.
Optimization Engine
request
prediction
PVCache
MSHR
PVStart
index
PVProxy
On the backside of the L1
L2
PVTable
Main Memory
35
To Virtualize Or Not to Virtualize?
Common Case
  1. Re-Use2. Predictor Info Prefetching

36
To Virtualize or Not?
  • Challenge
  • Hit in the PVCache most of the time
  • Will not work for all predictors out of the box
  • Reuse is necessary
  • Intrinsic
  • Easy to virtualize
  • Non-intrinsic
  • Must be engineered
  • More so if the predictor needs to be fast to
    start with

37
Will There Be Reuse?
  • Intrinsic
  • Multiple predictions per entry
  • Well see an example
  • Can be engineered
  • Group temporally correlated entries together

Cache block
38
Spatial Memory Streaming
  • Footprint
  • Blocks accessed per memory region
  • Predict next time the footprint will be the same
  • Handle PC offset within region

39
Spatial Generations
40
Virtualizing SMS
Virtualize
patterns
Predictor
Detector
patterns
trigger access
prefetches
41
Virtualizing SMS
Virtual Table
PVCache
8
1K
11
11
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
42
Packing Entries in One Cache Block
  • Index PC offset within spatial group
  • PC ?16 bits
  • 32 blocks in a spatial group ? 5 bit offset
  • ?
    32 bit spatial pattern
  • Pattern table 1K sets
  • 10 bits to index the table ? 11 bit tag
  • Cache block 64 bytes
  • 11 entries per cache block ? Pattern table
  • 1K sets
    11-way set associative

21 bit index
tag
pattern
tag
tag
pattern
pattern
unused
85
0
11
43
54
43
Memory Address Calculation
PC
Block offset
16 bits
5 bits
PV Start Address
000000
10 bits
Memory Address
44
Simulation Infrastructure
  • SimFlex CMU Impetus
  • Full-system simulator based on Simics
  • Base processor configuration
  • 8-wide OoO
  • 256-entry ROB / 64-entry LSQ
  • L1D/L1I 64KB 4-way set-associative
  • UL2 8MB 16-way set-associative
  • Commercial workloads
  • TPC-C DB2 and Oracle
  • TPC-H Query 1, Query 2, Query 16, Query 17
  • Web Apache and Zeus

45
SMS Performance Potential
better
46
Virtualized Spatial Memory Streaming
better
Original Prefetcher Cost 60KB Virtualized
Prefetcher Cost lt1Kbyte Nearly Identical
Performance
47
Impact of Virtualization on L2 Misses
48
Impact of Virtualization on L2 Requests
49
Coarse-Grain Tracking
  • Jason Zebchuk
Write a Comment
User Comments (0)
About PowerShow.com