Region-Centric Memory Design - PowerPoint PPT Presentation

About This Presentation
Title:

Region-Centric Memory Design

Description:

Multiple block evictions might also be a problem. ... 16 71.16 84.98 12.00 4.00 512.00 fft. ... 63 37.42 12.00 4.00 512.00 radix. 69.93 4.02 15.54 23.53 28.65 12.00 4 ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 47
Provided by: toro164
Category:

less

Transcript and Presenter's Notes

Title: Region-Centric Memory Design


1
Region-Centric Memory Design
  • AENAO Research Group
  • Patrick Akl, M.A.Sc.
  • Ioana Burcea, Ph.D. C.
  • Myrto Papadopoulou, M.A.Sc. C.
  • Elham Safi, Ph.D. C.
  • Jason Zebchuk, M.A.Sc. C.
  • Andreas Moshovos

pakl, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
2
Future On-Chip Caches Just Larger?
CPU
D
I
interconnect
10s 100s of MB
Main Memory
  • Observe and Exploit Memory Access Behavior at a
    Coarse Grain

3
Conventional Block-Centric Memory Hierarchy
Conventional Fine-Grain Tracking
  • Small Blocks
  • Performance and Bandwidth
  • Several optimizations exist
  • Big picture is lost

4
Big Picture View
Supplemental Coarse-Grain Tracking
  • Region 2n sized, aligned memory area
  • Concept already in use TLBs
  • Patterns Emerge in Space / Time
  • Exploit for performance power
  • Expose to software

5
This Presentation
  • Examples of Coarse-Grain Optimizations
  • Snoop Coherence
  • Thread-level speculation disambiguation
  • Region-Centric Memory Design
  • RegionTracker Cache
  • Snoop Coherence Revisited
  • Current Activities
  • Coherence Delegation
  • Predictor Virtualization

6
An Example Snoop Coherence
CPU
CPU
CPU
I
D
I
D
I
D
interconnect
Main Memory
  • Conventional Considerations
  • Complexity and Correctness NOT Power/Bandwidth
  • Can we (1) Reduce Power/bandwidth
  • (2) Leverage snoop
    coherence?
  • Remains Attractive Simple / Design Re-use
  • Yes Exploit Program Behavior to
  • Dynamically Identify Requests that do not Need
    Snooping

7
Coherence Basics
CPU
CPU
CPU
X
snoop
snoop
hit
Main Memory
  • Given request for memory block X (address)
  • Detect where current value resides

8
Conventional Coherence not Power-Aware/Bandwidth-
Effective
CPU
CPU
CPU
L2
miss
miss
Main Memory
All L2 tags see all accesses Perf. Complexity
Have L2 tags why not use them Power All L2 tags
consume power on all accesses Bandwidth
broadcast all coherent requests
9
RegionScout MotivationSharing is Coarse
Typical Memory Space Snapshot colored by owner(s)
addresses
  • Region large continuous memory area, power of 2
    size
  • CPU X asks for data block in region R
  • No one else has X
  • No one else has any block in R
  • RegionScout Exploits this Behavior
  • Layered Extension over Snoop Coherence

10
Optimization Opportunities
SWITCH
Memory
  • Power and Bandwidth
  • Originating node avoid asking others
  • Remote node avoid tag lookup

11
Potential Region Miss Frequency
better
of all requests
Global Region Misses
Region Size
Even with a 16K Region 45 of requests miss in
all remote nodes
12
RegionScout at Work Non-Shared Region Discovery
CPU
CPU
CPU
Region Miss
Region Miss
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
  • First request detects a non-shared region

13
RegionScout at WorkAvoiding Snoops
CPU
CPU
CPU
Global Region Miss
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
  • Subsequent request avoids snoops

14
RegionScout is Self-Correcting
CPU
CPU
CPU
Main Memory
Record Non-Shared Regions
Record Locally Cached Regions
  • Request from another node invalidates non-shared
    record

15
Implementation Requirements
  • Requesting Node provides address
  • At Originating Node from CPU
  • Have I discovered that this region is not shared?
  • At Remote Nodes from Interconnect
  • Do I have a block in the region?

address
lg(Region Size)
16
Remembering Non-Shared Regions
address
Region Tag
offset
Non-Shared Region Table
valid
Few entries 16x4 in most experiments
  • Records non-shared regions
  • Lookup by Region portion prior to issuing a
    request
  • Snoop requests and invalidate

17
What Regions are Locally Cached?
Region Tag
offset
counter
  • If we had as many counters as regions
  • Block Allocation counterregion
  • Block Eviction counterregion--
  • Region cached only if counterRegion non-zero
  • Not Practical
  • E.g., 16K Regions and 4G Memory ? 256K counters

18
What Regions are Locally Cached?
Region Tag
offset
counter
hash()
  • Imprecise
  • Records a superset of locally cached Regions
  • False positives lost opportunity, correctness
    preserved
  • Small e.g., 256 entries for 1M cache
  • Power-Optimized structures described in the paper

19
LFSR-Based Implementation
Region Tag
offset
LFSR
hash()
Zero Detector
  • Linear-Feedback Shift Register Array
  • Increment/Decrement/Is Zero?
  • 130nm commercial technology
  • ISLPED 06
  • Faster 1.6x to 3.7x
  • More Energy Efficient 1.4x to 2.3x
  • But Area 3.2x

20
Filter Rates SPLASH-II
better
Identified Global Region Misses
CRH Size
Jason Cantin_at_Wisconsin studied commercial
workloads 40 filter rate
21
Region-Centric Disambiguation
  • Join work w/
  • Greg Steffan and Mihai Burcea
  • Patrick Akl
  • Andreas Moshovos

22
Speculative Parallelization Models
  • Thread level speculation
  • Transactional Memory

Speculative Parallelization
Original
Good Scenario
Bad Scenario
read a
read b
time
write a
write a
Need to Compare Addresses Across Code Pieces
23
Ex 2 Region-Centric Disambiguation
Region-Centric
Conventional
Task 1
Task 2
Task 1
Task 2
Memory Space
  • Send digest at region level
  • Region-conflict
  • Send block-level info
  • Reduced bandwidth, potential for performance and
    power

24
How Much Traffic Can We Save?
Better
  • TLS benchmarks from STAMPEDE group (G. Steffan)
  • Approximate timing model

Potential for traffic reduction by 38
25
Exploiting Region-Level Information
  • Region Coherence Arrays
  • Cantin, Lipasti and Smith
  • RegionScout
  • Both of these reduce snoop lookups (and
    broadcasts) in snoop coherence protocolsOur work
  • Spatial Memory Prefetching
  • Leverages spatial memory patterns for prefetching
    with commercial workloads
  • Impetus Group at CMU
  • Stealth Prefetching
  • Cantin, Lipasti and Smith

26
Coarse-Grain Techniques Today
Conventional Cache
Auxiliary Tracking
DATA
TAGS
  • Overhead
  • Storage e.g., 60 of tags
  • Functionality Restrict placement, Region
    Evictions
  • Loss of Information
  • Hard to justify for a commercial design

27
Rethinking Cache Design
Embedded Tracking
DATA
Dual-Grain TAGS
  • Can we provide a common substrate for all these
    optimizations?
  • Redesign caches
  • Regions a first class citizen
  • RegionTracker Cache

28
RegionTracker Cache
  • Goals
  • Expose region behavior
  • Is region X cached?
  • Which blocks are?
  • Facilitate management at the region level
  • Evict/migrate region X
  • Do something with all blocks in X
  • Constraints
  • Data movement only at the block level
  • No increase in area
  • No decrease in performance
  • Complexity
  • Associativity

29
Region-Based Caches
  • Start with conventional 16-way cache and replace
    tag array
  • Sector Caches
  • Hit rate suffers 20 loss
  • Sector Pool Caches
  • High Associavity 48-way for matching a 16-way
    cache
  • Decoupled-Sector Caches
  • No coarse-grain info
  • Replacements require searching
  • No previous design is adequate
  • RegionTracker
  • Meets all requirements
  • But does not save as much tag resources

30
Sector Cache
D-way Data
D-way Region Tags

RVA
Data Array
  • Reduced Area and Power
  • Increased miss-rates (2.5 - 96 for 1kB sectors)
  • Replacement?

31
Sector Pool Cache
D-way Data

1 DSR
Data Array
  • M gt D
  • Requires highly associative cache to achieve same
    performance as RegionTracker (48-way)

32
Decoupled-Sectored Cache
  • Has multiple block evictions
  • Requires scanning status array
  • No simple mechanism to avoid this
  • Does NOT expose region-level information

33
RegionTracker
  • In practice L lt D
  • Decouple Data and Lookup organizations
  • Lower Associativity lookups with no hit-rate
    penalty
  • RegionTracker provides complete solution

34
RegionTracker Cache
Block and Region Lookups Region Tag Way Per
Block
Evict Region Blocks Lazily
Simplify replacement and reduce area Status per
block RVA set backpointer
Can be banked and partitioned
35
Region-Aware Cache Performance vs. Area
better
  • Commercial workloads DB2, Oracle, TPC-C and
    TPC-H, Apache, Zeus
  • SimICS SimFlex, Sampling, 2K Regions

36
RegionTracker-RegionScout
BlockScout
better
Reduction in Broadcasts
  • One bit per Region tag Known to be not shared
  • 1KB Regions, Commercial workloads
  • 512KB L2 private caches
  • Filter 41 of snoops at Zero Cost compared to
    conventional cache

37
Directory Optimizations Base Architecture
Core
L3 Data DRAM
Directory
L2 Tags
L3 Tags
L2 Data
38
Coherence Delegation
Ideal Path
Requesting Node
Directory Lookup
Remote L2 containing data
  • Eliminate 3-hop overhead
  • Attract directory tracking to nodes

39
Optimization Engines Predictors
Predictor Virtualization
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
CPU
L1-D
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-I
L1-D
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
L1-D
L1-I
Interconnect
L2
Main Memory
40
Motivating Trends
  • Chip multiprocessors
  • Space dedicated to predictors X processors
  • Larger predictor table
  • Increased performance
  • Memory hierarchies
  • Increased capacities

Use conventional memory hierarchies to store
predictor information
41
PV Architecture
Optimization Engine
entry
index
prediction
Predictor Table
42
PV Architecture
Optimization Engine
entry
index
prediction
Predictor Virtualization
43
PV Architecture
Optimization Engine
entry
index
prediction
PVCache
MSHR
PVStart
index
PVProxy
L2
PVTable
Main Memory
44
Virtualized Spatial Memory Streaming
Original Prefetcher Cost 80KB Virtualized
Prefetcer Cost lt1Kbyte Nearly Identical
Performance
45
Region-Centric Memory Design
  • AENAO Research Group
  • Patrick Akl, M.A.Sc. C.
  • Ioana Burcea, Ph.D. C.
  • Myrto Papadopoulou, M.A.Sc. C.
  • Elham Safi, Ph.D. C.
  • Jason Zebchuk, M.A.Sc. C.
  • Andreas Moshovos

pakl, ioana, myrto, elham, zebchuk,
moshovos_at_eecg.toronto.edu
46
Summary
  • Caches are getting larger
  • Time to look at the big picture
  • Region-Centric Memory Design
  • Expose region-level info
  • Allow management at the region-level
  • RegionScout
  • eliminate broadcasts for snoop coherence
  • Region-Centric Disambiguation
  • Reduce bandwidth for TLS or TM
  • Region-Aware Memory
  • Same area and performance as conventional
    region info
  • Predictor Virtualization
Write a Comment
User Comments (0)
About PowerShow.com