Cooperative Caching for Chip Multiprocessors - PowerPoint PPT Presentation

About This Presentation
Title:

Cooperative Caching for Chip Multiprocessors

Description:

Cooperative Caching for Chip Multiprocessors Jichuan Chang , Enric Herrero , Ramon Canal and Gurindar S. Sohi* HP Labs Universitat Polit cnica de Catalunya – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 35
Provided by: Smart89
Category:

less

Transcript and Presenter's Notes

Title: Cooperative Caching for Chip Multiprocessors


1
Cooperative Caching forChip Multiprocessors
  • Jichuan Chang, Enric Herrero, Ramon Canal and
    Gurindar S. Sohi
  • HP Labs
  • Universitat Politècnica de Catalunya
  • University of Wisconsin-Madison
  • M. S. Obaidat and S. Misra (Editors)Chapter 13,
    Cooperative Networking (Wiley)

2
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

3
Motivation - Background
  • Chip multiprocessors (CMPs) both require and
    enable innovative on-chip cache designs
  • Critical for CMPs
  • Processor/memory gap
  • Limited pin-bandwidth
  • Current designs
  • Shared cache
  • sharing can lead to contention
  • Private caches
  • isolation can waste resources

P
P
Mem.

Narrow
P
P
Slow
4-core CMP
Capacitycontention
Wastedcapacity
4
Motivation Challenges
  • Key challenges
  • Growing on-chip wire delay
  • Expensive off-chip accesses
  • Destructive inter-thread interference
  • Diverse workload characteristics
  • Three important demands for CMP caching
  • Capacity reduce off-chip accesses
  • Latency reduce remote on-chip references
  • Isolation reduce inter-thread interference
  • Need to combine the strength of both private and
    shared cache designs

5
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

6
CMP Cooperative Caching
  • Form an aggregate global cache via cooperative
    private caches
  • Use private caches to attract data for fast reuse
  • Share capacity through cooperative policies
  • Throttle cooperation to find an optimal sharing
    point
  • Inspired by cooperative file/web caches
  • Similar latency tradeoff
  • Similar algorithms

7
CMP Cooperative Caching
  • Private L2 caches to reduce access latency.
  • Centralized directory with duplicated tags grants
    coherence on-chip.
  • Spilling Evicted blocks are forwarded to other
    caches for a more efficient use of cache space.
    (N-Chance forwarding mechanism)

8
Distributed Cooperative Caching
  • Objective Keep the benefits of Cooperative
    caching while improving scalability and energy
    consumption.
  • Distributed directory with different tag
    allocation mechanism.

Main Memory
Bus
DCE
DCE
Interconnect
L1
L1
L1
L1
L2
L2
P
P
9
Tag Structure Comparison
10
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

11
3. Applications of CMP Cooperative Caching
  • Several techniques have appeared that take
    advantage of Cooperative Caching for CMPs.
  • For Latency Reduction
  • Cooperation Throtling
  • For Adaptive Repartitioning
  • Elastic Cooperative Caching
  • For Performance Isolation
  • Cooperative Cache Partitioning

12
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

13
Policies to Reduce Off-chip Accesses
  • Cooperation policies for capacity sharing
  • (1) Cache-to-cache transfers of clean data
  • (2) Replication-aware replacement
  • (3) Global replacement of inactive data
  • Implemented by two unified techniques
  • Policies enforced by cache replacement/placement
  • Information/data exchange supported by modifying
    the coherence protocol

14
Policy (1) - Make use of all on-chip data
  • Dont go off-chip if on-chip (clean) data exist
  • Beneficial and practical for CMPs
  • Peer cache is much closer than next-level storage
  • Affordable implementations of clean ownership
  • Important for all workloads
  • Multi-threaded (mostly) read-only shared data
  • Single-threaded spill into peer caches for later
    reuse

15
Policy (2) Control replication
  • Intuition increase of unique on-chip data
  • Latency/capacity tradeoff
  • Evict singlets only when no replications exist
  • Modify the default cache replacement policy
  • Spill an evicted singlet into a peer cache
  • Can further reduce on-chip replication
  • Randomly choose a recipient cache for spilling

singlets
16
Policy (3) - Global cache management
  • Approximate global-LRU replacement
  • Combine global spill/reuse history with local LRU
  • Identify and replace globally inactive data
  • First become the LRU entry in the local cache
  • Set as MRU if spilled into a peer cache
  • Later become LRU entry again evict globally
  • 1-chance forwarding (1-Fwd)
  • Blocks can only be spilled once if not reused

17
Cooperation Throttling
  • Why throttling?
  • Further tradeoff between capacity/latency
  • Two probabilities to help make decisions
  • Cooperation probability control replication
  • Spill probability throttle spilling

Shared
Private
Cooperative Caching
CC 100
18
Performance Evaluation
19
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

20
CC for Adaptive Repartitioning
  • Tradeoff between minimizing off-chip misses and
    avoiding inter-thread interference.
  • Elastic Cooperative Caching adapts caches
    according to application requirements.
  • High cache requirements -gt Big local private
    cache
  • Low data reuse -gt Cache space reassigned to
    increase the size of the global shared cache for
    spilled blocks.

21
Elastic Cooperative Caching structure
Local Shared/Private cache with independent
repartitioning unit.
Distributed Coherence Engines grant coherence
Allocates evicted blocks from all private regions
Only local core can allocate
Every N cycles repartitions cache based on LRU
hits in SP partitions.
Distributes evicted blocks from private partition
among nodes.
22
Repartitioning Unit Working Example
Counter gt HT Private Counter lt LT Shared
Independent Partitioning, distributed structure.
3
4
23
Spilled Allocator Working Example
Only data from the private region can be spilled
Private Ways
Shared Ways
Broadcast
No need of perfectly updated information, out of
critical path.
24
Performance and energy-efficiency evaluation
24 Over ASR
12 Over ASR
25
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

26
Time-share Based Partitioning
  • Throughput-fairness dilemma
  • Cooperation Taking turns to speed up
  • Multiple time-sharing partitions (MTP)
  • QoS guarantee
  • Cooperatively shrink/expand across MTPs
  • Bound average slowdown over the long term

P4
P1
P4
P2
P3
P1
P2
P3
Time
Time
IPC 0.52 WS2.42 QoS -0.52 FS 1.22
IPC 0.52 WS 2.42 QoS 0 FS 1.97
ltlt
Fairness improvement and QoS guarantee reflected
by higher FS and bounded QoS values
27
MTP Benefits
  • Better than single spatial partition (SSP)
  • MTP/long-termQoS almost the same as MTP/noQoS

Fair Speedup
Percentage of workloads achieving various FS
values
Offline analysis based on profile info, 210
workloads (job mixes)
28
Better than MTP
  • MTP issues
  • Not needed if LRU performs better (LRU often
    near-optimal Stone et al. IEEE TOC 92)
  • Partitioning is more complex than SSP
  • Cooperative Cache Partitioning (CCP)
  • Integration with Cooperative Caching (CC)
  • Exploit CCs latency and LRU-based sharing
    benefits
  • Simplify the partitioning algorithm
  • Total execution time Epochs(CC) Epochs(MTP)
  • Weighted by of threads benefiting from CC vs.
    MTP

29
Partitioning Heuristic
  • When is MTP better than CC
  • QoS ?speedup gt ?slowdown (over N partitions)
  • Speedup should be large
  • CC already good at fine-grained tuning

Baseline
C_shrink
Throughput (normalized)
thrashing _test Speedup gt (N-1) x Slowdown
Allocated cache ways (16-way total, 4-core)
30
Partitioning Algorithm
  • S All threads - supplier threads (e.g., gcc,
    swim)
  • Allocate them with gPar (guaranteed partition, or
    min.
  • capacity needed for QoS) Yeh/Reinman CASES
    05
  • For threads in S, init their C_expand and
    C_shrink
  • Do thrashing_test iteratively for each thread in
    S
  • If thread t fails, allocate t with gPar, remove t
    from S
  • Update C_expand and C_shrink for other threads in
    S
  • Repeat until S is empty or all threads in S pass
    the test

31
Fair Speedup Results
  • Two groups of workloads
  • PAR MTP better than CC (partitioning helps)
  • LRU CC better than MTP (partitioning hurts)

Percentage of workloads
Percentage of workloads
PAR (67 out of 210 workloads)
LRU (143 out of 210 workloads)
32
Performance and QoS evaluation
33
Outline
  • Motivation
  • Cooperative Caching for CMPs
  • Applications of CMP Cooperative Caching
  • Latency Reduction
  • Adaptive Repartitioning
  • Performance Isolation
  • Conclusions

34
4. Conclusions
  • On-chip cache hierarchy plays an important role
    in CMPs and must provide fast and fair data
    accesses for multiple, competing processor cores.
  • Cooperative Caching for CMPs is an effective
    framework to manage caches in such environment.
  • Cooperative sharing mechanisms and the philosophy
    of using cooperation for conflict resolution can
    be applied to many other resource management
    problems.
Write a Comment
User Comments (0)
About PowerShow.com