Dynamic Cache Clustering for Chip Multiprocessors - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Dynamic Cache Clustering for Chip Multiprocessors

Description:

Tiled CMP Architectures have recently been advocated as a scalable design. ... The basic trade-offs of varying the dimension of a cache cluster are the average ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 24
Provided by: sangye
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Cache Clustering for Chip Multiprocessors


1
Dynamic Cache Clustering for Chip Multiprocessors
  • Mohammad Hammoud, Sangyeun Cho, and Rami Melhem

Dept. of Computer Science University of Pittsburgh
2
Tiled CMP Architectures
  • Tiled CMP Architectures have recently been
    advocated as a scalable design.
  • They replicate identical building blocks (tiles)
    and connect them with a switched network on-chip
    (NoC).
  • A tile typically incorporates a private L1 cache
    and an L2 cache bank.
  • Two traditional practices of CMP caches
  • One bank to one core assignment (Private Scheme).
  • One bank to all cores assignment (Shared Scheme).

3
Private and Shared Schemes
  • Private Scheme
  • A core maps and locates a cache block, B, to and
    from its local L2 bank.
  • Coherence maintenance is required at both, the L1
    and the L2 levels.
  • Data is read very fast but cache miss rate might
    render high.
  • Shared Scheme
  • A core maps and locates a cache block, B, to and
    from a target tile (using some bits- home select
    or HS bits from Bs physical address) referred to
    as the static home tile (SHT) of B.
  • Coherence is required only at the L1 level.
  • Cache miss rate is low but data reads are slow
    (NUCA design).

Bs physical address
4
The Degree of Sharing
  • Sharing Degree (SD), or the number of cores that
    share a given pool of cache banks, could be set
    somewhere between the shared and the private
    designs.

1-1 assignment




2-2 assignment
4-4 assignment
8-8 assignment
16-16 assignment
(Private Design)
(Shared Design)
5
Static Designs Principal Deficiency
  • The aforementioned static designs are subject to
    a principal deficiency
  • In reality, computer applications exhibit
    different cache demands.
  • A single application may demonstrate different
    phases corresponding to distinct code regions
    invoked during its execution.
  • Program phases can be characterized by different
    L2 cache misses and durations.

They all entail static partitioning of the
available cache capacity and dont tolerate the
variability among working sets and phases of a
working set.
6
Our work
  • Dynamically monitor the behaviors of the programs
    running on different CMP cores.
  • Adapt to each program cache demand by offering a
    fine-grained banks-to-cores assignments (a
    technique we refer to as cache clustering).
  • Introduce novel mapping and location strategies
    to manage dynamic cache designs in tiled CMPs.

(CD Cluster Dimension)
7
Talk roadmap
  • The proposed dynamic cache clustering (DCC)
    scheme.
  • Performance metrics.
  • DCC algorithm.
  • DCC mapping strategy.
  • DCC location strategy.
  • Quantitative evaluation
  • Concluding remarks.

8
The Proposed Scheme
  • We denote the L2 cache banks that can be assigned
    to a specific core, i, as is cache cluster.
  • We further denote the number of banks that the
    cache cluster of core i consists of as cache
    cluster dimension of core i (CDi).
  • We propose a dynamic cache clustering (DCC)
    scheme where
  • Each core is initially started up with a specific
    cache cluster.
  • After every period time T (potential
    re-clustering point), the cache cluster of a core
    is dynamically contracted, expanded, or kept
    intact, depending on the cache demand experienced
    by that core.

9
Performance Metrics
  • The basic trade-offs of varying the dimension of
    a cache cluster are the average L2 access latency
    and the L2 miss rate.
  • Average L2 access latency (AAL) increases
    strictly with the cluster dimension.
  • L2 miss rate (MR) is inversely proportional to
    the cluster dimension.
  • Improving either AAL or MR doesnt necessarily
    correlate to an improvement in the overall system
    performance.
  • Improving one of the following metrics typically
    translates to a better system performance.

10
DCC Algorithm
  • The AMAT metric can be utilized to judiciously
    gauge the benefit of varying the cache cluster
    dimension of a certain core i.
  • At every potential re-clustering point
  • The AMATi (AMATi current) experienced by a
    process P running on core i is evaluated and
    stored.
  • AMATi current is subtracted from the previously
    stored AMATi (AMATi previous).
  • Assume a contraction action has been taken
    previously
  • A positive subtraction value indicates that AMATi
    has increased. Hence, we retard and expand Ps
    cluster.
  • A negative value indicates that AMATi has
    decreased. We hence contract Ps cluster a step
    further predicting more benefit.

11
DCC Mapping Strategy
  • Varying a cache cluster dimension (CD) of each
    core over time requires a function that maps
    cache blocks to cache clusters exactly as
    required.
  • Assume that a core i requests a cache block B. If
    CDi lt 16 (for instance), B is mapped to a dynamic
    home tile (DHT) different than the static home
    tile (SHT) of B.
  • DHT of B depends on CDi. With CDi smaller than 16
    only a subset of bits from the HS field of Bs
    physical address needs to be utilized to
    determine Bs DHT (i.e., 3 bits from HS are used
    if CDi 8).
  • We developed the following generic function to
    determine the DHT of block B (ID is the binary
    representation of core i and MB are masking
    bits)

12
DCC Mapping Strategy A Working Example
  • Assume core 5 (ID 0101) requests cache block B
    with HS 1111.

DHT (11111111) (01010000)
1111
DHT (11110111) (01011000)
0111
DHT (11110101) (01011010)
0101
DHT (11110001) (01011110)
0101
DHT (11110000) (01011111)
0101
13
DCC Location Strategy
  • The generic mapping function we defined cant be
    used straightforwardly to locate cache blocks.
  • Assume a cache block B with HS 1111 is
    requested by core 0 (ID 0000).
  • Assume the cache cluster of core 0 is contracted
    and B is afterward requested by core 0.

DHT (11110111) (00001000)
0111
DHT (11110101) (00001010)
0101
14
DCC Location Strategy
  • Solution 1 re-copy all blocks upon a
    re-clustering action.
  • Solution2 After missing at Bs DHT, Bs SHT
    (tile 15) can be accessed to locate B at tile 7.
  • Solution3 Send the L2 request directly to Bs
    SHT instead of sending it first to Bs DHT and
    then possibly to Bs SHT.

Very Expensive
Slow Inter-tile communications between tiles
0, 5, 15, 7, and lastly 0
DHT (11110101) (00001010)
0101
Slow inter-tile communications between tiles
0, 15, 7, and lastly 0.
15
DCC Location Strategy
  • Solution4 Send simultaneous requests to only the
    tiles that are potential DHTs of B.
  • The potential DHTs of B can be easily determined
    by varying MB and MBbar of the DCC mapping
    function for the range of CDs 1, 2, 4, 8, and 16.
  • Upper bound
  • Lower bound 1
  • Average 1 1/2 log2(n) (i.e., for 16 tiles, 3
    messages per request)

log2(NumberofTiles) 1
16
Quantitative Evaluation Methodology
  • System Parameters
  • We simulate a 16-way tiled CMP.
  • Simulator Simics 3.0.29 (Solaris OS)
  • Cache line size 64 Bytes.
  • L1 I-D sizes/ways/latency 16KB/2 ways/1 cycles.
  • L2 size/ways/latency 512KB per bank/16 ways/12
    cycles.
  • Latency per hop 3 cycles.
  • Memory latency 300 cycles.
  • L1 and L2 replacement policy LRU
  • Benchmarks SPECJBB, OCEAN, BARNES, LU, RADIX,
    FFT, MIX1(16 copies of HMMER), MIX2(16 copies of
    SPHINX), MIX3( Barnes, Lu, Milc, Mcf, Bzip2, and
    Hmmer- 2 threads/copies each).

17
Comparing With Static Schemes
We first study the effect of the average L1 miss
time (AMT) across FS1, FS2, FS4, FS8, FS16, and
DCC.
  • FS1
  • FS16
  • DCC outperforms FS16, FS8, FS4, FS2, and FS1 by
    averages of 6.5, 8.6, 10.1, 10, and 4.5,
  • respectively, and by as much as 21.3.

18
Comparing With Static Schemes
We second study the effect of L2 miss rate across
FS1, FS2, FS4, FS8, FS16, and DCC.
  • No Single static scheme provides the best miss
    rate for all the benchmarks.
  • DCC always provides miss rates comparable to the
    best static alternative.

19
Comparing With Static Schemes
We third study the effect of execution time
across FS1, FS2, FS4, FS8, FS16, and DCC.
  • The superiority of DCC in AMT translates to
    better overall performance.
  • DCC always provides performance comparable to
    the best static alternative.

20
Sensitivity Study
We fourth study the sensitivity of DCC to
different T,Tl,Tg values.
  • DCC is not much dependent on the values of
    parameters T,Tl,Tg .
  • Overall, DCC performs a little better with T
    100K than with T 300K.

21
Comparing With Cooperative Caching
We fifth compare DCC against the cooperative
caching (CC) scheme. CC is based on FS1 (private
scheme).
  • FS1
  • DCC
  • CC
  • DCC outperforms CC by an average of 1.59.
  • The basic problem with CC is that it spills
    blocks without knowing if spilling helps
  • or hurts cache performance (a problem
    addressed recently in HPCA09).

22
Concluding Remarks
  • This paper proposes DCC, a distributed cache
    management scheme for large scale chip
    multiprocessors.
  • Contrary to static designs, DCC adapts to working
    sets irregularities.
  • We propose generic mapping and location
    strategies that can be utilized for both, static
    designs (with different sharing degrees) and
    dynamic designs in tiled CMPs.
  • The proposed DCC location strategy can be
    improved (in regard to reducing the number of
    messages per request) by maintaining a small
    history about a specific cluster expansions and
    contractions.
  • For instance, with an activity chain of 16-8-4,
    we can predict that a requested block cant exist
    at a DHT corresponding to CD 1 or 2, and has
    higher probability to exist at DHTs corresponding
    to CD 4 and 8 than at DHT that corresponds to
    CD 16.

23
Dynamic Cache Clustering for Chip Multiprocessors
Thank you!
  • M. Hammoud, S. Cho, and R. Melhem

Dept. of Computer Science University of Pittsburgh
Write a Comment
User Comments (0)
About PowerShow.com