Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design - PowerPoint PPT Presentation

About This Presentation
Title:

Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design

Description:

What methodology needs to be put in place? How should the cache be sized at each level and shared at each level in the hierarchy? ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 19
Provided by: lzh42
Category:

less

Transcript and Presenter's Notes

Title: Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design


1
Performance, Area and Bandwidth Implications on
Large-Scale CMP Cache Design
  • Li Zhao, Ravi Iyer,
  • Srihari Makineni, Jaideep Moses,
  • Ramesh Illikkal, Don Newell
  • Intel Corporation

2
Outline
  • Motivation
  • Overview of LCMP
  • Constraint-aware Analysis Methodology
  • Experiment Results
  • Area and Bandwidth Implications
  • Performance Evaluation
  • Summary

3
Motivation
  • CMP architecture has been widely adopted
  • SCMP a few large out-of-order cores
  • Intel Dual-core Xeon processor
  • LCMP many small in-order cores
  • Sun Niagara, Azul
  • High throughput
  • Questions on cache/memory hierarchy
  • How do we prune the cache design space for LCMP
    architectures? What methodology needs to be put
    in place?
  • How should the cache be sized at each level and
    shared at each level in the hierarchy?
  • How much memory and interconnect bandwidth is
    required for scalable performance?

The goal of this paper is to accomplish a
first-level of analysis that narrows the design
space
4
Outline
  • Motivation
  • Overview of LCMP
  • Constraint-aware Analysis Methodology
  • Experiment Results
  • Area and Bandwidth Implications
  • Performance Evaluation
  • Summary

5
Overview of LCMP
C
C
C
C
C
C
C
C
L1
L1
L1
L1
L1
L1
L1
L1
CPU (LCMP)
DRAM
Memory interface
L2
L2
L2
L2
Interconnect
IO Bridge
L3
IO interface
  • 16 or 32 light weight cores on-die

6
Outline
  • Motivation
  • Overview of LCMP
  • Constraint-aware Analysis Methodology
  • Experiment Results
  • Area and Bandwidth Implications
  • Performance Evaluation
  • Summary

7
Cache Design Considerations
  • Die area constraints
  • Only a fraction of space (40 to 60) may be
    available to cache
  • On-die and off-die bandwidth
  • On-die interconnect carries the communication
    between cache hierarchy
  • Off-die memory bandwidth
  • Power consumption
  • Overall performance
  • Indicate the effectiveness of the cache design in
    supporting many simultaneous threads of execution

8
Constraint-Aware Analysis Methodology
  • Area constraints
  • Prune the design space by the area constraints
  • Estimate the area required for L2, then apply the
    overall area constraints to this cache
  • Bandwidth constraints
  • further prune the options of those already pruned
    by area constrains by applying the on-die and
    off-die bandwidth constraints
  • Estimate the number of requests generated by the
    caches at each level, which depends on core
    performance and cache performance for a given
    workload
  • Overall performance
  • Compare the performance of the pruned options,
    determine the top two or three design choices

9
Outline
  • Motivation
  • Overview of LCMP
  • Constraint-aware Analysis Methodology
  • Experiment Results
  • Area and Bandwidth Implications
  • Performance Evaluation
  • Summary

10
Experimental Setup
  • Platform simulator
  • Core model
  • Cache hierarchy
  • Interconnect model
  • Memory model
  • Area estimation tool
  • CACTI 3.2
  • Workload and traces
  • OLTP TPC-C
  • SAP SAP SD 2-tier workload
  • JAVA SPECjbb2005
  • Baseline configuration
  • 32 cores (4threads/core), core CPI 6
  • Several nodes (1 to 4 cores/node), L2 per node
    (128K to 4M)

11
Area Constraints
300 mm2
200 mm2
  • Look for options that support 3 levels of cache
  • Assume total die area is 400 mm2
  • Two constraints of cache size 50 ? 200 mm2, 75
    ? 300 mm2
  • Inclusive cache ? L3 gt 2xL2

12
Sharing Impact
C
C
C
C
L2
L2
L2
L2
C
C
C
C
TPCC
L2
L2
C
C
C
C
L2
  • MPI reduces when we increase the sharing degree
  • 512K seems to be a sweet spot

13
Bandwidth Constraints
On-die bandwidth
Off-die bandwidth
  • 4 cores/node, 8 nodes
  • On-die BW demand is around 180GB/s with infinite
    L3, reduces significantly with a 32M L3 cache
  • Off-die memory BW demand is between 40 to 50
    GB/s, reduces as we increase the L3 cache size

14
Cache Options Summary
  • Node
  • 1 to 4 cores
  • L2 size per core
  • Around 128K to 256K seems viable for 32-core LCMP
  • L3 size
  • ranging from 8M to about 20M depending on the
    configuration can be considered

15
Performance Evaluation (TPCC)
  • On-die BW is 512 GB/s, max sustainable memory BW
    is 64 GB/s
  • Performance configuration (4 cores/node, 1M L2
    and 32M L3) is the best
  • Performance per unit area config (4 cores/node,
    512K L2 and 8M L3) is the best
  • Performance3 per unit area (4 cores/node, 512K
    to 1M L2, 8M to 16M L3)

16
Performance Evaluation (SAP, SPECjbb)
17
Implications and Inferences
  • design a 3-level cache hierarchy
  • Each node consists of four cores, 512K to 1M of
    L2 cache
  • The L3 cache size is recommended to be a minimum
    of 16M
  • Recommend that the platform support at least
    64GB/s of memory bandwidth and 512GB/s of
    interconnect bandwidth

18
Summary
  • Performed the first study of performance, area
    and bandwidth implications on LCMP cache design
  • Introduced a constraints-aware analysis
    methodology to explore LCMP cache design options
  • Applied this methodology to a 32-core LCMP
    architecture
  • Quickly narrowed down the design space to a small
    subset of viable options
  • Conducted an in-depth performance/area evaluation
    of these options and summarize a set of
    recommendations for architecting efficient LCMP
    platforms
Write a Comment
User Comments (0)
About PowerShow.com