Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design

About This Presentation

Title:

Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design

Description:

What methodology needs to be put in place? How should the cache be sized at each level and shared at each level in the hierarchy? ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 19

Provided by: lzh42

Learn more at: https://www.eecg.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: Performance, Area and Bandwidth Implications on LargeScale CMP Cache Design

1
Performance, Area and Bandwidth Implications on
Large-Scale CMP Cache Design

Li Zhao, Ravi Iyer,
Srihari Makineni, Jaideep Moses,
Ramesh Illikkal, Don Newell
Intel Corporation

2
Outline

Motivation
Overview of LCMP
Constraint-aware Analysis Methodology
Experiment Results
Area and Bandwidth Implications
Performance Evaluation
Summary

3
Motivation

CMP architecture has been widely adopted
SCMP a few large out-of-order cores
Intel Dual-core Xeon processor
LCMP many small in-order cores
Sun Niagara, Azul
High throughput
Questions on cache/memory hierarchy
How do we prune the cache design space for LCMP
architectures? What methodology needs to be put
in place?
How should the cache be sized at each level and
shared at each level in the hierarchy?
How much memory and interconnect bandwidth is
required for scalable performance?

The goal of this paper is to accomplish a
first-level of analysis that narrows the design
space
4
Outline

Motivation
Overview of LCMP
Constraint-aware Analysis Methodology
Experiment Results
Area and Bandwidth Implications
Performance Evaluation
Summary

5
Overview of LCMP
C
C
C
C
C
C
C
C
L1
L1
L1
L1
L1
L1
L1
L1
CPU (LCMP)
DRAM
Memory interface
L2
L2
L2
L2
Interconnect
IO Bridge
L3
IO interface

16 or 32 light weight cores on-die

6
Outline

Motivation
Overview of LCMP
Constraint-aware Analysis Methodology
Experiment Results
Area and Bandwidth Implications
Performance Evaluation
Summary

7
Cache Design Considerations

Die area constraints
Only a fraction of space (40 to 60) may be
available to cache
On-die and off-die bandwidth
On-die interconnect carries the communication
between cache hierarchy
Off-die memory bandwidth
Power consumption
Overall performance
Indicate the effectiveness of the cache design in
supporting many simultaneous threads of execution

8
Constraint-Aware Analysis Methodology

Area constraints
Prune the design space by the area constraints
Estimate the area required for L2, then apply the
overall area constraints to this cache
Bandwidth constraints
further prune the options of those already pruned
by area constrains by applying the on-die and
off-die bandwidth constraints
Estimate the number of requests generated by the
caches at each level, which depends on core
performance and cache performance for a given
workload
Overall performance
Compare the performance of the pruned options,
determine the top two or three design choices

9
Outline

Motivation
Overview of LCMP
Constraint-aware Analysis Methodology
Experiment Results
Area and Bandwidth Implications
Performance Evaluation
Summary

10
Experimental Setup

Platform simulator
Core model
Cache hierarchy
Interconnect model
Memory model
Area estimation tool
CACTI 3.2
Workload and traces
OLTP TPC-C
SAP SAP SD 2-tier workload
JAVA SPECjbb2005
Baseline configuration
32 cores (4threads/core), core CPI 6
Several nodes (1 to 4 cores/node), L2 per node
(128K to 4M)

11
Area Constraints
300 mm2
200 mm2

Look for options that support 3 levels of cache
Assume total die area is 400 mm2
Two constraints of cache size 50 ? 200 mm2, 75
? 300 mm2
Inclusive cache ? L3 gt 2xL2

12
Sharing Impact
C
C
C
C
L2
L2
L2
L2
C
C
C
C
TPCC
L2
L2
C
C
C
C
L2

MPI reduces when we increase the sharing degree
512K seems to be a sweet spot

13
Bandwidth Constraints
On-die bandwidth
Off-die bandwidth

4 cores/node, 8 nodes
On-die BW demand is around 180GB/s with infinite
L3, reduces significantly with a 32M L3 cache
Off-die memory BW demand is between 40 to 50
GB/s, reduces as we increase the L3 cache size

14
Cache Options Summary

Node
1 to 4 cores
L2 size per core
Around 128K to 256K seems viable for 32-core LCMP
L3 size
ranging from 8M to about 20M depending on the
configuration can be considered

15
Performance Evaluation (TPCC)

On-die BW is 512 GB/s, max sustainable memory BW
is 64 GB/s
Performance configuration (4 cores/node, 1M L2
and 32M L3) is the best
Performance per unit area config (4 cores/node,
512K L2 and 8M L3) is the best
Performance3 per unit area (4 cores/node, 512K
to 1M L2, 8M to 16M L3)

16
Performance Evaluation (SAP, SPECjbb)
17
Implications and Inferences

design a 3-level cache hierarchy
Each node consists of four cores, 512K to 1M of
L2 cache
The L3 cache size is recommended to be a minimum
of 16M
Recommend that the platform support at least
64GB/s of memory bandwidth and 512GB/s of
interconnect bandwidth

18
Summary

Performed the first study of performance, area
and bandwidth implications on LCMP cache design
Introduced a constraints-aware analysis
methodology to explore LCMP cache design options
Applied this methodology to a 32-core LCMP
architecture
Quickly narrowed down the design space to a small
subset of viable options
Conducted an in-depth performance/area evaluation
of these options and summarize a set of
recommendations for architecting efficient LCMP
platforms