Title: University of Utah
1Optimizing NUCA Organizations and Wiring
Alternatives for Large Caches with CACTI 6.0
Naveen Muralimanohar Rajeev Balasubramonian Norman
P Jouppi
1
2Large Caches
Intel Montecito
- Cache hierarchies will dominate chip area
- 3D stacked processors with an entire die for
on-chip cache could be common - Montecito has two private 12 MB L3 caches (27MB
including L2) - Long global wires are required to transmit
data/address
Cache
Cache
2
3Wire Delay/Power
- Wire delays are costly for performance and power
- Latencies of 60 cycles to reach ends of a chip
at 32nm (_at_ 5 GHz) - 50 of dynamic power is in interconnect switching
(Magen et al. SLIP 04) - CACTI access time for 24 MB cache is 90 cycles _at_
5GHz, 65nm Tech
3
version 4
4Contribution
- Support for various interconnect models
- Improved design space exploration
- Support for modeling Non-Uniform Cache Access
(NUCA)
5Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
5
6Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
6
7Power/Delay Overhead of Wires
- H-tree delay increases with cache size
- H-tree power continues to dominate
- Bitlines are other major contributors to total
power
8Motivation
- The dominant role of interconnect is clear
- Lack of tool to model interconnect in detail can
impede progress - Current solutions have limited wire options
- Orion, CACTI
- Weak wire model
- No support for modeling Multi-megabyte caches
University of Utah
8
9CACTI 6.0 Enhancements
- Incorporation of
- Different wire models
- Different router models
- Grid topology for NUCA
- Shared bus for UCA
- Contention values for various cache
configurations - Methodology to compute optimal NUCA organization
- Improved interface that enables trade-off
analysis - Validation analysis
University of Utah
9
10Full-swing Wires
Z
Y
X
University of Utah
10
11Full-swing Wires II
Three different design points
10 Delay penalty
20 Delay penalty
30 Delay penalty
Repeater size
- Caveat Repeater sizing and spacing cannot be
controlled precisely all the time
University of Utah
11
12Full-Swing Wires
- Fast and simple
- Delay proportional to sqrt(RC) as against RC
- High bandwidth
- Can be pipelined
- Requires silicon area
- High energy
- Quadratic dependence on voltage
13Low-swing wires
50mV raise
400mV
400mV
400mV
50mV drop
Differential wires
University of Utah
13
14Differential Low-swing
- Very low-power, can be routed over other
modules - Relatively slow, low-bandwidth, high area
requirement, requires special transmitter and
receiver - Bitlines are a form of low-swing wire
- Optimized for speed and area as against power
- Driver and pre-charger employ full Vdd voltage
-
University of Utah
14
15Delay Characteristics
Quadratic increase in delay
University of Utah
15
16Energy Characteristics
University of Utah
16
17Search Space of CACTI-5
- Design space with global wires optimized for
delay
University of Utah
17
18Search Space of CACTI-6
Low-swing
30 Delay Penalty
Least Delay
Design space with global and low-swing wires
University of Utah
18
19CACTI Another Limitation
- Access delay is equal to the delay of slowest
sub-array - Very high hit time for large caches
- Employs a separate bus for each cache bank for
multi-banked caches - Not scalable
Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to improve the search space
19
20Non-Uniform Cache Access (NUCA)
- Large cache is broken into a number of small
banks - Employs on-chip network for communication
- Access delay a (distance between bank and cache
controller) -
CPU L1
Cache banks
(Kim et al. ASPLOS 02)
20
21Extension to CACTI
- On-chip network
- Wire model based on ITRS 2005 parameters
- Grid network
- 3-stage speculative router pipeline
- Network latency vs Bank access latency tradeoff
- Iterate over different bank sizes
- Calculate the average network delay based on the
number of banks and bank sizes - Consider contention values for different cache
configurations - Similarly we also consider power consumed for
each organization
21
22Trade-off Analysis (32 MB Cache)
16 Core CMP
23Effect of Core Count
24Power Centric Design (32MB Cache)
24
25Validation
- HSPICE tool
- Predictive Technology Model (65nm tech.)
- Analytical model that employs PTM parameters
compared against HSPICE - Distributed wordlines, bitlines, low-swing
transmitters, wires, receivers - Verified to be within 12
26Case Study Heterogeneous D-NUCA
- Dynamic-NUCA
- Reduces access time by dynamic data movement
- Near-by banks are accessed more frequently
- Heterogeneous Banks
- Near-by banks are made smaller and hence faster
- Access to nearby banks consume less power
- Other banks can be made larger and more power
efficient
27Access Frequency
- request satisfied by x KB of cache
28Few Heterogeneous Organizations Considered by
CACTI
Model 1
Model 2
29Other Applications
- Exposing wire properties
- Novel cache pipelining
- Early lookup, Aggressive lookup (ISCA 07)
- Flit-reservation flow control (Peh et al., HPCA
00) - Novel topologies
- Hybrid network (ISCA 07)
30Conclusion
- Network parameters and contention play a critical
role in deciding NUCA organization - Wire choices have significant impact on cache
properties - CACTI 6.0 can identify models that reduce power
by a factor of three for a delay penalty of 25 - http//www.hpl.hp.com/personal/Norman_Jouppi/cacti
6.html - http//www.cs.utah.edu/rajeev/cacti6/