Interconnect Design Considerations for Large NUCA Caches - PowerPoint PPT Presentation

About This Presentation
Title:

Interconnect Design Considerations for Large NUCA Caches

Description:

Hybrid Network. Combination of point-to-point and bus. Reduction in ... Hybrid model, average improvement over Model 2 15% L2 Sensitive 20% Prior work ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 40
Provided by: NaveenMura5
Category:

less

Transcript and Presenter's Notes

Title: Interconnect Design Considerations for Large NUCA Caches


1
Interconnect Design Considerations for Large NUCA
Caches
Naveen Muralimanohar Rajeev Balasubramonian
2
Large Caches
Intel Montecito
  • Cache hierarchies will dominate chip area
  • Montecito has two private 12 MB L3 caches (27MB
    including L2)
  • Long global wires are required to transmit
    data/address

Cache
Cache
3
Wire Delay/Power
  • Wire delays are costly for performance and power
  • Latencies of 60 cycles to reach ends of a chip
    at 32nm (_at_ 5 GHz)
  • 50 of dynamic power is in interconnect switching
    (Magen et al. SLIP 04)
  • CACTI access time for 24 MB cache is 90 cycles _at_
    5GHz, 65nm Tech

version 3.2
4
Contributions
  • Methodology to compute optimal baseline NUCA
    organization
  • Performs 51 better than prior NUCA models
  • Introduce heterogeneity in the network
  • Additional 15 improvement in performance

5
Cache Design Basics
Bitlines
Input address
Wordline
Decoder
Tag array
Data array
Column muxes
Sense Amps
Comparators
Output driver
Mux drivers
Output driver
Data output
Valid output?
6
Existing Model - CACTI
Wordline bitline delay
Wordline bitline delay
Decoder delay
Decoder delay
Cache model with 4 sub-arrays
Cache model with 16 sub-arrays
Decoder delay H-tree delay logic delay
7
CACTI Shortcomings
  • Access delay is equal to the delay of slowest
    sub-array
  • Very high hit time for large caches
  • Employs a separate bus for each cache bank for
    multi-banked caches
  • Not scalable

Potential solution NUCA Extend CACTI to model
NUCA
Exploit different wire types and network design
choices to reduce access latency
8
Non-Uniform Cache Access (NUCA)
  • Large cache is broken into a number of small
    banks
  • Employs on-chip network for communication
  • Access delay a (distance between bank and cache
    controller)

CPU L1
Cache banks
(Kim et al. ASPLOS 02)
9
Extension to CACTI
  • On-chip network
  • Wire model based on ITRS 2005 parameters
  • Grid network
  • 3-stage speculative router pipeline
  • Network latency vs Bank access latency tradeoff
  • Iterate over different bank sizes
  • Calculate the average network delay based on the
    number of banks and bank sizes
  • Similarly we also consider power consumed for
    each organization

10
Effect of Network Delay (32MB cache)
Earlier NUCA Model
11
Power Centric Design (32MB Cache)
12
Wire Design Space
  • Wires can be tuned for low latency or low power
  • Low power wires with small, fewer repeaters
  • Fat, low-bandwidth fast wires

Global wires B wires 8x plane
Semi global wires W wires 4x plane
Power optimized PW wires 4x plane
Fast, low bandwidth L wires 8x plane
13
Wire Model
ores
Cside-wall
V
Wire RC
M
M
M
ocap
Icap
Cadj
Ref Banerjee et al. IEEE TED 2002
Wire Type Relative Latency Relative Area Dynamic Power Static Power
B-Wire 8x 1x 1x 2.65a 1x
B-Wire 4x 1.6x 0.5x 2.9a 1.13x
L-Wire 8x 0.25x 8x 1.46a 0.55X
PW-Wire 4x 3.2x 0.5x 0.87a 0.3x
65nm process
14
Access time for different link types
Bank Count Bank Access Time Avg Access time Avg Access time Avg Access time
Bank Count Bank Access Time 8x-wires 4x-wires L-wires
2 77 122 157 98
4 62 119 156 90
8 26 80 110 56
16 17 70 99 54
32 9 72 99 57
64 6 86 127 70
128 5 108 149 101
256 4 147 179 132
512 3 210 240 195
15
Cache Look-Up
Total cache access time
Bank access
Data transfer
Decoder, WL, BL (10-15 bits of address)
Comparator, output driver delay (rest of the add.)
Network delay (4-6 bits to identify the
cache Bank)
  • The entire access happens in a sequential manner

16
Early Look-Up
Traditional Access
  • Send partial address in L-wires
  • Initiate the bank lookup
  • In parallel send the complete address
  • Complete the access

Lookup
Tag
Address
Data transfer
L
Tag match
Early lookup (10-15 bits of address)
  • We can hide 70 of the bank access delay

17
Aggressive Look-Up
Aggressive Lookup
L
Full tag entry
11011101111100010
11100010
Agg. lookup (additional 8-bits On L-wires for
address for partial tag match)
Tag match at cache controller
18
Aggressive Look-Up
  • Reduction in link delay (for address transfer)
  • Increase in traffic due to false match lt 1
  • Marginal increase in link overhead
  • Additional 8-bits
  • More logic at the cache controller for tag match
  • Address transfer for writes happens on L-wires

19
Heterogeneous Network
  • Routers introduce significant overhead
    (especially in L-network)
  • L-wires can transfer signal across four banks in
    four cycles
  • Router adds three cycles for each hop
  • Modify network topology to take advantage of wire
    property
  • Different topology for address and data transfers

20
Hybrid Network
  • Combination of point-to-point and bus
  • Reduction in latency
  • Reduction in power
  • Efficient use of L-wires
  • - Low bandwidth

21
Experimental Setup
  • Simplescalar with contention modeled in detail
  • Single core, 8-issue out-of-order processor
  • 32 MB, 8-way set-associative, on-chip L2 cache
    (SNUCA organization)
  • 32KB L1 I-cache and 32KB L1 D-cache with a hit
    latency of 3 cycles
  • Main memory latency 300 cycles

22
CMP Setup
L2 Bank
  • Eight Core CMP
  • (Simplescalar tool)
  • 32 MB, 8-way set-associative
  • (SNUCA organization)
  • Two cache controllers
  • Main memory latency 300 cycles

C5
C4
C6
C7
C3
C8
C2
C1
23
Network Model
  • Virtual channel flow control
  • Four virtual channels/physical channel
  • Credit based flow control (for backpressure)
  • Adaptive routing
  • Each hop should reduce Manhattan distance between
    the source and the destination

24
Cache Models
Model Bank Access (cycles) Bank Count Network Link Description
1 3 512 B-wires Based on prior work
2 17 16 B-wires CACTI-L2
3 17 16 B Lwires Early Lookup
4 17 16 B Lwires Agg. Lookup
5 17 16 B Lwires Hybrid network
6 17 16 B-wires Upper bound
25
Performance Results (Uniprocessor)
114
73
Latency sensitive benchmarks - 70 of the SPEC
suite
26
Performance Results (Uniprocessor)
26
20
8
9
19
15
6
8
Latency sensitive benchmarks - 70 of the SPEC
suite
27
Performance Results (CMP)
28
Performance Results (4X Wires)
  • Wire delay constrained
  • model
  • Performance improvements are better
  • Early lookup - 7
  • Aggressive model - 20
  • Hybrid model - 29

29
Conclusion
  • Network parameters play a significant role in the
    performance of large caches
  • Modified CACTI model, that includes network
    overhead performs 51 better compared to previous
    models
  • Methodology to compute an optimal baseline NUCA

30
Conclusion
  • Wires can be tuned for different metrics
  • Routers impose non-trivial overhead
  • Address and data have different bandwidth needs
  • We introduce heterogeneity at three levels
  • Different types of wires for address and data
    transfers
  • Different topologies for address and data
    networks
  • Different architectures within address network
    (point-to-point and bus)
  • (Yields an additional performance improvement of
    15 over the optimal, baseline NUCA)

31
Performance Results (Uniprocessor)
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
Model derived from CACTI, improvement over model
assumed in the prior work 73 L2 Sensitive
114
32
Performance Results (Uniprocessor)
Early lookup technique, average improvement over
Model 2 6 L2 Sensitive 8
33
Performance Results (Uniprocessor)
Aggressive lookup technique, average improvement
over Model 2 8 L2 Sensitive 9
34
Performance Results (Uniprocessor)
Hybrid model, average improvement over Model 2
15 L2 Sensitive 20
35
Outline
  • Problem Overview
  • Cache Design Basics
  • Extensions to CACTI
  • Effect of Network Parameters
  • Wire Design Space
  • Exploiting Heterogeneous Wires
  • Results

36
Outline
  • Problem Overview
  • Cache Design Basics
  • Extensions to CACTI
  • Effect of Network Parameters
  • Wire Design Space
  • Exploiting Heterogeneous Wires
  • Results

37
Outline
  • Problem Overview
  • Cache Design Basics
  • Extensions to CACTI
  • Effect of Network Parameters
  • Wire Design Space
  • Exploiting Heterogeneous Wires
  • Results

38
Outline
  • Overview
  • Cache Design
  • Effect of Network Parameters
  • Wire Design Space
  • Exploiting Heterogeneous Wires
  • Methodology
  • Results

39
Aggressive Look-Up
Full tag entry
Way 1
11011101111100010
L
Way n
11011101111100010
Agg. lookup (additional 8-bits of address for
partial tag match)
Tag match at cache controller
11100010
Write a Comment
User Comments (0)
About PowerShow.com