3D Integration for High-Performance Processor Microarchitectures - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

3D Integration for High-Performance Processor Microarchitectures

Description:

3D Integration for HighPerformance Processor Microarchitectures – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 90
Provided by: gabri1
Category:

less

Transcript and Presenter's Notes

Title: 3D Integration for High-Performance Processor Microarchitectures


1
3D Integration for High-Performance Processor
Microarchitectures
  • Gabriel H. Loh

2
Outline
  • Motivation
  • 3D - what is it?
  • 3D - what can we do with it?
  • Looking toward the future

3
Wire Scaling Problems
Wires shrink, too...
... but their delays arent improving much
Devices are getting faster...
4
Wire scaling problems
one clock cycle
Device Delay
Wire Delay
180nm
Device Delay
Wire Delay
130nm
For some designs, wire delay already accounts for
50 of a clock cycle
Device Delay
Wire Delay
90nm
Device Delay
Wire Delay
65nm
(not to scale)
Fetzer and Orton, ISSCC 02
5
Power Scaling Problem
Dohh!
1000
Pentium III
Nuclear Reactor
Pentium II
!
Pentium Pro
100
Pentium
Power Density (Watts/cm2)
P4
Prescott
i486
10
!
i386
Core 2 Duo
Hot Plate
1
1.5mm
1.0mm
700nm
500nm
350nm
250nm
180nm
130nm
90nm
65nm
http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
ml
Data Intel
http//capefeare.com/homerdoh.gif
Data sandpile.org
http//research.esd.ornl.gov/EMBYR/fire-crop.gif
6
Power Scaling Problems
  • Peak power
  • Power density
  • Average power

Giant heatsink
Liquid cooling
HVAC cooling
1000W PSU
Power costs
bose.com
laptop.org
Battery life (and weight)
Environmental noise
georgiapower.com
Coolmax PS-CTG1000
Toshiba PA3107U-1BRS
CoolIT Freezone CPU Cooler
SCNJ-1100P Scythe Ninja PLUS CPU Cooler
www.hostcountryusa.com/images/aircondition.jpg
7
Technology Scaling Problems
  • Physical limits to silicon, lithography, etc.

... how far?
45nm
32nm
22nm
16nm
11nm
8nm
5nm
Many other challenges...
The limits are inevitable, but weve been
delaying the inevitable for a while now
8
3D Integration
  • One possible way to delay the inevitable for
    several generations

Moores Law Number of transistors on a chip
doubles every two years
Traditionally accomplished by reducing feature
size
Gordon Moore
Can achieve the same density by stacking in the
third dimension
www.intel.com (not a direct quote!)
9
High-level benefits
Major source of circuit delay and energy
consumption
Load
Load
millimeters of metal
Layer 2
Simultaneous reduction in latency and energy
microns of metal
Layer 1
Can place and route circuits in 3D
Driver
Driver
You can also mix-and-match process technologies
(e.g., stack DRAM on CMOS)
10
Manufacturing Assumptions
  • Wafer Bonding

Wafer 1 Thinned to 10s of microns. Etch
through to connect power and I/O
Wafer 1
Wafer 2 Unthinned for structural support.
Connect heat sink to this side
Wafer 2
(face-to-face example)
  • compatible with current manufacturing
  • 3D bonding is a relatively straightforward BEOL
    process
  • coarser connections
  • requires backside etching

This approach currently pursued by Intel, IBM and
others
11
Key Parameters
  • For this talk, we mostly assume face-to-face
    wafer bonding

TSVs
Wafer 1 (thin 10-20 mm)
  • Two types of vertical interconnect
  • die-to-die (d2d) in between faces
  • through-silicon via (TSV) on
  • backside thinned die

Wafer 2
d2ds
(TSV)
(d2d)
(TSV)
d2d latency is fast well under one gate delay
(www.tezzaron.com)
12
Not a Device Physics/Fab talk
  • 3D integration works

Computer Architecture Question Given 3D
integration, what can (and should) we build?
Intel Corporation
13
3D Landscape
Integration Heterogeneity
Mixed Process (e.g., CMOSDRAM)
Block-on-block, Gate-on-gate, Transistor-on-transi
stor
CMOS only
2D Cores
Number Of Cores
3D Cores
Single Core
Multi-Core
Granularity Of Stacking
Many Core
14
Circuit Level
Wordline wire delay
  • Example SRAMs
  • Used for caches, register files, branch
    predictors, register renamer, etc.
  • Questions
  • How to organize in 3D?
  • Benefits?

Bitline wire delay
Column mux, sense amp, drive out
Row decoder
15
3D Organizations (1)
Bitline wire length halved
  • Split bitlines

Reduces both latency and energy consumption
Less wire in row decoder
Bitlines split between layers
16
3D Organizations (2)
  • Split wordlines

Half-length wordline
Less wire in column mux
Wordline split between layers
17
Granularity of Stacking
Cache on Cores
Original 2D
Fine-grained stacking Provides more wire
reduction Also requires more d2d vias
Banks on Banks
Bitcells on Bitcells
18
Empirical Results
Optimized for latency
Latency (ps)
Energy (pJ)
8
21
5
31
For 4-layer stack 30 lower latency
... and 10 lower energy
Puttaswamy and Loh, ICCD 2005
Puttaswamy and Loh, HPCA 2007
19
Datapath/Layout Level
  • Example bypass network

Every ALU can forward its result to any other ALU
Each mux chooses from n results
O(n)
There are n ALUs ? 2n muxes
  • Total area for bypass network
  • O(n) O(n) 2n
  • O(n3) Area
  • O(n2) Wire Length

O(n)
Network must support n results
20
BIT-Split Datapath
  • Place bits by significance or interleaved

n inputs split over L layers
There are still n ALUs ? 2n muxes
  • Total area for bypass network
  • O(n/L) O(n/L) 2n
  • O(n3/L2) Area
  • O(n2/L) Wire Length

O(n/L)
For L2 layers, area decreases to 0.25 of
original (75 reduction!)
n results split over L layers
O(n/L)
21
Bit-Split DataPath
No die-to-die vias needed in the bypass network,
since result biti will never be forwarded to
input bitj (for i?j)
ALUs may need some die-to-die vias (e.g., for
carry propagation)
22
Empirical Results
64-bit datapath (2D one layer)
32 bits per layer (3D two layers)
16 bits per layer (3D four layers)
-30.9 latency -29.2 energy
-47.2 latency -44.9 energy
Puttaswamy and Loh, HPCA 2007
23
Thermals?
  • 3D stacking can improve wire delay and power
  • ... but power density may still increase
  • ... which can lead to higher chip temperatures

24
Microarchitecture Level
  • Arrange blocks/circuits to
  • Reduce power density
  • Keep power close to heat sink
  • While reducing wire (for latency and power)

First, an Observation
Most integer operations use small values (7-4,
20231, 149, etc.)
00000000000000000000000000000000000000000000000000
00000000001110 000000000000000000000000000000000
0000000000000000000000000001001 000000000000000000
0000000000000000000000000000000000000000010111
Most bits are zeros, and not used/useful
25
Significance Partitioned Datapath
  • Bypass, ALUs similar to earlier
  • but split everything in this fashion

bits015
bits1631
bits3247
bits4863
26
Thermal Herding
(This end closer to heatsink)
Most of the time, only the one layer closest to
the heatsink is active
ADD 59
ADD 8910265539
bits015
bits1631
All four layers active
bits3247
These three layers inactive
Circuits implemented in 3D, so latency and power
reduced whether one or all layers active
bits4863
27
Width Prediction
  • To clock-gate a block, we need to know to gate it
    early enough

By the time we know that the value only needs a
few bits, its too late to clock gate the RF
R5
Register File
R5
LW (predict)
0x00000013
Register File
Significance Detector
0
0x00000013
Stall and re-read if necessary
?
Low-width!
Dynamic width prediction can be very accurate
(98 on average) Loh MICRO02
28
overview of other components
Significance-partitioned using width prediction
3D-stacked, but no explicit thermal control
Lookup/update split
Entry-partitioned
Port-stacked
29
Performance/Power results
Resulted in 47.9 clock speed and 47.0
performance
Wire reduction due to 3D is significant
Benefits even greater with thermal herding
(-19.3 Power)
(-28.6 Power)
Puttaswamy and Loh, HPCA 2007
30
Thermals
Thermals are a challenge for 3D processors, but
not an insurmountable problem or a show-stopper
Hotspot increases by 19C
Remainder may require V/f scaling, better
cooling, etc.
TH reduces this to only 8C
TH can reduce temp below 2D!
Results collected with UVa HotSpot
3.0 Preliminary results with Intel
thermal modeling tool indicates Thermal
Herding is even more effective
Puttaswamy and Loh, HPCA 2007
31
Obvious Stack DRAM on CPU
3D-stacked DRAM, wider bus Liu et al. IEEE DT
2005Loi et al. DAC 2006 Kgil et al. ASPLOS
2006
Tezzaron FaStack 3D DRAM
32
Simple 3D DRAM Stacking Works
Very low latency b/t core and MC, and b/t MC and
DRAM MC running at core speed
Greatly reduced contention on bus between core
and MC
Actual DRAM array access latency reduced
33
Still Using 2D Interface
  • Previous approaches showed decent gains, but
    still used a traditional interface to memory

This interface only needs 100 bits
But die-stacking provides many 1000s of
connections!
34
Simple Modifications
Increase number of ranks
Add more row buffer entries
Add more memory controllers
35
Performance Impact
33.8
30.6
Total of 74.7 over previous 3D DRAM 3.8x over
2D
Better MSHR design gives 17.8 (106 over prev.
3D)
36
3D Memory Stacking Summary
  • Doing the simple thing works
  • Opening up the interface gets you more
  • Adjusting the microarchitecture to match the new
    interface gets you even more

These general lessons can be applied to other
components that you want to implement in 3D
37
Research Summary
  • Exploring 3D integration at many levels of
    microprocessor design
  • Additional transistors and shorter wires can be
    converted into performance and power benefits
  • Increasing the utilization of the 3D interface
    yields better results
  • Simple approaches are beneficial, too
    (low-hanging fruit)
  • Still a lot of open research
  • 3D for new microarchitectures
  • More opportunities for mixed-process integration

38
What Else?
  • Other open research problems in 3D
  • Reliability
  • Parametric variations, yield
  • Tools (CAD/EDA)
  • Test, DFT and Debug

di/dt noise
electro- migration
slow
bad
fast
good
http//www.bo.imm.cnr.it/researchs/elettronica/ind
ex_files/rel_res.htm
39
Summary
  • 3D can delay the end of Moores Law
  • this is worth many billions of
  • more importantly, enables other areas of
    computation to continue their advancements
  • Success depends on (among other things) new
    architectures to exploit the 3D technology while
    dealing with the challenges

40
Acknowledgments (people)
  • GA Tech
  • Prof. Hsien-Hsin S. Lee, Prof. Sung Kyu Lim
  • Kiran Puttaswamy, Dean Lewis, Michael Healy, Dae
    Hyun Kim
  • Tutorial
  • MICRO06 Yuan Xie (Penn State), Bryan Black
    (Intel), John Devale (Intel), Kerry Bernstein
    (IBM)
  • ISCA08 Jian Li (IBM), Jerry Bautista (Intel),
    Jason Cong (UCLA), Hsien-Hsin Lee (GT)
  • Intel
  • Bryan Black, John Devale, Jeff Rupley, Ned
    Brekelbaum, Don McCauley, Paul Reed

41
Acknowledgments (Other)
  • Funding
  • SRC/FCRP C2S2
  • NSF CAREER
  • State of Georgia
  • Equipment
  • Intel Corporation

42
More Info
  • www.3D.GATECH.edu
  • Starting points for 3D architecture
  • Loh, Xie, Black, Processor Design in
    Three-Dimensional Die-Stacking TechnologiesIEEE
    Micro magazine, May/June issue 2007
  • Xie, Loh, Black, Bernstein, Design Space
    Exploration for 3D ArchitecturesACM Journal of
    Emerging Technologies for Computer Systems, 2(2),
    pp. 65-103, April 2006

43
gtgtgtgt BACKUP SLIDES ltltltlt
44
Example Bonding Process
3. Thermocompression bonding
4. CMP backside thinning ( 10 mm)
5. Backside etching for power, ground, I/O
6. Dice, package, etc.
(heat spreader, heat sink, etc.)
1. Manufacture dies/wafers separately 2. Deposit
Cu via stubs
45
d2d via latency
1mm of wire 225 picoseconds
Die-to-die delay 8 picoseconds
ltlt 1 gate delay/FO4
RCd2d 0.35 RCviastack
FO4 delay 22 picoseconds (approximately one
gate delay)
RCviastack
HSpice, (B)PTM 70nm transistor model
source Puttaswamy and Loh, ICCD 2005
source Intel Corporation
46
intel stack example
(d2d via)
Source Intel
47
X-SEM of Bond Structure
Good bonding indicated when no voiding or seam
seen between metal pieces
Source Intel
48
3oomm wafer bonding
  • Scanning Acoustic Microscopy Image (CSAM)
  • Non optimal bonding
  • time/temperature
  • Good bonding across the full 300 mm wafer

Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)
49
3D Bond Test Configuration
50
Sample Chain Results
Calculated from geometry, modeling 5 layer
thickness variation
Calculated from geometry
Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)
  • Resistance measured using backside
    through-silicon vias each point is a chain of
    4096 links
  • Obtained tight distributions of resistances with
    high yield
  • Measurements agree with calculated values based
    on geometry
  • Negligible contribution from bond interface
    resistance

Source Intel
51
Individual Device parameters
  • Wafer Level Testing of NMOS and PMOS
  • No difference between thin stacked wafers and
    non-bonded wafers
  • Thin/Stacked outliers due to patterning issues
    with testing pads on backside

Ref P. R. Morrow, et al, Three-dimensional
Wafer Stacking via Cu-Cu BondingIntegrated with
65 nm Strained-Si/Low-k CMOS technology,
Electron Device Letters, 2(5), 335-337 (2006).
52
Yield of LogicLogic Stacking
1 wafer 10 good die 2 wafers 20 good die
Planar
2 wafers 22 good die


3D
  • Half size die
  • Increase individual die yield
  • Dramatically increases die count (edge effect)
  • Bonding slow die to fast die
  • A tight process will have a small impact
  • Simple etest pre-sort can eliminate impact

Source Intel
53
Die vs. Wafer Stacking
Wafer Stacking
Die Stacking
TSV
Source Intel
Source Intel
Possible Application Logic Memory TSV Size
50 mm Thickness 100 mm Bonding Structure
Bump Size Bonding Pitch Bump Pitch
Possible Application Logic Logic TSV Size
lt5mm Thickness 10 mm Bonding Structure lt5
mm Bonding Pitch lt8 mm
Limited currently by 300mm 3s alignment
capability
Source Intel
54
Dealing with Large D2D Pitches
For split-wordline, need one d2d via per wordline
So long as total d2d bandwidth is low enough, we
can use dovetailed layouts
What if d2d via pitch is too large for one via
per wordline?
55
Power Density estimates
  • 130nm P4
  • 3.4D GHz µPGA478 0.13 µm HTT
  • 109W peak
  • 131mm2
  • 109W / 1.31cm2 83 W/cm2
  • 90nm P4 (Prescott)
  • 3.6 GHz (Model 560) 3.6D GHz LGA775 0.09 µm HTT
  • 151W peak power
  • 112mm2
  • 151W / 1.12cm2 134 W/cm2

Dohh!
Best guess... data on sandpile.org
is somewhat difficult to correlate
56
gtgtgtgt MICRO 2006 ltltltlt
57
Core-Level Design
  • Simplest approach
  • Core-on-core, cache-on-core
  • Reuse existing 2D designs
  • first step in evolutionary path to new 3D
    processors

LLC is relatively low power, so thermal impact
should be limited
Likely to have severe thermal problems
58
Options
  • Use 3D to build larger last-level caches (LLC)

4MB L2 cache (SRAM)
Stack 8MB (SRAM)
Stack 32MB (DRAM)
Stack 64MB (DRAM)
Core 2
Original 4MB
Original 4MB for fast tags
Original 4MB removed
Core 1
2D baseline
3D (12MB)
3D (32MB)
3D (64MB)
59
Performance Impact
Significant performance increase
1.4
14
IPC
1.2
12
BW
1.0
10
0.8
8
Instructions Per Cycle
BandWidth (GB/s)
0.6
6
Significant reduction in off-chip BW
0.4
4
0.2
2
0.0
0
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
12
12
12
12
12
12
12
12
12
12
12
12
12
conj
dSym
gauss
pcg
sMVM
sSym
sTrans
spgAVDF
spgAVIF
spgUS
svd
svm
Avg
Intel RMS parallel/multi-core benchmarks
Black et al. MICRO 2006
60
Thermal Impact
100
92.85
88.35
88.43
90.27
90
80
70
60
Peak Temperature (C)
50
40
30
20
10
0
2D 4MB
3D 12MB
3D 32MB
3D 64MB
No major thermal issues
Black et al. MICRO 2006
61
Baseline Core2 Duo Thermals
Coolest 59ºC
Hottest 88.4ºC
Core 1
L2 Cache 4MB
FP
Ld/ST
RS
Edge temp drop is due to an epoxy fillet
62
Thermal cost of DRAm on CPU
89
88
87
Actually
86
Better
3.8
Increase
85
Peak Temperature (C)
84

83
82
86W6.2W
92W6.2W
92W0W
81
92W
80
CPU (2D)
CPU (3D)
DRAM on CPU
DRAM on CPU -
(3D)
I/O (3D)
DRAM on CPU
CPU on DRAM
63
3D-stacked P4
  • Planar die area is 50
  • Eliminates RC delay, drivers, reduce driver
    sizing
  • 25 of pipe stages removed
  • 15 perf. improvement from pipestage elimination
  • 15 power improvement from clock elimination and
    pipestage elimination
  • Splitting FUBs and placement benefits are additive

Top
Bottom
Black, et. al, 3D Processing Technology and Its
Impact on iA32 Microprocessors, Proceedings of
the International Conference on Computer Design,
October 2004
64
P4 3D tradeoff space
Power Pwr Temp Perf Vcc Freq
Baseline 147 100 99 100 1.00 1.00
Same Power 147 100 127 129 1.00 1.18
Same Frequency 125 85 113 115 1.00 1.00
Same Temperature 97.3 66 99 108 0.92 0.92
Same Performance 68.2 46 77 100 0.82 0.82
Perf vs. Freq 0.82 performance for 1
frequency Freq vs. Vcc 1 for 1 in Vcc
Black et al. MICRO2006
65
gtgtgtgt ISCA 2008 ltltltlt
66
Estimate for of DRAM Layers
  • Samsung K4T51083QE DDR2 SDRAM
  • 10.9Mb/mm2 (1.36MB/mm2) in 80nm
  • 27.9Mb/mm2 (3.5MB/mm2) in 50nm
  • Quad-core in 45nm 200-300mm2
  • DC Penryn 107mm2 ? QC 214mm2
  • QC Barcelona 285mm2
  • 1GB per layer ? 294mm2 per DRAM layer
  • Should be slightly less since peripheral logic
    placed on separate layer
  • Else, 512MB per layer (16 DRAM, 1 peripheral
    logic)

67
Thermal Impact
Results based on UVA HotSpot thermal simulator
Performance modeling already accounts for 32ms
refresh rate (as opposed to 64ms)
Numbers are pessimistic in that we assume max
power consumption for both CPU and DRAM (e.g., if
DRAM is running at full-tilt, then CPU is
probably mostly idling if CPU is running at full
tilt, then it is probably mostly hitting in cache
and so DRAM activity would be low)
68
Baseline Config and Workloads
69
gtgtgtgt CF 2008 ltltltlt
70
3D Benefits and Costs
  • Benefits
  • More processor resources (larger caches,
    functional units, buffers)
  • ? higher performance
  • Additional functionality
  • On-stack voltage converters, noise control,
    profiling,
  • Costs
  • More layers of silicon ?
  • Additional fab steps (and equipment) ?
  • Higher temps ? better cooling ?

71
Benefits Not For Everybody
Some markets cant afford additional costs
Other design points may already be thermally
limited
Other markets (server, gaming) may be quite
willing to deal with 3D costs for the benefits
May not want 3D in these markets
72
Converged Design Methodology
  • One microarchitecture for all segments

Base Core Microarchitecture
Base K8 Microarchitecture
budget
mobile
desktop
server
gaming
budget
mobile
desktop
server
gaming
  • Reduces design costs
  • differentiate products via speed/power binning,
    cache size, cores

Naively using 3D for only high-end
products breaks the One Design methodology
73
Goal/Motivation
  • Use 2D as the default (for low-end, mobile, etc.)
  • Make 3D optional (for high-end market segments)
  • Maintain a single overall design
  • Stackable Microarchitecture Approach
  • baseline 2D processor has everything it needs to
    stand-alone
  • optional 3D layers augment the conventional
    baseline structures
  • 3D layers snap on for segments that need it

74
Stackable SRAMs
  • For caches, predictors, register files, etc.

Bank 3
Bank 3
Bank 2
Bank 2
Bank 1
Bank 1
Bank 0
Bank 0
0
0
ai
ai-1
2 banks? 1
2 banks? 0
ai-1
ai
4 banks? 0
75
Another View
76
Configuring Each Layer
Step 1 Layer Identification
Step 2 Per-Layer Configuration
Layer ID 3
Layer ID 2
Layer ID 1
Layer ID 0
0
Config Bits
Set during assembly, at boot-up, etc.
77
Modular Processor Overview
2D Floorplan and parameters based on Core 2
microarchitecture
78
Performance Results
uops-per-cycle speedup
79
Design Tradeoffs
80
Modular Design Conclusions
  • With a partitionable design, 3D benefits can be
    segregated across markets
  • Baseline 2D processor fulfills needs of mass
    markets
  • Optional, stackable layers provide value for
    higher-end market segments
  • Can be combined with conventional speed/power
    binning for further product-line differentiation

81
gtgtgtgt CAD/EDA ltltltlt
82
Some 3D CAD/EDA problems
  • Layer partitioning
  • Global routing
  • Clock routing
  • Decap placement/sizing

See also www.GTCAD.gatech.edu
83
3D Layer Partitioning
  • Goals Approach
  • Partition modules and/or gates into multiple
    layers
  • Consider bonding-style F2F, F2B, B2B
  • Maximize inter-partition interconnect for F2F
  • Minimize inter-partition interconnect for F2B and
    B2B

84
3D Global Routing
  • Goals Approach
  • Construct 3D routing tree
  • Optimize performance (?Elmore delay), power
    (?wirelength)
  • Thermal-aware through-via insertion
  • move through-vias ( thermal passage) closer to
    hotspots

85
3D Clock Routing
  • Goals Approach
  • Construct 3D clock tree
  • Minimize clock skew under non-uniform thermal
    profile
  • Bonding style imposes constraints on of
    through-vias

F2B Stack
Alternating F2F/B2B
86
3D Decap Placement and Sizing
  • Goals Approach
  • Place modules to reduce IR-drop and decap
    required
  • Allocate whitespace for decap insertion
  • Size decaps ( Tox) for leakage reduction

87
gtgtgtgt OTHER ltltltlt
88
3D-Stacked 21364
2D 21364 core
3D 21364
3D EBox Detail
21364
21364 core
EBox Detail
  • EBox has mix of self-stacking
  • RF, IQ critical wires are intra-FUB
  • and FUB-stacking
  • EUs critical wires are inter-FUB

89
Self-Bibliography
  • Low-/Circuit-level
  • DAC 2007
  • GLSVLSI 2006 (2)
  • ISCAS 2006
  • ISVLSI 2006
  • ICCD 2005
  • Architecture-level
  • IEEE Micro 2007
  • HPCA 2007
  • MICRO 2006
  • JETC 2006
  • ISCA 2008
  • CF 2008
  • Other
  • TCAD 2007
Write a Comment
User Comments (0)
About PowerShow.com