ThreeDimensional Integration for MultiProcessor SystemonChip - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

ThreeDimensional Integration for MultiProcessor SystemonChip

Description:

Three-Dimensional Integration for Multi-Processor System-on-Chip. Luca Benini ... DOUT [1 to N word] DIN [1 to N word] Scalability bottleneck. Bottleneck ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 52

Provided by: con99

Category:

more less

Transcript and Presenter's Notes

Title: ThreeDimensional Integration for MultiProcessor SystemonChip

1
Three-Dimensional Integration for Multi-Processor
System-on-Chip
Searching for the architectural sweet spot
Luca Benini DEIS Università di Bologna lbenini_at_dei
s.unibo.it
Thanks to M. Facchini, T. Carlson, P. Marchal
(IMEC) C. Seiculescu, G. De Micheli (EPFL) S.
Murali, A. Pullini, F. Angiolini (INoCs) S.
Mitra (Stanford) I. Loi (UNIBO)
2
The communication bottleneck

Architectural issues
Traditional shared buses do not scale well
bandwidth saturation
Chip IO is pad limited
Physical issues
On-chip Interconnects become increasingly slower
w.r.t. logic
IOs are increasingly expensive
Consequences
Performance losses
Power/Energy cost
Design closure issues, respins or infeasibility

New architectures and design methods are
required!
2
3
TSV market outlook
Yole07
4
TSV performance

Delay is given by combination of parasitics
Horizontal wire to via base
Via delay (includes R of bases)
Horizontal wire from via top
Load
For a whole via of 50µm, delay is 16/18.5ps
(SOI/bulk)
For a 1.5mm horizontal link, delay is around
200ps

5
So far so good, but

The area news are not so good!
TSV itself can be small (even 2-3um)
But TSV pitch is not so small
Limited by wafer aligment technology!
Sub-micron aligment is not yet feasible
Micron aligment is feasible but slow and
expensive!
Need large landing pads for TSVs

10um pitches seem to be realistic
Not all TSVs can be used for signals
Power supply, clock, thermal vias

6
TSV reliability losses

Main failure mechanisms (fabrication)
Misalignment
Voids formation during Bonding phase
Dislocation and defects of Copper grains
Oxide film formation over Cu interface
Partial or full Pad detaching due to thermal
Stress
Thermal dissipation is much harder in 3D stacks,
thereby further increasing the risk of
temperature-related failures

12/14/2009
6
Loi Igor igor.loi_at_unibo.it
7
TSV yield
Miyakawa HRI07
DBI defect frequency NBI Number of TSVs
Yexp(-DBI NBI)
8
Summing up

Good power and speed
Area overhead is significant
Reliability not ideal (fabrication and aging)
Synchronization is hard (skew minimization across
layers)
Therefore
Cost and design effort are not trivial
Not just another dimension for wiring (as of
today)
Need a sistematic way to deal with non-ideality

9
A medium-term vision
10
Do We Really Need It?

Multi-core logic performance is back on track,
but

John McCalpin
http//www.cs.virginia.edu/stream/

Multi-core are bandwith-hungry
Limited caches
Multi-threading
Virtualization

The Bandwidth Challenge
11
Scaling cores with constant BW
C
T
B
Using Cache size to accommodate increasing
thread traffic is VERY expensive using BW can
be cheaper!!
T/C1/dB dgt1 (2-3)
2x increased traffic drives 8x cache size
(constant memory bandwidth) 4x increased traffic
drives 64 x cache size (constant memory bandwidth)
IBM
12
What about Embedded MPSoCs?
NXP07
Frame rate constraint is getting too tight!
13
3D offers plenty of bandwidth
High-end packaging roadmap
Intel 07
10µm TSV-pitch ? 10K vertical connections per mm2
14
How do we get the bandwith?

Current low cost SoC solution (2D) single
channel memory system interface

Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH1
Memory scheduler
Sk
Off chip physical interface circuits
Main bottleneck the memory channel
15
Memory Controller example
16
Multi-channel (2D) Memory interface

Data-parallel memory system (e.g. OpenSPARC T1,
T2)

PHY
Memory backend
CH1
Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH2
Memory scheduler
Sk
PHY
Memory backend
CH3

Main bottlenecks
Power, pin budget of memory channels
Front-end congestion
Scheduler scalability

17
Packaging scenarios vs. IO circuits
Board
SSTL2
DDR2 -16bit
Logic
PCB
SiP
DDR2 -16bit
RLC interconnect
PCB
SSTL2 no terminations
3D SiC (TSVs)
DDR2- 16bit
PCB
RC interconnect
3D-ready DDR2- 16bits
CMOS
PCB
3D-ready DDR2- xbits
RC interconnect
PCB
ORDERS OF MAGNITUDE MORE ENERGY EFFICIENT
SSTL2 standard stub termination logic This is
dedicated logic circuits to transmit data across
a transmission line, commonly used in DRAM
memories
18
3D-single channel interface

Current low cost SoC solution (2D) single
channel memory system interface

CTRL
PHY
Slave port/s
Addr
SoC front-end Queue Scheduler
S1
DW
Memory backend
Fat data lanes ? 1 cycle block transfers
S2
Split R/W ? reduced scheduling conflicts
DR
ISSUE requires changes in MEM
Sk
CH
Main bottleneck the its still a single channel
(for transactions)
19
3D-multi channel interface

Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
20
3D-multi channel interface

Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
21
3D-multi channel interface

Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
22
3D-multi channel interface

Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
23
3D-multi channel interface

Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
Advantage does not require functional changes to
DRAM interface
24
3D-multi channel interface

Solution 2 TDMA 3D overclocked bus

Memory backend
Exploits speed of TSVs
PHY
Logic DIE
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH1
CH2
CH3
CH2
Sk
t
Memory backend
CH3
Multi-channel on wide unidirectional data lanes
is also possible
25
3D DRAM interface choices

Multi-channel 3D DRAM looks very promising
Can leverage existing DRAM organization
Relieves bandwidth bottleneck
Mitigates (by orders of magnitude) the cost
issues of off-chip multi-channel interfaces
Exploration of wide channels with unidirectional
data lanes is worthwile coupled with a deep
revision of DRAM chip interfaces

26
DRAM for wide 3D interface
ROW Address 120
DOUT 1 to N word
Read Latch
n
Write Buff
DIN 1 to N word
Column Address 110
27
Scalability bottleneck

Single controller becomes a bottleneck even with
many slave ports
All cores need to reach it! Even using NoC
interconnect, the latency price is high
Internal management of multiple slave interface
and many transaction queues creates complexity
bottlenecks
This approach does not exploit the possibility of
fine-grain distribution of 3D visa
Creates a single point of faileure

Bottleneck
28
Multiple 3D DRAM interfaces

Relieves single-controller bottleneck
Cores have a friendly neighbor controller
Memory is fully accessible to everybody
Notion of vicinity in memory space
Not without issues
Area cost is increased (some hw sharing of
memctrl is lost)
Many points of entry in the memory dies. Need a
regular pattern of memory access ports for
commodity 3D RAM

29
IMIS 1.0 Example

Intimate memory interface specification

PORT
Chip footprint
80x19 cells, pitch 24µm
30
A case study

NoC Based Scalability and Modularity
Increase QoS, Predictability and Bandwidth

TSV
FP
Core
Core
Core
Core
L1
L1
L1
L1
SW
SW
DRAM
DRAM
MC
FP
MC
TSVs
TSVs
ni
MCU
TSVs
MCU
ni
TSVs
ni
SW
SW
SW
TSV
TSV
SW
DRAM
TSVs
TSVs
MCU
MCU
TSVs
TSVs
MC
DRAM
ni
ni
MC
GPRs
Core
L1
Core
L1
SW
SW
Core
L1
Core
L1
PLL
IO
Optional Low latency high BW channel
31
A promising approach

A high-level analysis for GP architectures
Traditional 2-Level cache hierarchy

G. Loh, 3D-Stacked Memory Architectures for
Multi-Core Processors ISCA 2008
32
Looking forward

More in general, multiple, application specific
dies
Logic prevalent process ? lots of metal, fast
and leaky transistors, area inefficient large
library of cells
Memory prevalent process ? few levels of metal
low-leakage transistors, few highly specialized
cell generators
From SoC to ML-SoC
Lots of opportunities! More degrees of freedom to
achieve reliability, low-quiescent power, low
energy
If TSV technology becomes really stable (high
yield, low cost)

33
3D NoCs for MLSoCs

Designing NoCs for 3D ICs big challenge
Which topology, switches on what layer and
floorplan locations ?
Meet application constraints
Bandwidth, latency
Meet 3D technology constraints
Maximum available TSV constraint
Communication between adjacent layers
NoC floorplan considering 3D layers

Automating 3D NoC design essential !
34
2D vs 3D Synthesis
3D technology TSV constraints
35
2D vs 3D Synthesis
3D technology TSV constraints
TSV constraints7 links
36
2D vs 3D Synthesis

TSV constraint important factor in determining
topology
Addressing 3D floorplanning of NoC also crucial
Additional constraint only links across adjacent
layers

Several new isues in 3D NoC synthesis !
37
Contributions 3D NoC Flow
3D Specs
Communication characteristics
Technology constraints
User objectives

Application bandwidth requirements
Latency constraints
Message type of traffic flows

Core assignment to layer in 3D
Optionally, floorplan of cores in each layer

Max. TSVs across adjacent layers
Constraint on links only across adjacent layers

Power consumption
Latency

NoC Topology Synthesis
NoC area models
Application-specific 3D NoC
NoC power models
Vertical link power, latency models
38
3D NoC Flow

Features
Deadlock removal (routing and message-dependent)
intra and inter layer
Floorplan of network components layer by layer
Meet 3D technology constraints
Design trade-offs possible
Inputs
IP, communication specs
Layer assignment placement of cores
Bandwidth, latency constraints of flows
NoC area, power models
Maximum inter-layer links (TSV constraint)

39
Synthesis Approach
40
Communication Abstraction
Synthesize best topology
Build communication graph based on application
specs
µ bw/max_bw (1- µ) min_lat/lat, µ - scaling
parameter varied by algorithm
41
Core to Switch Assignment

Build local partitioning graphs (LPG)
Layer-by-layer NoC design

0.5
1.0
ARM
M
ARM
LPG 1
LPG 0
M
ARM
ARM
0.2
42
Path Computation

In 3D, 2 important constraints to be met
TSV, maximum switch size (frequency)

Trade-offs of TSV count (yield) Vs
power-performance possible
43
Placement of Switches
Layer assignment of switches Floorplan of each
layer
Layer 1
Layer 0

Initial placemet of cores taken as input
44
Placement of Switches

Layer 1
Layer 0

Switches inserted to minimize distance (weighted
by bandwidth)
Solved as a Linear Program
45
Experiments
46
Example Layer Layout