ThreeDimensional Integration for MultiProcessor SystemonChip - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

ThreeDimensional Integration for MultiProcessor SystemonChip

Description:

Three-Dimensional Integration for Multi-Processor System-on-Chip. Luca Benini ... DOUT [1 to N word] DIN [1 to N word] Scalability bottleneck. Bottleneck ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 52
Provided by: con99
Category:

less

Transcript and Presenter's Notes

Title: ThreeDimensional Integration for MultiProcessor SystemonChip


1
Three-Dimensional Integration for Multi-Processor
System-on-Chip
Searching for the architectural sweet spot
Luca Benini DEIS Università di Bologna lbenini_at_dei
s.unibo.it
Thanks to M. Facchini, T. Carlson, P. Marchal
(IMEC) C. Seiculescu, G. De Micheli (EPFL) S.
Murali, A. Pullini, F. Angiolini (INoCs) S.
Mitra (Stanford) I. Loi (UNIBO)
2
The communication bottleneck
  • Architectural issues
  • Traditional shared buses do not scale well
    bandwidth saturation
  • Chip IO is pad limited
  • Physical issues
  • On-chip Interconnects become increasingly slower
    w.r.t. logic
  • IOs are increasingly expensive
  • Consequences
  • Performance losses
  • Power/Energy cost
  • Design closure issues, respins or infeasibility

New architectures and design methods are
required!
2
3
TSV market outlook
Yole07
4
TSV performance
  • Delay is given by combination of parasitics
  • Horizontal wire to via base
  • Via delay (includes R of bases)
  • Horizontal wire from via top
  • Load
  • For a whole via of 50µm, delay is 16/18.5ps
    (SOI/bulk)
  • For a 1.5mm horizontal link, delay is around
    200ps

5
So far so good, but
  • The area news are not so good!
  • TSV itself can be small (even 2-3um)
  • But TSV pitch is not so small
  • Limited by wafer aligment technology!
  • Sub-micron aligment is not yet feasible
  • Micron aligment is feasible but slow and
    expensive!
  • Need large landing pads for TSVs
  • 10um pitches seem to be realistic
  • Not all TSVs can be used for signals
  • Power supply, clock, thermal vias

6
TSV reliability losses
  • Main failure mechanisms (fabrication)
  • Misalignment
  • Voids formation during Bonding phase
  • Dislocation and defects of Copper grains
  • Oxide film formation over Cu interface
  • Partial or full Pad detaching due to thermal
    Stress
  • Thermal dissipation is much harder in 3D stacks,
    thereby further increasing the risk of
    temperature-related failures

12/14/2009
6
Loi Igor igor.loi_at_unibo.it
7
TSV yield
Miyakawa HRI07
DBI defect frequency NBI Number of TSVs
Yexp(-DBI NBI)
8
Summing up
  • Good power and speed
  • Area overhead is significant
  • Reliability not ideal (fabrication and aging)
  • Synchronization is hard (skew minimization across
    layers)
  • Therefore
  • Cost and design effort are not trivial
  • Not just another dimension for wiring (as of
    today)
  • Need a sistematic way to deal with non-ideality

9
A medium-term vision
10
Do We Really Need It?
  • Multi-core logic performance is back on track,
    but

John McCalpin
http//www.cs.virginia.edu/stream/
  • Multi-core are bandwith-hungry
  • Limited caches
  • Multi-threading
  • Virtualization

The Bandwidth Challenge
11
Scaling cores with constant BW
C
T
B
Using Cache size to accommodate increasing
thread traffic is VERY expensive using BW can
be cheaper!!
T/C1/dB dgt1 (2-3)
2x increased traffic drives 8x cache size
(constant memory bandwidth) 4x increased traffic
drives 64 x cache size (constant memory bandwidth)
IBM
12
What about Embedded MPSoCs?
NXP07
Frame rate constraint is getting too tight!
13
3D offers plenty of bandwidth
High-end packaging roadmap
Intel 07
10µm TSV-pitch ? 10K vertical connections per mm2
14
How do we get the bandwith?
  • Current low cost SoC solution (2D) single
    channel memory system interface

Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH1
Memory scheduler
Sk
Off chip physical interface circuits
Main bottleneck the memory channel
15
Memory Controller example
16
Multi-channel (2D) Memory interface
  • Data-parallel memory system (e.g. OpenSPARC T1,
    T2)

PHY
Memory backend
CH1
Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH2
Memory scheduler
Sk
PHY
Memory backend
CH3
  • Main bottlenecks
  • Power, pin budget of memory channels
  • Front-end congestion
  • Scheduler scalability

17
Packaging scenarios vs. IO circuits
Board
SSTL2
DDR2 -16bit
Logic
PCB
SiP
DDR2 -16bit
RLC interconnect
PCB
SSTL2 no terminations
3D SiC (TSVs)
DDR2- 16bit
PCB
RC interconnect
3D-ready DDR2- 16bits
CMOS
PCB
3D-ready DDR2- xbits
RC interconnect
PCB
ORDERS OF MAGNITUDE MORE ENERGY EFFICIENT
SSTL2 standard stub termination logic This is
dedicated logic circuits to transmit data across
a transmission line, commonly used in DRAM
memories
18
3D-single channel interface
  • Current low cost SoC solution (2D) single
    channel memory system interface

CTRL
PHY
Slave port/s
Addr
SoC front-end Queue Scheduler
S1
DW
Memory backend
Fat data lanes ? 1 cycle block transfers
S2
Split R/W ? reduced scheduling conflicts
DR
ISSUE requires changes in MEM
Sk
CH
Main bottleneck the its still a single channel
(for transactions)
19
3D-multi channel interface
  • Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
20
3D-multi channel interface
  • Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
21
3D-multi channel interface
  • Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
22
3D-multi channel interface
  • Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
23
3D-multi channel interface
  • Solution 1 multiple standard channels

Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
Advantage does not require functional changes to
DRAM interface
24
3D-multi channel interface
  • Solution 2 TDMA 3D overclocked bus

Memory backend
Exploits speed of TSVs
PHY
Logic DIE
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH1
CH2
CH3
CH2
Sk
t
Memory backend
CH3
Multi-channel on wide unidirectional data lanes
is also possible
25
3D DRAM interface choices
  • Multi-channel 3D DRAM looks very promising
  • Can leverage existing DRAM organization
  • Relieves bandwidth bottleneck
  • Mitigates (by orders of magnitude) the cost
    issues of off-chip multi-channel interfaces
  • Exploration of wide channels with unidirectional
    data lanes is worthwile coupled with a deep
    revision of DRAM chip interfaces

26
DRAM for wide 3D interface
ROW Address 120
DOUT 1 to N word
Read Latch
n
Write Buff
DIN 1 to N word
Column Address 110
27
Scalability bottleneck
  • Single controller becomes a bottleneck even with
    many slave ports
  • All cores need to reach it! Even using NoC
    interconnect, the latency price is high
  • Internal management of multiple slave interface
    and many transaction queues creates complexity
    bottlenecks
  • This approach does not exploit the possibility of
    fine-grain distribution of 3D visa
  • Creates a single point of faileure

Bottleneck
28
Multiple 3D DRAM interfaces
  • Relieves single-controller bottleneck
  • Cores have a friendly neighbor controller
  • Memory is fully accessible to everybody
  • Notion of vicinity in memory space
  • Not without issues
  • Area cost is increased (some hw sharing of
    memctrl is lost)
  • Many points of entry in the memory dies. Need a
    regular pattern of memory access ports for
    commodity 3D RAM

29
IMIS 1.0 Example
  • Intimate memory interface specification

PORT
Chip footprint
80x19 cells, pitch 24µm
30
A case study
  • NoC Based Scalability and Modularity
  • Increase QoS, Predictability and Bandwidth

TSV
FP
Core
Core
Core
Core
L1
L1
L1
L1
SW
SW
DRAM
DRAM
MC
FP
MC
TSVs
TSVs
ni
MCU
TSVs
MCU
ni
TSVs
ni
SW
SW
SW
TSV
TSV
SW
DRAM
TSVs
TSVs
MCU
MCU
TSVs
TSVs
MC
DRAM
ni
ni
MC
GPRs
Core
L1
Core
L1
SW
SW
Core
L1
Core
L1
PLL
IO
Optional Low latency high BW channel
31
A promising approach
  • A high-level analysis for GP architectures
  • Traditional 2-Level cache hierarchy

G. Loh, 3D-Stacked Memory Architectures for
Multi-Core Processors ISCA 2008
32
Looking forward
  • More in general, multiple, application specific
    dies
  • Logic prevalent process ? lots of metal, fast
    and leaky transistors, area inefficient large
    library of cells
  • Memory prevalent process ? few levels of metal
    low-leakage transistors, few highly specialized
    cell generators
  • From SoC to ML-SoC
  • Lots of opportunities! More degrees of freedom to
    achieve reliability, low-quiescent power, low
    energy
  • If TSV technology becomes really stable (high
    yield, low cost)

33
3D NoCs for MLSoCs
  • Designing NoCs for 3D ICs big challenge
  • Which topology, switches on what layer and
    floorplan locations ?
  • Meet application constraints
  • Bandwidth, latency
  • Meet 3D technology constraints
  • Maximum available TSV constraint
  • Communication between adjacent layers
  • NoC floorplan considering 3D layers

Automating 3D NoC design essential !
34
2D vs 3D Synthesis
3D technology TSV constraints
35
2D vs 3D Synthesis
3D technology TSV constraints
TSV constraints7 links
36
2D vs 3D Synthesis
  • TSV constraint important factor in determining
    topology
  • Addressing 3D floorplanning of NoC also crucial
  • Additional constraint only links across adjacent
    layers

Several new isues in 3D NoC synthesis !
37
Contributions 3D NoC Flow
3D Specs
Communication characteristics
Technology constraints
User objectives
  • Application bandwidth requirements
  • Latency constraints
  • Message type of traffic flows
  • Core assignment to layer in 3D
  • Optionally, floorplan of cores in each layer
  • Max. TSVs across adjacent layers
  • Constraint on links only across adjacent layers
  • Power consumption
  • Latency

NoC Topology Synthesis
NoC area models
Application-specific 3D NoC
NoC power models
Vertical link power, latency models
38
3D NoC Flow
  • Features
  • Deadlock removal (routing and message-dependent)
    intra and inter layer
  • Floorplan of network components layer by layer
  • Meet 3D technology constraints
  • Design trade-offs possible
  • Inputs
  • IP, communication specs
  • Layer assignment placement of cores
  • Bandwidth, latency constraints of flows
  • NoC area, power models
  • Maximum inter-layer links (TSV constraint)

39
Synthesis Approach
40
Communication Abstraction
Synthesize best topology
Build communication graph based on application
specs
µ bw/max_bw (1- µ) min_lat/lat, µ - scaling
parameter varied by algorithm
41
Core to Switch Assignment
  • Build local partitioning graphs (LPG)
  • Layer-by-layer NoC design

0.5
1.0
ARM
M
ARM
LPG 1
LPG 0
M
ARM
ARM
0.2
42
Path Computation
  • In 3D, 2 important constraints to be met
  • TSV, maximum switch size (frequency)

Trade-offs of TSV count (yield) Vs
power-performance possible
43
Placement of Switches
Layer assignment of switches Floorplan of each
layer
Layer 1
Layer 0




Initial placemet of cores taken as input
44
Placement of Switches







Layer 1
Layer 0





Switches inserted to minimize distance (weighted
by bandwidth)
Solved as a Linear Program
45
Experiments
46
Example Layer Layout
  • Vertical links are laid out as floorplan
    obstructions
  • NoC components based on xpipes library

47
Multi-media Case Study
  • Triple video object plane decoder (TVOPD)
  • 38 cores, lot of pipeline traffic flow
  • Core layer assignment, floorplan given as inputs

48
Generated Topology
49
Multi-media Case Study
Floorplan of each layer
50
Comparisons with Meshes
  • Different SoC benchmarks-36 to 65 cores
  • Model bottleneck (shared memory), local memory,
    pipeline benchmarks
  • Meshes optimized for application traffic

Proposed method 38 reduction in power 25
reduction in latency
51
TSV constraints
  • TSV constrainsts gt inter-layer link constraints
  • Tigther, poorer latency, power
  • Trade-off exploration possible with our algorithm
  • Run time few hours for all benchmarks

52
Conclusions
  • NoCs are critical for 3D ICs
  • Scalable, modular, support technology constraints
  • Presented synthesis approach for 3D NoCs
  • Topology generation, floorplanning
  • Large improvements in power, delay compared to
    existing solutions
  • Not the optimal solution for e.g. tightly coupled
    memories
  • Need fast NoC bypasse e.g. NEC solution at
    ISSCC 2009
Write a Comment
User Comments (0)
About PowerShow.com