Yuchun Ma

About This Presentation

Title:

Yuchun Ma

Description:

A 3D IC example with two device layers. 10. Rlateral. Thermal Resistive Network [Wilkerson04] ... the given clock period and the set of paths P, we can then ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 44

Provided by: cadlab

Learn more at: http://cadlab.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Yuchun Ma

1
International Center for Design on
Nanotechnologies Workshop
Physical Modeling and Exploration
for 3D Microarchitecture Design

Yuchun Ma
Joint Work with Jason Cong, Yongxiang Liu,
Glenn Reinman, and Yan Zhang

2
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

3
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

4
Superscalar Processors

Superscalar processing is the ability of a
microprocessor to initiate multiple instructions
into multiple pipelines so that the computations
of many instructions can be done in parallel if
they are not dependent on each other.

5
Alpha 21264
6
Performance of a microprocessor

Performance is measured as the time taken to
complete a given task
Operating systems
Compiler optimizations
Workload used for studying the performance
Microprocessor organization
Typically, the processor performance is measured
in MIPS or BIPS

7
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

8
Motivations of 3-D ICs

Alternative ways for device integration as we
approach the limit of CMOS scaling
Interconnect length/delay reduction
System performance Improvement Black04
Power Reduction Black04
Integration of heterogeneous technologies
No existing flow to evaluate 3D implementations
of architectures systematically
Performance
Thermal

Black04
9
Technology background

Wafer bonding 3D IC technologies
With flipping the top layer
Without flipping the top layer

(a) With flipping the top layer
(b) Without flipping the top layer A 3D IC
example with two device layers
10
Thermal Resistive Network Wilkerson04

Circuit stack partitioned into tiles
Tiles connected through thermal resistances
Lateral resistances fixed
Vertical resistances ? 1/via
Heat sources modeled as current sources
Current value power
Heat sinks modeled as ground nodes
Thermal vias
After floorplanning, we can further reduce the
temperature by thermal via insertion.

(a) Tiles stack array
(b) Single tile stack
11
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

12
MEVA-3D

An Automated Design Flow for 3D Architecture
Evaluation (MEVA-3D)
Evaluate 3D implementations of micro-architectures
systematically and study them from both
performance and thermal perspectives.
MEVA-3D Flow
Automated 2D/3D floorplanning
Reduce the latency along critical loops in the
mico-architecture by considering interconnect
pipelining at a given target frequency.
Thermal Evaluation
Resistive network model considering white-space
and thermal via insertion.
3D router

13
3D Architecture Evaluation with Physical Planning

Optimize
BIPS (not IPC or Freq)
Consider interconnect pipelining based on early
floorplanning for critical paths
Use IPC sensitivity model Jagannathan05
Area/wirelength
Temperature

14
Design Example

An out-of-order superscalar processor
micro-architecture with 4 banks of L2 cache in
70nm technology
Critical paths

15
Baseline Processor Parameters
16
2D vs 3D Layout
Assume two device layers
3D EV6-like core (2 layers)
2D EV6-like core
BIPS 2.75
BIPS 2.94
Wakeup loop The extra cycle is eliminated.
Branch misprediction resolution loop and the L2
cache access latency Some of the extra cycles
are eliminated
17
Simulation Results

The 3D architecture outperforms 2D design about
11.7 when the frequency is 4GHz.

18
Performance for the micro-architecture with 2D
and 3D layout at different target frequencies

3D integration can help improve the performance
by 11 by eliminating most of the wire latencies
in 2D.

19
Maximum On-Chip Temperature

3D integration shows a temperature increase of
over 4.78? on average. After thermal via
insertion, we can reduce the maximum on-chip
temperature by an average of about 62.

HS denotes a heat sink, and the 3D integration
allows to insert thermal vias to reduce the
temperature.
20
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

21
3D Design w/ Component Folding and Stacking

Explore 3D design of architectural structures
that are
Timing/Throughput Critical
Expensive in Terms of Power Consumption and/or
Thermal Output
Possible candidates for 3D component folding
Instruction Scheduling Window
Issue Queue can be partitioned into multiple
levels via matchlines or taglines.
On-Chip Caches
Regular structure lends itself to a wide range of
partitionings
Register File
Thermally critical resource also has a regular
structure

22
3D Architectural Block Design and Modeling

First explore how to design blocks in 3D
Wordline folding
Fold block horizontally
Port Partitioning
Extend ports to different layers
Tools
CACTI
Caches and cache-like structures
Register files
HSpice
Issue Queue
Then explore design space for a microprocessor
with these blocks

23
3D Issue Queue

Block folding
Fold the entries and place them on different
layers
Effectively shortens the tag lines
Port partitioning
Place tag lines and ports on multiple layer, thus
reducing both the height and width of the ISQ.
The reduction in tag and matchline wires can help
reduce both power and delay.

(a) 2D issue queue with 4 taglines(b)block
folding (c) port partitioning
24
Benefits from IQ folding

Maximum delay reduction of 50, maximum area
reduction of 90 and a maximum reduction in power
consumption of 40

nL- n number of layers, FB Folding banks, TP
Tag/Ports Partitioning
25
Improvements for blocks

Port folding performs better than wordline
folding for area.(72 vs 51)
Wordline folding is more effective in reducing
the block delay (13 vs 5)
Port folding also performs better in reducing
power (13 vs 5)

26
3D packing with folded blocks

The exploration of the use of vertical
integration on microprocessor design requires
consideration for both physical design and
architecture.
True 3D packing
Architectural Alternative Selection
The number of layers in folded blocks
The partition way block folding or port
partitioning

27
3D Corner Block List Representation

(S, L, T) composes a 3D CBL.
S a record of block name
L corner cubic block orientation(X-, Y- or Z-
oriented)
T The sequence of Tn,Tn-1, ,T2 recording the
number of attached tri-branches covered by corner
cubic block

S1 2 3 4 5 L ( Y,Z,Y,X) T( 10,110,10,1110)
28
Packings with folded blocks

29
(No Transcript)
30
Performance

On average, multi-layer(3D) block configurations
have 11 lower temperature as well as 14
improvement in BIPS.

31
Temperatures

Temperatures can be below 100 degree with thermal
vias inserted.

32
Temperature profile

1 layer

33
Temperature profile(2 layers with thermal vias)
34
Outline

Micro-architecture Design
3-D IC Technology
3D Architecture Exploration with 2D blocks
3D Architecture Design with cubic folded blocks
3D cubic packing algorithm
3D architecture exploration with folded blocks
Pipelining Optimization with Throughput-Aware
Floorplanning
Summary and Future Work

35
Micro-architecture Pipelining Optimization

Previous works assume that the blocks are
separately designed subject to a clock frequency,
and the wire pipelining is then carried out on
the global wires of the circuits.
Sub-optimal due to the possible utilized slacks
in block pipeline designs
We propose a novel optimization methodology of
architecture pipelining with physical design, so
that block pipelining and interconnect pipelining
can be considered simultaneously.

36
Simultaneous Block and Interconnect Pipelining

We define path-based pipelinging as Simultaneous
Block and Interconnect Pipelining (SBIP) Problem
Represent the micro-architecture design by a path
graph G(V,E).
The delay between any two flip-flops along the
same path is less than clock period ?.
The performance of the architecture can be
evaluated by the weighted sum of number of FFs on
ei(nei) along the paths.
Therefore the objective is to find a feasible
solution with the optimal performance.

37
MILP Formulation

We define a term a(P,v) that represents the
arrival time at node (v) along path P, which is
the longest delay from a flip-flop to the node v
along path P.
With the given clock period ? and the set of
paths P, we can then formulate the problem as the
following MILP
Obj. Min
s.t. 0 ? a(Pi,v)? ? ? v?V and Pi passes
v (1)
nei?0 ? ei?E
(2)
a(Pi,v) ? a(Pi,u) dei ? nei ? ei ?E
and ei is a connection from node u to node v
along path Pi. (3)

38
Graph-based heuristic algorithm

Traverse the graph to decide the optimal
insertion of flip-flops such that the weighted
sum of cycle numbers of paths is minimized
Dynamic scanning for combinational circuits
Slacks along paths are used to compute the
optimal positions for FFs.
Near-optimal method for sequential circuits
break the cycle into a path from s to t
Throughput aware floorplanning with pipelining
The path-based pipelining design guides the block
design to optimize the performance for the whole
design.

39
Experimental Results

We compare the results with the wire-pipelining
results (WP), and the solutions obtained from the
MILP solver (MILP), the ideal upper bound used in
68(UB) and our graph-based heuristic approach
(GH).
Impact of frequencies
The path-based pipelining will give about a 27
performance improvement over wire pipelining

40
Integrated with floorplanning optimization

MILP approach as a post process at the end of the
floorplanning
integrate our approach with the thoughput-driven
floorplannning.

Frequency GHz UBpost_MILP UBpost_MILP UBpost_MILP GH GH GH
Frequency GHz Area (mm2) Wire (mm) BIPS Area (mm2) Wire (mm) BIPS
2 32. 115.6 1.492 31.8 142 1.714
3 34.6 103.7 2.139 33.3 108.4 2.22
4 32.4 98.7 2.776 36.1 124.3 2.828
5 32.8 126.2 2.885 32.6 94.17 3.35
6 36.0 108.4 3.636 33.7 100.3 3.882
7 35.9 112.5 3.479 36.8 129.9 3.906
Comparison 1 1 1 1.003 1.05 1.091
41
Summary

3D Architecture Exploration
Coupled with 3D physical planning
Consider both 3D component stacking and folding
MEVA-3D can systematically evaluate the 3D
architecture both from the performance side and
from the thermal side.
We propose the optimization methodology of
architecture pipelining with physical design
which simultaneously optimize the pipeline design
and physical packing in terms of system
throughput. The performance of the system can be
improved a lot over the wire-pipelining.

42
Ongoing Work

3D Multi-core architecture design and
implementation
Deep pipeline design in microarchitecture with
interconnect considered
The slacks in 3D design may be used to enlarge
the sizes of blocks and get better performance.

43
Thank You! Mayuchun_at_tsinghua.org.cn

Write a Comment

User Comments (0)