Title: Yuchun Ma
1International Center for Design on
Nanotechnologies Workshop
Physical Modeling and Exploration
for 3D Microarchitecture Design
- Yuchun Ma
- Joint Work with Jason Cong, Yongxiang Liu,
- Glenn Reinman, and Yan Zhang
2Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
3Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
4Superscalar Processors
- Superscalar processing is the ability of a
microprocessor to initiate multiple instructions
into multiple pipelines so that the computations
of many instructions can be done in parallel if
they are not dependent on each other.
5Alpha 21264
6Performance of a microprocessor
- Performance is measured as the time taken to
complete a given task - Operating systems
- Compiler optimizations
- Workload used for studying the performance
- Microprocessor organization
- Typically, the processor performance is measured
in MIPS or BIPS
7Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
8Motivations of 3-D ICs
- Alternative ways for device integration as we
approach the limit of CMOS scaling - Interconnect length/delay reduction
- System performance Improvement Black04
- Power Reduction Black04
- Integration of heterogeneous technologies
- No existing flow to evaluate 3D implementations
of architectures systematically - Performance
- Thermal
Black04
9Technology background
- Wafer bonding 3D IC technologies
- With flipping the top layer
- Without flipping the top layer
(a) With flipping the top layer
(b) Without flipping the top layer A 3D IC
example with two device layers
10Thermal Resistive Network Wilkerson04
- Circuit stack partitioned into tiles
- Tiles connected through thermal resistances
- Lateral resistances fixed
- Vertical resistances ? 1/via
- Heat sources modeled as current sources
- Current value power
- Heat sinks modeled as ground nodes
- Thermal vias
- After floorplanning, we can further reduce the
temperature by thermal via insertion.
(a) Tiles stack array
(b) Single tile stack
11Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
12MEVA-3D
- An Automated Design Flow for 3D Architecture
Evaluation (MEVA-3D) - Evaluate 3D implementations of micro-architectures
systematically and study them from both
performance and thermal perspectives. - MEVA-3D Flow
- Automated 2D/3D floorplanning
- Reduce the latency along critical loops in the
mico-architecture by considering interconnect
pipelining at a given target frequency. - Thermal Evaluation
- Resistive network model considering white-space
and thermal via insertion. - 3D router
133D Architecture Evaluation with Physical Planning
- Optimize
- BIPS (not IPC or Freq)
- Consider interconnect pipelining based on early
floorplanning for critical paths - Use IPC sensitivity model Jagannathan05
- Area/wirelength
- Temperature
14Design Example
- An out-of-order superscalar processor
micro-architecture with 4 banks of L2 cache in
70nm technology - Critical paths
15Baseline Processor Parameters
162D vs 3D Layout
Assume two device layers
3D EV6-like core (2 layers)
2D EV6-like core
BIPS 2.75
BIPS 2.94
Wakeup loop The extra cycle is eliminated.
Branch misprediction resolution loop and the L2
cache access latency Some of the extra cycles
are eliminated
17Simulation Results
- The 3D architecture outperforms 2D design about
11.7 when the frequency is 4GHz.
18Performance for the micro-architecture with 2D
and 3D layout at different target frequencies
- 3D integration can help improve the performance
by 11 by eliminating most of the wire latencies
in 2D.
19Maximum On-Chip Temperature
- 3D integration shows a temperature increase of
over 4.78? on average. After thermal via
insertion, we can reduce the maximum on-chip
temperature by an average of about 62.
HS denotes a heat sink, and the 3D integration
allows to insert thermal vias to reduce the
temperature.
20Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
213D Design w/ Component Folding and Stacking
- Explore 3D design of architectural structures
that are - Timing/Throughput Critical
- Expensive in Terms of Power Consumption and/or
Thermal Output - Possible candidates for 3D component folding
- Instruction Scheduling Window
- Issue Queue can be partitioned into multiple
levels via matchlines or taglines. - On-Chip Caches
- Regular structure lends itself to a wide range of
partitionings - Register File
- Thermally critical resource also has a regular
structure
223D Architectural Block Design and Modeling
- First explore how to design blocks in 3D
- Wordline folding
- Fold block horizontally
- Port Partitioning
- Extend ports to different layers
- Tools
- CACTI
- Caches and cache-like structures
- Register files
- HSpice
- Issue Queue
- Then explore design space for a microprocessor
with these blocks
233D Issue Queue
- Block folding
- Fold the entries and place them on different
layers - Effectively shortens the tag lines
- Port partitioning
- Place tag lines and ports on multiple layer, thus
reducing both the height and width of the ISQ. - The reduction in tag and matchline wires can help
reduce both power and delay.
(a) 2D issue queue with 4 taglines(b)block
folding (c) port partitioning
24Benefits from IQ folding
- Maximum delay reduction of 50, maximum area
reduction of 90 and a maximum reduction in power
consumption of 40
nL- n number of layers, FB Folding banks, TP
Tag/Ports Partitioning
25Improvements for blocks
- Port folding performs better than wordline
folding for area.(72 vs 51) - Wordline folding is more effective in reducing
the block delay (13 vs 5) - Port folding also performs better in reducing
power (13 vs 5)
263D packing with folded blocks
- The exploration of the use of vertical
integration on microprocessor design requires
consideration for both physical design and
architecture. - True 3D packing
- Architectural Alternative Selection
- The number of layers in folded blocks
- The partition way block folding or port
partitioning
273D Corner Block List Representation
- (S, L, T) composes a 3D CBL.
- S a record of block name
- L corner cubic block orientation(X-, Y- or Z-
oriented) - T The sequence of Tn,Tn-1, ,T2 recording the
number of attached tri-branches covered by corner
cubic block
S1 2 3 4 5 L ( Y,Z,Y,X) T( 10,110,10,1110)
28Packings with folded blocks
29(No Transcript)
30Performance
- On average, multi-layer(3D) block configurations
have 11 lower temperature as well as 14
improvement in BIPS.
31Temperatures
- Temperatures can be below 100 degree with thermal
vias inserted.
32Temperature profile
33Temperature profile(2 layers with thermal vias)
34Outline
- Micro-architecture Design
- 3-D IC Technology
- 3D Architecture Exploration with 2D blocks
- 3D Architecture Design with cubic folded blocks
- 3D cubic packing algorithm
- 3D architecture exploration with folded blocks
- Pipelining Optimization with Throughput-Aware
Floorplanning - Summary and Future Work
35Micro-architecture Pipelining Optimization
- Previous works assume that the blocks are
separately designed subject to a clock frequency,
and the wire pipelining is then carried out on
the global wires of the circuits. - Sub-optimal due to the possible utilized slacks
in block pipeline designs - We propose a novel optimization methodology of
architecture pipelining with physical design, so
that block pipelining and interconnect pipelining
can be considered simultaneously.
36Simultaneous Block and Interconnect Pipelining
- We define path-based pipelinging as Simultaneous
Block and Interconnect Pipelining (SBIP) Problem - Represent the micro-architecture design by a path
graph G(V,E). - The delay between any two flip-flops along the
same path is less than clock period ?. - The performance of the architecture can be
evaluated by the weighted sum of number of FFs on
ei(nei) along the paths. - Therefore the objective is to find a feasible
solution with the optimal performance.
37MILP Formulation
- We define a term a(P,v) that represents the
arrival time at node (v) along path P, which is
the longest delay from a flip-flop to the node v
along path P. - With the given clock period ? and the set of
paths P, we can then formulate the problem as the
following MILP - Obj. Min
- s.t. 0 ? a(Pi,v)? ? ? v?V and Pi passes
v (1) - nei?0 ? ei?E
(2) - a(Pi,v) ? a(Pi,u) dei ? nei ? ei ?E
and ei is a connection from node u to node v
along path Pi. (3)
38Graph-based heuristic algorithm
- Traverse the graph to decide the optimal
insertion of flip-flops such that the weighted
sum of cycle numbers of paths is minimized - Dynamic scanning for combinational circuits
- Slacks along paths are used to compute the
optimal positions for FFs. - Near-optimal method for sequential circuits
- break the cycle into a path from s to t
- Throughput aware floorplanning with pipelining
- The path-based pipelining design guides the block
design to optimize the performance for the whole
design.
39Experimental Results
- We compare the results with the wire-pipelining
results (WP), and the solutions obtained from the
MILP solver (MILP), the ideal upper bound used in
68(UB) and our graph-based heuristic approach
(GH). - Impact of frequencies
- The path-based pipelining will give about a 27
performance improvement over wire pipelining
40Integrated with floorplanning optimization
- MILP approach as a post process at the end of the
floorplanning - integrate our approach with the thoughput-driven
floorplannning.
Frequency GHz UBpost_MILP UBpost_MILP UBpost_MILP GH GH GH
Frequency GHz Area (mm2) Wire (mm) BIPS Area (mm2) Wire (mm) BIPS
2 32. 115.6 1.492 31.8 142 1.714
3 34.6 103.7 2.139 33.3 108.4 2.22
4 32.4 98.7 2.776 36.1 124.3 2.828
5 32.8 126.2 2.885 32.6 94.17 3.35
6 36.0 108.4 3.636 33.7 100.3 3.882
7 35.9 112.5 3.479 36.8 129.9 3.906
Comparison 1 1 1 1.003 1.05 1.091
41Summary
- 3D Architecture Exploration
- Coupled with 3D physical planning
- Consider both 3D component stacking and folding
- MEVA-3D can systematically evaluate the 3D
architecture both from the performance side and
from the thermal side. - We propose the optimization methodology of
architecture pipelining with physical design
which simultaneously optimize the pipeline design
and physical packing in terms of system
throughput. The performance of the system can be
improved a lot over the wire-pipelining.
42Ongoing Work
- 3D Multi-core architecture design and
implementation - Deep pipeline design in microarchitecture with
interconnect considered - The slacks in 3D design may be used to enlarge
the sizes of blocks and get better performance.
43 Thank You! Mayuchun_at_tsinghua.org.cn