Title: voor dia serie SNSUtrecth't Gooi
1(No Transcript)
2Closing the Gap BetweenASIC and CustomAn ASIC
Perspective
- David ChinneryKurt Keutzer
- EECSUniversity of California at Berkeley
3Our questions
- How big is the speed gap between ASIC and custom?
- Where does the speed go?
- How can we close the speed gap?
4How much is on the table?
manual layout
How big is the gap between ASIC and custom
circuits?
tiling
?
automated place route
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
4
50.25 um Design Examples
- Very high speed custom designs
- Alpha 21264A, 750 MHz
- Out-of-order execution of instructions
- IBM PowerPC 1.0 GHz integer processor,not
commercial - In order execution
- ASIC
- Tensilica Xtensa processor, 150 MHz worst case
- In order execution
- Average ASIC, estimated 120 to 150 MHz
5
6The Gap
PowerPC 1 GHz
manual layout
How big is the gap between ASIC and custom
circuits?
tiling
automated place route
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
6
7An interesting data point
PowerPC 1 GHz
manual layout
How big is the gap between ASIC and custom
circuits?
tiling
automated place route
Tensilica Xtensa 150 MHz
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
7
8Observed Gap
PowerPC 1 GHz
manual layout
How big is the gap between ASIC and custom
circuits?
6-8 speed
tiling
automated place route
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
8
9Where does all that speed go?
PowerPC 1 GHz
manual layout
6-8 speed
tiling
automated place route
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
9
10Where does all that speed go?
PowerPC 1 GHz
- Custom prejudice
- ASIC designers are bad
- ASIC CAD tools are worse
manual layout
tiling
automated place route
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
10
11Where does all that speed go?
PowerPC 1 GHz
manual layout
- Whats the reality?
- Lets take a quick look
tiling
automated place route
Average ASIC 120 MHz
RTL synthesis
gate array
standard cell 2 sizes
standard cell 6 sizes
arbitrary circuits
11
12Where does the speed go?
- Maximum contribution
- 4.20 architecture
- Architecture
- Reducing critical path length by inserting
registers or latches
instruction fetch
instruction decode
write
ALU
instruction fetch
instruction decode
ALU
write
12
13Where does the speed go?
- Maximum contribution
- 1.20 logic design and clock skew
- Reducing levels of logic through complex
functions - reduces area and sometimes reduces speed
- less overhead due to guard bands and signal wires
- Generally worse clock skew in ASICs
VDD
GND
GND
13
14Where does the speed go?
- Maximum contribution
- 1.25 good floorplanning and placement
- Reduce wire lengths by placing connected modules
nearby
14
15Where does the speed go?
- Maximum contribution
- 1.25 clever sizing of transistors and wires
15
16Where does the speed go?
- Maximum contribution
- 1.50 through use of dynamic logic on critical
paths - Avoid slow p-transistor chains, reduced area
16
17Where does the speed go?
- Maximum contribution
- 2.00 due to process variation and accessibility
ASICworst case, worst process
fastest custom bin
produced
2.0
ASIC libraries may lag technology improvements
speed
17
18Full Range from ASIC to Custom
- Maximum contribution summary
- 4.20 architecture
- 1.20 logic design and clock skew
- 1.25 good floorplanning and placement
- 1.25 clever sizing of transistors and wires
- 1.50 through dynamic logic on critical paths
- 2.00 due to process variation and accessibility
- Good custom might be 23.6 better than bad ASIC.
- Your mileage may vary!
23.6
18
19Full Range from ASIC to Custom
- Maximum contribution summary
- 4.20 architecture
- 1.20 logic design and clock skew
- 1.25 good floorplanning and placement
- 1.25 clever sizing of transistors and wires
- 1.50 through dynamic logic on critical paths
- 2.00 due to process variation and accessibility
- Good custom might be 23.6 better than bad ASIC.
- Lets look at all that more carefully
23.6
19
20First the facts Critical Path Delay
- Delay is a function of
- Gate and wire delays
data
Tclock1
Q2
Q1
critical path, 5 logic levels
Tclock1
Tclock2
clock
20
21Critical Path Delay
- Delay is a function of
- Gate and wire delays
- Data stable during
- Setup time, before clock
data
Tclock1
Q2
Q1
critical path, 5 logic levels
Tclock1
Tclock2
clock
21
22Critical Path Delay
- Delay is also a function of
- Clock skew
data
Tclock1
Tclock2
Q2
Q2
clock skew
Q2
Q1
Tclock1
Tclock2
clock
22
23Critical Path Delay
- Delay is also a function of
- Clock skew
- Clock-to-Q
data
Tclock1
Tclock2
Q2
Q2
Q1
Tclock1
Tclock2
clock
23
241. Architecture
- Increase speed by reducing the critical path
length - Pipeline add latches between gates
- Must balance pipeline stages to maximize gain
instruction fetch
instruction decode
write
ALU
instruction fetch
instruction decode
ALU
write
If we add 5 stages, why is speed-up 4 and not 5?
24
25Pipelining Comparison
- Compare in-order execution
- estimate latch, clock skew overheads of
20(overhead in 0.35 um Alpha 21264) - ASIC, Xtensa
- Pipelined 4.0 1.0 ns overhead 5ns cycle
- Unpiplined 5 4.0 1.0 (overhead) 21.0 ns
cycle - Creating five pipeline stages in Xtensa gives
4.2 - Speed-up is less due to pipelining overheads
- Latch delay
- Clock skew
- Limited number of pipeline stages
- More stages increases cost of branch
misprediction, stalls
25
26Can we improve the architecture
andmicro-architecture of ASICs?
- Not always Fundamental problem in some
applications - PCI Bus interface has cycle-to-cycle dependency
- No opportunity for pipelining
- Bottom line
- Unpipelined ASICs lose factor of 4.20
- Compared with custom and pipelined ASICs
23.6
26
27But, some ASICs can be pipelined!
- If we can perform instructions in parallel for
application, then pipeline - Five stages in Xtensa
- 4.2X speed-up
23.6
4.20
27
282. Better Logic Design in Custom
- Custom designs can have specially designed
structures - Reducing levels of logic through complex
functions - Reduces area and sometimes reduces speed
- Less overhead as less guard banding and signal
wires - Superior design of regular logic like adders,
multipliers - Incorporate logic in latches
- Reduce latch overhead
28
29Clock Skew, Latch Design Comparison
- Greater clock skew in ASICs, contributing 1.10
- Best ASIC 5, 250 ps at 250 MHz
- Xtensa in 0.25 um (at typical speeds, typical
process) - Custom 5, 75 ps at 600 MHz
- Alpha 21264 in 0.35 um
- Better latch design would also impact pipelining
- E.g. if could have 0.2 ns custom overhead (1.0 ns
for ASIC) - ASIC 5.0 ns cycle ? 4.2 ns cycle
- 1.20 is due to logic design and clock skew
clock skew
29
30Can we improve ASIC logic design?
- Add custom macros to ASIC library
- Drawback takes time to design macros
- Reuse, amortizing design time
- Limited by design overhead of macros that wont
be reused - Designer must ensure ASIC description invokes
predefined macros
custom ingredients
23.6
adder
barrel shifter
register file
MAC
30
31Can we improve ASIC logic design?
- Add custom macros to ASIC library
- Drawback takes time to design macros
- Reuse, amortizing design time
- Limited by design overhead of macros that wont
be reused - Designer must ensure ASIC description invokes
predefined macros
custom ingredients
23.6
adder
1.20
barrel shifter
register file
MAC
31
323. Floorplanning and Placement
- Increase speed by avoiding cross-chip critical
path wires - Place interconnected modules nearby
- Long wires in ASICs due to poor final placement
of modules - Impact of long wires BACPAC 0.25 um ASIC, 12
million transistors - With shorter wires design would be 1.25 faster
32
33Can ASICs improve floorplanning?
- Use good ASIC floorplanning tools
- Improve tool recognition of similar structures
that can be abutted and tiled - Do about as well as custom in this regard
23.6
33
34Can ASICs improve floorplanning?
- Use good ASIC floorplanning tools
- Improve tool recognition of similar structures
that can be abutted and tiled - Do about as well as custom 1.25x
23.6
1.25
34
354. Transistor and Wire Sizing
- Reduce gate and wire delays on critical path
- Size gate output to drive load of fan-out gates
and wires - Size up transistors to drive large loads
- Widen wires to decrease resistance
- Delay is proportional to resistance capacitance
35
36Impact of poor library design
- ASIC standard cell library has discrete gate
sizes - Some libraries used with insufficient range of
gate drives - One or two sizes per cell
- Few inverters, few buffers
- Single gate polarities (less compact)
- Tools for sizing wires in ASIC designs not
available - After layout, resizing transistors knowing layout
can give up to 20 improvement Gavrilov 97 - Custom gains about a factor of 1.25 due to these
problems.
36
37Can we improve ASIC sizing?
ASIC libraries can be improved
- Use library with dual polarities,several (e.g.
6) drive strengths per cell
23.6
37
38Can we improve ASIC sizing?
ASIC libraries can be improved
- Use library with dual polarities,several (e.g.
6) drive strengths per cell
23.6
1.20
- Custom still about 1.05 better.
- Iterative transistor sizing and resynthesis
- Can improve speed by up to 20 Gavrilov ICCAD97
38
395. Dynamic Logic
- Using dynamic logic on critical paths
- Avoids slow p-transistor chains
- Higher speed
- Reduces area
- Only pull down network, and charging transistors
- Dynamic logic increases speed by about
1.50Nowka ICCD98
slow p-chain
VDD
GND
GND
clock
domino logic
static CMOS
39
40Dynamic Logic in ASICs?
- Dynamic logic requires careful design
- Glitching causes incorrect result
- More susceptible to noise
- Precharge power spike
- Careful design of power supply for dynamic
- Static CMOS is lower power
- ASIC tools are unable to support dynamic logic
- Dynamic logic libraries not available
- Unable to use library driven static timing
analysis - Interface of dynamic and static logic is
complicated
23.6
Custom remains 1.50 better.
40
41Dynamic Logic in ASICs?
- Dynamic logic requires careful design
- Glitching causes incorrect result
- More susceptible to noise
- Precharge power spike
- Careful design of power supply for dynamic
- Static CMOS is lower power
- ASIC tools are unable to support dynamic logic
- Dynamic logic libraries not available
- Unable to use library driven static timing
analysis - Interface of dynamic and static logic is
complicated
23.6
cant improve
Custom remains 1.50 better.
41
42But Dynamic Logic in Custom?
- Dynamic logic problems more pronounced in deep
submicron - Power dissipation
- Power consumption limited by supply
- Heat dissipation limited by packaging
- More noise
- Higher frequencies cause more noise
- More cross-talk noise as wires are closer
- Longer design times than static CMOS
- Prohibitive for progressively larger designs
- Dynamic logic likely to lose its advantages by
100 nm. - (Sorry Mark )
23.6
42
43Dynamic Logic in Custom?
- Dynamic logic problems more pronounced in deep
submicron - Power dissipation
- Power consumption limited by supply
- Heat dissipation limited by packaging
- More noise
- Higher frequencies cause more noise
- More cross-talk noise as wires are closer
- Longer design times than static CMOS
- Prohibitive for progressively larger designs
- Dynamic logic likely to lose its advantages by
100 nm. - (Sorry Mark )
15.7
43
446. Process Variation and Accessibility
- ASIC libraries calculate worst case speeds for
process - Speeds off a line may vary by 20 to 40
- Less variation in a mature process
- Custom designs can down-bin the slower chips
fast custom, rest slower
good yield
ASICworst case, worst process
produced
1.2
1.4
speed
44
45Process Variation and Accessibility
- ASIC libraries calculate worst case speeds for
process - Speeds off a line vary by 20 to 40
- Less variation in a mature process
- Custom designs can down bin the slower chips
- Could run ASICs faster than worst case speeds,
with high yield
acceptable ASIC yield
ASICworst case, worst process
produced
1.2
speed
45
46Process Variation and Accessibility
- ASIC libraries calculate worst case speeds for
process - Speeds off a line vary by 20 to 40
- Less variation in a mature process
- Custom designs can down bin the slower chips
- Could run ASICs faster than worst case speeds,
with high yield - Fabrication plants vary in speed by up to 25
Tensilica Xtensa modeling
ASIC worst case
Fab A
Fab B
produced
1.2
speed
46
47Process Variation and Accessibility
- ASIC libraries may not keep up with process
improvements - Technology improvements
- Intel 0.25 um 856 process had 18 speed
improvement, over the life of the process
generation
acceptableASIC yield
ASIC worst case
Improved process
Fab A
Fab B
produced
1.2 x 1.18
1.4
speed
ASIC libraries may lag technology improvements
47
48Process Variation and Accessibility
- Total difference of 2.00 between
- worst case ASIC speeds on worst process, with
original library (lagging process improvements) - and fast custom with fully up-to-date technology
acceptableASIC yield
fast customs,rest slower
fastcustoms,restslower
ASIC worst case
Fab A
Fab B
produced
2.0
speed
ASIC libraries may lag technology improvements
48
49Process Variation and Accessibility
- Can run ASICs faster than worst case speeds
- Test what speed can run at with high yield
- Improve speed by 30 to 40
- Xtensa can run at 250 MHz in 0.25 um
- Choose good fabrication company
- May be more expensive
- 20 better than worst processes in technology
Tensilica Xtensa modeling - Bottom line
- ASICs in a slow process, at worst case speeds,
lose factor of 2.00
23.6
49
50Process Variation and Accessibility
- Can run ASICs faster than worst case speeds
- Test what speed can run at with high yield
- Improve speed by 30 to 40
- Xtensa can run at 250 MHz in 0.25 um
- Choose good fabrication company
- May be more expensive
- 20 better than worst processes in technology
Tensilica Xtensa modeling - Bottom line
- ASICs in a slow process, at worst case speeds,
lose factor of 2.00 - But ASICs in a good process, running at better
than worst case speeds, get within 20 of custom
(gain 1.66x relative through best practice)
1.66
23.6
50
51Custom advantages over best ASIC practice
- 1.20 logic design and clock skew
- 1.05 clever sizing of transistors and wires
- 1.50 (today) dynamic logic on critical paths
- 1.20 process variation and accessibility
- Custom still 2.3 better!
2.3
51
52Custom advantages over best ASIC practice
- Custom advantages relative to best ASIC methods
- 1.20 logic design and clock skew
- 1.05 clever sizing of transistors and wires
- 1.50 (today) dynamic logic on critical paths
- 1.20 process variation and accessibility
2.3
But custom only 1.5 faster at 100 nm if
dynamic logic not viable.
1.5
52
53Another look (NB big is BAD!)
4.20
2.00
1.50
1.25
53
54Punch Line
- ASIC performance lags custom by 8
- Attention typically focused on detailed circuit
design and layout as primary reason - Our work indicates that
- Architecture, logic design and clock skew1.20
to 5.00, - And processing 1.20 to 2.00
- play a much larger role
- and custom circuit design and layout offer only
about 1.30 - Dynamic logic is one other significant factor in
why custom designs can do better 1.50 - ASIC/custom gap will narrow further (x 1.20
1.50) if custom loses dynamic logic advantage in
small geometries - Response to Bill Yes, its easy to make really
slow ASICs if you have a critical path with long,
unbuffered wires, even if you have a good
architecture manufactured in a fast process.
54
55(No Transcript)