Device and Architecture Co-Optimization for FPGA Power Reduction - PowerPoint PPT Presentation

About This Presentation

Title:

Device and Architecture Co-Optimization for FPGA Power Reduction

Description:

Device and Architecture Co-Optimization for FPGA Power Reduction Lerong Cheng, Phoebe Wong, Fei Li, Yan Lin, and Prof. Lei He EE Department, UCLA – PowerPoint PPT presentation

Number of Views:114

Avg rating:3.0/5.0

Slides: 39

Provided by: Fei45

Learn more at: http://eda.ee.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Device and Architecture Co-Optimization for FPGA Power Reduction

1
Device and Architecture Co-Optimization for FPGA
Power Reduction

Lerong Cheng, Phoebe Wong,
Fei Li, Yan Lin,
and Prof. Lei He
EE Department, UCLA
Partially supported by NSF CAREER award
CCR-0093273/0401682 and NSF grant CCR-0306682.
Address comments to lhe_at_ee.ucla.edu

2
Outline

Background and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Conclusion

3
Evaluation of Conventional FPGA Architecture

LUT size and cluster size have been evaluated for
conventional FPGA
performance and area Ahmed et al, ISFPGA00
power and performance Li et al, ISFPGA 03
Architecture tuning leads to 2.8X energy
difference and1.5X delay difference

Island style FPGA architecture
Evaluation result
4
Evaluation of Low-Power FPGA Architecture

Field programmable dual-vdd for power reduction
Lin et al, ISFPGA05
Applying field programmable dual Vdd reduces
energy-delay product by 49

5
Evaluation Methodology
Benchmark circuits
Logic Optimization(SIS)
Tech-Mapping (RASP)
Timing-Driven Packing (TV-Pack)
Cycle-accurate Power Simulator (Psim)
Placement Routing (VPR)
Area
Delay
6
Impact of Device Tuning

All the previous work only considers architecture
tuning
Device tuning leads to 84X power difference and
12X delay difference
It is necessary to perform device tuning and
architecture tuning simultaneously

7
Challenge of Device and Architecture
Co-Optimization

We consider the following architecture and device
parameters during our co-optimization
Architecture parameters
Cluster size (N)
LUT size (K)
Device parameters
Supply voltage (Vdd)
Threshold voltage (Vt)
Hyper-architecture (hyper-arch) is the
combination of the device and architecture
parameters.
Large number of hyper-arch combinations
VPR and Psim are too slow to deal with such large
numberof experiments
Need fast yet accurate power and delay estimation

8
Outline

Back ground and motivation
Trace-based power and delay estimation
Trace collection
Trace based power and delay model
Accuracy and efficiency verification of
Trace-based estimator
Device and architecture co-optimization
Conclusion

9
Trace Collection
Circuit element statistics
Critical path structure
VPR and Psim
Short circuit power ratio
Trace
Ptrace
Switching activity
Area

Assume trace information will remain the same
when device setting changes

10
Trace Base Estimation (Ptrace) Framework
Device independent
Ptrace
Trace
Chip level delay, power, and area
Device dependent
Circuit level delay and power
11
Outline

Back ground and motivation
Trace-based power and delay estimation
Trace collection
Trace based power and delay model
Accuracy and efficiency verification of
Trace-based estimator
Device and architecture co-optimization
Conclusion

12
Delay Model in VPR

Delay is calculated for each path as
Nip is number of type i elements in the path and
Di is delay oftype i element
Delay of the logic elements is measured by SPICE
simulation
Elmore delay is used for interconnect wire
segments
Critical path is the path with longest delay

13
Delay in Ptrace

Obtain the path structure of a set of longest
circuit paths
Assume that when device setting changes, the new
critical path is still among the set of longest
paths.
Delay computation

Trace information
Device dependent parameters
14
Dynamic Power Model

Psim
Switch power
Switching activity is measured by timing
simulation for each node
Si is the average switching activity
Short circuit power
asc is calculated for each node
Ptrace
Switch power
Short circuit power
asc is the average short circuit power ratio for
the whole circuit

Trace information
Device dependent parameters
15
Static Power Model

Psim
Without power gating
With power gating
Ptrace
Without power gating
With power gating

Trace information
Device dependent parameters
16
Outline

Back ground and motivation
Trace-based power and delay estimation
Trace collection
Trace based power and delay model
Accuracy and efficiency verification of
Trace-based estimator
Device and architecture co-optimization
Conclusion

17
Experiment Setting

Collect trace using ITRS 70nm technology, but
apply to both 100nm and 70nm technologies
20 MCNC benchmarks
Assume each benchmark works in its highest
possible frequency
Power and delay are computed as geometric mean
of20 benchmarks.
Evaluation range

Vdd Vt LUT size (K) Cluster size (N)
0.81.1 0.20.4 37 612
18
Accuracy

Average power error is 3.4.
Average delay error is 6.4.
Delay error is due to Ptrace ignores the impact
of path branches that considered in VPR

19
Runtime

VPR and Psim for one device setting
five days on eight 1.2GHz Intel Xeon servers
Ptrace for 20 device settings
80 seconds on one 1.2GHz Intel Xeon server

20
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Energy and delay tradeoff
ED and area tradeoff
Comparison between classes
Comparison between device tuning and architecture
tuning
Conclusion

21
Architectures Classes to be Evaluated

Hyper-architecture classes
Baseline case
Vdd suggested by ITRS
Architecture same as Xilinx Virtex-II.
Vt optimized by our method with respect to the
above architecture and Vdd

Hyper-arch classes Vt
Homo-Vt Homogeneous Vt
Hetero-Vt Heterogeneous Vt
Homo-VtG Homogeneous Vt Power Gating
Hetero-VtG Heterogeneous Vt Power Gating
Vdd Vt LUT size (K) Cluster size (N)
0.9 0.3 4 8
22
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Energy and delay tradeoff
ED and area tradeoff
Comparison between classes
Comparison between device tuning and architecture
tuning
Conclusion

23
Energy and Delay Tradeoff

Dominant hyper-arch
Hyper-arch B is inferior to A if A has less
energy and smaller delay than B.
Dominant hyper-archs (dom-arch) are the
hyper-archs that are NOT inferior to any other
hyper-archs.

24
Energy and Delay Tradeoff

Hetero-Vt can reduce power
Power gating reduces more leakage power than
hetero-Vt
Hetero-Vt has less impact when power gating is
applied

25
Min-ED Hyper-Arch
Hyper-arch classes Vdd (V) CVt (V) IVt (V) (N, K) ED (nJns) ED reduction
Baseline 0.9 0.3 0.3 (8,4) 26.9 -
Homo-Vt 0.9 0.3 0.3 (6,7) 23.3 13.4
Hetero-Vt 0.9 0.2 0.25 (8,4) 21.4 20.5
Homo-VtG 0.9 0.25 0.25 (12,4) 11.1 58.9
Hetero-VtG 0.9 0.2 0.25 (8,4) 11 59.0

To achieve the best energy and delay tradeoff, we
find out the hyper-arch with the minimum energy
and delay product (ED)
Compared to the baseline, the min-ED hyper-arch
of the conventional FPGA (Homo-Vt) reduces ED by
13.4
For the Hetero-Vt class, ED is reduced by 20.5
If power gating is applied, ED can be reduced by
up to 59.0

26
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Energy and delay tradeoff
ED and area tradeoff
Comparison between classes
Comparison between device tuning and architecture
tuning
Conclusion

27
ED and area Tradeoff

Architecture tuning has great impact on area.
To achieve the best area and ED tradeoff, we find
the hyper-arch with the minimum product of area,
energy and delay (AED)

28
ED Area Tradeoff for Classes without Power Gating

Compared to the min-ED hyper arch, the min-AED
hyper-arch significantly reduce area with a small
ED increase

29
Sleep Transistor Size Tuning

When Power gating is applied, sleep transistors
may increase area
The larger the sleep transistor size, the smaller
the delay
Sleep transistor size tuning
Area overhead introduced by sleep transistors of
logic blocks is negligible.
We consider 2X, 4X, 7X and 10X PMOS as sleep
transistor for switch buffer

30
ED Area Tradeoff for Classes with Power Gating

The area reduction achieved by device and
architecture co-optimization compensates the
area overhead introduced by sleep transistors

31
Min-AED Hyper-Arch
Vdd (V) CVt (V) IVt (V) (N,K) Sleep transistor size ED (nJns) Normalized area AED reduction
Baseline 0.9 0.30 0.30 (8,4) - 26.9 1.00 -
Homo-Vt 1.0 0.30 0.30 (6,4) - 23.6 0.80 30.0
Hetero-Vt 0.9 0.30 0.25 (12, 4) - 21.3 0.77 40.0
Hetero-VtG 0.9 0.25 0.25 (12, 4) 2 12.4 0.92 57.6
Hetero-VtG 0.9 0.20 0.25 (12, 4) 2 12.2 0.92 58.3

Compared to the baseline, the min-AED hyper-arch
in the conventional FPGA class can reduce area by
20 and ED by 12.3
In the Hetero-Vt class, ED is reduced by 20.8
and area is reduced by 23 compared to the
baseline
If power gating is applied, ED is reduced by
54.6 and area is reduced by 8.3

32
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Energy and delay tradeoff
ED and area tradeoff
Comparison between classes
Comparison between device tuning and architecture
tuning
Conclusion

33
Comparison Between Classes in Similar Performance
Range
Homo-Vt Homo-Vt Homo-Vt Homo-Vt Homo-Vt Homo-Vt Hetero-Vt Hetero-Vt Hetero-Vt Hetero-Vt Hetero-Vt Hetero-Vt Hetero-Vt
Vdd Vt (N, K) E (nJ) D (ns) ED (nJns) Vdd CVt IVt (N, K) E (nJ) D (ns) ED (nJns)
0.9 0.30 6,6 1.33 18.6 24.8 0.9 0.3 0.35 6,4 1.16 20.1 23.3
0.9 0.3 10,5 1.27 19.8 25 0.9 0.3 0.35 12,4 1.14 20.5 23.7
0.9 0.3 6,4 1.23 21.6 26.5 0.9 0.3 0.35 8,4 1.09 22.1 24.1
Homo-VtG Homo-VtG Homo-VtG Homo-VtG Homo-VtG Homo-VtG Hetero-VtG Hetero-VtG Hetero-VtG Hetero-VtG Hetero-VtG Hetero-VtG Hetero-VtG
Vdd Vt (N, K) E (nJ) D (ns) ED (nJns) Vdd CVt IVt (N, K) E (nJ) D (ns) ED (nJns)
0.8 0.25 10,5 0.70 19.4 13.7 0.9 0.25 0.3 12,4 0.66 18.9 12.5
0.8 0.25 8,4 0.62 20.9 12.9 0.8 0.25 0.25 8,4 0.62 20.9 12.9
0.8 0.25 12,4 0.62 21 12.9 0.8 0.25 0.25 12,4 0.62 21 12.9

Vt for logic block is lower than Vt for
interconnect
Vt for classes with power gating is lower

34
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Energy and delay tradeoff
ED and area tradeoff
Comparison between classes
Comparison between device tuning and architecture
tuning
Conclusion

35
Dom-Archs under Different Device Settings

For a given device setting architecture tuning
changes delay and energy in a smaller range
Device tuning has a much more impact on delay and
energy

36
Outline

Back ground and motivation
Trace-based power and delay estimation
Device and architecture co-optimization
Conclusion

37
Conclusion and Discussion

Trace-based estimator provides efficient and
accurate FPGA power and delay estimation
Average power error is 3.4 and average delay
error is 6.1
Device and architecture co-optimization reduces
ED by 20.5 and area by 23.3 when there is no
power gating
With power gating, device and architecture
co-optimization reduces ED by 54.6 and area by
8.3
Device tuning has a more significant impact on
delay and power than architecture tuning does
In recent research, Ptrace has been extended to
consider leakage and timing yield with process
variations