Title: Power Emulation: A New Paradigm for Power Estimation
1Power Emulation A New Paradigm for Power
Estimation
- Joel Coburn, Srivaths Ravi,Anand Raghunathan
NEC Laboratories America, Inc. 4 Independence
Way Princeton, NJ 08540
2Power Emulation
Power estimation flow
Circuit Simulation
Component input statistics
Power model evaluation for each circuit component
Aggregate power consumption of individual
components
Power profile
3Outline
- Conventional power estimation
- Motivation for power emulation
- Power emulation techniques
- Proposed methodology
- Results
4How Power Estimation is Addressed
System-level
design
Power models for system-level components
System-level power analysis
High-level synthesis,
RTL optimizations
Power models for macroblocks and control logic
Architecture-level power analysis
Levels of the design flow
Logic synthesis
days
Layout
Power models for gates, cells, and nets
Transistor
Logic-level power analysis
Good speed/accuracy trade-off
Logic
Estimation Time
Transistor-level/
RTL
Layout synthesis
Algorithm
System
Transistor models, wire models
Transistor-level power analysis
seconds
5
30
Accuracy
5RTL Power Estimation
- Analytical techniques
- Correlate power consumption to simple measuresof
design complexity - Use gate count and user-specified activity
factors (Glaser-91) - Useful for regular structures (memories and clock
networks) (Liu-94) - Better accuracy with information-theoretic
approaches (Marculescu-95, Nemani-99) - Characterization-based macromodeling
- Characterize a lower level implementation of
anRTL block - Characterize as a constant power value
(Powell-90) - Characterize as a function of input signal
statistics (Landman-95, Raghunathan-96, Mehta-96,
Benini-96,Gupta-00) - Address training data bias, fitting data
errors(Bogliolo-98, Corgnati-99)
6RTL Power Estimation
- Fast synthesis based power estimation
- Map design through low-effort to a netlist for
power estimation (Llopis-98) - Speedup techniques
- Statistical sampling (Ravi-03)
- Circuit partitioning for parallel mixed-level
simulation (Chinosi-99) - Commercial/In-house Tools
- PowerTheater (Sequence Design)
- PowerChecker (Bulldast)
- CYBER RTL Power Estimation (NEC)
- Orinoco (ChipVision)
7Power Emulation Technology
- RTL power estimation is too slow forlarge
designs - New paradigm for power estimation!
- Use emulation to accelerate RTL power estimation
Time to decode 4 video frames
Testbench
Outputs
2 to 3 orders of magnitude speedup possible !
Power
Host PC
FPGA platform
8NECs RTL Power Estimation Flow
Characterization-based macromodeling
Behavior
RTL library
Synthesis conditions
Testbench/ stimuli
Synthesis PR
Behavioral Synthesis
Power model library generator
NECs C-based design flow (CYBER)
RTL netlist
RTL simulation
Post-layout netlists
Powerlib.vhd
CHARACTERIZATION FLOW
Power model inference and estimation code
generation
Power characterization
Powerlib.v
Powerlib.c
Enhanced RTL
Power macro-model database
Power
Simulateable Power Model Libraries
Output
Input
Power Profiles
9Enhanced RTL Design
first
last
value
data
Controller
FSM
? ? ?
1
Power Model
- Components for power estimation
- Power models for every component Monitor
component I/O values and compute power - Power strobe generator Trigger power models
(statistical sampling employed for improved
efficiency since RTL simulation can also be slow
for large designs) - Power aggregator Compute total power consumption
Functional Units
-1
Power Model
gtgt 1
Registers
reg_c0
reg_first
reg_last
reg_out
reg_c1
reg_c1
reg_mid
Power Model
Bus 1
Bus 2
Bus 3
addr
out
Power Aggregator
Power Model
Power Model
Power Model
Power Model
? ? ?
Power Strobe Generator
Total Power
10Power Model Architecture
Queues
Power summation
Component Inputs/Outputs
Transition count function
- What does the power model contain?
- Queues to store present and past values
- Transition count function is a simple computation
- Coefficients aggregated based on output of
transition count function
11Power Emulation Challenges
- Size of design enhanced with power models isvery
large! - Size increases an average of 18.2X for MPEG4
sub-designs - Enhanced version exceeds capacity of largest
Xilinx Virtex-II FPGA
20.4X
Capacity of XC2V8000 FPGA
20.6X
17.7X
16.3X
17.5X
Need to reduce the area requirements of power
models !
14.7X
15.0X
12Power Emulation Challenges
- Why area increase?
- Resource-hungry power models used for every RTL
component in the design - How to reduce area?
- Optimize the number of power models used
- Make the implementations of power models
resource-efficient - Catch Ensure minimum loss of estimation accuracy
due to area reduction techniques
13Area Optimization Techniques
- Clustering of power models
- Single power model servicing multiple components
- Changing component granularity
- Constructing power models for complex components
that subsume several smaller components - Exploiting correlation
- Using power correlation between components to
reduce the number of monitored components - Optimizing power model implementations
- Multi-cycling additions in power model
computations - Using FPGA block memories for efficient storage
of power model coefficients
14Clustering
- Construct a generic power model that is
responsible for a cluster of (say, M) components
- Hardware must be added to support multiple
components - Multiplexers for component I/Os and coefficient
addresses
Component selector
Input bits
N max(I/O) pins among serviced components
N
Comp_1
Comp_2
- In a given cycle, a generic power model can
monitor only one component - Similar to sampling previously used for power
estimation - Can cause estimation error
32
N
M1
Inputs
Power
Power
Changes to monitor different comp.
- But the maximum number of I/O pins from serviced
components determines power model bitwidth - Extra bits are wasted for some components
- Zero padding used for coefficients and I/O bits
of components with bitwidth lt N
? ? ?
? ? ?
N-bit power model
Comp_N
log2M
POW_STROBE
sel_comp
sel_comp
Coefficients
Coefficient Address
Ncoeff_width
K
Comp_1 0000
Coefficient ROM
Comp_2 0001
- Clustering saves area because M dedicated power
models are collapsed into a single power model - Queues and adders are shared
Comp_3 0001
K
M1
addr
dout
Hardwired comp. addresses
Ultra-wide memory for max coeff. BW
? ? ?
? ? ?
Comp_N 1011
clk
CLOCK
15Clustering Area/Accuracy Trade-offs
- Estimate error increases with larger clusters
- Area first decreases and then increases with more
clustering - Multiplexer and select logic area dominate for
large clusters
Area vs Error trade-offs
Area
Estimate Error
16Changing Component Granularity
- Hardware overhead for power models depends on
granularity of RTL components - High overhead for small granularity components
(e.g., logic gates) - Increase component granularity for power modeling
- Estimation error increases, since internal
signals are not visible tothe power model
Error vs. component granularity
17Exploiting Correlations Between Component Power
Consumptions
- Given two components x and y, approximate
Power(y) as f( Power(x) ) - Power correlations occur due to internal circuit
structure, e.g., fanout, logic replication - No power model needed for y
- f( ) should be a simple function
Non-linear correlation
Strong linear correlation
Weak linear correlation
18Power Emulation Flow
Power Model library
Resource sharing
0
Optimized power model library
Testbench
Power model inference and estimation
code generation
FPGA synthesis, PR
Optimize for area and minimize error
Download to FPGA Execute
Power Profile
RTL design
3
4
2
1
19Optimization Methodology
Enhanced RTL design, testbench, power model
library, parameters target_area and k
Apply hierarchical clustering with area-based
objective function to determine component
groups to generic power model mappings
Short RTL simulation to generate component
power profiles
target_area, k
k valid solutions
Component Power Profiles
Determine optimum sampling rate for each
component
Compute mean, variance for component power
inter-component power correlation factor
Multi-way component swapping to minimize
undersampling
for_all_k
Use inter-component power profile correlations
to collapse component list
Choose solution with the lowest undersampling
(i) Select component sets suitable for higher
granularity power models (ii) Update power model
library, design
Power emulation ready RTL
20Experimental Procedure
design
Synthesis transformation, Area optimizations
Create ROM init files
Behavioral Synthesis (CYBER)
Build coefficient ROMs Xilinx CoreGenerator
Power- enhanced RTL
RTL description
Power emulation- ready RTL
RTL power library
FPGA Synthesis Synplify Pro
Sequence Design PowerTheater
RTL Power Estimation
Place Route, FPGA Configuration Xilinx ISE
ModelSim
targeting Xilinx Virtex-II FPGA
RTL Power Estimation
Power Emulation
21Experimental Results
- Evaluation on various designs, each compared with
CYBER-RTL and PowerTheater
Nearly 500X speedup possible !
- Upto 500X speedup compared to RTL power
estimation - 3 Loss of accuracy on an average
- Area overheads lowered to 3X
22Conclusions
- Power Emulation is a promising way to perform
fast power estimation - Extends capabilities of current powerestimation
tools
23Thank you!Questions?
24An Example Correlation Distribution
- Components exhibiting a correlation coefficient
above a threshold value (say 0.5) can be grouped
together and replaced with one scaled power model - RTL component example from Bubble Sort design
Contributes 1.04 of total estimated power
36 components with ? gt 0.5
25Resource Sharing
- Estimation error decreases with more adders
- Area first decreases and then increases with more
adders - Scheduling overhead dramatically affects area
adders contribute less area because of dedicated
carry-chains in FPGA architecture
Area vs Error trade-offs
Area
Estimate Error
26Hierarchical Clustering
- Inputs list of components, target area
constraint - Outputs k valid solutions that meet the target
area constraint - Initially, every component forms a cluster with
its own power model - Look at pairwise cost of combining two clusters
into a single cluster - Choose the pair of clusters that combine to
result in the best area savings - Update the bitwidth of the resultant cluster
- Repeat the previous steps until k solutions meet
the target area constraint or all component are
in a single cluster
27Determining Optimum Component Sampling Rates
- Observation components whose power consumption
characteristics are associated with a higher mean
or variance must be sampled more frequently - Objective minimize the aggregate error due to
sampling
Aggregate error
Component weights
Minimization constraints
Use solver to get values of N
28Minimizing Undersampling
- Estimate error introduced for a component is
computed by finding its distance from the optimum
sampling rate - Aggregate undersampling for the present
clustering solution - Minimize by using an iterative improvement
algorithm based on the Kernighan-Lin heuristic - Move components to other clusters to reduce
undersampling while ensuring that the target area
constraint is not violated