Title: The Design and Use of SimplePower: A CycleAccurate Energy Estimation Tool
1The Design and Use of SimplePower A
Cycle-Accurate Energy Estimation Tool
- W. Ye, N. Vijaykrishnan, M. Kandemir, M. J. Irwin
- Microsystems Design Lab
- The Pennsylvania State University
2Why Power Matters
- Packaging costs cooling costs
- Power supply rail design
- Digital noise immunity
- Battery life (in portable systems)
- Environmental concerns
- Office equipment accounted for 5 of total US
commercial energy usage in 1993 - Energy Star compliant systems
3Motivation
Abstraction Analysis Analysis
Analysis Analysis Power Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Behavior (System) Architectural (RTL) Logic
(Gate) Transistor (Switch)
Least Best Slowest Most
Least
4Architectural Level Analysis Considerations
- Very computationally efficient
- requires predefined analytical and
transition-sensitive energy characterization
models - Simulation based so can be used to support
architectural, compiler, operating system, and
application level experimentation - WattWatcher (Sente), DesignPower and
PowerCompiler (Synopsys), prototype academic
tools (Wattch - Princeton, Avalanche -
Princeton/NEC)
5SimplePower Framework
6Functional Unit Characterization
- Transition sensitive energy models
- energy tables Adders, Register files
- Transition aware energy models
- system level interconnect
- Analytical energy models
- cache and main memory
7Switch Capacitance Table
8Table Compression
- Problem
- Results in large uncompressed table
- (e.g., 16-bit adder ? 232 rows)
- Excessive simulation (e.g., 232 !)
- Existing Solutions
- Clustering Algorithm
- Reference Huzefa Mehta, et. al Module Energy
Characterization using Clustering, DAC96 - For 16-bit adder, to keep 12 average error ?
- 1000 simulation points, 97 rows
- Becomes complex for larger input widths to
maintain accuracy after clustering
9Modeling Solutions
- Partitioning of unit into smaller sub-modules
- Modeling adders, multipliers, shifters, register
files - bit dependent decoder
- bit independent each bit transition can be
considered independently - pipeline registers,
logic unit in ALU - Analytical Modeling
- Structure is too large or complicated
10Register File
Write Decoder
532
Write Data Drivers
Write Decoder
532
32 x 32 Cell Array
Read Decoder
Word Line Drivers
532
Read Decoder
532
Read Decoder
Read Sense Amps
532
Bit Independent
Bit Dependent
11Decoder Characterization
- A 532 decoder ? 210 row table!
- Build 532 decoder out of smaller decoders
- A 24 decoder (with enable) ? 26 row table
12Analytical Energy Model Example
- On-chip cache Kamble Off-chip Memory Shiue
- Energy Ebus Ecell Epad Emain
-
- Ecell ?(wl_length)(bl_length4.8)(Nhit
2Nmiss) - wl_length m(T 8L St)
- bl_length C/(mL)
- Nhit number of hits Nmiss number of misses
C cache size L cache line size in bytes
m set associativity T tag size in bits St
of status bits per line ? 1.44e-14
(technology parameter)
13Validation of Energy Model
HSPICE Power Consumption
Estimated Power Consumption
14SimplePower Design Summary
- Supports Integer Instruction Set of SimpleScalar
- Models On-Chip Caches and Off-Chip Memory along
with buses - Provides cycle-accurate energy information across
different system components - Does not account for clock generation and
distribution circuitry - Computationally efficient
- Register file takes 0.1 second for each input
sequence as opposed to 9 minutes for the HSPICE
simulation
15The Use of SimplePower
- Can reuse the technology based files to evaluate
other architectures - Number and type of Functional Units
- Study Architectural Modifications and
Optimizations - Number of pipeline stages
- Gated-pipelining
- Study Influence of Software
- High-level Algorithmic Choices
- High-level Compiler Optimizations
- Low-level Compiler Optimizations
16Compiler Framework
Benchmark source
17Sample of Benchmark Set
18Datapath Energy Consumption
19Selectively Gated Pipeline Regs
- Pipeline registers consume a large percentage of
datapath power - 40 for 0.35?
- Pipeline registers have large width
- Pipeline registers are clocked every cycle
- Not all clockings are necessary
- use the decoded control signals to selectively
gate the clock of pipeline register fields - only simple extra logic necessary
- can be built into the clock buffer circuit
20Gated Pipeline Registers
Instr SW r1, 0(r2)
MEM/WB
EXE/MEM
mem/wb_cntl
MemData
Address
D
Data
EXE
MEM
WB
21Switch Capacitance Reduction
22Compiler Framework
Benchmark source
23Compiler Optimizations
- High-level Optimizations
- Inter and Intra Procedural Dataflow Optimization
- Loop Transformations
- Memory-Layout Transformation
24Data Transformation Effects
25Compiler Framework
Benchmark source
compiler transformations
Source to source translation
GCC
assembly code
GAS
object code
26Low Level Optimizations
- Instruction Scheduling
- Register allocation
- Operand Swapping
- Register Relabelling
27Register Relabeling
- A post-compilation optimization
- Exploits corresponding fields in consecutive
instructions - Reduces bit switches on the instruction bus
- Reduces the energy of the pipeline registers and
register file decoder
28Register Relabeling Example
29Solution Steps
- Construct a Register Transition Graph
- Compiler analysis to record all consecutive
transitions between all possible pairs of
registers - Profiling (use training inputs to create sample
traces) - Determine important paths
- Paths that contain edges with high transition
counts - Relabel the registers
30Register Transition Graph
80
R5
R6
35
15
90
R1
R4
35
5
R3
100
R2
Registers with large transition counts should be
labeled using minimum Hamming distance
31Icache Data Bus Reduction
32Conclusions
- SimplePower - cycle accurate simulator
- Find energy hotspots in the architecture
- Study hardware/software interaction
- Architectural experiments
- Selectively gated pipeline registers
- 18-36 energy savings in datapath
- Compiler optimization experiments
- Data transformations Memory
- 62 Reduction
- Register relabeling Bus
- 11.7 Reduction