Title: Overview
1Overview
- Motivation (Kevin)
- Thermal issues (Kevin)
- Power modeling (David)
- Thermal management (David)
- Optimal DTM (Lev)
- Clustering (Antonio)
- Power distribution (David)
- What current chips do (Lev)
- HotSpot (Kevin)
2Power modeling
- Research Power Simulators
- Wattch Brooks and Martonosi ISCA2000
- SimplePower Vijaykrishnan et al (Penn State)
ISCA2000 - TEMPEST Dhodapkar et al (Intel/Wisconsin)
- PowerAnalyzer Umich/Colorado
- AccuPower SUNY Binghamton
- Industry Power Simulators
- IBM PowerTimer Brooks and Bose PACS2000
- Intel ALPS Gunther, et al.
3Power The Basics
- Dynamic power vs. Static power
- Dynamic switching power
- Static leakage power
- Dynamic power dominates, but static power
increasing in importance - Trends in each
- Static power steady, per-cycle energy cost
- Dynamic power capacitive and short-circuit
- Capacitive power charging/discharging at
transitions from 0?1 and 1?0 - Short-circuit power power due to brief
short-circuit current during transitions. - Mostly focus on capacitive, but recent work on
others
4Capacitive Power dissipation
Power ½ CV2Af
5Short-Circuit Power Dissipation
- Short-Circuit Current caused by finite-slope
input signals - Direct Current Path between VDD and GND when both
NMOS and PMOS transistors are conducting
6Leakage Power
- Subthreshold currents grow exponentially with
increases in temperature, decreases in threshold
voltage
7Modeling Hierarchy and Tool Flow
8Analysis Abstraction Levels
Abstraction Analysis Analysis
Analysis Analysis Energy Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Circuit)
Least Best
Slowest Most Least
9Power/Performance abstractions
- Low-level
- Hspice
- PowerMill
- Medium-Level
- RTL Models
- Architecture-level
- PennState SimplePower
- Intel Tempest
- Princeton Wattch
- IBM PowerTimer
- Umich/Colorado PowerAnalyzer
10Low-level models Hspice
- Extracted netlists from circuit/layout
descriptions - Diffusion, gate, and wiring capacitance is
modeled - Analog simulation performed
- Detailed device models used
- Large systems of equations are solved
- Can estimate dynamic and leakage power
dissipation within a few percent - Slow, only practical for 10-100K transistors
- PowerMill (Synopsys) is similar but about 10x
faster
11Medium-level models RTL
- Logic simulation obtains switching events for
every signal - Structural VHDL or verilog with zero or
unit-delay timing models - Capacitance estimates performed
- Device Capacitance
- Gate sizing estimates performed, similar to
synthesis - Wiring Capacitance
- Wire load estimates performed, similar to
placement and routing - Switching event and capacitance estimates provide
dynamic power estimates
12Architecture level models
Power ½ CV2Af
- Bottom-up Approach
- Estimate CV2f via analytical models
- Tools Wattch, PowerAnalyzer, Tempest
(mixed-mode) - Top-Down Approach
- Estimate CV2f via empirical measurements
- Tools PowerTimer, AccuPower, Most Industrial
Tools - Estimate A via statistics from
architectural-performance simulators
13Analytical Models Capacitance
- Requires modeling wire length and estimating
transistor sizes - Related to RC Delay analysis for speed along
critical path - But capacitance estimates require summing up all
wire lengths, rather than only an accurate
estimate of the longest one.
14Register File Capacitance Analysis
Bit
Pre-Charge
Cell Access Transistors (N1)
Decoders
Wordlines (Number of Entries)
Cell
Sense Amps
Number of Ports
Number of Ports
Bitlines (Data Width of Entries)
15Register File Model Accuracy
(Numbers in Percent)
- Validated against a register file schematic used
in internal Intel design - Compared capacitance values with estimates from a
layout-level Intel tool - Interconnect capacitance had largest errors
- Model neglects poly connections
- Differences in wire lengths -- difficult to tell
wire distances of schematic nodes
16Different Circuit Design Styles
- RTL and Architectural level power estimation
requires the tool/user to perform circuit design
style assumptions - Static vs. Dynamic logic
- Single vs. Double-ended bitlines in register
files/caches - Sense Amp designs
- Transistor and buffer sizings
- Generic solutions are difficult because many
styles are popular - Within individual companies, circuit design
styles may be fixed
17Clock Gating What, why, when?
Clock
Gated Clock
Gate
- Dynamic Power is dissipated on clock transitions
- Gating off clock lines when they are unneeded
reduces activity factor - But putting extra gate delays into clock lines
increases clock skew - End results
- Clock gating complicates design analysis but
saves power.
18Wattch An Overview
- Wattchs Design Goals
- Flexibility
- Planning-stage info
- Speed
- Modularity
- Reasonable accuracy
- Overview of Features
- Parameterized models for different CPU units
- Can vary size or design style as needed
- Abstract signal transition models for speed
- Can select different conditional clocking and
input transition models as needed - Based on SimpleScalar (has been ported to many
simulators) - Modular Can add new models for new units studied
19Unit Modeling
- Modeling Capacitance
- Models depend on structure, bitwidth, design
style, etc. - E.g., may model capacitance of a register file
with bitwidth number of ports as input
parameters
- Modeling Activity Factor
- Use cycle-level simulator to determine number and
type of accesses - reads, writes, how many ports
- Abstract model of bitline activity
20One Cycle in Wattch
- On each cycle
- determine which units are accessed
- model execution time issues
- model per-unit energy/power based on which units
used and how many ports.
21Units Modeled by Wattch
22PowerTimer
- IBM Tool First Develop During Summer of 2000
- Continued Development 2001 gt Today
- Methodology Applied to Research and Product
Power-Performance Simulators with IBM - Currently in Beta-Release
- Working towards Full Academic Release
23PowerTimer Empirical Power
Pre-silicon, POWER4-like superscalar design
24Processor Power Density
Pre-silicon, POWER4-like superscalar
design Originally presented at PACS2002
25PowerTimer
Circuit Power Data (Macros)
SubUnit Power f(SF, uArch, Tech)
Tech Parms
Compute Sub-Unit Power
Power
uArch Parms
AF/SF Data
Program Executable or Trace
CPI
Architectural Performance Simulator
26PowerTimer Energy Models
- Energy models for uArch structures formed by
summation of circuit-level macro data
27Empirical Estimates with CPAM
- Estimate power under Input Hold and Input
Switching Modes - Input Hold All Macro Inputs (Except Clocks) Held
- Can also collect data for Clock Gate Signals
- Input Switching Apply Random Switching Patterns
with 50 Switching on Input Pins
Macro
- 0 Switching (Hold Power)
- 50 Switching Power
Macro Inputs
28Example Unit
29PowerTimer Models f(SF)
Assumption Power linearly dependent on Switching
Factor This separates Clock Power and Switching
Power
Switching Power
Clock Power
At 0 SF, Power Clock Power (significant
without clock gating)
30Key Activity Data
Changes in SF
Changes in AF
- SF gt Moves along the Switching Power Curve
- Estimated on a per-unit basis from RTL Analysis
- AF gt Moves along the Clock Power Curve
- Extracted from Microarchitectural Statistics
(Turandot)
31Microarchitectural Statistics
- Stats are very similar to tracking used in
Wattch, etc - Differences
- Clock Gating Modes (3 modes)
- Customized Scaling Based on Circuit Style (4
styles) - Clock Gating Modes
- P_constrained P_unconstrained (not
clock-gateable) - P_constrained_1 AF (Pclock Plogic) (common)
- P_constrained_2 AF Pclock Plogic (rare)
- P_constrained_3 Pclock AF Plogic (very
rare) - Scaling Based on Circuit Styles
- AF_1 valid (Latch-and-Mux, No Stall Gating)
- AF_2 valid - stalls (Latch-and-Mux, With
Stall Gating) - AF_3 writes (Arrays that only gate updates)
- AF_4 writes reads (Arrays, RAM Macros)
32Clock Gating Valid-Bit Gating
- Latch-Based Structures Execute Pipelines, Issue
Queues
Clock
V
V
V
V
V
V
33Clock Gating Modes
- P_constrained_1 AF (Pclock Plogic)
clock
valid
Plogic
Pclock
- P_constrained_2 AF Pclock Plogic
clock
Selection Logic
valid
Pclock
Plogic
34Valid-bit Gating, Stalls?
- Option 1 Stalls cannot be gated
clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage
- Option 2 Stalls can be gated
clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage
35Scaling Array Structures
- Option 1 Reads and Writes Eligible to Gate for
Power
Write Bitline
Read Bitline
read_wordline_active
read_gate
write_wordline_active
write_gate
Cell
read_data
write_gate
write_data
36Scaling Array Structures
- Option 2 Only Writes Eligible to Gate for Power
Write Bitline
read_entry_n
read_entry_2
read_data
write_wordline_active
read_entry_1
write_gate
Cell
read_entry_0
write_gate
write_data
3712 Clock Gating Modes
Gating Mode Valid Valid Stalls Writes Writes Reads Gate Both Gate Clock Gate Logic Examples
0 No No No No No No No Control Logic, Buffers, Small Macros
1 Yes No No No Yes No No Issue Queues, Execute Pipelines
2 No Yes No No Yes No No Issue Queues, Execute Pipelines
3 No No Yes No Yes No No Caches
4 No No No Yes Yes No No Some Queues
5 Yes No No No No Yes No CAMs, Selection Logic
6 No Yes No No No Yes No CAMs, Selection Logic
7 No No Yes No No Yes No No Known macros
8 No No No Yes No Yes No No Known macros
9 Yes No No No No No Yes No Known macros
10 No Yes No No No No Yes No Known macros
11 No No Yes No No No Yes No Known macros
12 No No No Yes No No Yes No Known macros
38PowerTimer Observations
- PowerTimer works well for POWER4-like estimates
and derivatives - Scale base microarchitecture quite well
- E.g. optimal power-performance pipelining study
- Lack of run-time, bit-level SF not seen as a
problem within IBM (seen as noise) - Chip bit-level SFs are quite low (5-15)
- Most (60-70) power is dissipated while
maintaining state (arrays, latches, clocks) - Much state is not available in early-stage timers
39Comparing models Flexibility
- Flexibility necessary for certain studies
- Resource tradeoff analysis
- Modeling different architectures
- Purely analytical tools provides
fully-parameterizable power models - Within this methodology, circuit design styles
could also be studied - PowerTimer scales power models in a user-defined
manner for individual sub-units - Constrained to structures and circuit-styles
currently in the library - Perhaps Mixed Mode tools could be very useful
40Comparing models Accuracy
- PowerTimer -- Based on validation of individual
pieces - Extensive validation of the performance model
(AFs) - Power estimates from circuits are accurate
- Circuit designers must vouch for clock gating
scenarios - Certain assumptions will limit accuracy or
require more in-depth analysis - Analytical Tools
- Inherent Issues
- Analytical estimates cannot be as accurate as
SPICE analysis (C estimates, CV2 approximation) - Practical Issues
- Without industrial data, must estimate transistor
sizing, bits per structure, circuit choices
41Comparing models Speed
- Performance simulation is slow enough!
- Post-Processing vs. Run-Time Estimates
- Wattchs per-cycle power estimates roughly 30
overhead - Post-processing (per-program power estimates)
would be much faster (minimal overhead) - PowerTimer allows both no overhead
post-processing and run-time analysis for certain
studies (di/dt, thermal) - Some clock gating modes may require run-time
analysis - Third Option Bit Vector Dumps
- Flexible Post-Processing ? Huge Output Files
42Power modeling summary
- Wattch provides excellent relative accuracy
- Underestimates full chip power (some units not
modeled, etc) - PowerTimer models based on circuit-level power
analysis - Inaccuracy is introduced in SF/AF and scaling
assumptions
43Overview
- Motivation (Kevin)
- Thermal issues (Kevin)
- Power modeling (David)
- Thermal management (David)
- Optimal DTM (Lev)
- Clustering (Antonio)
- Power distribution (David)
- What current chips do (Lev)
- HotSpot (Kevin)
44Existing Work
- Research Ideas
- DEETM Huang and Torrellas MICRO2000
- DTM Brooks and Martonosi HPCA2001
- Control-Theoretic DTM Skadron, Abdelzaher, Stan
HPCA2002 - Thermal Scheduling Cai, Lim, Daasch WCED2002
- Commercial Products
- PowerPC G3 Microprocessor
- Pentium III
- Pentium 4
45Overview
- Hard to optimize power-performance at design time
for all cases - Forces conservative choices for issues like
cooling, current delivery, resource sizes - Want to explore dynamic power optimizations for
run-time power management - Dynamic Voltage/Frequency Scaling Burd, 2000
- Dynamic Hardware Resizing Albonesi, 1999
- Fetch Throttling Sanchez, 1997
- Global Clock Gating Gunther, 2001
- Speculation Control Manne, 1998
- Dynamic Thermal Management Brooks, 2001Huang,
2000
46Important to optimize P T early
12FO4
14FO4
Maximum Power Budget
23FO4
18FO4
47Dynamic Thermal Management
- Goal
- Provide dynamic techniques to cool chip when
needed - Exploit natural variations due to different
applications, phase behavior, - Allow designers to target average, rather than
worst-case behavior - Design Decisions
- Mechanism policy for triggering response?
- What should response be?
- How to select DTM trigger levels?
48Power consumption impacts cost
From Gunther, et al. Managing the Impact of
Increasing Microprocessor Power Consumption,
Intel Technology Journal, Q1, 2001
CPU
- System costs associated with power dissipation
- Thermal control cost
- Heatsinks, fans
- Power delivery
- Power supply
- Decoupling caps
49Average and Worst Case Power
- System costs are constrained by worst case power
dissipation - Average case power dissipation can often be much
lower - Aggressive Clock Gating
- Applications variations
- Underutilized resources
- Not enough ILP
- Floating point units during integer code
execution - Currently about a 30 difference
- Likely to further diverge
50Dynamic Thermal Management
DTM Disabled
51DTM Definitions
52DTM When, How, and What
53DTM Trigger Mechanisms
- Mechanism How to deduce temperature?
- Direct approach Temperature sensors providing
feedback - Implemented in some PowerPC chips (G3, G4)
Sanchez, 1997 - Sensor quantity, placement, and precision will be
discussed later - Other indirect approaches possible
- Policy When to begin responding?
- Trigger level set too high Packaging cost will
be high - Little advantage
- Trigger level set too low
- Frequent triggering causes performance to suffer
- Choose trigger level to exploit difference
between average and worst-case power.
54DTM Initiation Mechanisms
- Operating system or microarchitectural control?
- Hardware support can significantly reduce
performance penalty - Policy Delay Settings
- For Volt/Freq scaling, much of the performance
penalty can be attributed to enabling/disabling - Increasing policy delay reduces overhead smarter
initiation techniques would help as well
55DTM Response Mechanisms
- Scaling Techniques
- Clock Frequency Scaling Intel Pentium 4
- Voltage and Frequency Scaling
- Temperature-tracking frequency scalingSkadron03
- Adjusts frequency to account for T-dep. of
switching speed - Microarchitectural Techniques
- Speculation Control Manne98
- Low-Power Cache Techniques Huang00
- Hierarchical Responses
- Decode Throttling Sanchez97
- Fetch Toggling Brooks01
- Feedback controlled Fetch Gating Skadron02
- Migrating Computation Skadron03
- Dual Pipelines Lim02
56Dynamic Voltage/Frequency Scale
- Voltage Scheduler predicts workload requirements
- Set frequency/voltage to near-optimal, energy
savings - Burd, et al., ISSCC2000
- 5MHz _at_ 1.2V 6 MIPS, 2.8mW
- 80MHz _at_ 3.8V 85 MIPS, 460mW
- 70us 1.2V lt-gt 3.8V
- Transmeta Crusoe
- Commercial implementation (500-700MHz, 1.2-1.6V)
57Temperature-Tracking Frequency
- Temperature affects
- Transistor threshold and mobility
- Ion, Ioff, Igate, delay
- ITRS 85C for high-performance, 110C for
embedded! - So adjust frequency as f(T) -- TTDFS
Ioff
Ion NMOS
58Speculation Control
- Manne et al. (ISCA 98)
- Branch confidence estimator used to determine
whether to speculate - Pipeline gating based on confidence estimation
- 38 reduction in wrong-path instructions with 1
performance loss - But Parikh et al. (HPCA 02) found much smaller
savings ED product is zero or negative - Significant energy savings only come with
significant loss of performance - This is because many instructions are squashed
early in the pipeline, so reduction in wrong-path
instructions is not a useful metric - Benefit is actually a function of prediction
accuracy - Only for very badly predicted programs do you get
benefit - Well-predicted programs suffer
59Dynamic Hardware Resizing
- Complexity Adaptive Processors
- Based on application characteristics
- Underutilized structures may be reduced with
minimal performance impact - Resize Caches, Issue Queues, etc.
- Resize gt Reduce Capacitance gt Reduce Energy
- Of course, this only helps manage heat if it
reduces heat dissipation within hot spots - And does so for a sufficiently long duration
60DEETM
- Dynamic Energy Efficiency and Temperature
Management - Slack algorithm detects if slowdown can be
tolerated - If so, invoke techniques to reduce energy
- Temperature algorithm
- If temperature limit is reached, invokes
techniques - Techniques considered
- Filter Cache, Voltage Scaling, etc.
61Control-theoretic DTM
- Fetch toggling
- disable fetch every N cycles
- 4/5, 2/3, 1/2, 1/3, 1/5,
- How to set the fetch rate?
- (Assume idealized temperature sensing)
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
62Feedback-Control of Fetch Toggling
- Formal feedback control
- PID m KC (e KI?e Kdde/dt)
- easy to compute
- toggling f(m)
e
m
setpoint
P
T
ActuatorI-fetch toggling
Thermaldynamics
Controller
Temp. sensor
measured T
63Formal Feedback Control
- Regulatory control problem hold value to a
specified setpoint - Example temperature
- Proved that PID controller will not allow
temperature to exceed setpoint by more than 0.02 - Max power dissipation, thermal dynamics,sampling
rate ? max overshoot - This precision is excessive but illustrates the
value of formal feedback control theory
64Performance Loss
- Performance loss reduced by 65
65Migrating Computation
- When one unit overheats, migrate its
functionality to a distant, spare unit (MC) - Spare register file (Skadron et al. 2003)
- Separate core (CMP) (Heo et al. 2003)
- Microarchitectural clusters
- etc.
- Raises many interesting issues
- Cost-benefit tradeoff for that area
- Use both resources (scheduling)
- Extra power for long-distance communication
- Floorplanning
66Migrating Computation Reg File
67Thermal Scheduling (Cai 2002)
Majority mobile apps with performance requirements
- Primary pipeline maximal performance, complex
pipeline structure - Second pipeline Minimum power and energy
consumption, very simple in order structure and
target mobile anywhere-anytime applications. - Transparent to OS and applications
- Maximal utilizing on die clock/power gating for
energy saving
Text email, caller-id, reminder and other none
high performance w/ anywhere-anytime requested
apps
68Scheduling Algorithm (Cai 2002)
TS1
TS2
69Hybrid DTM
- DVS is attractive because of its cubic advantage
- P ? V2f
- This factor dominates when DTM must be aggressive
- But changing DVS setting can be costly
- Resynchronize PLL
- Sensitive to sensor noise ? spurious changes
- ILP techniques are attractive because they can
use instruction level parallelism to hide/reduce
impact of DTM - Only effective when DTM is mild
- So use both!
- Need to find crossover point
70Hybrid DTM, cont.
- Combine fetch gating with DVS
- When DVS is better, use it
- Otherwise use fetch gating
- Determined by magnitude of temperature overshoot
- Crossover at FG duty cycle of 3
- FG has low overhead helps reduce cost of sensor
noise
Hyb
71Hybrid DTM, cont.
- DVS doesnt need more than two settings for
thermal control - Lower voltage cools chip faster
- FG by itself does need multiple duty cycles and
hence requires PI control - But in a hybrid configuration, FG does not
require PI control - FG is only used at mild DTM settings
- Can pick one fixed duty cycle
- This is beneficial because feedback control is
vulnerable to noise
72Simulation Details
- 85C maximum temperature
- Guard band requires a trigger threshold of 81.8
- Ambient temperature (inside computer case) 45C
- Rpackage 0.8 K/W (old package model)
- 0.7 K/W necessary if DTM not available
- Die thickness 0.5mm
- Currently neglecting interface material
- 9 SPEC2000 benchmarks, both integer and FP
- 4 hover near 81.8C, rest are above
- SimpleScalar/Wattch, modified to model pipeline
and power of an Alpha 21364 as closely as
possible - Scaled to 130nm, 1.3V, 3.0 GHz
73Performance Comparison
- TT-DFS is best but cant prevent excess
temperature - Suitable for use with aggressive clock rates at
low temp. - Hybrid technique reduces DTM cost by 25 vs. DVS
(DVS overhead important) - A substantial portion of MCs benefit comes from
the altered floorplan, which separates hot units
74Conclusions so far
- DTM can be used to reduce cooling costs
- Proper modeling is required
- HotSpot is publicly available athttp//lava.cs.vi
rginia.edu/HotSpot - ILP matters
- Hybrid techniques beneficial
- Merge advantages of different schemes
- Simplify control
- Architectural techniques important in thermal
design - Growing use of clusters and redundant units opens
an incredibly rich design space
75DTM Summary and Key Issues
- Dynamic optimizations translate max-power problem
to average-power problem - Heightens importance of average-power techniques
like clock gating - Key Issues
- Initiation interval
- Collection of possible response mechanisms