Overview

About This Presentation

Title:

Overview

Description:

Wattch Brooks and Martonosi ISCA2000. SimplePower Vijaykrishnan et al (Penn State) ISCA2000 ... Structural VHDL or verilog with zero or unit-delay timing models ... – PowerPoint PPT presentation

Number of Views:42

Avg rating:3.0/5.0

Slides: 76

Provided by: skadronsta

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Overview

1
Overview

Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)

2
Power modeling

Research Power Simulators
Wattch Brooks and Martonosi ISCA2000
SimplePower Vijaykrishnan et al (Penn State)
ISCA2000
TEMPEST Dhodapkar et al (Intel/Wisconsin)
PowerAnalyzer Umich/Colorado
AccuPower SUNY Binghamton
Industry Power Simulators
IBM PowerTimer Brooks and Bose PACS2000
Intel ALPS Gunther, et al.

3
Power The Basics

Dynamic power vs. Static power
Dynamic switching power
Static leakage power
Dynamic power dominates, but static power
increasing in importance
Trends in each
Static power steady, per-cycle energy cost
Dynamic power capacitive and short-circuit
Capacitive power charging/discharging at
transitions from 0?1 and 1?0
Short-circuit power power due to brief
short-circuit current during transitions.
Mostly focus on capacitive, but recent work on
others

4
Capacitive Power dissipation
Power ½ CV2Af
5
Short-Circuit Power Dissipation

Short-Circuit Current caused by finite-slope
input signals
Direct Current Path between VDD and GND when both
NMOS and PMOS transistors are conducting

6
Leakage Power

Subthreshold currents grow exponentially with
increases in temperature, decreases in threshold
voltage

7
Modeling Hierarchy and Tool Flow
8
Analysis Abstraction Levels
Abstraction Analysis Analysis
Analysis Analysis Energy Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Circuit)
Least Best
Slowest Most Least
9
Power/Performance abstractions

Low-level
Hspice
PowerMill
Medium-Level
RTL Models
Architecture-level
PennState SimplePower
Intel Tempest
Princeton Wattch
IBM PowerTimer
Umich/Colorado PowerAnalyzer

10
Low-level models Hspice

Extracted netlists from circuit/layout
descriptions
Diffusion, gate, and wiring capacitance is
modeled
Analog simulation performed
Detailed device models used
Large systems of equations are solved
Can estimate dynamic and leakage power
dissipation within a few percent
Slow, only practical for 10-100K transistors
PowerMill (Synopsys) is similar but about 10x
faster

11
Medium-level models RTL

Logic simulation obtains switching events for
every signal
Structural VHDL or verilog with zero or
unit-delay timing models
Capacitance estimates performed
Device Capacitance
Gate sizing estimates performed, similar to
synthesis
Wiring Capacitance
Wire load estimates performed, similar to
placement and routing
Switching event and capacitance estimates provide
dynamic power estimates

12
Architecture level models
Power ½ CV2Af

Bottom-up Approach
Estimate CV2f via analytical models
Tools Wattch, PowerAnalyzer, Tempest
(mixed-mode)
Top-Down Approach
Estimate CV2f via empirical measurements
Tools PowerTimer, AccuPower, Most Industrial
Tools
Estimate A via statistics from
architectural-performance simulators

13
Analytical Models Capacitance

Requires modeling wire length and estimating
transistor sizes
Related to RC Delay analysis for speed along
critical path
But capacitance estimates require summing up all
wire lengths, rather than only an accurate
estimate of the longest one.

14
Register File Capacitance Analysis
Bit
Pre-Charge
Cell Access Transistors (N1)
Decoders
Wordlines (Number of Entries)
Cell
Sense Amps
Number of Ports
Number of Ports
Bitlines (Data Width of Entries)
15
Register File Model Accuracy
(Numbers in Percent)

Validated against a register file schematic used
in internal Intel design
Compared capacitance values with estimates from a
layout-level Intel tool
Interconnect capacitance had largest errors
Model neglects poly connections
Differences in wire lengths -- difficult to tell
wire distances of schematic nodes

16
Different Circuit Design Styles

RTL and Architectural level power estimation
requires the tool/user to perform circuit design
style assumptions
Static vs. Dynamic logic
Single vs. Double-ended bitlines in register
files/caches
Sense Amp designs
Transistor and buffer sizings
Generic solutions are difficult because many
styles are popular
Within individual companies, circuit design
styles may be fixed

17
Clock Gating What, why, when?
Clock
Gated Clock
Gate

Dynamic Power is dissipated on clock transitions
Gating off clock lines when they are unneeded
reduces activity factor
But putting extra gate delays into clock lines
increases clock skew
End results
Clock gating complicates design analysis but
saves power.

18
Wattch An Overview

Wattchs Design Goals
Flexibility
Planning-stage info
Speed
Modularity
Reasonable accuracy

Overview of Features
Parameterized models for different CPU units
Can vary size or design style as needed
Abstract signal transition models for speed
Can select different conditional clocking and
input transition models as needed
Based on SimpleScalar (has been ported to many
simulators)
Modular Can add new models for new units studied

19
Unit Modeling

Modeling Capacitance
Models depend on structure, bitwidth, design
style, etc.
E.g., may model capacitance of a register file
with bitwidth number of ports as input
parameters

Modeling Activity Factor
Use cycle-level simulator to determine number and
type of accesses
reads, writes, how many ports
Abstract model of bitline activity

20
One Cycle in Wattch

On each cycle
determine which units are accessed
model execution time issues
model per-unit energy/power based on which units
used and how many ports.

21
Units Modeled by Wattch
22
PowerTimer

IBM Tool First Develop During Summer of 2000
Continued Development 2001 gt Today
Methodology Applied to Research and Product
Power-Performance Simulators with IBM
Currently in Beta-Release
Working towards Full Academic Release

23
PowerTimer Empirical Power
Pre-silicon, POWER4-like superscalar design
24
Processor Power Density
Pre-silicon, POWER4-like superscalar
design Originally presented at PACS2002
25
PowerTimer
Circuit Power Data (Macros)
SubUnit Power f(SF, uArch, Tech)
Tech Parms
Compute Sub-Unit Power
Power
uArch Parms
AF/SF Data
Program Executable or Trace
CPI
Architectural Performance Simulator
26
PowerTimer Energy Models

Energy models for uArch structures formed by
summation of circuit-level macro data

27
Empirical Estimates with CPAM

Estimate power under Input Hold and Input
Switching Modes
Input Hold All Macro Inputs (Except Clocks) Held
Can also collect data for Clock Gate Signals
Input Switching Apply Random Switching Patterns
with 50 Switching on Input Pins

Macro

0 Switching (Hold Power)
50 Switching Power

Macro Inputs
28
Example Unit

Made up of 5 macros

29
PowerTimer Models f(SF)
Assumption Power linearly dependent on Switching
Factor This separates Clock Power and Switching
Power
Switching Power
Clock Power
At 0 SF, Power Clock Power (significant
without clock gating)
30
Key Activity Data
Changes in SF
Changes in AF

SF gt Moves along the Switching Power Curve
Estimated on a per-unit basis from RTL Analysis
AF gt Moves along the Clock Power Curve
Extracted from Microarchitectural Statistics
(Turandot)

31
Microarchitectural Statistics

Stats are very similar to tracking used in
Wattch, etc
Differences
Clock Gating Modes (3 modes)
Customized Scaling Based on Circuit Style (4
styles)
Clock Gating Modes
P_constrained P_unconstrained (not
clock-gateable)
P_constrained_1 AF (Pclock Plogic) (common)
P_constrained_2 AF Pclock Plogic (rare)
P_constrained_3 Pclock AF Plogic (very
rare)
Scaling Based on Circuit Styles
AF_1 valid (Latch-and-Mux, No Stall Gating)
AF_2 valid - stalls (Latch-and-Mux, With
Stall Gating)
AF_3 writes (Arrays that only gate updates)
AF_4 writes reads (Arrays, RAM Macros)

32
Clock Gating Valid-Bit Gating

Latch-Based Structures Execute Pipelines, Issue
Queues

Clock
V
V
V
V
V
V
33
Clock Gating Modes

P_constrained_1 AF (Pclock Plogic)

clock
valid
Plogic
Pclock

P_constrained_2 AF Pclock Plogic

clock
Selection Logic
valid
Pclock
Plogic
34
Valid-bit Gating, Stalls?

Option 1 Stalls cannot be gated

clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage

Option 2 Stalls can be gated

clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage
35
Scaling Array Structures

Option 1 Reads and Writes Eligible to Gate for
Power

Write Bitline
Read Bitline
read_wordline_active
read_gate
write_wordline_active
write_gate
Cell
read_data
write_gate
write_data
36
Scaling Array Structures

Option 2 Only Writes Eligible to Gate for Power

Write Bitline
read_entry_n
read_entry_2
read_data
write_wordline_active
read_entry_1
write_gate
Cell
read_entry_0
write_gate
write_data
37
12 Clock Gating Modes
Gating Mode Valid Valid Stalls Writes Writes Reads Gate Both Gate Clock Gate Logic Examples
0 No No No No No No No Control Logic, Buffers, Small Macros
1 Yes No No No Yes No No Issue Queues, Execute Pipelines
2 No Yes No No Yes No No Issue Queues, Execute Pipelines
3 No No Yes No Yes No No Caches
4 No No No Yes Yes No No Some Queues
5 Yes No No No No Yes No CAMs, Selection Logic
6 No Yes No No No Yes No CAMs, Selection Logic
7 No No Yes No No Yes No No Known macros
8 No No No Yes No Yes No No Known macros
9 Yes No No No No No Yes No Known macros
10 No Yes No No No No Yes No Known macros
11 No No Yes No No No Yes No Known macros
12 No No No Yes No No Yes No Known macros
38
PowerTimer Observations

PowerTimer works well for POWER4-like estimates
and derivatives
Scale base microarchitecture quite well
E.g. optimal power-performance pipelining study
Lack of run-time, bit-level SF not seen as a
problem within IBM (seen as noise)
Chip bit-level SFs are quite low (5-15)
Most (60-70) power is dissipated while
maintaining state (arrays, latches, clocks)
Much state is not available in early-stage timers

39
Comparing models Flexibility

Flexibility necessary for certain studies
Resource tradeoff analysis
Modeling different architectures
Purely analytical tools provides
fully-parameterizable power models
Within this methodology, circuit design styles
could also be studied
PowerTimer scales power models in a user-defined
manner for individual sub-units
Constrained to structures and circuit-styles
currently in the library
Perhaps Mixed Mode tools could be very useful

40
Comparing models Accuracy

PowerTimer -- Based on validation of individual
pieces
Extensive validation of the performance model
(AFs)
Power estimates from circuits are accurate
Circuit designers must vouch for clock gating
scenarios
Certain assumptions will limit accuracy or
require more in-depth analysis
Analytical Tools
Inherent Issues
Analytical estimates cannot be as accurate as
SPICE analysis (C estimates, CV2 approximation)
Practical Issues
Without industrial data, must estimate transistor
sizing, bits per structure, circuit choices

41
Comparing models Speed

Performance simulation is slow enough!
Post-Processing vs. Run-Time Estimates
Wattchs per-cycle power estimates roughly 30
overhead
Post-processing (per-program power estimates)
would be much faster (minimal overhead)
PowerTimer allows both no overhead
post-processing and run-time analysis for certain
studies (di/dt, thermal)
Some clock gating modes may require run-time
analysis
Third Option Bit Vector Dumps
Flexible Post-Processing ? Huge Output Files

42
Power modeling summary

Wattch provides excellent relative accuracy
Underestimates full chip power (some units not
modeled, etc)
PowerTimer models based on circuit-level power
analysis
Inaccuracy is introduced in SF/AF and scaling
assumptions

43
Overview

Motivation (Kevin)
Thermal issues (Kevin)
Power modeling (David)
Thermal management (David)
Optimal DTM (Lev)
Clustering (Antonio)
Power distribution (David)
What current chips do (Lev)
HotSpot (Kevin)

44
Existing Work

Research Ideas
DEETM Huang and Torrellas MICRO2000
DTM Brooks and Martonosi HPCA2001
Control-Theoretic DTM Skadron, Abdelzaher, Stan
HPCA2002
Thermal Scheduling Cai, Lim, Daasch WCED2002
Commercial Products
PowerPC G3 Microprocessor
Pentium III
Pentium 4

45
Overview

Hard to optimize power-performance at design time
for all cases
Forces conservative choices for issues like
cooling, current delivery, resource sizes
Want to explore dynamic power optimizations for
run-time power management
Dynamic Voltage/Frequency Scaling Burd, 2000
Dynamic Hardware Resizing Albonesi, 1999
Fetch Throttling Sanchez, 1997
Global Clock Gating Gunther, 2001
Speculation Control Manne, 1998
Dynamic Thermal Management Brooks, 2001Huang,
2000

46
Important to optimize P T early
12FO4
14FO4
Maximum Power Budget
23FO4
18FO4
47
Dynamic Thermal Management

Goal
Provide dynamic techniques to cool chip when
needed
Exploit natural variations due to different
applications, phase behavior,
Allow designers to target average, rather than
worst-case behavior
Design Decisions
Mechanism policy for triggering response?
What should response be?
How to select DTM trigger levels?

48
Power consumption impacts cost
From Gunther, et al. Managing the Impact of
Increasing Microprocessor Power Consumption,
Intel Technology Journal, Q1, 2001
CPU

System costs associated with power dissipation
Thermal control cost
Heatsinks, fans
Power delivery
Power supply
Decoupling caps

49
Average and Worst Case Power

System costs are constrained by worst case power
dissipation
Average case power dissipation can often be much
lower
Aggressive Clock Gating
Applications variations
Underutilized resources
Not enough ILP
Floating point units during integer code
execution
Currently about a 30 difference
Likely to further diverge

50
Dynamic Thermal Management
DTM Disabled
51
DTM Definitions
52
DTM When, How, and What
53
DTM Trigger Mechanisms

Mechanism How to deduce temperature?
Direct approach Temperature sensors providing
feedback
Implemented in some PowerPC chips (G3, G4)
Sanchez, 1997
Sensor quantity, placement, and precision will be
discussed later
Other indirect approaches possible

Policy When to begin responding?
Trigger level set too high Packaging cost will
be high
Little advantage
Trigger level set too low
Frequent triggering causes performance to suffer
Choose trigger level to exploit difference
between average and worst-case power.

54
DTM Initiation Mechanisms

Operating system or microarchitectural control?
Hardware support can significantly reduce
performance penalty
Policy Delay Settings
For Volt/Freq scaling, much of the performance
penalty can be attributed to enabling/disabling
Increasing policy delay reduces overhead smarter
initiation techniques would help as well

55
DTM Response Mechanisms

Scaling Techniques
Clock Frequency Scaling Intel Pentium 4
Voltage and Frequency Scaling
Temperature-tracking frequency scalingSkadron03
Adjusts frequency to account for T-dep. of
switching speed
Microarchitectural Techniques
Speculation Control Manne98
Low-Power Cache Techniques Huang00
Hierarchical Responses
Decode Throttling Sanchez97
Fetch Toggling Brooks01
Feedback controlled Fetch Gating Skadron02
Migrating Computation Skadron03
Dual Pipelines Lim02

56
Dynamic Voltage/Frequency Scale

Voltage Scheduler predicts workload requirements
Set frequency/voltage to near-optimal, energy
savings
Burd, et al., ISSCC2000
5MHz _at_ 1.2V 6 MIPS, 2.8mW
80MHz _at_ 3.8V 85 MIPS, 460mW
70us 1.2V lt-gt 3.8V
Transmeta Crusoe
Commercial implementation (500-700MHz, 1.2-1.6V)

57
Temperature-Tracking Frequency

Temperature affects
Transistor threshold and mobility
Ion, Ioff, Igate, delay
ITRS 85C for high-performance, 110C for
embedded!
So adjust frequency as f(T) -- TTDFS

Ioff
Ion NMOS
58
Speculation Control

Manne et al. (ISCA 98)
Branch confidence estimator used to determine
whether to speculate
Pipeline gating based on confidence estimation
38 reduction in wrong-path instructions with 1
performance loss
But Parikh et al. (HPCA 02) found much smaller
savings ED product is zero or negative
Significant energy savings only come with
significant loss of performance
This is because many instructions are squashed
early in the pipeline, so reduction in wrong-path
instructions is not a useful metric
Benefit is actually a function of prediction
accuracy
Only for very badly predicted programs do you get
benefit
Well-predicted programs suffer

59
Dynamic Hardware Resizing

Complexity Adaptive Processors
Based on application characteristics
Underutilized structures may be reduced with
minimal performance impact
Resize Caches, Issue Queues, etc.
Resize gt Reduce Capacitance gt Reduce Energy
Of course, this only helps manage heat if it
reduces heat dissipation within hot spots
And does so for a sufficiently long duration

60
DEETM

Dynamic Energy Efficiency and Temperature
Management
Slack algorithm detects if slowdown can be
tolerated
If so, invoke techniques to reduce energy
Temperature algorithm
If temperature limit is reached, invokes
techniques
Techniques considered
Filter Cache, Voltage Scaling, etc.

61
Control-theoretic DTM

Fetch toggling
disable fetch every N cycles
4/5, 2/3, 1/2, 1/3, 1/5,
How to set the fetch rate?
(Assume idealized temperature sensing)

IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
62
Feedback-Control of Fetch Toggling

Formal feedback control
PID m KC (e KI?e Kdde/dt)
easy to compute
toggling f(m)

e
m
setpoint
P
T
ActuatorI-fetch toggling
Thermaldynamics
Controller
Temp. sensor
measured T
63
Formal Feedback Control

Regulatory control problem hold value to a
specified setpoint
Example temperature
Proved that PID controller will not allow
temperature to exceed setpoint by more than 0.02
Max power dissipation, thermal dynamics,sampling
rate ? max overshoot
This precision is excessive but illustrates the
value of formal feedback control theory

64
Performance Loss

Performance loss reduced by 65

65
Migrating Computation

When one unit overheats, migrate its
functionality to a distant, spare unit (MC)
Spare register file (Skadron et al. 2003)
Separate core (CMP) (Heo et al. 2003)
Microarchitectural clusters
etc.
Raises many interesting issues
Cost-benefit tradeoff for that area
Use both resources (scheduling)
Extra power for long-distance communication
Floorplanning

66
Migrating Computation Reg File
67
Thermal Scheduling (Cai 2002)
Majority mobile apps with performance requirements

Primary pipeline maximal performance, complex
pipeline structure
Second pipeline Minimum power and energy
consumption, very simple in order structure and
target mobile anywhere-anytime applications.
Transparent to OS and applications
Maximal utilizing on die clock/power gating for
energy saving

Text email, caller-id, reminder and other none
high performance w/ anywhere-anytime requested
apps
68
Scheduling Algorithm (Cai 2002)
TS1
TS2
69
Hybrid DTM

DVS is attractive because of its cubic advantage
P ? V2f
This factor dominates when DTM must be aggressive
But changing DVS setting can be costly
Resynchronize PLL
Sensitive to sensor noise ? spurious changes
ILP techniques are attractive because they can
use instruction level parallelism to hide/reduce
impact of DTM
Only effective when DTM is mild
So use both!
Need to find crossover point

70
Hybrid DTM, cont.

Combine fetch gating with DVS
When DVS is better, use it
Otherwise use fetch gating
Determined by magnitude of temperature overshoot
Crossover at FG duty cycle of 3
FG has low overhead helps reduce cost of sensor
noise

Hyb
71
Hybrid DTM, cont.

DVS doesnt need more than two settings for
thermal control
Lower voltage cools chip faster
FG by itself does need multiple duty cycles and
hence requires PI control
But in a hybrid configuration, FG does not
require PI control
FG is only used at mild DTM settings
Can pick one fixed duty cycle
This is beneficial because feedback control is
vulnerable to noise

72
Simulation Details

85C maximum temperature
Guard band requires a trigger threshold of 81.8
Ambient temperature (inside computer case) 45C
Rpackage 0.8 K/W (old package model)
0.7 K/W necessary if DTM not available
Die thickness 0.5mm
Currently neglecting interface material
9 SPEC2000 benchmarks, both integer and FP
4 hover near 81.8C, rest are above
SimpleScalar/Wattch, modified to model pipeline
and power of an Alpha 21364 as closely as
possible
Scaled to 130nm, 1.3V, 3.0 GHz

73
Performance Comparison

TT-DFS is best but cant prevent excess
temperature
Suitable for use with aggressive clock rates at
low temp.
Hybrid technique reduces DTM cost by 25 vs. DVS
(DVS overhead important)
A substantial portion of MCs benefit comes from
the altered floorplan, which separates hot units

74
Conclusions so far

DTM can be used to reduce cooling costs
Proper modeling is required
HotSpot is publicly available athttp//lava.cs.vi
rginia.edu/HotSpot
ILP matters
Hybrid techniques beneficial
Merge advantages of different schemes
Simplify control
Architectural techniques important in thermal
design
Growing use of clusters and redundant units opens
an incredibly rich design space

75
DTM Summary and Key Issues