Overview - PowerPoint PPT Presentation

About This Presentation
Title:

Overview

Description:

Wattch Brooks and Martonosi ISCA2000. SimplePower Vijaykrishnan et al (Penn State) ISCA2000 ... Structural VHDL or verilog with zero or unit-delay timing models ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 76
Provided by: skadronsta
Category:
Tags: overview | verilog

less

Transcript and Presenter's Notes

Title: Overview


1
Overview
  1. Motivation (Kevin)
  2. Thermal issues (Kevin)
  3. Power modeling (David)
  4. Thermal management (David)
  5. Optimal DTM (Lev)
  6. Clustering (Antonio)
  7. Power distribution (David)
  8. What current chips do (Lev)
  9. HotSpot (Kevin)

2
Power modeling
  • Research Power Simulators
  • Wattch Brooks and Martonosi ISCA2000
  • SimplePower Vijaykrishnan et al (Penn State)
    ISCA2000
  • TEMPEST Dhodapkar et al (Intel/Wisconsin)
  • PowerAnalyzer Umich/Colorado
  • AccuPower SUNY Binghamton
  • Industry Power Simulators
  • IBM PowerTimer Brooks and Bose PACS2000
  • Intel ALPS Gunther, et al.

3
Power The Basics
  • Dynamic power vs. Static power
  • Dynamic switching power
  • Static leakage power
  • Dynamic power dominates, but static power
    increasing in importance
  • Trends in each
  • Static power steady, per-cycle energy cost
  • Dynamic power capacitive and short-circuit
  • Capacitive power charging/discharging at
    transitions from 0?1 and 1?0
  • Short-circuit power power due to brief
    short-circuit current during transitions.
  • Mostly focus on capacitive, but recent work on
    others

4
Capacitive Power dissipation
Power ½ CV2Af
5
Short-Circuit Power Dissipation
  • Short-Circuit Current caused by finite-slope
    input signals
  • Direct Current Path between VDD and GND when both
    NMOS and PMOS transistors are conducting

6
Leakage Power
  • Subthreshold currents grow exponentially with
    increases in temperature, decreases in threshold
    voltage

7
Modeling Hierarchy and Tool Flow
8
Analysis Abstraction Levels
Abstraction Analysis Analysis
Analysis Analysis Energy Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Circuit)
Least Best
Slowest Most Least
9
Power/Performance abstractions
  • Low-level
  • Hspice
  • PowerMill
  • Medium-Level
  • RTL Models
  • Architecture-level
  • PennState SimplePower
  • Intel Tempest
  • Princeton Wattch
  • IBM PowerTimer
  • Umich/Colorado PowerAnalyzer

10
Low-level models Hspice
  • Extracted netlists from circuit/layout
    descriptions
  • Diffusion, gate, and wiring capacitance is
    modeled
  • Analog simulation performed
  • Detailed device models used
  • Large systems of equations are solved
  • Can estimate dynamic and leakage power
    dissipation within a few percent
  • Slow, only practical for 10-100K transistors
  • PowerMill (Synopsys) is similar but about 10x
    faster

11
Medium-level models RTL
  • Logic simulation obtains switching events for
    every signal
  • Structural VHDL or verilog with zero or
    unit-delay timing models
  • Capacitance estimates performed
  • Device Capacitance
  • Gate sizing estimates performed, similar to
    synthesis
  • Wiring Capacitance
  • Wire load estimates performed, similar to
    placement and routing
  • Switching event and capacitance estimates provide
    dynamic power estimates

12
Architecture level models
Power ½ CV2Af
  • Bottom-up Approach
  • Estimate CV2f via analytical models
  • Tools Wattch, PowerAnalyzer, Tempest
    (mixed-mode)
  • Top-Down Approach
  • Estimate CV2f via empirical measurements
  • Tools PowerTimer, AccuPower, Most Industrial
    Tools
  • Estimate A via statistics from
    architectural-performance simulators

13
Analytical Models Capacitance
  • Requires modeling wire length and estimating
    transistor sizes
  • Related to RC Delay analysis for speed along
    critical path
  • But capacitance estimates require summing up all
    wire lengths, rather than only an accurate
    estimate of the longest one.

14
Register File Capacitance Analysis
Bit
Pre-Charge
Cell Access Transistors (N1)
Decoders
Wordlines (Number of Entries)
Cell
Sense Amps
Number of Ports
Number of Ports
Bitlines (Data Width of Entries)
15
Register File Model Accuracy
(Numbers in Percent)
  • Validated against a register file schematic used
    in internal Intel design
  • Compared capacitance values with estimates from a
    layout-level Intel tool
  • Interconnect capacitance had largest errors
  • Model neglects poly connections
  • Differences in wire lengths -- difficult to tell
    wire distances of schematic nodes

16
Different Circuit Design Styles
  • RTL and Architectural level power estimation
    requires the tool/user to perform circuit design
    style assumptions
  • Static vs. Dynamic logic
  • Single vs. Double-ended bitlines in register
    files/caches
  • Sense Amp designs
  • Transistor and buffer sizings
  • Generic solutions are difficult because many
    styles are popular
  • Within individual companies, circuit design
    styles may be fixed

17
Clock Gating What, why, when?
Clock
Gated Clock
Gate
  • Dynamic Power is dissipated on clock transitions
  • Gating off clock lines when they are unneeded
    reduces activity factor
  • But putting extra gate delays into clock lines
    increases clock skew
  • End results
  • Clock gating complicates design analysis but
    saves power.

18
Wattch An Overview
  • Wattchs Design Goals
  • Flexibility
  • Planning-stage info
  • Speed
  • Modularity
  • Reasonable accuracy
  • Overview of Features
  • Parameterized models for different CPU units
  • Can vary size or design style as needed
  • Abstract signal transition models for speed
  • Can select different conditional clocking and
    input transition models as needed
  • Based on SimpleScalar (has been ported to many
    simulators)
  • Modular Can add new models for new units studied

19
Unit Modeling
  • Modeling Capacitance
  • Models depend on structure, bitwidth, design
    style, etc.
  • E.g., may model capacitance of a register file
    with bitwidth number of ports as input
    parameters
  • Modeling Activity Factor
  • Use cycle-level simulator to determine number and
    type of accesses
  • reads, writes, how many ports
  • Abstract model of bitline activity

20
One Cycle in Wattch
  • On each cycle
  • determine which units are accessed
  • model execution time issues
  • model per-unit energy/power based on which units
    used and how many ports.

21
Units Modeled by Wattch
22
PowerTimer
  • IBM Tool First Develop During Summer of 2000
  • Continued Development 2001 gt Today
  • Methodology Applied to Research and Product
    Power-Performance Simulators with IBM
  • Currently in Beta-Release
  • Working towards Full Academic Release

23
PowerTimer Empirical Power
Pre-silicon, POWER4-like superscalar design
24
Processor Power Density
Pre-silicon, POWER4-like superscalar
design Originally presented at PACS2002
25
PowerTimer
Circuit Power Data (Macros)
SubUnit Power f(SF, uArch, Tech)
Tech Parms
Compute Sub-Unit Power
Power
uArch Parms
AF/SF Data
Program Executable or Trace
CPI
Architectural Performance Simulator
26
PowerTimer Energy Models
  • Energy models for uArch structures formed by
    summation of circuit-level macro data

27
Empirical Estimates with CPAM
  • Estimate power under Input Hold and Input
    Switching Modes
  • Input Hold All Macro Inputs (Except Clocks) Held
  • Can also collect data for Clock Gate Signals
  • Input Switching Apply Random Switching Patterns
    with 50 Switching on Input Pins

Macro
  • 0 Switching (Hold Power)
  • 50 Switching Power

Macro Inputs
28
Example Unit
  • Made up of 5 macros

29
PowerTimer Models f(SF)
Assumption Power linearly dependent on Switching
Factor This separates Clock Power and Switching
Power
Switching Power
Clock Power
At 0 SF, Power Clock Power (significant
without clock gating)
30
Key Activity Data
Changes in SF
Changes in AF
  • SF gt Moves along the Switching Power Curve
  • Estimated on a per-unit basis from RTL Analysis
  • AF gt Moves along the Clock Power Curve
  • Extracted from Microarchitectural Statistics
    (Turandot)

31
Microarchitectural Statistics
  • Stats are very similar to tracking used in
    Wattch, etc
  • Differences
  • Clock Gating Modes (3 modes)
  • Customized Scaling Based on Circuit Style (4
    styles)
  • Clock Gating Modes
  • P_constrained P_unconstrained (not
    clock-gateable)
  • P_constrained_1 AF (Pclock Plogic) (common)
  • P_constrained_2 AF Pclock Plogic (rare)
  • P_constrained_3 Pclock AF Plogic (very
    rare)
  • Scaling Based on Circuit Styles
  • AF_1 valid (Latch-and-Mux, No Stall Gating)
  • AF_2 valid - stalls (Latch-and-Mux, With
    Stall Gating)
  • AF_3 writes (Arrays that only gate updates)
  • AF_4 writes reads (Arrays, RAM Macros)

32
Clock Gating Valid-Bit Gating
  • Latch-Based Structures Execute Pipelines, Issue
    Queues

Clock
V
V
V
V
V
V
33
Clock Gating Modes
  • P_constrained_1 AF (Pclock Plogic)

clock
valid
Plogic
Pclock
  • P_constrained_2 AF Pclock Plogic

clock
Selection Logic
valid
Pclock
Plogic
34
Valid-bit Gating, Stalls?
  • Option 1 Stalls cannot be gated

clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage
  • Option 2 Stalls can be gated

clk
valid
Stall From Previous Pipestage
Data From Previous Pipestage
Data For Next Pipestage
35
Scaling Array Structures
  • Option 1 Reads and Writes Eligible to Gate for
    Power

Write Bitline
Read Bitline
read_wordline_active
read_gate
write_wordline_active
write_gate
Cell
read_data
write_gate
write_data
36
Scaling Array Structures
  • Option 2 Only Writes Eligible to Gate for Power

Write Bitline
read_entry_n
read_entry_2
read_data
write_wordline_active
read_entry_1
write_gate
Cell
read_entry_0
write_gate
write_data
37
12 Clock Gating Modes
Gating Mode Valid Valid Stalls Writes Writes Reads Gate Both Gate Clock Gate Logic Examples
0 No No No No No No No Control Logic, Buffers, Small Macros
1 Yes No No No Yes No No Issue Queues, Execute Pipelines
2 No Yes No No Yes No No Issue Queues, Execute Pipelines
3 No No Yes No Yes No No Caches
4 No No No Yes Yes No No Some Queues
5 Yes No No No No Yes No CAMs, Selection Logic
6 No Yes No No No Yes No CAMs, Selection Logic
7 No No Yes No No Yes No No Known macros
8 No No No Yes No Yes No No Known macros
9 Yes No No No No No Yes No Known macros
10 No Yes No No No No Yes No Known macros
11 No No Yes No No No Yes No Known macros
12 No No No Yes No No Yes No Known macros
38
PowerTimer Observations
  • PowerTimer works well for POWER4-like estimates
    and derivatives
  • Scale base microarchitecture quite well
  • E.g. optimal power-performance pipelining study
  • Lack of run-time, bit-level SF not seen as a
    problem within IBM (seen as noise)
  • Chip bit-level SFs are quite low (5-15)
  • Most (60-70) power is dissipated while
    maintaining state (arrays, latches, clocks)
  • Much state is not available in early-stage timers

39
Comparing models Flexibility
  • Flexibility necessary for certain studies
  • Resource tradeoff analysis
  • Modeling different architectures
  • Purely analytical tools provides
    fully-parameterizable power models
  • Within this methodology, circuit design styles
    could also be studied
  • PowerTimer scales power models in a user-defined
    manner for individual sub-units
  • Constrained to structures and circuit-styles
    currently in the library
  • Perhaps Mixed Mode tools could be very useful

40
Comparing models Accuracy
  • PowerTimer -- Based on validation of individual
    pieces
  • Extensive validation of the performance model
    (AFs)
  • Power estimates from circuits are accurate
  • Circuit designers must vouch for clock gating
    scenarios
  • Certain assumptions will limit accuracy or
    require more in-depth analysis
  • Analytical Tools
  • Inherent Issues
  • Analytical estimates cannot be as accurate as
    SPICE analysis (C estimates, CV2 approximation)
  • Practical Issues
  • Without industrial data, must estimate transistor
    sizing, bits per structure, circuit choices

41
Comparing models Speed
  • Performance simulation is slow enough!
  • Post-Processing vs. Run-Time Estimates
  • Wattchs per-cycle power estimates roughly 30
    overhead
  • Post-processing (per-program power estimates)
    would be much faster (minimal overhead)
  • PowerTimer allows both no overhead
    post-processing and run-time analysis for certain
    studies (di/dt, thermal)
  • Some clock gating modes may require run-time
    analysis
  • Third Option Bit Vector Dumps
  • Flexible Post-Processing ? Huge Output Files

42
Power modeling summary
  • Wattch provides excellent relative accuracy
  • Underestimates full chip power (some units not
    modeled, etc)
  • PowerTimer models based on circuit-level power
    analysis
  • Inaccuracy is introduced in SF/AF and scaling
    assumptions

43
Overview
  1. Motivation (Kevin)
  2. Thermal issues (Kevin)
  3. Power modeling (David)
  4. Thermal management (David)
  5. Optimal DTM (Lev)
  6. Clustering (Antonio)
  7. Power distribution (David)
  8. What current chips do (Lev)
  9. HotSpot (Kevin)

44
Existing Work
  • Research Ideas
  • DEETM Huang and Torrellas MICRO2000
  • DTM Brooks and Martonosi HPCA2001
  • Control-Theoretic DTM Skadron, Abdelzaher, Stan
    HPCA2002
  • Thermal Scheduling Cai, Lim, Daasch WCED2002
  • Commercial Products
  • PowerPC G3 Microprocessor
  • Pentium III
  • Pentium 4

45
Overview
  • Hard to optimize power-performance at design time
    for all cases
  • Forces conservative choices for issues like
    cooling, current delivery, resource sizes
  • Want to explore dynamic power optimizations for
    run-time power management
  • Dynamic Voltage/Frequency Scaling Burd, 2000
  • Dynamic Hardware Resizing Albonesi, 1999
  • Fetch Throttling Sanchez, 1997
  • Global Clock Gating Gunther, 2001
  • Speculation Control Manne, 1998
  • Dynamic Thermal Management Brooks, 2001Huang,
    2000

46
Important to optimize P T early
12FO4
14FO4
Maximum Power Budget
23FO4
18FO4
47
Dynamic Thermal Management
  • Goal
  • Provide dynamic techniques to cool chip when
    needed
  • Exploit natural variations due to different
    applications, phase behavior,
  • Allow designers to target average, rather than
    worst-case behavior
  • Design Decisions
  • Mechanism policy for triggering response?
  • What should response be?
  • How to select DTM trigger levels?

48
Power consumption impacts cost
From Gunther, et al. Managing the Impact of
Increasing Microprocessor Power Consumption,
Intel Technology Journal, Q1, 2001
CPU
  • System costs associated with power dissipation
  • Thermal control cost
  • Heatsinks, fans
  • Power delivery
  • Power supply
  • Decoupling caps

49
Average and Worst Case Power
  • System costs are constrained by worst case power
    dissipation
  • Average case power dissipation can often be much
    lower
  • Aggressive Clock Gating
  • Applications variations
  • Underutilized resources
  • Not enough ILP
  • Floating point units during integer code
    execution
  • Currently about a 30 difference
  • Likely to further diverge

50
Dynamic Thermal Management
DTM Disabled
51
DTM Definitions
52
DTM When, How, and What
53
DTM Trigger Mechanisms
  • Mechanism How to deduce temperature?
  • Direct approach Temperature sensors providing
    feedback
  • Implemented in some PowerPC chips (G3, G4)
    Sanchez, 1997
  • Sensor quantity, placement, and precision will be
    discussed later
  • Other indirect approaches possible
  • Policy When to begin responding?
  • Trigger level set too high Packaging cost will
    be high
  • Little advantage
  • Trigger level set too low
  • Frequent triggering causes performance to suffer
  • Choose trigger level to exploit difference
    between average and worst-case power.

54
DTM Initiation Mechanisms
  • Operating system or microarchitectural control?
  • Hardware support can significantly reduce
    performance penalty
  • Policy Delay Settings
  • For Volt/Freq scaling, much of the performance
    penalty can be attributed to enabling/disabling
  • Increasing policy delay reduces overhead smarter
    initiation techniques would help as well

55
DTM Response Mechanisms
  • Scaling Techniques
  • Clock Frequency Scaling Intel Pentium 4
  • Voltage and Frequency Scaling
  • Temperature-tracking frequency scalingSkadron03
  • Adjusts frequency to account for T-dep. of
    switching speed
  • Microarchitectural Techniques
  • Speculation Control Manne98
  • Low-Power Cache Techniques Huang00
  • Hierarchical Responses
  • Decode Throttling Sanchez97
  • Fetch Toggling Brooks01
  • Feedback controlled Fetch Gating Skadron02
  • Migrating Computation Skadron03
  • Dual Pipelines Lim02

56
Dynamic Voltage/Frequency Scale
  • Voltage Scheduler predicts workload requirements
  • Set frequency/voltage to near-optimal, energy
    savings
  • Burd, et al., ISSCC2000
  • 5MHz _at_ 1.2V 6 MIPS, 2.8mW
  • 80MHz _at_ 3.8V 85 MIPS, 460mW
  • 70us 1.2V lt-gt 3.8V
  • Transmeta Crusoe
  • Commercial implementation (500-700MHz, 1.2-1.6V)

57
Temperature-Tracking Frequency
  • Temperature affects
  • Transistor threshold and mobility
  • Ion, Ioff, Igate, delay
  • ITRS 85C for high-performance, 110C for
    embedded!
  • So adjust frequency as f(T) -- TTDFS

Ioff
Ion NMOS
58
Speculation Control
  • Manne et al. (ISCA 98)
  • Branch confidence estimator used to determine
    whether to speculate
  • Pipeline gating based on confidence estimation
  • 38 reduction in wrong-path instructions with 1
    performance loss
  • But Parikh et al. (HPCA 02) found much smaller
    savings ED product is zero or negative
  • Significant energy savings only come with
    significant loss of performance
  • This is because many instructions are squashed
    early in the pipeline, so reduction in wrong-path
    instructions is not a useful metric
  • Benefit is actually a function of prediction
    accuracy
  • Only for very badly predicted programs do you get
    benefit
  • Well-predicted programs suffer

59
Dynamic Hardware Resizing
  • Complexity Adaptive Processors
  • Based on application characteristics
  • Underutilized structures may be reduced with
    minimal performance impact
  • Resize Caches, Issue Queues, etc.
  • Resize gt Reduce Capacitance gt Reduce Energy
  • Of course, this only helps manage heat if it
    reduces heat dissipation within hot spots
  • And does so for a sufficiently long duration

60
DEETM
  • Dynamic Energy Efficiency and Temperature
    Management
  • Slack algorithm detects if slowdown can be
    tolerated
  • If so, invoke techniques to reduce energy
  • Temperature algorithm
  • If temperature limit is reached, invokes
    techniques
  • Techniques considered
  • Filter Cache, Voltage Scaling, etc.

61
Control-theoretic DTM
  • Fetch toggling
  • disable fetch every N cycles
  • 4/5, 2/3, 1/2, 1/3, 1/5,
  • How to set the fetch rate?
  • (Assume idealized temperature sensing)

IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
62
Feedback-Control of Fetch Toggling
  • Formal feedback control
  • PID m KC (e KI?e Kdde/dt)
  • easy to compute
  • toggling f(m)

e
m
setpoint
P
T
ActuatorI-fetch toggling
Thermaldynamics
Controller
Temp. sensor
measured T
63
Formal Feedback Control
  • Regulatory control problem hold value to a
    specified setpoint
  • Example temperature
  • Proved that PID controller will not allow
    temperature to exceed setpoint by more than 0.02
  • Max power dissipation, thermal dynamics,sampling
    rate ? max overshoot
  • This precision is excessive but illustrates the
    value of formal feedback control theory

64
Performance Loss
  • Performance loss reduced by 65

65
Migrating Computation
  • When one unit overheats, migrate its
    functionality to a distant, spare unit (MC)
  • Spare register file (Skadron et al. 2003)
  • Separate core (CMP) (Heo et al. 2003)
  • Microarchitectural clusters
  • etc.
  • Raises many interesting issues
  • Cost-benefit tradeoff for that area
  • Use both resources (scheduling)
  • Extra power for long-distance communication
  • Floorplanning

66
Migrating Computation Reg File
67
Thermal Scheduling (Cai 2002)
Majority mobile apps with performance requirements
  • Primary pipeline maximal performance, complex
    pipeline structure
  • Second pipeline Minimum power and energy
    consumption, very simple in order structure and
    target mobile anywhere-anytime applications.
  • Transparent to OS and applications
  • Maximal utilizing on die clock/power gating for
    energy saving

Text email, caller-id, reminder and other none
high performance w/ anywhere-anytime requested
apps
68
Scheduling Algorithm (Cai 2002)
TS1
TS2
69
Hybrid DTM
  • DVS is attractive because of its cubic advantage
  • P ? V2f
  • This factor dominates when DTM must be aggressive
  • But changing DVS setting can be costly
  • Resynchronize PLL
  • Sensitive to sensor noise ? spurious changes
  • ILP techniques are attractive because they can
    use instruction level parallelism to hide/reduce
    impact of DTM
  • Only effective when DTM is mild
  • So use both!
  • Need to find crossover point

70
Hybrid DTM, cont.
  • Combine fetch gating with DVS
  • When DVS is better, use it
  • Otherwise use fetch gating
  • Determined by magnitude of temperature overshoot
  • Crossover at FG duty cycle of 3
  • FG has low overhead helps reduce cost of sensor
    noise

Hyb
71
Hybrid DTM, cont.
  • DVS doesnt need more than two settings for
    thermal control
  • Lower voltage cools chip faster
  • FG by itself does need multiple duty cycles and
    hence requires PI control
  • But in a hybrid configuration, FG does not
    require PI control
  • FG is only used at mild DTM settings
  • Can pick one fixed duty cycle
  • This is beneficial because feedback control is
    vulnerable to noise

72
Simulation Details
  • 85C maximum temperature
  • Guard band requires a trigger threshold of 81.8
  • Ambient temperature (inside computer case) 45C
  • Rpackage 0.8 K/W (old package model)
  • 0.7 K/W necessary if DTM not available
  • Die thickness 0.5mm
  • Currently neglecting interface material
  • 9 SPEC2000 benchmarks, both integer and FP
  • 4 hover near 81.8C, rest are above
  • SimpleScalar/Wattch, modified to model pipeline
    and power of an Alpha 21364 as closely as
    possible
  • Scaled to 130nm, 1.3V, 3.0 GHz

73
Performance Comparison
  • TT-DFS is best but cant prevent excess
    temperature
  • Suitable for use with aggressive clock rates at
    low temp.
  • Hybrid technique reduces DTM cost by 25 vs. DVS
    (DVS overhead important)
  • A substantial portion of MCs benefit comes from
    the altered floorplan, which separates hot units

74
Conclusions so far
  • DTM can be used to reduce cooling costs
  • Proper modeling is required
  • HotSpot is publicly available athttp//lava.cs.vi
    rginia.edu/HotSpot
  • ILP matters
  • Hybrid techniques beneficial
  • Merge advantages of different schemes
  • Simplify control
  • Architectural techniques important in thermal
    design
  • Growing use of clusters and redundant units opens
    an incredibly rich design space

75
DTM Summary and Key Issues
  • Dynamic optimizations translate max-power problem
    to average-power problem
  • Heightens importance of average-power techniques
    like clock gating
  • Key Issues
  • Initiation interval
  • Collection of possible response mechanisms
Write a Comment
User Comments (0)
About PowerShow.com