Title: Adaptive PowerPerformance Management for Highend Microprocessors
1Adaptive Power/Performance Management for
High-end Microprocessors
- Prof. Margaret Martonosi
- Dept. of Electrical Engineering
- Princeton University
2Motivation
- The obvious Power is a big problem
- SIA Roadmap Power as grand challenge for design,
packaging, etc - ISSCC 05 more and more elaborate approaches
taken to address the issue - Still Obvious, but less so Power is a
constellation of important sub-problems - Dynamic energy related to battery life
- Dynamic energy related to thermal control
- dI/dt
- Leakage energy
3Real-Power techniques
- The analogy Real-time techniques are about
bounding timing behavior subject to constraints.
Fast enough - Real-Power manage and bound energy/thermal
behavior subject to both static and dynamic
constraints - Energy-efficient operation with fast-enough
execution - Abide by thermal and power constraints
- Composable relationships between different
techniques - Mixture of static and dynamic strategies
4Real-Power techniques
- The philosophy Online measurement and dynamic
analysis drives full-system power adaptation
Measure
Analyze
React
Model
5This talk
- Control-theory for managing DVFS in MCD
processors - With Qiang Wu, Philo Juang, Doug Clark
- ASPLOS 04, HPCA 05
- Coordinated control for Chip Multiprocessors
- With Philo Juang, Qiang Wu, Li-Shiuan Peh
- In submission
- Brief pointers to other work
- Counter-based power estimation and phase analysis
- With Canturk Isci, Alper Buyuktosunoglu,
- ACEED closed session
- Linear programming for compiler-managed DVFS
- With Fen Xie and Sharad Malik
- PLDI 03, and newer work in submission
6Control-Theoretic Power Management
- Modern processors manage increasingly complex
power/performance tradeoffs - Many interacting heuristics being applied
- How effective are they across varied workloads ?
- How to bound their worst-case behavior ?
- How do they all interact with each other ?
- Control theory is a formal-yet-practical way
- Answering such questions
- Designing robust systems
- Thus far Apply formal control techniques to
- dI/dt management
- DVFS in MCD processors
- Speed/Energy balancing in CMPs
- End goal Apply broadly and at several composed
system layers
7Lots of unused potential
- Often, the processor has little to do
- Capable of 4 instructions per cycle, but real
execution lt 2 IPC - Why run the CPU at full speed when you dont have
to?
8DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f2
frequency f1
queue q
9DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
10DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
Feedback control using queue as feedback signals.
- Challenges in designing it formally
- System modeling ?
- Linearization controller design ?
11A first application example MCD Processors
- Multiple Clock Domain processors Semeraro et al.
HPCA 02 - Partially asynchronous approach Marculescu et
al. ISCA 03 - -- Globally Asynchronous Locally Synchronous
(GALS) - Independent clock for each domain
- Domains communicate via interface queues
f1
Ifetch/Decode
f2
f3
f4
INT exec
FP exec
Ld/St exec
12Design Flow for DVFS Controller
Processor (MCD) Specification (queue length, etc)
DVFS Control Specification (control interval,
etc)
Modeling of Queue Domain Dynamics
analysis/ design toolbox
System Linearization
Linear Controller Design Stability Analysis
design plan control parameters
Tradeoff Specification (how aggressively to
save energy? )
Energy/Performance Tradeoff Analysis
Hardware Implementation
reference queue qref
Processors (MCD) with DVFS Control
13Modeling Queue/domain Dynamics
- A stochastic queuing-domain model (Section 3.3)
service rate ?
arrival rate ?
frequency f
queue q
average queue changes due to different demand
and service rates
14Linear Controller Design
frequency f
- PID controller
- Proportional gain (KP)
- Integral gain (KI)
- Derivative gain (KD)
service rate ?
arrival rate ?
queue q
qref
Control block diagram
Linearized system
q
f
qref
?
e
-
q
?
Disturbance input
Implementation modest amount of hardware
15Specify Energy Performance Tradeoff
- How aggressively to save energy?
- Or preserve performance?
- A simple lever qref position
- Increase qref more aggressive in saving energy
- Decrease qref value performance more
- qref adjustable by OS/application
- Software/hardware cooperation
- Software make overall tradeoff decisions
- Hardware implement details of speed adaptation
16Experimental Results
- Use an MCD simulator (based on Semeraro et al.
HPCA 02) - 4 clock domains (IF, INT,FP,LS), Low-overhead
DVFS
External
Front End
L1-ICache
Main memory
Fetch Unit
Load/Store
ROB, Rename, Dispatch
L2 Cache
Integer
Floating-Point
Integer queue
FP queue
Mem input queue
Integer ALUs
FP ALUs
L1-Dcache
17 An Illustrative Example
frequency settings
queue entries
18Energy and Performance Results
Average results over all benchmarks
19Energy and Performance Results
Average results over all benchmarks
20Energy and Performance Results
Average results over all benchmarks
21Energy and Performance Results
Average results over all benchmarks
22This talk
- Control-theory for managing DVFS in MCD
processors - With Qiang Wu, Philo Juang, Doug Clark
- ASPLOS 04, HPCA 05
- Coordinated control for Chip Multiprocessors
- With Philo Juang, Qiang Wu, Li-Shiuan Peh
- In submission
- Brief pointers to other work
- Counter-based power estimation and phase analysis
- With Canturk Isci, Alper Buyuktosunoglu,
- ACEED closed session
- Linear programming for compiler-managed DVFS
- With Fen Xie and Sharad Malik
- PLDI 03, and newer work in submission
23Energy and Speed-balancing on CMPs
- CMPs increasingly common platform for high-end
microprocessors - High performance potential in a
complexity-effective design - But, not all cores are useful at full-speed at
all times - Limited parallelism
- Memory or I/O stalls
- Via a CMPs inter-core networks, can see data
communication relationships - This work Dynamically adapt power V/f settings
according to data CPU usage
24DVFS using Producer-Consumer Cores
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
- Strategy appears similar to MCD
- Identify producer-consumer relationships
- Speed balance based on data pileups in between
them
25Parallel Code and DVFS An Example
Parent Thread (Sends out X numbers)
100 cycles/number
Helper Thread 1
Helper Thread T3
Helper Thread T1
Helper Thread T2
Process every 2nd number
Process every 17th number
Process every 10,000th number
Receiver (Has to wait for all X numbers to arrive)
- When one input buffer fills, Parent thread stalls
- Observation 1 Thread T1 has most work to do
- Threads T2 and T3 can run more slowly
- Observation 2 All threads (especially T2 and T3)
have bursty work requirements) - Must avoid oscillations
26Options for CMP DVFS Policies
- Static DVFS settings for whole application
- Based on profiling or application knowledge
- Pro simple, no overshoot or oscillation
- Con hard to gather application knowledge,
especially for dynamically-varying parallel
applications. - Locally-controlled, uncoordinated V/f settings
per core - Pro simple, fast, easy to scale
- Con doesnt account for inter-thread
relationships - Coordinated cross-chip control of DVFS settings
- Pro more realistic, more flexible
- Con Slower, possibly harder to scale
- Which info to transfer and how fast?
27Engineering a Coordinated Control Scheme Back
to Example
- Over a sample interval
- T1s queue is building up
- T2s coming down
- T3s relatively stable
- Which to speed up? T1 or T2?
- Bursty behavior means that queue occupancies must
be averaged out - Inter-relationships between threads mean that
local queues alone are not enough
28Introducing Dist-PID
- 1) Determine critical path using equation
- qtarget (Kp(qk qk-1) Kiqk µk µk-1)/Ki
- 2) Distribute to all processors
- Exchange qtarget between processors
- Choose highest qtarget seen this is critical
path - 3) Use highest qtarget as new qref and solve
equation - µk µk-1 Ki(qk qref) Kp(qk qk-1)
- Intuitively
- Who is the critical path?
- To preserve performance, run that processor at
maximum speed - To save energy, run everyone else slower
29Dist-PID manages oscillation/bursts better than
Local approaches
Frequency (Mhz)
Time
- Because of the communication, Dist-PID knows
what speed to target - Formal approach causes controller to gently
zero in on optimal speed
30Dist-PID outperforms Local-PID
- Quicksort Fast moving, high thread pressure
- Othello Slow moving, bursty
- 183.equake Statically balanced, steady
- 181.mcf Bimodal
- 300.twolf Small but significant and easy to
identify opportunities
Energy-Delay Product
Dist-PID equal or better energy-delay product
than Local-PID for all benchmarks
31Dist-PID resiliency
- Dist-PID More resilient than local approaches to
error in processor load predictions - Othello, quicksort
Normalized execution time
32Microarchitectural Issues for Distributed
Management
- Key Requirements
- Managing information flow
- Detecting thread-to-thread critical path
- Quick responsive changes
- Network-Driven Processor (NDP) Joint project
with Profs Peh and August. - NDP CMP intelligent, adaptive routers
- For dynamic management of parallelism and power
- Track communicate rates and CPU requirements of
different threads - NDP designed to support dynamic parallelism and
power management - Spawn threads such that related threads are
co-located - Schedule or migrate competing threads
- Manage energy and temperature based on same usage
stats
33This talk
- Control-theory for managing DVFS in MCD
processors - With Qiang Wu, Philo Juang, Doug Clark
- ASPLOS 04, HPCA 05
- Coordinated control for Chip Multiprocessors
- With Philo Juang, Qiang Wu, Li-Shiuan Peh
- In submission
- Brief pointers to other work
- Counter-based power estimation and phase analysis
- With Canturk Isci, Alper Buyuktosunoglu,
- ACEED closed session
- Linear programming for compiler-managed DVFS
- With Fen Xie and Sharad Malik
- PLDI 03, and newer work in submission
34OS-level Power Estimation
Measure
Analyze
React
Model
- Use hardware performance counters to gauge
processor activity - Analyze phases and adapt
- Recognize power/thermal hotspots and control
35Counter-Based Power Estimation An Overview of
Our Approach
- Idealized view For all components in a processor
chip
Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area Stressmarks
CPU Performance Counters!
From microarch. properties
- More realistic view Handle non-linear scaling
Empirical Multimeter measurement
NonGatedPowerI
36Counter-Based Power EstimationGeneral
Implementation
PowerModel
Multimeter
37Intel Pentium 4 HPC-based model SPEC Results
38Analyzing and Predicting Power Phases
- WWC-03 more Developed a range of analysis
techniques for discerning (from HPC readings)
similarity in power behavior for different
execution phases - Also, simple predictors for determining the
likely duration of a phase once it begins
Based on composition of absolute and normalized
power
39This talk
- Control-theory for managing DVFS in MCD
processors - With Qiang Wu, Philo Juang, Doug Clark
- ASPLOS 04, HPCA 05
- Coordinated control for Chip Multiprocessors
- With Philo Juang, Qiang Wu, Li-Shiuan Peh
- In submission
- Brief pointers to other work
- Counter-based power estimation and phase analysis
- With Canturk Isci, Alper Buyuktosunoglu,
- ACEED closed session
- Linear programming for compiler-managed DVFS
- With Fen Xie and Sharad Malik
- PLDI 03, and newer work in submission
40Results Summary
- DVFS control for MCDs
- 23 fold increase of Power/Perf ratio
- -- automatic regulation, more effective decisions
- More resilient and complete
- -- guarantee stability and efficiency under
extreme cases - DVFS control for CMPs
- Demonstrates value of distributed control
- Improves energy-delay product by 8 over local
approach - Improves energy-delay product for tightly
coordinated applications by 8X - Resilient and stable in the face of inaccurate
load factors - Higher-level Power Analysis
- OS and compiler can have a role to play as well
41Conclusions
- Formal Control has an important role to play in
future computer systems - With increasing gap between worst-case and
average-case execution, dynamic management is
imperative - Need verifiable bounds on adaptive response
magnitude and delay - Need composable behavior across multiple effects
- Multiple layers (hw and sw) of power adaptivity
can work together towards real-power systems - Necessary performance, while also meeting
power/thermal targets
42Acknowledgments
- Students
- Philo Juang (Graduating 2005)
- Fen Xie (Graduating 2005, co-advised with Sharad
Malik) - Qiang Wu (Graduating 2006, adviser Doug Clark)
- Canturk Isci (Graduating 2006)
- Gilberto Contreras (Graduating 2007)
- PhD Alums from group
- Prof. Russ Joseph, Northwestern Univ. (IBM grad
fellowship) - Dr. Zhigang Hu, IBM Research
- Prof. David Brooks, Harvard Univ. (IBM grad
fellowship) - Other colleagues Doug Clark, Sharad Malik,
Pradip Bose, Alper Buyuktosunoglu - Grants NSF, SRC, IBM, Intel
43Why Local-PID worked for MCDs, but not CMPs
- In MCDs, input rate is smoother, higher, and more
sustained - Thread creation and incoming data in CMPs are
burstier - Homogeneous domains in CMPs, heterogeneous in
MCDs - Load/Store domain forced to be more aggressive
- Formal technique worked as advertised
- But the static qref has no basis in reality
44Analyzing and Predicting Power Phase Behavior