Adaptive PowerPerformance Management for Highend Microprocessors - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Adaptive PowerPerformance Management for Highend Microprocessors

Description:

DVFS using Interface Queue. Challenges in designing it formally. System modeling ? ... (queue length, etc) DVFS Control Specification (control interval, etc) ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 45

Provided by: edmun1

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive PowerPerformance Management for Highend Microprocessors

1
Adaptive Power/Performance Management for
High-end Microprocessors

Prof. Margaret Martonosi
Dept. of Electrical Engineering
Princeton University

2
Motivation

The obvious Power is a big problem
SIA Roadmap Power as grand challenge for design,
packaging, etc
ISSCC 05 more and more elaborate approaches
taken to address the issue
Still Obvious, but less so Power is a
constellation of important sub-problems
Dynamic energy related to battery life
Dynamic energy related to thermal control
dI/dt
Leakage energy

3
Real-Power techniques

The analogy Real-time techniques are about
bounding timing behavior subject to constraints.
Fast enough
Real-Power manage and bound energy/thermal
behavior subject to both static and dynamic
constraints
Energy-efficient operation with fast-enough
execution
Abide by thermal and power constraints
Composable relationships between different
techniques
Mixture of static and dynamic strategies

4
Real-Power techniques

The philosophy Online measurement and dynamic
analysis drives full-system power adaptation

Measure
Analyze
React
Model
5
This talk

Control-theory for managing DVFS in MCD
processors
With Qiang Wu, Philo Juang, Doug Clark
ASPLOS 04, HPCA 05
Coordinated control for Chip Multiprocessors
With Philo Juang, Qiang Wu, Li-Shiuan Peh
In submission
Brief pointers to other work
Counter-based power estimation and phase analysis
With Canturk Isci, Alper Buyuktosunoglu,
ACEED closed session
Linear programming for compiler-managed DVFS
With Fen Xie and Sharad Malik
PLDI 03, and newer work in submission

6
Control-Theoretic Power Management

Modern processors manage increasingly complex
power/performance tradeoffs
Many interacting heuristics being applied
How effective are they across varied workloads ?
How to bound their worst-case behavior ?
How do they all interact with each other ?
Control theory is a formal-yet-practical way
Answering such questions
Designing robust systems
Thus far Apply formal control techniques to
dI/dt management
DVFS in MCD processors
Speed/Energy balancing in CMPs
End goal Apply broadly and at several composed
system layers

7
Lots of unused potential

Often, the processor has little to do
Capable of 4 instructions per cycle, but real
execution lt 2 IPC
Why run the CPU at full speed when you dont have
to?

8
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f2
frequency f1
queue q
9
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
10
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
Feedback control using queue as feedback signals.

Challenges in designing it formally
System modeling ?
Linearization controller design ?

11
A first application example MCD Processors

Multiple Clock Domain processors Semeraro et al.
HPCA 02
Partially asynchronous approach Marculescu et
al. ISCA 03
-- Globally Asynchronous Locally Synchronous
(GALS)
Independent clock for each domain
Domains communicate via interface queues

f1
Ifetch/Decode
f2
f3
f4
INT exec
FP exec
Ld/St exec
12
Design Flow for DVFS Controller
Processor (MCD) Specification (queue length, etc)
DVFS Control Specification (control interval,
etc)
Modeling of Queue Domain Dynamics
analysis/ design toolbox
System Linearization
Linear Controller Design Stability Analysis
design plan control parameters
Tradeoff Specification (how aggressively to
save energy? )
Energy/Performance Tradeoff Analysis
Hardware Implementation

reference queue qref
Processors (MCD) with DVFS Control
13
Modeling Queue/domain Dynamics

A stochastic queuing-domain model (Section 3.3)

service rate ?
arrival rate ?
frequency f
queue q
average queue changes due to different demand
and service rates
14
Linear Controller Design
frequency f

PID controller
Proportional gain (KP)
Integral gain (KI)
Derivative gain (KD)

service rate ?
arrival rate ?
queue q
qref
Control block diagram
Linearized system
q
f
qref
?
e
-
q

?
Disturbance input
Implementation modest amount of hardware
15
Specify Energy Performance Tradeoff

How aggressively to save energy?
Or preserve performance?
A simple lever qref position
Increase qref more aggressive in saving energy
Decrease qref value performance more
qref adjustable by OS/application
Software/hardware cooperation
Software make overall tradeoff decisions
Hardware implement details of speed adaptation

16
Experimental Results

Use an MCD simulator (based on Semeraro et al.
HPCA 02)
4 clock domains (IF, INT,FP,LS), Low-overhead
DVFS

External
Front End
L1-ICache
Main memory
Fetch Unit
Load/Store
ROB, Rename, Dispatch
L2 Cache
Integer
Floating-Point
Integer queue
FP queue
Mem input queue
Integer ALUs
FP ALUs
L1-Dcache
17
An Illustrative Example

Benchmark Epic_Decode

frequency settings
queue entries
18
Energy and Performance Results
Average results over all benchmarks
19
Energy and Performance Results
Average results over all benchmarks
20
Energy and Performance Results
Average results over all benchmarks
21
Energy and Performance Results
Average results over all benchmarks
22
This talk

Control-theory for managing DVFS in MCD
processors
With Qiang Wu, Philo Juang, Doug Clark
ASPLOS 04, HPCA 05
Coordinated control for Chip Multiprocessors
With Philo Juang, Qiang Wu, Li-Shiuan Peh
In submission
Brief pointers to other work
Counter-based power estimation and phase analysis
With Canturk Isci, Alper Buyuktosunoglu,
ACEED closed session
Linear programming for compiler-managed DVFS
With Fen Xie and Sharad Malik
PLDI 03, and newer work in submission

23
Energy and Speed-balancing on CMPs

CMPs increasingly common platform for high-end
microprocessors
High performance potential in a
complexity-effective design
But, not all cores are useful at full-speed at
all times
Limited parallelism
Memory or I/O stalls
Via a CMPs inter-core networks, can see data
communication relationships
This work Dynamically adapt power V/f settings
according to data CPU usage

24
DVFS using Producer-Consumer Cores
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q

Strategy appears similar to MCD
Identify producer-consumer relationships
Speed balance based on data pileups in between
them

25
Parallel Code and DVFS An Example
Parent Thread (Sends out X numbers)
100 cycles/number
Helper Thread 1
Helper Thread T3
Helper Thread T1
Helper Thread T2
Process every 2nd number
Process every 17th number
Process every 10,000th number
Receiver (Has to wait for all X numbers to arrive)

When one input buffer fills, Parent thread stalls
Observation 1 Thread T1 has most work to do
Threads T2 and T3 can run more slowly
Observation 2 All threads (especially T2 and T3)
have bursty work requirements)
Must avoid oscillations

26
Options for CMP DVFS Policies

Static DVFS settings for whole application
Based on profiling or application knowledge
Pro simple, no overshoot or oscillation
Con hard to gather application knowledge,
especially for dynamically-varying parallel
applications.
Locally-controlled, uncoordinated V/f settings
per core
Pro simple, fast, easy to scale
Con doesnt account for inter-thread
relationships
Coordinated cross-chip control of DVFS settings
Pro more realistic, more flexible
Con Slower, possibly harder to scale
Which info to transfer and how fast?

27
Engineering a Coordinated Control Scheme Back
to Example

Over a sample interval
T1s queue is building up
T2s coming down
T3s relatively stable
Which to speed up? T1 or T2?
Bursty behavior means that queue occupancies must
be averaged out
Inter-relationships between threads mean that
local queues alone are not enough

28
Introducing Dist-PID

1) Determine critical path using equation
qtarget (Kp(qk qk-1) Kiqk µk µk-1)/Ki
2) Distribute to all processors
Exchange qtarget between processors
Choose highest qtarget seen this is critical
path
3) Use highest qtarget as new qref and solve
equation
µk µk-1 Ki(qk qref) Kp(qk qk-1)

Intuitively
Who is the critical path?
To preserve performance, run that processor at
maximum speed
To save energy, run everyone else slower

29
Dist-PID manages oscillation/bursts better than
Local approaches
Frequency (Mhz)
Time

Because of the communication, Dist-PID knows
what speed to target
Formal approach causes controller to gently
zero in on optimal speed

30
Dist-PID outperforms Local-PID

Quicksort Fast moving, high thread pressure
Othello Slow moving, bursty
183.equake Statically balanced, steady
181.mcf Bimodal
300.twolf Small but significant and easy to
identify opportunities

Energy-Delay Product
Dist-PID equal or better energy-delay product
than Local-PID for all benchmarks
31
Dist-PID resiliency

Dist-PID More resilient than local approaches to
error in processor load predictions
Othello, quicksort

Normalized execution time
32
Microarchitectural Issues for Distributed
Management

Key Requirements
Managing information flow
Detecting thread-to-thread critical path
Quick responsive changes

Network-Driven Processor (NDP) Joint project
with Profs Peh and August.
NDP CMP intelligent, adaptive routers
For dynamic management of parallelism and power
Track communicate rates and CPU requirements of
different threads
NDP designed to support dynamic parallelism and
power management
Spawn threads such that related threads are
co-located
Schedule or migrate competing threads
Manage energy and temperature based on same usage
stats

33
This talk

Control-theory for managing DVFS in MCD
processors
With Qiang Wu, Philo Juang, Doug Clark
ASPLOS 04, HPCA 05
Coordinated control for Chip Multiprocessors
With Philo Juang, Qiang Wu, Li-Shiuan Peh
In submission
Brief pointers to other work
Counter-based power estimation and phase analysis
With Canturk Isci, Alper Buyuktosunoglu,
ACEED closed session
Linear programming for compiler-managed DVFS
With Fen Xie and Sharad Malik
PLDI 03, and newer work in submission

34
OS-level Power Estimation
Measure
Analyze
React
Model

Use hardware performance counters to gauge
processor activity
Analyze phases and adapt
Recognize power/thermal hotspots and control

35
Counter-Based Power Estimation An Overview of
Our Approach

Idealized view For all components in a processor
chip

Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area Stressmarks
CPU Performance Counters!
From microarch. properties

More realistic view Handle non-linear scaling

Empirical Multimeter measurement
NonGatedPowerI
36
Counter-Based Power EstimationGeneral
Implementation
PowerModel
Multimeter
37
Intel Pentium 4 HPC-based model SPEC Results
38
Analyzing and Predicting Power Phases

WWC-03 more Developed a range of analysis
techniques for discerning (from HPC readings)
similarity in power behavior for different
execution phases
Also, simple predictors for determining the
likely duration of a phase once it begins

Based on composition of absolute and normalized
power
39
This talk

Control-theory for managing DVFS in MCD
processors
With Qiang Wu, Philo Juang, Doug Clark
ASPLOS 04, HPCA 05
Coordinated control for Chip Multiprocessors
With Philo Juang, Qiang Wu, Li-Shiuan Peh
In submission
Brief pointers to other work
Counter-based power estimation and phase analysis
With Canturk Isci, Alper Buyuktosunoglu,
ACEED closed session
Linear programming for compiler-managed DVFS
With Fen Xie and Sharad Malik
PLDI 03, and newer work in submission

40
Results Summary

DVFS control for MCDs
23 fold increase of Power/Perf ratio
-- automatic regulation, more effective decisions
More resilient and complete
-- guarantee stability and efficiency under
extreme cases
DVFS control for CMPs
Demonstrates value of distributed control
Improves energy-delay product by 8 over local
approach
Improves energy-delay product for tightly
coordinated applications by 8X
Resilient and stable in the face of inaccurate
load factors
Higher-level Power Analysis
OS and compiler can have a role to play as well

41
Conclusions

Formal Control has an important role to play in
future computer systems
With increasing gap between worst-case and
average-case execution, dynamic management is
imperative
Need verifiable bounds on adaptive response
magnitude and delay
Need composable behavior across multiple effects
Multiple layers (hw and sw) of power adaptivity
can work together towards real-power systems
Necessary performance, while also meeting
power/thermal targets

42
Acknowledgments