Adaptive PowerPerformance Management for Highend Microprocessors - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Adaptive PowerPerformance Management for Highend Microprocessors

Description:

DVFS using Interface Queue. Challenges in designing it formally. System modeling ? ... (queue length, etc) DVFS Control Specification (control interval, etc) ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 45
Provided by: edmun1
Category:

less

Transcript and Presenter's Notes

Title: Adaptive PowerPerformance Management for Highend Microprocessors


1
Adaptive Power/Performance Management for
High-end Microprocessors
  • Prof. Margaret Martonosi
  • Dept. of Electrical Engineering
  • Princeton University

2
Motivation
  • The obvious Power is a big problem
  • SIA Roadmap Power as grand challenge for design,
    packaging, etc
  • ISSCC 05 more and more elaborate approaches
    taken to address the issue
  • Still Obvious, but less so Power is a
    constellation of important sub-problems
  • Dynamic energy related to battery life
  • Dynamic energy related to thermal control
  • dI/dt
  • Leakage energy

3
Real-Power techniques
  • The analogy Real-time techniques are about
    bounding timing behavior subject to constraints.
    Fast enough
  • Real-Power manage and bound energy/thermal
    behavior subject to both static and dynamic
    constraints
  • Energy-efficient operation with fast-enough
    execution
  • Abide by thermal and power constraints
  • Composable relationships between different
    techniques
  • Mixture of static and dynamic strategies

4
Real-Power techniques
  • The philosophy Online measurement and dynamic
    analysis drives full-system power adaptation

Measure
Analyze
React
Model
5
This talk
  • Control-theory for managing DVFS in MCD
    processors
  • With Qiang Wu, Philo Juang, Doug Clark
  • ASPLOS 04, HPCA 05
  • Coordinated control for Chip Multiprocessors
  • With Philo Juang, Qiang Wu, Li-Shiuan Peh
  • In submission
  • Brief pointers to other work
  • Counter-based power estimation and phase analysis
  • With Canturk Isci, Alper Buyuktosunoglu,
  • ACEED closed session
  • Linear programming for compiler-managed DVFS
  • With Fen Xie and Sharad Malik
  • PLDI 03, and newer work in submission

6
Control-Theoretic Power Management
  • Modern processors manage increasingly complex
    power/performance tradeoffs
  • Many interacting heuristics being applied
  • How effective are they across varied workloads ?
  • How to bound their worst-case behavior ?
  • How do they all interact with each other ?
  • Control theory is a formal-yet-practical way
  • Answering such questions
  • Designing robust systems
  • Thus far Apply formal control techniques to
  • dI/dt management
  • DVFS in MCD processors
  • Speed/Energy balancing in CMPs
  • End goal Apply broadly and at several composed
    system layers

7
Lots of unused potential
  • Often, the processor has little to do
  • Capable of 4 instructions per cycle, but real
    execution lt 2 IPC
  • Why run the CPU at full speed when you dont have
    to?

8
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f2
frequency f1
queue q
9
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
10
DVFS using Interface Queue
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
Feedback control using queue as feedback signals.
  • Challenges in designing it formally
  • System modeling ?
  • Linearization controller design ?

11
A first application example MCD Processors
  • Multiple Clock Domain processors Semeraro et al.
    HPCA 02
  • Partially asynchronous approach Marculescu et
    al. ISCA 03
  • -- Globally Asynchronous Locally Synchronous
    (GALS)
  • Independent clock for each domain
  • Domains communicate via interface queues

f1
Ifetch/Decode
f2
f3
f4
INT exec
FP exec
Ld/St exec
12
Design Flow for DVFS Controller
Processor (MCD) Specification (queue length, etc)
DVFS Control Specification (control interval,
etc)
Modeling of Queue Domain Dynamics
analysis/ design toolbox
System Linearization
Linear Controller Design Stability Analysis
design plan control parameters
Tradeoff Specification (how aggressively to
save energy? )
Energy/Performance Tradeoff Analysis
Hardware Implementation

reference queue qref
Processors (MCD) with DVFS Control
13
Modeling Queue/domain Dynamics
  • A stochastic queuing-domain model (Section 3.3)

service rate ?
arrival rate ?
frequency f
queue q
average queue changes due to different demand
and service rates
14
Linear Controller Design
frequency f
  • PID controller
  • Proportional gain (KP)
  • Integral gain (KI)
  • Derivative gain (KD)

service rate ?
arrival rate ?
queue q
qref
Control block diagram
Linearized system
q
f
qref
?
e
-
q

?
Disturbance input
Implementation modest amount of hardware
15
Specify Energy Performance Tradeoff
  • How aggressively to save energy?
  • Or preserve performance?
  • A simple lever qref position
  • Increase qref more aggressive in saving energy
  • Decrease qref value performance more
  • qref adjustable by OS/application
  • Software/hardware cooperation
  • Software make overall tradeoff decisions
  • Hardware implement details of speed adaptation

16
Experimental Results
  • Use an MCD simulator (based on Semeraro et al.
    HPCA 02)
  • 4 clock domains (IF, INT,FP,LS), Low-overhead
    DVFS

External
Front End
L1-ICache
Main memory
Fetch Unit
Load/Store
ROB, Rename, Dispatch
L2 Cache
Integer
Floating-Point
Integer queue
FP queue
Mem input queue
Integer ALUs
FP ALUs
L1-Dcache
17
An Illustrative Example
  • Benchmark Epic_Decode

frequency settings
queue entries
18
Energy and Performance Results
Average results over all benchmarks
19
Energy and Performance Results
Average results over all benchmarks
20
Energy and Performance Results
Average results over all benchmarks
21
Energy and Performance Results
Average results over all benchmarks
22
This talk
  • Control-theory for managing DVFS in MCD
    processors
  • With Qiang Wu, Philo Juang, Doug Clark
  • ASPLOS 04, HPCA 05
  • Coordinated control for Chip Multiprocessors
  • With Philo Juang, Qiang Wu, Li-Shiuan Peh
  • In submission
  • Brief pointers to other work
  • Counter-based power estimation and phase analysis
  • With Canturk Isci, Alper Buyuktosunoglu,
  • ACEED closed session
  • Linear programming for compiler-managed DVFS
  • With Fen Xie and Sharad Malik
  • PLDI 03, and newer work in submission

23
Energy and Speed-balancing on CMPs
  • CMPs increasingly common platform for high-end
    microprocessors
  • High performance potential in a
    complexity-effective design
  • But, not all cores are useful at full-speed at
    all times
  • Limited parallelism
  • Memory or I/O stalls
  • Via a CMPs inter-core networks, can see data
    communication relationships
  • This work Dynamically adapt power V/f settings
    according to data CPU usage

24
DVFS using Producer-Consumer Cores
demand
service rate ?
arrival rate ?
frequency f1
frequency f2
queue q
  • Strategy appears similar to MCD
  • Identify producer-consumer relationships
  • Speed balance based on data pileups in between
    them

25
Parallel Code and DVFS An Example
Parent Thread (Sends out X numbers)
100 cycles/number
Helper Thread 1
Helper Thread T3
Helper Thread T1
Helper Thread T2
Process every 2nd number
Process every 17th number
Process every 10,000th number
Receiver (Has to wait for all X numbers to arrive)
  • When one input buffer fills, Parent thread stalls
  • Observation 1 Thread T1 has most work to do
  • Threads T2 and T3 can run more slowly
  • Observation 2 All threads (especially T2 and T3)
    have bursty work requirements)
  • Must avoid oscillations

26
Options for CMP DVFS Policies
  • Static DVFS settings for whole application
  • Based on profiling or application knowledge
  • Pro simple, no overshoot or oscillation
  • Con hard to gather application knowledge,
    especially for dynamically-varying parallel
    applications.
  • Locally-controlled, uncoordinated V/f settings
    per core
  • Pro simple, fast, easy to scale
  • Con doesnt account for inter-thread
    relationships
  • Coordinated cross-chip control of DVFS settings
  • Pro more realistic, more flexible
  • Con Slower, possibly harder to scale
  • Which info to transfer and how fast?

27
Engineering a Coordinated Control Scheme Back
to Example
  • Over a sample interval
  • T1s queue is building up
  • T2s coming down
  • T3s relatively stable
  • Which to speed up? T1 or T2?
  • Bursty behavior means that queue occupancies must
    be averaged out
  • Inter-relationships between threads mean that
    local queues alone are not enough

28
Introducing Dist-PID
  • 1) Determine critical path using equation
  • qtarget (Kp(qk qk-1) Kiqk µk µk-1)/Ki
  • 2) Distribute to all processors
  • Exchange qtarget between processors
  • Choose highest qtarget seen this is critical
    path
  • 3) Use highest qtarget as new qref and solve
    equation
  • µk µk-1 Ki(qk qref) Kp(qk qk-1)
  • Intuitively
  • Who is the critical path?
  • To preserve performance, run that processor at
    maximum speed
  • To save energy, run everyone else slower

29
Dist-PID manages oscillation/bursts better than
Local approaches
Frequency (Mhz)
Time
  • Because of the communication, Dist-PID knows
    what speed to target
  • Formal approach causes controller to gently
    zero in on optimal speed

30
Dist-PID outperforms Local-PID
  • Quicksort Fast moving, high thread pressure
  • Othello Slow moving, bursty
  • 183.equake Statically balanced, steady
  • 181.mcf Bimodal
  • 300.twolf Small but significant and easy to
    identify opportunities

Energy-Delay Product
Dist-PID equal or better energy-delay product
than Local-PID for all benchmarks
31
Dist-PID resiliency
  • Dist-PID More resilient than local approaches to
    error in processor load predictions
  • Othello, quicksort

Normalized execution time
32
Microarchitectural Issues for Distributed
Management
  • Key Requirements
  • Managing information flow
  • Detecting thread-to-thread critical path
  • Quick responsive changes
  • Network-Driven Processor (NDP) Joint project
    with Profs Peh and August.
  • NDP CMP intelligent, adaptive routers
  • For dynamic management of parallelism and power
  • Track communicate rates and CPU requirements of
    different threads
  • NDP designed to support dynamic parallelism and
    power management
  • Spawn threads such that related threads are
    co-located
  • Schedule or migrate competing threads
  • Manage energy and temperature based on same usage
    stats

33
This talk
  • Control-theory for managing DVFS in MCD
    processors
  • With Qiang Wu, Philo Juang, Doug Clark
  • ASPLOS 04, HPCA 05
  • Coordinated control for Chip Multiprocessors
  • With Philo Juang, Qiang Wu, Li-Shiuan Peh
  • In submission
  • Brief pointers to other work
  • Counter-based power estimation and phase analysis
  • With Canturk Isci, Alper Buyuktosunoglu,
  • ACEED closed session
  • Linear programming for compiler-managed DVFS
  • With Fen Xie and Sharad Malik
  • PLDI 03, and newer work in submission

34
OS-level Power Estimation
Measure
Analyze
React
Model
  • Use hardware performance counters to gauge
    processor activity
  • Analyze phases and adapt
  • Recognize power/thermal hotspots and control

35
Counter-Based Power Estimation An Overview of
Our Approach
  • Idealized view For all components in a processor
    chip

Power of component I
MaxPowerI ArchScalingI AccessRateI
Die area Stressmarks
CPU Performance Counters!
From microarch. properties
  • More realistic view Handle non-linear scaling

Empirical Multimeter measurement
NonGatedPowerI
36
Counter-Based Power EstimationGeneral
Implementation
PowerModel
Multimeter
37
Intel Pentium 4 HPC-based model SPEC Results
38
Analyzing and Predicting Power Phases
  • WWC-03 more Developed a range of analysis
    techniques for discerning (from HPC readings)
    similarity in power behavior for different
    execution phases
  • Also, simple predictors for determining the
    likely duration of a phase once it begins

Based on composition of absolute and normalized
power
39
This talk
  • Control-theory for managing DVFS in MCD
    processors
  • With Qiang Wu, Philo Juang, Doug Clark
  • ASPLOS 04, HPCA 05
  • Coordinated control for Chip Multiprocessors
  • With Philo Juang, Qiang Wu, Li-Shiuan Peh
  • In submission
  • Brief pointers to other work
  • Counter-based power estimation and phase analysis
  • With Canturk Isci, Alper Buyuktosunoglu,
  • ACEED closed session
  • Linear programming for compiler-managed DVFS
  • With Fen Xie and Sharad Malik
  • PLDI 03, and newer work in submission

40
Results Summary
  • DVFS control for MCDs
  • 23 fold increase of Power/Perf ratio
  • -- automatic regulation, more effective decisions
  • More resilient and complete
  • -- guarantee stability and efficiency under
    extreme cases
  • DVFS control for CMPs
  • Demonstrates value of distributed control
  • Improves energy-delay product by 8 over local
    approach
  • Improves energy-delay product for tightly
    coordinated applications by 8X
  • Resilient and stable in the face of inaccurate
    load factors
  • Higher-level Power Analysis
  • OS and compiler can have a role to play as well

41
Conclusions
  • Formal Control has an important role to play in
    future computer systems
  • With increasing gap between worst-case and
    average-case execution, dynamic management is
    imperative
  • Need verifiable bounds on adaptive response
    magnitude and delay
  • Need composable behavior across multiple effects
  • Multiple layers (hw and sw) of power adaptivity
    can work together towards real-power systems
  • Necessary performance, while also meeting
    power/thermal targets

42
Acknowledgments
  • Students
  • Philo Juang (Graduating 2005)
  • Fen Xie (Graduating 2005, co-advised with Sharad
    Malik)
  • Qiang Wu (Graduating 2006, adviser Doug Clark)
  • Canturk Isci (Graduating 2006)
  • Gilberto Contreras (Graduating 2007)
  • PhD Alums from group
  • Prof. Russ Joseph, Northwestern Univ. (IBM grad
    fellowship)
  • Dr. Zhigang Hu, IBM Research
  • Prof. David Brooks, Harvard Univ. (IBM grad
    fellowship)
  • Other colleagues Doug Clark, Sharad Malik,
    Pradip Bose, Alper Buyuktosunoglu
  • Grants NSF, SRC, IBM, Intel

43
Why Local-PID worked for MCDs, but not CMPs
  • In MCDs, input rate is smoother, higher, and more
    sustained
  • Thread creation and incoming data in CMPs are
    burstier
  • Homogeneous domains in CMPs, heterogeneous in
    MCDs
  • Load/Store domain forced to be more aggressive
  • Formal technique worked as advertised
  • But the static qref has no basis in reality

44
Analyzing and Predicting Power Phase Behavior
Write a Comment
User Comments (0)
About PowerShow.com