Title: High-Performance Power-Aware Computing
1High-PerformancePower-Aware Computing
- Vincent W. Freeh
- Computer Science
- NCSU
- vin_at_csc.ncsu.edu
2Acknowledgements
- NCSU
- Tyler K. Bletsch
- Mark E. Femal
- Nandini Kappiah
- Feng Pan
- Daniel M. Smith
- U of Georgia
- Robert Springer
- Barry Rountree
- Prof. David K. Lowenthal
3The case for power management
- Eric Schmidt, Google CEO
- its not speed but powerlow power, because
data centers can consume as much electricity as a
small city. - Power/energy consumption becoming key issue
- Power limitations
- Energy Heat Heat dissipation is costly
- Non-trivial amount of money
- Consequence
- Excessive power consumption limits performance
- Fewer nodes can operate concurrently
- Goal
- Increase power/energy efficiency
- More performance per unit power/energy
4CPU scaling
power ? frequency x voltage2
- How CPU scaling
- Reduce frequency voltage
- Reduce power performance
- Energy/power gears
- Frequency-voltage pair
- Power-performance setting
- Energy-time tradeoff
- Why CPU scaling?
- Large power consumer
- Mechanism exists
power
frequency/voltage
application throughput
frequency/voltage
5Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
6Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
7Our work
- Exploit bottlenecks
- Application waiting on bottleneck resource
- Reduce power consumption (non-critical resource)
- Generally CPU not on critical path
- Bottlenecks we exploit
- Intra-node (memory)
- Inter-node (load imbalance)
- Contributions
- Impact studies HPPAC 05 IPDPS 05
- Varying gears/nodes PPoPP 05 PPoPP 06
(submitted) - Leveraging load imbalance SC 05
8Methodology
- Cluster used 10 nodes, AMD Athlon-64
- Processor supports 7 frequency-voltage settings
(gears) - Frequency (MHz) 2000 1800 1600 1400 1200
1000 800 - Voltage (V) 1.5 1.4 1.35 1.3
1.2 1.1 1.0 - Measure
- Wall clock time (gettimeofday system call)
- Energy (external power meter)
9NAS
10CG 1 node
2000MHz
800MHz
- Not CPU bound
- Little time penalty
- Large energy savings
11EP 1 node
11 -3
- CPU bound
- Big time penalty
- No (little) energy savings
12Operation per miss
CG 8.60
13Multiple nodes EP
14Multiple nodes LU
S8 5.8 E8 1.28
S4 3.3 E4 1.15
S2 1.9 E2 1.03
Good speedup E-T tradeoff as N increases
15Multiple nodes MG
Poor speedup Increased E as N increases
S8 2.7 E8 2.29
S4 1.6 E4 1.99
S2 1.2 E2 1.41
16Phases
17Phases LU
18Phase detection
- First, divide program into blocks
- All code in block execute in same gear
- Block boundaries
- MPI operation
- Expect OPM change
- Then, merge adjacent blocks into phases
- Merge if similar memory pressure
- Use OPM
- OPMi OPMi1 small
- Merge if small (short time)
- Note, in future
- Leverage large body of phase detection research
- Kennedy Kremer 1998 Sherwood, et al 2002
19Data collection
MPI application
MPI library
- Use MPI-jack
- Pre and post hooks
- For example
- Program tracing
- Gear shifting
- Gather profile data during execution
- Define MPI-jack hook for every MPI operation
- Insert pseudo MPI call at end of loops
- Information collected
- Type of call and location (PC)
- Status (gear, time, etc)
- Statistics (uops and L2 misses for OPM
calculation)
MPI-jack
code
20Example bt
21Comparing two schedules
- What is the best schedule?
- Depends on user
- User supplies better function
- bool better(i, j)
- Several metrics can be used
- Energy-delay
- Energy-delay squared Cameron et al. SC2004
22Slope metric
- Project uses slope
- Energy-time tradeoff
- Slope -1 ? energy savings time delay
- User-defines the limit
- Limit 0 ? minimize energy
- Limit -8 ? minimize time
- If slope lt limit, then better
- We do not advocate this metric over others
23Example bt
Solutions Slope lt -1.5?
1 00 ? 01 -11.7 true
2 01 ? 02 -1.78 true
3 02 ? 03 -1.19 false
4 02 ? 12 -1.44 false
02 is the best 02 is the best 02 is the best 02 is the best
24Benefit of multiple gears mg
25Current work no. of nodes, gear/phase
26Load imbalance
27Node bottleneck
- Best course is to keep load balanced
- Load balancing is hard
- Slow down if not critical node
- How to tell if not critical node?
- Suppose a barrier
- All nodes must arrive before any leave
- No benefit to arriving early
- Measure block time
- Assume it is (mostly) the same between iterations
- Assumptions
- Iterative application
- Past predicts future
28Example
synch pt
synch pt
performance 1
performance (t-slack)/t
iteration k1
iteration k
Reduced performance power ? Energy savings
29Measuring slack
- Blocking operations
- Receive
- Wait
- Barrier
- Measure with MPI_Jack
- Too frequent
- Can be hundreds or thousands per second
- Aggregate slack for one or more iterations
- Computing slack, S
- Measure times for computing and blocking phases
- T C1 B1 C2 B2 Cn Bn
- Compute aggregate slack
- S (B1B2Bn)/T
30Slack
Communication slack
CG
Aztec
Sweep3d
- Slack
- Varies between nodes
- Varies between applications
- Use net slack
- Each node individually determines slack
- Reduction to find min slack
31Shifting
- When to reduce performance?
- When there is enough slack
- When to increase performance?
- When application performance suffers
- Create high and low limit for slack
- Need damping
- Dynamically learn
- Not the same for all applications
- Range starts small
- Increase if necessary
reduce gear
slack
same gear
increase gear
T
32Aztec gears
33Performance
Aztec
Sweep3d
34Synthetic benchmark
35Summary
- Contributions
- Improved energy efficiency of HPC applications
- Found simple metric for phase boundary location
- Developed simple, effective linear time algorithm
for determining proper gears - Leveraged load imbalance
- Future work
- Reduce sampling interval to handful of iterations
- Reduce algorithm time w/ modeling and prediction
- Develop AMPERE
- a message passing environment for reducing energy
- http//fortknox.csc.ncsu.eduosr/
- vin_at_csc.ncsu.edu dkl_at_cs.uga.edu
36End
37Shifting test
NAS LU 1 node
7.7
1
1
4.5
38Beta
- Hsu Kremer PLDI 03
- Relates application slowdown to CPU slowdown
- b
- b1 ? time is
- CPU dependent
- b0 ? time is
- independent of CPU
- OPM vs. b
- Correlated
- Log(OPM) ? b
39OPM and b and slack
- OPM not strongly correlated to b in multi-node
- Why?
- There is another bottleneck
- Communication slack
- Waiting time
- Eg, MPI_Receive, MPI_Wait, MPI_Barrier
- MG OPM 70.6 slack 25
- LU OPM 73.5 slack 11
- Can predict b with
- Log(OPM) and
- slack
40Energy savings (synthetic)
41Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
42SPEC FP
43SPEC INT
44Single node MG
6 -7
12 -8
Modest memory pressure Gears offer E-T tradeoff
45Dynamically adjust performance
net slack
time
0
2
1
46Adjust performance
net slack
time
0
0
0
1
1
1
47Dampening
net slack
time
0
0
1
1
1
0
48Power consumption
Average for NAS suite
49Related work Energy conservation
- Goal conserve energy
- Performance degradation acceptable
- Usually in mobile environments (finite energy
source, battery) - Primary goal
- Extend battery life
- Secondary goal
- Re-allocate energy
- Increase value of energy use
- Tertiary goal
- Increase energy efficiency
- More tasks per unit energy
- Example
- Feedback-driven, energy conservation
- Control average power usage
- Pave (E0 Ef)/T
50Related work Realtime DVS
- Goal
- Reduce energy consumption
- With no performance degradation
- Mechanism
- Eliminate slack time in system
- Savings
- Eidle
- with F scaling
- Additional Etask Etask
- with V scaling
P
P
Pmax
Pmax
Etask
deadline
deadline
Etask
Eidle
T
T
51Related work
- Previous studies in power-aware HPC
- Cameron et al., SC 2004 IPDPS 2005, Freeh et
al., IPDPS 2005 - Energy-aware server clusters
- Many projects e.g., Heath PPoPP 2005
- Low-power supercomputer design
- Green Destiny (Warren et al., 2002)
- Orion Multisystems
52Related work Fixed installations
- Goal
- Reduce cost (in heat generation or )
- Goal is not to conserve a battery
- Mechanisms
- Scaling
- Fine-grain DVS
- Coarse-grain power down
- Load balancing
53Memory pressure
- Why different tradeoffs?
- CG is memory bound CPU not on critical path
- EP is CPU bound CPU is on critical path
- Operations per miss
- Metric of memory pressure
- Indicates criticality of CPU
- Use performance counters
- Count micro operations and cache misses
54Single node MG
55Single node LU