Title: Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster
1Using Multiple Energy Gears in MPI Programs on a
Power-Scalable Cluster
- Vincent W. Freeh NCSU
- David K. Lowenthal U. of Georgia
- Feng Pan NCSU
- Nandini Kappiah NCSU
2The case for power management
- Eric Schmidt, Google CEO
- its not speed but powerlow power, because
data centers can consume as much electricity as a
small city. - Power/energy consumption becoming key issue
- Power limitations
- Energy Heat Heat dissipation is costly
- Non-trivial amount of money
- Consequence
- Power limits performance
- Fewer nodes can operate concurrently
- Goal
- Increase power/energy efficiency
- More performance per unit power/energy
3CPU scaling
- How CPU scaling
- Reduce frequency voltage
- Reduce power performance
- Why CPU scaling?
- Large power consumer
- Mechanism exists
- Energy gears
- Frequency-voltage pair
- Power-performance setting
- Energy-time tradeoff
performance ? frequency power ? frequency x
voltage2
4Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
5Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
6Contributions
- Given HPC application written in MPI
- Determine proper gear for each phase (profiling)
- Execute this solution
- Improved energy efficiency of HPC applications
- Execute program in reduced gear
- This paper shows benefit of multiple gears in a
program - Found simple metric for phase boundary location
- Developed simple, effective linear time algorithm
for determining proper gears
7Methodology
- Cluster used 10 nodes, AMD Athlon-64
- Processor supports 7 frequency-voltage settings
(gears) - Frequency (MHz) 2000 1800 1600 1400 1200
1000 800 - Voltage (V) 1.5 1.4 1.35 1.3
1.2 1.1 1.0 - Measure
- Wall clock time (gettimeofday system call)
- Energy (external multimeter for power)
8Saving energy (single gear) - cg
2000MHz
800MHz
- Not CPU bound
- Little time penalty
- Large energy savings
9Saving energy (single gear) - ep
11 -3
- CPU bound
- Big time penalty
- No (little) energy savings
10Memory pressure
- Why different tradeoffs?
- CG is memory bound CPU not on critical path
- EP is CPU bound CPU is on critical path
- Operations per miss
- Metric of memory pressure
- Indicates criticality of CPU
- Use performance counters
- Count micro operations and cache misses
- Use to determine phases
11Operation per miss
CG 8.60
12Phases LU
13Phase detection
- First, divide program into blocks
- All code in block execute in same gear
- Block boundaries
- MPI operation
- Expect OPM change
- Then, merge adjacent blocks into phases
- Merge if similar memory pressure
- Use OPM
- OPMi OPMj small
- Merge if small (short time)
- Note
- Leverage large body of phase detection research
- Kennedy Kremer 1998 Sherwood, et al 2002
14Data collection
MPI application
MPI library
- Use MPI-jack
- Pre and post hooks
- For example
- Program tracing
- Gear shifting
- Gather profile data during execution
- Define MPI-jack hook for every MPI operation
- Insert pseudo MPI call at end of loops
- Information collected
- Type of call and location (PC)
- Status (gear, time, etc)
- Statistics (uops and L2 misses for OPM
calculation)
MPI-jack
code
15Example bt
16Comparing two schedules
- What is the best schedule?
- Depends on user
- User supplies better function
- bool better(i, j)
- Several metrics can be used
- Energy-delay
- Energy-delay squared Cameron et al. SC2004
17Slope metric
- Paper uses slope
- Energy-time tradeoff
- Slope -1 ?energy saving equal to time delay
- User-defines the limit
- Limit 0 ? minimize energy
- Limit -8 ? minimize time
- If slope lt limit, then better
- We do not advocate this metric over others
18Example bt
19Benefit of multiple gears mg
20Related work
- Previous studies in power-aware HPC
- Cameron et al., SC 2004 IPDPS 2005, Freeh et
al., IPDPS 2005 - Energy-aware server clusters
- Many projects e.g., Heath PPoPP 2005
- Low-power supercomputer design
- Green Destiny (Warren et al., 2002)
- Orion Multisystems
21Summary
- Contributions
- Improved energy efficiency of HPC applications
- Found simple metric for phase boundary location
- Developed simple, effective linear time algorithm
for determining proper gears - Future work
- Reduce sampling interval to handful of iterations
- Reduce algorithm time w/ modeling and prediction
- Leverage load imbalance
- Develop AMPERE
- a message passing environment for reducing energy
- http//fortknox.csc.ncsu.eduosr/
- vin_at_csc.ncsu.edu dkl_at_cs.uga.edu
22end
23Algorithm
- Set Gk 0, ?k / 0 is fastest gear /
- Gf ? evaluate(program, G, 0, n, T)
- define evaluate(program, G, i, n, T)
- if i ? n or Gi ? gslowest then return G fi
- Gi ? Gi 1
- execute program using solution G, T ? (e,t)
- if T lt T then / T is not better than T /
- Gi ? Gi 1
- G evaluate(program, G, i1, n, T)
- else / T is better than T /
- G evaluate(program, G, i, n, T)
- fi
- return G
- end
24Phase sorting
- If phases are independent, no sorting needed
- This is NOT the case
- In SP
- slope from 00 to 10 is positive
- slope from 01 to 11 is negative
- Same behavior in all other benchmarks
- There are gn data points
- Our approach
- Always look for the best energy-time tradeoff in
a step - By starting with phase with lowest OPM, we always
arrive at a good solution
25Phase Sorting - sp
26Is CPU scaling a win?
- Two reasons
- Frequency and voltage scaling
- Performance reduction less than
- Power reduction
- Application throughput
- Throughput reduction less than
- Performance reduction
- Assumptions
- CPU large power consumer
- CPU driver
- Diminishing throughput gains
power
CPU power P ½ CfV2
performance (freq)
application throughput
performance (freq)
27Multiple gears
- Extension
- Programs can have different E-T tradeoffs
- Portions of programs (phase) can too
- Idea
- Find best gear
- Find phase
- How
28Methodology
- If there are n phases and g gears, then the
number of possible solutions is gn - Too large to search
- A heuristic to find the best solution
- Find best gear for one phase, then move on to
the next phase - Once a best gear is found for a phase, it is
fixed - Running time is at most n g, but most of time
in fewer steps
29Trace of OPM for lu
30Motivation
- Energy savings increasingly important
- Well-explored research area in mobile devices and
server centers - Increasing attention in high-performance
computing - Large clusters running compute-intensive, energy
consuming jobs - Entire machines developed with low power in mind
- Green Destiny/Orion Multisystems
- Our approach start with clusters built from
high-performance, frequency scalable processors
31The case for power management in HPC
- Power/energy consumption a critical issue
- Energy Heat Heat dissipation is costly
- Limited power supply
- Non-trivial amount of money
- Consequence
- Performance limited by available power
- Fewer nodes can operate concurrently
- Opportunity bottlenecks
- Bottleneck component limits performance of other
components - Reduce power of some components, not overall
performance - Today, CPU is
- Major power consumer (100W),
- Rarely bottleneck and
- Scalable in power/performance (frequency
voltage)
Power/performance gears
32Results LU
Shift 0/1 1, -6
Gear 1 5, -8
Gear 2 10, -10
Shift 1/2 1, -6
Auto shift 3, -8
Shift 0/2 5, -8
33Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
34Jacobi iteration
35Example Load imbalance
- Uniform allocation of power
- Pi Plimit P/M, for node i
- Not ideal if nodes unevenly loaded
- Tasks execute more slowly on busy nodes
- Lightly loaded nodes may not use all power
- Allocate power based on load
- At regular intervals, nodes exchange load
information - Each computes individual power limit for next
interval (k) - Note Load is one of several possible objective
functions.
Ensure
individual power limit for node i at interval k
36Motivation ROB title needs to change.
PCPU
PSYS
T
0
E (PSYS PCPU) T
37Motivation
PCPU
PSYS
T
T?T
0
Additional CPU energy for slower speed
Additional system energy for slower speed
E (PSYS PCPU) T
CPU energy saved (from 0 to T)
38Is CPU scaling a win?
- Two reasons
- Power scaling
- Performance reduction less
- than power reduction
- Application throughput
- Throughput reduction less
- than performance reduction
- Assumptions
- CPU large power consumer
- Diminishing throughput gains
power
(1)
CPU power P ½ CVf2
Freq (performance)
application throughput
(2)
Freq (performance)
39Algorithm
no. phases 4
1
2
1
1
2
1
1
0
1
2
1
40Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
3
2
is T lt T? yes
41Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
2
is T lt T? no
42Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
1
2
is T lt T? no
43Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
3
1
2
is T lt T? no
44Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
0
3
1
2
return G