Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster

Description:

Find 'best' gear for one phase, then move on to the next phase ... Well-explored research area in mobile devices and server centers ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 45
Provided by: DavidLo81
Category:

less

Transcript and Presenter's Notes

Title: Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster


1
Using Multiple Energy Gears in MPI Programs on a
Power-Scalable Cluster
  • Vincent W. Freeh NCSU
  • David K. Lowenthal U. of Georgia
  • Feng Pan NCSU
  • Nandini Kappiah NCSU

2
The case for power management
  • Eric Schmidt, Google CEO
  • its not speed but powerlow power, because
    data centers can consume as much electricity as a
    small city.
  • Power/energy consumption becoming key issue
  • Power limitations
  • Energy Heat Heat dissipation is costly
  • Non-trivial amount of money
  • Consequence
  • Power limits performance
  • Fewer nodes can operate concurrently
  • Goal
  • Increase power/energy efficiency
  • More performance per unit power/energy

3
CPU scaling
  • How CPU scaling
  • Reduce frequency voltage
  • Reduce power performance
  • Why CPU scaling?
  • Large power consumer
  • Mechanism exists
  • Energy gears
  • Frequency-voltage pair
  • Power-performance setting
  • Energy-time tradeoff

performance ? frequency power ? frequency x
voltage2
4
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
5
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
6
Contributions
  • Given HPC application written in MPI
  • Determine proper gear for each phase (profiling)
  • Execute this solution
  • Improved energy efficiency of HPC applications
  • Execute program in reduced gear
  • This paper shows benefit of multiple gears in a
    program
  • Found simple metric for phase boundary location
  • Developed simple, effective linear time algorithm
    for determining proper gears

7
Methodology
  • Cluster used 10 nodes, AMD Athlon-64
  • Processor supports 7 frequency-voltage settings
    (gears)
  • Frequency (MHz) 2000 1800 1600 1400 1200
    1000 800
  • Voltage (V) 1.5 1.4 1.35 1.3
    1.2 1.1 1.0
  • Measure
  • Wall clock time (gettimeofday system call)
  • Energy (external multimeter for power)

8
Saving energy (single gear) - cg
2000MHz
800MHz
  • Not CPU bound
  • Little time penalty
  • Large energy savings

9
Saving energy (single gear) - ep
11 -3
  • CPU bound
  • Big time penalty
  • No (little) energy savings

10
Memory pressure
  • Why different tradeoffs?
  • CG is memory bound CPU not on critical path
  • EP is CPU bound CPU is on critical path
  • Operations per miss
  • Metric of memory pressure
  • Indicates criticality of CPU
  • Use performance counters
  • Count micro operations and cache misses
  • Use to determine phases

11
Operation per miss
CG 8.60
12
Phases LU
13
Phase detection
  • First, divide program into blocks
  • All code in block execute in same gear
  • Block boundaries
  • MPI operation
  • Expect OPM change
  • Then, merge adjacent blocks into phases
  • Merge if similar memory pressure
  • Use OPM
  • OPMi OPMj small
  • Merge if small (short time)
  • Note
  • Leverage large body of phase detection research
  • Kennedy Kremer 1998 Sherwood, et al 2002

14
Data collection
MPI application
MPI library
  • Use MPI-jack
  • Pre and post hooks
  • For example
  • Program tracing
  • Gear shifting
  • Gather profile data during execution
  • Define MPI-jack hook for every MPI operation
  • Insert pseudo MPI call at end of loops
  • Information collected
  • Type of call and location (PC)
  • Status (gear, time, etc)
  • Statistics (uops and L2 misses for OPM
    calculation)

MPI-jack
code
15
Example bt
16
Comparing two schedules
  • What is the best schedule?
  • Depends on user
  • User supplies better function
  • bool better(i, j)
  • Several metrics can be used
  • Energy-delay
  • Energy-delay squared Cameron et al. SC2004

17
Slope metric
  • Paper uses slope
  • Energy-time tradeoff
  • Slope -1 ?energy saving equal to time delay
  • User-defines the limit
  • Limit 0 ? minimize energy
  • Limit -8 ? minimize time
  • If slope lt limit, then better
  • We do not advocate this metric over others

18
Example bt
19
Benefit of multiple gears mg
20
Related work
  • Previous studies in power-aware HPC
  • Cameron et al., SC 2004 IPDPS 2005, Freeh et
    al., IPDPS 2005
  • Energy-aware server clusters
  • Many projects e.g., Heath PPoPP 2005
  • Low-power supercomputer design
  • Green Destiny (Warren et al., 2002)
  • Orion Multisystems

21
Summary
  • Contributions
  • Improved energy efficiency of HPC applications
  • Found simple metric for phase boundary location
  • Developed simple, effective linear time algorithm
    for determining proper gears
  • Future work
  • Reduce sampling interval to handful of iterations
  • Reduce algorithm time w/ modeling and prediction
  • Leverage load imbalance
  • Develop AMPERE
  • a message passing environment for reducing energy
  • http//fortknox.csc.ncsu.eduosr/
  • vin_at_csc.ncsu.edu dkl_at_cs.uga.edu

22
end
23
Algorithm
  • Set Gk 0, ?k / 0 is fastest gear /
  • Gf ? evaluate(program, G, 0, n, T)
  • define evaluate(program, G, i, n, T)
  • if i ? n or Gi ? gslowest then return G fi
  • Gi ? Gi 1
  • execute program using solution G, T ? (e,t)
  • if T lt T then / T is not better than T /
  • Gi ? Gi 1
  • G evaluate(program, G, i1, n, T)
  • else / T is better than T /
  • G evaluate(program, G, i, n, T)
  • fi
  • return G
  • end

24
Phase sorting
  • If phases are independent, no sorting needed
  • This is NOT the case
  • In SP
  • slope from 00 to 10 is positive
  • slope from 01 to 11 is negative
  • Same behavior in all other benchmarks
  • There are gn data points
  • Our approach
  • Always look for the best energy-time tradeoff in
    a step
  • By starting with phase with lowest OPM, we always
    arrive at a good solution

25
Phase Sorting - sp
26
Is CPU scaling a win?
  • Two reasons
  • Frequency and voltage scaling
  • Performance reduction less than
  • Power reduction
  • Application throughput
  • Throughput reduction less than
  • Performance reduction
  • Assumptions
  • CPU large power consumer
  • CPU driver
  • Diminishing throughput gains

power
CPU power P ½ CfV2
performance (freq)
application throughput
performance (freq)
27
Multiple gears
  • Extension
  • Programs can have different E-T tradeoffs
  • Portions of programs (phase) can too
  • Idea
  • Find best gear
  • Find phase
  • How

28
Methodology
  • If there are n phases and g gears, then the
    number of possible solutions is gn
  • Too large to search
  • A heuristic to find the best solution
  • Find best gear for one phase, then move on to
    the next phase
  • Once a best gear is found for a phase, it is
    fixed
  • Running time is at most n g, but most of time
    in fewer steps

29
Trace of OPM for lu
30
Motivation
  • Energy savings increasingly important
  • Well-explored research area in mobile devices and
    server centers
  • Increasing attention in high-performance
    computing
  • Large clusters running compute-intensive, energy
    consuming jobs
  • Entire machines developed with low power in mind
  • Green Destiny/Orion Multisystems
  • Our approach start with clusters built from
    high-performance, frequency scalable processors

31
The case for power management in HPC
  • Power/energy consumption a critical issue
  • Energy Heat Heat dissipation is costly
  • Limited power supply
  • Non-trivial amount of money
  • Consequence
  • Performance limited by available power
  • Fewer nodes can operate concurrently
  • Opportunity bottlenecks
  • Bottleneck component limits performance of other
    components
  • Reduce power of some components, not overall
    performance
  • Today, CPU is
  • Major power consumer (100W),
  • Rarely bottleneck and
  • Scalable in power/performance (frequency
    voltage)

Power/performance gears
32
Results LU
Shift 0/1 1, -6
Gear 1 5, -8
Gear 2 10, -10
Shift 1/2 1, -6
Auto shift 3, -8
Shift 0/2 5, -8
33
Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
34
Jacobi iteration
35
Example Load imbalance
  • Uniform allocation of power
  • Pi Plimit P/M, for node i
  • Not ideal if nodes unevenly loaded
  • Tasks execute more slowly on busy nodes
  • Lightly loaded nodes may not use all power
  • Allocate power based on load
  • At regular intervals, nodes exchange load
    information
  • Each computes individual power limit for next
    interval (k)
  • Note Load is one of several possible objective
    functions.

Ensure
individual power limit for node i at interval k
36
Motivation ROB title needs to change.
PCPU
PSYS
T
0
E (PSYS PCPU) T
37
Motivation
PCPU
PSYS
T
T?T
0
Additional CPU energy for slower speed
Additional system energy for slower speed
E (PSYS PCPU) T
CPU energy saved (from 0 to T)
38
Is CPU scaling a win?
  • Two reasons
  • Power scaling
  • Performance reduction less
  • than power reduction
  • Application throughput
  • Throughput reduction less
  • than performance reduction
  • Assumptions
  • CPU large power consumer
  • Diminishing throughput gains

power
(1)
CPU power P ½ CVf2
Freq (performance)
application throughput
(2)
Freq (performance)
39
Algorithm
no. phases 4
1
2
1
1
2
1
1
0
1
2
1
40
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
3
2
is T lt T? yes
41
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
2
is T lt T? no
42
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
1
2
is T lt T? no
43
Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
3
1
2
is T lt T? no
44
Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
0
3
1
2
return G
Write a Comment
User Comments (0)
About PowerShow.com