Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster

Description:

Find 'best' gear for one phase, then move on to the next phase ... Well-explored research area in mobile devices and server centers ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 45

Provided by: DavidLo81

Category:

more less

Transcript and Presenter's Notes

Title: Using Multiple Energy Gears in MPI Programs on a PowerScalable Cluster

1
Using Multiple Energy Gears in MPI Programs on a
Power-Scalable Cluster

Vincent W. Freeh NCSU
David K. Lowenthal U. of Georgia
Feng Pan NCSU
Nandini Kappiah NCSU

2
The case for power management

Eric Schmidt, Google CEO
its not speed but powerlow power, because
data centers can consume as much electricity as a
small city.
Power/energy consumption becoming key issue
Power limitations
Energy Heat Heat dissipation is costly
Non-trivial amount of money
Consequence
Power limits performance
Fewer nodes can operate concurrently
Goal
Increase power/energy efficiency
More performance per unit power/energy

3
CPU scaling

How CPU scaling
Reduce frequency voltage
Reduce power performance
Why CPU scaling?
Large power consumer
Mechanism exists
Energy gears
Frequency-voltage pair
Power-performance setting
Energy-time tradeoff

performance ? frequency power ? frequency x
voltage2
4
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
5
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
6
Contributions

Given HPC application written in MPI
Determine proper gear for each phase (profiling)
Execute this solution
Improved energy efficiency of HPC applications
Execute program in reduced gear
This paper shows benefit of multiple gears in a
program
Found simple metric for phase boundary location
Developed simple, effective linear time algorithm
for determining proper gears

7
Methodology

Cluster used 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings
(gears)
Frequency (MHz) 2000 1800 1600 1400 1200
1000 800
Voltage (V) 1.5 1.4 1.35 1.3
1.2 1.1 1.0
Measure
Wall clock time (gettimeofday system call)
Energy (external multimeter for power)

8
Saving energy (single gear) - cg
2000MHz
800MHz

Not CPU bound
Little time penalty
Large energy savings

9
Saving energy (single gear) - ep
11 -3

CPU bound
Big time penalty
No (little) energy savings

10
Memory pressure

Why different tradeoffs?
CG is memory bound CPU not on critical path
EP is CPU bound CPU is on critical path
Operations per miss
Metric of memory pressure
Indicates criticality of CPU
Use performance counters
Count micro operations and cache misses
Use to determine phases

11
Operation per miss
CG 8.60
12
Phases LU
13
Phase detection

First, divide program into blocks
All code in block execute in same gear
Block boundaries
MPI operation
Expect OPM change
Then, merge adjacent blocks into phases
Merge if similar memory pressure
Use OPM
OPMi OPMj small
Merge if small (short time)
Note
Leverage large body of phase detection research
Kennedy Kremer 1998 Sherwood, et al 2002

14
Data collection
MPI application
MPI library

Use MPI-jack
Pre and post hooks
For example
Program tracing
Gear shifting
Gather profile data during execution
Define MPI-jack hook for every MPI operation
Insert pseudo MPI call at end of loops
Information collected
Type of call and location (PC)
Status (gear, time, etc)
Statistics (uops and L2 misses for OPM
calculation)

MPI-jack
code
15
Example bt
16
Comparing two schedules

What is the best schedule?
Depends on user
User supplies better function
bool better(i, j)
Several metrics can be used
Energy-delay
Energy-delay squared Cameron et al. SC2004

17
Slope metric

Paper uses slope
Energy-time tradeoff
Slope -1 ?energy saving equal to time delay
User-defines the limit
Limit 0 ? minimize energy
Limit -8 ? minimize time
If slope lt limit, then better
We do not advocate this metric over others

18
Example bt
19
Benefit of multiple gears mg
20
Related work

Previous studies in power-aware HPC
Cameron et al., SC 2004 IPDPS 2005, Freeh et
al., IPDPS 2005
Energy-aware server clusters
Many projects e.g., Heath PPoPP 2005
Low-power supercomputer design
Green Destiny (Warren et al., 2002)
Orion Multisystems

21
Summary

Contributions
Improved energy efficiency of HPC applications
Found simple metric for phase boundary location
Developed simple, effective linear time algorithm
for determining proper gears
Future work
Reduce sampling interval to handful of iterations
Reduce algorithm time w/ modeling and prediction
Leverage load imbalance
Develop AMPERE
a message passing environment for reducing energy
http//fortknox.csc.ncsu.eduosr/
vin_at_csc.ncsu.edu dkl_at_cs.uga.edu

22
end
23
Algorithm

Set Gk 0, ?k / 0 is fastest gear /
Gf ? evaluate(program, G, 0, n, T)
define evaluate(program, G, i, n, T)
if i ? n or Gi ? gslowest then return G fi
Gi ? Gi 1
execute program using solution G, T ? (e,t)
if T lt T then / T is not better than T /
Gi ? Gi 1
G evaluate(program, G, i1, n, T)
else / T is better than T /
G evaluate(program, G, i, n, T)
fi
return G
end

24
Phase sorting

If phases are independent, no sorting needed
This is NOT the case
In SP
slope from 00 to 10 is positive
slope from 01 to 11 is negative
Same behavior in all other benchmarks
There are gn data points
Our approach
Always look for the best energy-time tradeoff in
a step
By starting with phase with lowest OPM, we always
arrive at a good solution

25
Phase Sorting - sp
26
Is CPU scaling a win?

Two reasons
Frequency and voltage scaling
Performance reduction less than
Power reduction
Application throughput
Throughput reduction less than
Performance reduction
Assumptions
CPU large power consumer
CPU driver
Diminishing throughput gains

power
CPU power P ½ CfV2
performance (freq)
application throughput
performance (freq)
27
Multiple gears

Extension
Programs can have different E-T tradeoffs
Portions of programs (phase) can too
Idea
Find best gear
Find phase
How

28
Methodology

If there are n phases and g gears, then the
number of possible solutions is gn
Too large to search
A heuristic to find the best solution
Find best gear for one phase, then move on to
the next phase
Once a best gear is found for a phase, it is
fixed
Running time is at most n g, but most of time
in fewer steps

29
Trace of OPM for lu
30
Motivation

Energy savings increasingly important
Well-explored research area in mobile devices and
server centers
Increasing attention in high-performance
computing
Large clusters running compute-intensive, energy
consuming jobs
Entire machines developed with low power in mind
Green Destiny/Orion Multisystems
Our approach start with clusters built from
high-performance, frequency scalable processors

31
The case for power management in HPC

Power/energy consumption a critical issue
Energy Heat Heat dissipation is costly
Limited power supply
Non-trivial amount of money
Consequence
Performance limited by available power
Fewer nodes can operate concurrently
Opportunity bottlenecks
Bottleneck component limits performance of other
components
Reduce power of some components, not overall
performance
Today, CPU is
Major power consumer (100W),
Rarely bottleneck and
Scalable in power/performance (frequency
voltage)

Power/performance gears
32
Results LU
Shift 0/1 1, -6
Gear 1 5, -8
Gear 2 10, -10
Shift 1/2 1, -6
Auto shift 3, -8
Shift 0/2 5, -8
33
Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
34
Jacobi iteration
35
Example Load imbalance

Uniform allocation of power
Pi Plimit P/M, for node i
Not ideal if nodes unevenly loaded
Tasks execute more slowly on busy nodes
Lightly loaded nodes may not use all power
Allocate power based on load
At regular intervals, nodes exchange load
information
Each computes individual power limit for next
interval (k)
Note Load is one of several possible objective
functions.

Ensure
individual power limit for node i at interval k
36
Motivation ROB title needs to change.
PCPU
PSYS
T
0
E (PSYS PCPU) T
37
Motivation
PCPU
PSYS
T
T?T
0
Additional CPU energy for slower speed
Additional system energy for slower speed
E (PSYS PCPU) T
CPU energy saved (from 0 to T)
38
Is CPU scaling a win?

Two reasons
Power scaling
Performance reduction less
than power reduction
Application throughput
Throughput reduction less
than performance reduction
Assumptions
CPU large power consumer
Diminishing throughput gains

power
(1)
CPU power P ½ CVf2
Freq (performance)
application throughput
(2)
Freq (performance)
39
Algorithm
no. phases 4
1
2
1
1
2
1
1
0
1
2
1
40
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
3
2
is T lt T? yes
41
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
2
is T lt T? no
42
Algorithm
evaluate(foo, G, 1, 3, T)
0
0
0
G
1
1
2
2
3
1
2
is T lt T? no
43
Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
3
1
2
is T lt T? no
44
Algorithm
evaluate(foo, G, 2, 3, T)
0
0
0
G
1
1
1
2
2
0
3
1
2
return G

Write a Comment

User Comments (0)