High-Performance Power-Aware Computing - PowerPoint PPT Presentation

About This Presentation

Title:

High-Performance Power-Aware Computing

Description:

'it's not speed but power low power, because data centers can consume as much ... Develop AMPERE. a message passing environment for reducing energy ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 55

Provided by: Vincent154

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: High-Performance Power-Aware Computing

1
High-PerformancePower-Aware Computing

Vincent W. Freeh
Computer Science
NCSU
vin_at_csc.ncsu.edu

2
Acknowledgements

NCSU
Tyler K. Bletsch
Mark E. Femal
Nandini Kappiah
Feng Pan
Daniel M. Smith

U of Georgia
Robert Springer
Barry Rountree
Prof. David K. Lowenthal

3
The case for power management

Eric Schmidt, Google CEO
its not speed but powerlow power, because
data centers can consume as much electricity as a
small city.
Power/energy consumption becoming key issue
Power limitations
Energy Heat Heat dissipation is costly
Non-trivial amount of money
Consequence
Excessive power consumption limits performance
Fewer nodes can operate concurrently
Goal
Increase power/energy efficiency
More performance per unit power/energy

4
CPU scaling
power ? frequency x voltage2

How CPU scaling
Reduce frequency voltage
Reduce power performance
Energy/power gears
Frequency-voltage pair
Power-performance setting
Energy-time tradeoff
Why CPU scaling?
Large power consumer
Mechanism exists

power
frequency/voltage
application throughput
frequency/voltage
5
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
6
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
7
Our work

Exploit bottlenecks
Application waiting on bottleneck resource
Reduce power consumption (non-critical resource)
Generally CPU not on critical path
Bottlenecks we exploit
Intra-node (memory)
Inter-node (load imbalance)
Contributions
Impact studies HPPAC 05 IPDPS 05
Varying gears/nodes PPoPP 05 PPoPP 06
(submitted)
Leveraging load imbalance SC 05

8
Methodology

Cluster used 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings
(gears)
Frequency (MHz) 2000 1800 1600 1400 1200
1000 800
Voltage (V) 1.5 1.4 1.35 1.3
1.2 1.1 1.0
Measure
Wall clock time (gettimeofday system call)
Energy (external power meter)

9
NAS
10
CG 1 node
2000MHz
800MHz

Not CPU bound
Little time penalty
Large energy savings

11
EP 1 node
11 -3

CPU bound
Big time penalty
No (little) energy savings

12
Operation per miss
CG 8.60
13
Multiple nodes EP
14
Multiple nodes LU
S8 5.8 E8 1.28
S4 3.3 E4 1.15
S2 1.9 E2 1.03
Good speedup E-T tradeoff as N increases
15
Multiple nodes MG
Poor speedup Increased E as N increases
S8 2.7 E8 2.29
S4 1.6 E4 1.99
S2 1.2 E2 1.41
16
Phases
17
Phases LU
18
Phase detection

First, divide program into blocks
All code in block execute in same gear
Block boundaries
MPI operation
Expect OPM change
Then, merge adjacent blocks into phases
Merge if similar memory pressure
Use OPM
OPMi OPMi1 small
Merge if small (short time)
Note, in future
Leverage large body of phase detection research
Kennedy Kremer 1998 Sherwood, et al 2002

19
Data collection
MPI application
MPI library

Use MPI-jack
Pre and post hooks
For example
Program tracing
Gear shifting
Gather profile data during execution
Define MPI-jack hook for every MPI operation
Insert pseudo MPI call at end of loops
Information collected
Type of call and location (PC)
Status (gear, time, etc)
Statistics (uops and L2 misses for OPM
calculation)

MPI-jack
code
20
Example bt
21
Comparing two schedules

What is the best schedule?
Depends on user
User supplies better function
bool better(i, j)
Several metrics can be used
Energy-delay
Energy-delay squared Cameron et al. SC2004

22
Slope metric

Project uses slope
Energy-time tradeoff
Slope -1 ? energy savings time delay
User-defines the limit
Limit 0 ? minimize energy
Limit -8 ? minimize time
If slope lt limit, then better
We do not advocate this metric over others

23
Example bt
Solutions Slope lt -1.5?
1 00 ? 01 -11.7 true
2 01 ? 02 -1.78 true
3 02 ? 03 -1.19 false
4 02 ? 12 -1.44 false
02 is the best 02 is the best 02 is the best 02 is the best
24
Benefit of multiple gears mg
25
Current work no. of nodes, gear/phase
26
Load imbalance
27
Node bottleneck

Best course is to keep load balanced
Load balancing is hard
Slow down if not critical node
How to tell if not critical node?
Suppose a barrier
All nodes must arrive before any leave
No benefit to arriving early
Measure block time
Assume it is (mostly) the same between iterations
Assumptions
Iterative application
Past predicts future

28
Example
synch pt
synch pt
performance 1
performance (t-slack)/t
iteration k1
iteration k
Reduced performance power ? Energy savings
29
Measuring slack

Blocking operations
Receive
Wait
Barrier
Measure with MPI_Jack
Too frequent
Can be hundreds or thousands per second
Aggregate slack for one or more iterations
Computing slack, S
Measure times for computing and blocking phases
T C1 B1 C2 B2 Cn Bn
Compute aggregate slack
S (B1B2Bn)/T

30
Slack
Communication slack
CG
Aztec
Sweep3d

Slack
Varies between nodes
Varies between applications

Use net slack
Each node individually determines slack
Reduction to find min slack

31
Shifting

When to reduce performance?
When there is enough slack
When to increase performance?
When application performance suffers
Create high and low limit for slack
Need damping
Dynamically learn
Not the same for all applications
Range starts small
Increase if necessary

reduce gear
slack
same gear
increase gear
T
32
Aztec gears
33
Performance
Aztec
Sweep3d
34
Synthetic benchmark
35
Summary

Contributions
Improved energy efficiency of HPC applications
Found simple metric for phase boundary location
Developed simple, effective linear time algorithm
for determining proper gears
Leveraged load imbalance
Future work
Reduce sampling interval to handful of iterations
Reduce algorithm time w/ modeling and prediction
Develop AMPERE
a message passing environment for reducing energy
http//fortknox.csc.ncsu.eduosr/
vin_at_csc.ncsu.edu dkl_at_cs.uga.edu

36
End
37
Shifting test
NAS LU 1 node
7.7
1
1
4.5
38
Beta

Hsu Kremer PLDI 03
Relates application slowdown to CPU slowdown
b
b1 ? time is
CPU dependent
b0 ? time is
independent of CPU
OPM vs. b
Correlated
Log(OPM) ? b

39
OPM and b and slack

OPM not strongly correlated to b in multi-node
Why?
There is another bottleneck
Communication slack
Waiting time
Eg, MPI_Receive, MPI_Wait, MPI_Barrier
MG OPM 70.6 slack 25
LU OPM 73.5 slack 11
Can predict b with
Log(OPM) and
slack

40
Energy savings (synthetic)
41
Normalized MG
With communication bottleneck E-T tradeoff
improves as N increases
42
SPEC FP
43
SPEC INT
44
Single node MG
6 -7
12 -8
Modest memory pressure Gears offer E-T tradeoff
45
Dynamically adjust performance
net slack
time
0
2
1
46
Adjust performance
net slack
time
0
0
0
1
1
1
47
Dampening
net slack
time
0
0
1
1
1
0
48
Power consumption
Average for NAS suite
49
Related work Energy conservation