High-Performance Power-Aware Computing - PowerPoint PPT Presentation

About This Presentation

Title:

High-Performance Power-Aware Computing

Description:

High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin_at_csc.ncsu.edu – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 36

Provided by: Vinc113

Learn more at: https://arcb.csc.ncsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: High-Performance Power-Aware Computing

1
High-PerformancePower-Aware Computing

Vincent W. Freeh
Computer Science
NCSU
vin_at_csc.ncsu.edu

2
Acknowledgements

NCSU
Tyler K. Bletsch
Mark E. Femal
Nandini Kappiah
Feng Pan
Daniel M. Smith

U of Georgia
Robert Springer
Barry Rountree
Prof. David K. Lowenthal

3
The case for power management

Eric Schmidt, Google CEO
its not speed but powerlow power, because
data centers can consume as much electricity as a
small city.
Power/energy consumption becoming key issue
Power limitations
Energy Heat Heat dissipation is costly
Non-trivial amount of money
Consequence
Excessive power consumption limits performance
Fewer nodes can operate concurrently
Goal
Increase power/energy efficiency
More performance per unit power/energy

4
CPU scaling
power ? frequency x voltage2

How CPU scaling
Reduce frequency voltage
Reduce power performance
Energy/power gears
Frequency-voltage pair
Power-performance setting
Energy-time tradeoff
Why CPU scaling?
Large power consumer
Mechanism exists

power
frequency/voltage
application throughput
frequency/voltage
5
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
6
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
7
Our work

Exploit bottlenecks
Application waiting on bottleneck resource
Reduce power consumption (non-critical resource)
Generally CPU not on critical path
Bottlenecks we exploit
Intra-node (memory)
Inter-node (load imbalance)
Contributions
Impact studies HPPAC 05 IPDPS 05
Varying gears/nodes PPoPP 05 PPoPP 06
(submitted)
Leveraging load imbalance SC 05

8
Methodology

Cluster used 10 nodes, AMD Athlon-64
Processor supports 7 frequency-voltage settings
(gears)
Frequency (MHz) 2000 1800 1600 1400 1200
1000 800
Voltage (V) 1.5 1.4 1.35 1.3
1.2 1.1 1.0
Measure
Wall clock time (gettimeofday system call)
Energy (external power meter)

9
NAS
10
CG 1 node
2000MHz
800MHz

Not CPU bound
Little time penalty
Large energy savings

11
EP 1 node
11 -3

CPU bound
Big time penalty
No (little) energy savings

12
Operation per miss
CG 8.60
13
Multiple nodes EP
14
Multiple nodes LU
S8 5.8 E8 1.28
S4 3.3 E4 1.15
S2 1.9 E2 1.03
Good speedup E-T tradeoff as N increases
15
Multiple nodes MG
Poor speedup Increased E as N increases
S8 2.7 E8 2.29
S4 1.6 E4 1.99
S2 1.2 E2 1.41
16
Phases
17
Phases LU
18
Phase detection

First, divide program into blocks
All code in block execute in same gear
Block boundaries
MPI operation
Expect OPM change
Then, merge adjacent blocks into phases
Merge if similar memory pressure
Use OPM
OPMi OPMi1 small
Merge if small (short time)
Note, in future
Leverage large body of phase detection research
Kennedy Kremer 1998 Sherwood, et al 2002

19
Data collection
MPI application
MPI library

Use MPI-jack
Pre and post hooks
For example
Program tracing
Gear shifting
Gather profile data during execution
Define MPI-jack hook for every MPI operation
Insert pseudo MPI call at end of loops
Information collected
Type of call and location (PC)
Status (gear, time, etc)
Statistics (uops and L2 misses for OPM
calculation)

MPI-jack
code
20
Example bt
21
Comparing two schedules

What is the best schedule?
Depends on user
User supplies better function
bool better(i, j)
Several metrics can be used
Energy-delay
Energy-delay squared Cameron et al. SC2004

22
Slope metric

Project uses slope
Energy-time tradeoff
Slope -1 ? energy savings time delay
Energy-delay product
User-defines the limit
Limit 0 ? minimize energy
Limit -8 ? minimize time
If slope lt limit, then better
We do not advocate this metric over others

23
Example bt
Solutions Slope lt -1.5?
1 00 ? 01 -11.7 true
2 01 ? 02 -1.78 true
3 02 ? 03 -1.19 false
4 02 ? 12 -1.44 false
02 is the best 02 is the best 02 is the best 02 is the best
24
Benefit of multiple gears mg
25
Current work no. of nodes, gear/phase
26
Load imbalance
27
Node bottleneck

Best course is to keep load balanced
Load balancing is hard
Slow down if not critical node
How to tell if not critical node?
Suppose a barrier
All nodes must arrive before any leave
No benefit to arriving early
Measure block time
Assume it is (mostly) the same between iterations
Assumptions
Iterative application
Past predicts future

28
Example
synch pt
synch pt
performance 1
performance (t-slack)/t
iteration k1
iteration k
Reduced performance power ? Energy savings
29
Measuring slack