High-Performance Power-Aware Computing - PowerPoint PPT Presentation

About This Presentation
Title:

High-Performance Power-Aware Computing

Description:

High-Performance Power-Aware Computing Vincent W. Freeh Computer Science NCSU vin_at_csc.ncsu.edu – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 36
Provided by: Vinc113
Category:

less

Transcript and Presenter's Notes

Title: High-Performance Power-Aware Computing


1
High-PerformancePower-Aware Computing
  • Vincent W. Freeh
  • Computer Science
  • NCSU
  • vin_at_csc.ncsu.edu

2
Acknowledgements
  • NCSU
  • Tyler K. Bletsch
  • Mark E. Femal
  • Nandini Kappiah
  • Feng Pan
  • Daniel M. Smith
  • U of Georgia
  • Robert Springer
  • Barry Rountree
  • Prof. David K. Lowenthal

3
The case for power management
  • Eric Schmidt, Google CEO
  • its not speed but powerlow power, because
    data centers can consume as much electricity as a
    small city.
  • Power/energy consumption becoming key issue
  • Power limitations
  • Energy Heat Heat dissipation is costly
  • Non-trivial amount of money
  • Consequence
  • Excessive power consumption limits performance
  • Fewer nodes can operate concurrently
  • Goal
  • Increase power/energy efficiency
  • More performance per unit power/energy

4
CPU scaling
power ? frequency x voltage2
  • How CPU scaling
  • Reduce frequency voltage
  • Reduce power performance
  • Energy/power gears
  • Frequency-voltage pair
  • Power-performance setting
  • Energy-time tradeoff
  • Why CPU scaling?
  • Large power consumer
  • Mechanism exists

power
frequency/voltage
application throughput
frequency/voltage
5
Is CPU scaling a win?
power
ECPU
PCPU
Psystem
Eother
Pother
T
time
full
6
Is CPU scaling a win?
power
PCPU
ECPU
PCPU
Psystem
Eother
Psystem
Pother
Pother
T
TDT
time
reduced
full
7
Our work
  • Exploit bottlenecks
  • Application waiting on bottleneck resource
  • Reduce power consumption (non-critical resource)
  • Generally CPU not on critical path
  • Bottlenecks we exploit
  • Intra-node (memory)
  • Inter-node (load imbalance)
  • Contributions
  • Impact studies HPPAC 05 IPDPS 05
  • Varying gears/nodes PPoPP 05 PPoPP 06
    (submitted)
  • Leveraging load imbalance SC 05

8
Methodology
  • Cluster used 10 nodes, AMD Athlon-64
  • Processor supports 7 frequency-voltage settings
    (gears)
  • Frequency (MHz) 2000 1800 1600 1400 1200
    1000 800
  • Voltage (V) 1.5 1.4 1.35 1.3
    1.2 1.1 1.0
  • Measure
  • Wall clock time (gettimeofday system call)
  • Energy (external power meter)

9
NAS
10
CG 1 node
2000MHz
800MHz
  • Not CPU bound
  • Little time penalty
  • Large energy savings

11
EP 1 node
11 -3
  • CPU bound
  • Big time penalty
  • No (little) energy savings

12
Operation per miss
CG 8.60
13
Multiple nodes EP
14
Multiple nodes LU
S8 5.8 E8 1.28
S4 3.3 E4 1.15
S2 1.9 E2 1.03
Good speedup E-T tradeoff as N increases
15
Multiple nodes MG
Poor speedup Increased E as N increases
S8 2.7 E8 2.29
S4 1.6 E4 1.99
S2 1.2 E2 1.41
16
Phases
17
Phases LU
18
Phase detection
  • First, divide program into blocks
  • All code in block execute in same gear
  • Block boundaries
  • MPI operation
  • Expect OPM change
  • Then, merge adjacent blocks into phases
  • Merge if similar memory pressure
  • Use OPM
  • OPMi OPMi1 small
  • Merge if small (short time)
  • Note, in future
  • Leverage large body of phase detection research
  • Kennedy Kremer 1998 Sherwood, et al 2002

19
Data collection
MPI application
MPI library
  • Use MPI-jack
  • Pre and post hooks
  • For example
  • Program tracing
  • Gear shifting
  • Gather profile data during execution
  • Define MPI-jack hook for every MPI operation
  • Insert pseudo MPI call at end of loops
  • Information collected
  • Type of call and location (PC)
  • Status (gear, time, etc)
  • Statistics (uops and L2 misses for OPM
    calculation)

MPI-jack
code
20
Example bt
21
Comparing two schedules
  • What is the best schedule?
  • Depends on user
  • User supplies better function
  • bool better(i, j)
  • Several metrics can be used
  • Energy-delay
  • Energy-delay squared Cameron et al. SC2004

22
Slope metric
  • Project uses slope
  • Energy-time tradeoff
  • Slope -1 ? energy savings time delay
  • Energy-delay product
  • User-defines the limit
  • Limit 0 ? minimize energy
  • Limit -8 ? minimize time
  • If slope lt limit, then better
  • We do not advocate this metric over others

23
Example bt
Solutions Slope lt -1.5?
1 00 ? 01 -11.7 true
2 01 ? 02 -1.78 true
3 02 ? 03 -1.19 false
4 02 ? 12 -1.44 false
02 is the best 02 is the best 02 is the best 02 is the best
24
Benefit of multiple gears mg
25
Current work no. of nodes, gear/phase
26
Load imbalance
27
Node bottleneck
  • Best course is to keep load balanced
  • Load balancing is hard
  • Slow down if not critical node
  • How to tell if not critical node?
  • Suppose a barrier
  • All nodes must arrive before any leave
  • No benefit to arriving early
  • Measure block time
  • Assume it is (mostly) the same between iterations
  • Assumptions
  • Iterative application
  • Past predicts future

28
Example
synch pt
synch pt
performance 1
performance (t-slack)/t
iteration k1
iteration k
Reduced performance power ? Energy savings
29
Measuring slack
  • Blocking operations
  • Receive
  • Wait
  • Barrier
  • Measure with MPI_Jack
  • Too frequent
  • Can be hundreds or thousands per second
  • Aggregate slack for one or more iterations
  • Computing slack, S
  • Measure times for computing and blocking phases
  • T C1 B1 C2 B2 Cn Bn
  • Compute aggregate slack
  • S (B1B2Bn)/T

30
Slack
Communication slack
CG
Aztec
Sweep3d
  • Slack
  • Varies between nodes
  • Varies between applications
  • Use net slack
  • Each node individually determines slack
  • Reduction to find min slack

31
Shifting
  • When to reduce performance?
  • When there is enough slack
  • When to increase performance?
  • When application performance suffers
  • Create high and low limit for slack
  • Need damping
  • Dynamically learn
  • Not the same for all applications
  • Range starts small
  • Increase if necessary

reduce gear
slack
same gear
increase gear
T
32
Aztec gears
33
Performance
Aztec
Sweep3d
34
Synthetic benchmark
35
Summary
  • Contributions
  • Improved energy efficiency of HPC applications
  • Found simple metric for phase boundary location
  • Developed simple, effective linear time algorithm
    for determining proper gears
  • Leveraged load imbalance
  • Future work
  • Reduce sampling interval to handful of iterations
  • Reduce algorithm time w/ modeling and prediction
  • Develop AMPERE
  • a message passing environment for reducing energy
  • http//fortknox.csc.ncsu.eduosr/
  • vin_at_csc.ncsu.edu dkl_at_cs.uga.edu

36
End
Write a Comment
User Comments (0)
About PowerShow.com