Title: There are no comprehensive, holistic studies of performance,
1Improving Performance, Power, and Thermal
Efficiency in High-End Systems
Kirk W. Cameron
Scalable Performance Laboratory Department
of Computer Science and Engineering
Virginia Tech
cameron_at_ cs.vt.edu
Power Efficiency
Introduction
PowerPack II Software Power profiling API
library - synchronized profiling of parallel
applications. Power control API library -
synchronized DVS control within parallel
application. Multimeter middleware - coordinates
data from multiple meter sources. Power analyzer
middleware sorts/sifts/analyzes/correlates
profiling data. Performance profiler use
common utilities to poll system performance
status.
Problem Statement Left unchecked, the fundamental
drive to increase peak performance using tens of
thousands of components in close proximity to one
another will result in 1) an inability to
sustain performance improvements, and 2)
exorbitant infrastructure and operational cost
for power and cooling. Performance, Power, and
Thermal Facts The gap between peak and achieved
performance is growing ? A 5 Megawatt
Supercomputer can consume 4M in energy
annually. In just 2 hours, Earth Simulator can
produce enough heat to heat a home in the
midwest all winter long. Projections Commodity
components fail at annual rate of 2-3. Petaflop
system of 12,000 nodes (CPU, NIC, DRAM,
disk) will sustain hardware failure once every
24 hours. Life expectancy of an electronic
component decreases 50 for every 10C(18F)
temperature increase.
Relevant approaches to the problem Improving
Performance Efficiencies Includes a myriad of
tools and modeling techniques to analyze and
optimize the performance of parallel scientific
applications. In our work we focus on using fast
analytical modeling techniques to optimize
emergent architectures such as the IBM Cell
Broadband Architecture. Improving Power
Efficiencies Exploit application slack times to
operate various components in lower power modes
(e.g. dynamic voltage scaling or DVFS) to
conserve power and energy. Prior to our work, no
framework for profiling performance and power of
parallel systems and applications. Improving
Thermal Efficiencies Exploit application slack
times to operate various components in lower
power (and thermal) modes to reduce the heat
emitted by the system. Prior to our work, no
framework for profiling performance and thermals
of parallel systems and applications.
Multimeters
Resistors
Node under test
Distributed Power Profiles NAS codes exhibit
regularity (e.g. FT on 4 nodes above left) that
reflects algorithm behavior. Intensive use of
memory corresponds to decreases in CPU power and
increases in memory power use (above right).
Power consumption can vary with node for a single
application, with number of nodes under fixed
workload and with varied workload under fixed
number of nodes. Results often correlate to
comm/comp ratio.
V
V
S
R
R
-
RS232/GBIC
Component
P
(V
-
V
)V
/R
Component
S
R
R
Ethernet
Ethernet
There are no comprehensive, holistic studies of
performance, power and thermals on distributed
scientific systems and workloads
Without innovation future HEC systems will waste
performance potential, waste energy, and require
extravagant cooling.
8-node Dori
Ethernet Switch
Data Collection System
Our Approach Observations Predictive models and
techniques are needed to maximize performance of
emergent systems. Additional below-peak
performance may provide adequate slack times
for improved power and thermal efficiencies. Const
raint Performance is the critical constraint.
Reduce power and thermals ONLY if it does not
reduce performance significantly.
- Our Contributions
- Portable framework to profile, analyze and
optimize distributed applications for
performance, power, and thermals with minimal
performance impact. - Performance-Power-Thermal tradeoff studies and
optimizations of scientific workloads on various
architectures.
Reducing Energy Consumption (left) CPU Miser
uses dynamic voltage and frequency scaling (DVFS)
to lower average processor power consumption.
Using the default cpuspeed daemon (auto) or any
fixed lower frequency, performance loss is
common. CPU Miser is able to reduce energy
consumption without reducing performance
significantly. (above) Memory Miser uses power
scalable DRAM to lower average memory power
consumption by turning off memory DIMMs based on
memory use and allocation. Note the top curve
shows the amount of online memory and the bottom
curve shows actual demand. CPU Miser and Memory
Miser are both capable of 30 total system energy
savings with less than 1 performance loss.
Thermal Efficiency
Performance Efficiency
Thermal-Performance tradeoffs are studied using
Tempest and DVFS strategies applied to reduce
temperature in parallel scientific applications.
Optimizing Heterogeneous Multicore Systems We use
a variation of the lognP performance model to
predict the cost of various process and data
placement configurations at runtime. Using the
performance model we can schedule process and
data placement optimally for a heterogeneous
multicore architecture. Results on the IBM Cell
Broadband Engine show dynamic multicore
scheduling using analytical modeling is a viable,
accurate technique to improve performance
efficiencies. Portions of this work were
accomplished in collaboration with the Pearl
Laboratory led by Prof. D. Nikolopoulos.
Tempest profiling techniques are automatic,
accurate, and portable.
Detailed thermal profile of FT (Class C,NP4)
Thermal regulation of FT (Class C, NP4)
Thermal regulation of IS (Class C, NP4)
Tempest Software Architecture
Single APU TAPU TAPUp CAPU TAPUp APU
part that can be parallelized CAPU APU
sequential part Multiple APUs TAPU(1,p)
TAPU(1,1)/p CAPU p number
of APUs TAPU(1,1) offloaded time for 1 APU
TAPU(1,p) offloaded time for p APUs T THPU
TAPU(1,1)/p CAPU Ooffload pg
- HPU time for one iteration
- THPU(m,1) am THPU(1,1) TCSW Ocol
- T(m,p) THPU(m,p) TAPU(m,p) Ooffload
pg
Time for a single iteration Ti THPU TAPU
Offload Off-loaded timeOffload Or Os Total
time T ?i(THPU,i TAPU,i Ooffload,i)
Distributed Thermal Profiles A thermal profile
of FT (above) reveals thermal patterns
corresponding to code phases. Floating point
intensive phases run hot while memory bound
phases run cooler. Also, significant temperature
drops occur in very short periods of time.
Thermal behavior of BT (not pictured) shows
temperatures synchronize with workload behavior
across nodes. We also observe some nodes trend
hotter than others. All of this data was obtained
using Tempest.
Thermal regulation (top top right) Tempest
controller constrains temperature to within a
threshold. Since the controller is heuristic, the
temperature can exceed the threshold. However,
temperature is typically controlled well using
DVFS in a node. The weighted importance of
thermals, performance and energy can determine
the best operating point over a number of
nodes.
Performance analysis of NAS parallel benchmarks
Avg CPU Temp for various NAS PB codes
Thermal-aware Performance Impact (right) The
performance impact of our thermal-aware DVFS
controller is less than 10 for all the NAS PB
codes measured. Nonetheless, we commonly reduce
operating temperature nearly 10C(18F) which
translates to 50 reliability improvement in some
cases. On average, we reduce operating
temperature between 5-7 C.
- Application Parallel Bayesian Phylogenetic
Inference - Dataset 107 sequences, each 10000 nucleotides,
20,000 gens - MMGP mean error 3.2, std. dev. 2.6, max. error
10
CPU Impact on Thermals (left) For floating point
intensive codes (e.g. SP, FT, EP from NAS) CPU is
a large consumer of power under load and
dissipates significant heat. Energy optimizations
that significantly reduce CPU heat should impact
total system temperature significantly.
- PBPI executes sampling phase at the beginning of
execution - MMGP params are determined during the sampling
phase - Execution restarted after the sampling phase with
MMGP - PBPI with sampling phase outperforms other
configsby 1 - 4x. Sampling phase overhead is
2.5.
Download Tempest Tempest is available for
download from http//sourceforge.net . Related
papers can be found at http//scape.cs.vt.edu .
Temperature-Performance tradeoffs
Thermal optimizations are achieved with minimal
performance impact
This work sponsored in part by the Department of
Energy Office of Science Early Career Principal
Investigator (ECPI) Program under grant number
DOE DE-FG02-04ER25608.