Expanding the Computational Niche of GPGPUAccelerated Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Expanding the Computational Niche of GPGPUAccelerated Systems

Description:

scientists, creative artists, and computing researchers ... Or the curse of adding special purpose widgets? Competing with charcoal? (Thanks to Bob Colwell. ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 22

Provided by: robfo

Learn more at: http://gamma.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Expanding the Computational Niche of GPGPUAccelerated Systems

1
Expanding the ComputationalNiche of
GPGPU-Accelerated Systems

Rob Fowler, Allan Porterfield, Dan Reed
Renaissance Computing Institute
North Carolina
April 13, 2007rjf,akp,dan_reed_at_renci.org

2
Renaissance Computing Institute

RENCI vision
a multidisciplinary institute
academe, commerce and society
broad in scope and participation
from art to zoology
Objectives
enrich and empower human potential
faculty, staff, students, collaborators
create multidisciplinary partnerships
science, engineering and computing
commerce, humanities and the arts
develop and deploy leading infrastructure
driven by collaborative opportunities
enable and sustain economic development
Multidisciplinary team model
scientists, creative artists, and computing
researchers
exploring new approaches to old and new problems
Leverages research excellence and targets
statewide opportunities

3
RENCI What Is It?

Statewide objectives
create broad benefit in a competitive world
engage industry, academia, government and
citizens
Four target application areas
public benefit
supporting urban planning, disaster response,
economic development
helping companies and people with innovative
ideas
research engagement across disciplines
catalyzing new projects and increasing success
building multidisciplinary partnerships
education and outreach
providing hands on experiences and broadening
participation
Mechanisms and approaches
partnerships and collaborations
infrastructure as needed to accomplish goals

4
RENCI HPC Activities.

Los Alamos Computer Science Institute
Performance, reliability analysis for NNSA apps
SciDAC PERC/PERI
Performance, reliability for Off. Science.
Biomedical computing, various sources
Genetics, proteomics through NC Bioportal
Wearable devices to monitor chronic problems
LEAD Linked Environments for Atmospheric
Discovery
Adaptive mesoscale weather sensing, simulation,
and reaction
VGrADS Virtual Grid Application Development
System
Usable Grid programming, used by LEAD, Bio-Med,
Hurricane modeling.
NSF Cyberinfrastructure Evaluation Center
Honest, relevant evaluation of petascale system
proposals
SciDAC Lattice QCD Consortium
Performance evaluation and tuning of LQCD codes.
Cyberinfrastructure
RENCI-UNC provides HPC services to UNC C-H campus
CI for RENCI internal activities, BG/L, Viz.
systems, Clusters,
Partnership in NSF Petascale proposal(s)

5
RENCI Capabilities.

Rob Fowler Rice parallel compiler group
(96-2006), Copenhagen, Rochester. Parallel
compilers, performance tools, OS for MPPs, memory
hierarchy architecture and analysis, distributed
O-O systems.
Allan Porterfield Rice parallel compiler group
(Ph.D. 1989), 18 years at Tera and Cray
processor design group, performance analysis,
compilers, and runtime for multi-threaded
systems.
Dan Reed Chancellors Eminent Professor and
RENCI director, PCAST and PITAC, NCSA director,
head of CS at UIUC. Performance tools and
analysis.

6
Moore's law

Circuit element count doubles every N months. (N
18)
Why Features shrink, semiconductor dies grow.
Corollaries Gate delays decrease. Wires are
relatively longer.
In the past the focus has been making
"conventional" processors faster.
Faster clocks
Clever architecture and implementation ?
instruction-level parallelism.
Clever architecture (speculation, predication,
etc), HW/SW Prefetching, and massive caches ease
the memory wall problem.
Problems
Faster clocks --gt more power P X V2F, but
Fmax V so P F3
Where X is proportional to the avg. number of
gates active per clock cycle.
More power goes to overhead cache, predictors,
Tomasulo, clock,
Big dies --gt fewer dies/wafer, lower yields,
higher costs
Together --gt Expensive, power-hog processors on
which some signals take 6 cycles to cross.
Meanwhile, the market for high-end graphics (and
gaming) drove
Graphics Processors to ride the Moores law
wave.
Number of transistors and power consumption is
comparable to CPUs, but

7
General Purpose Hardware
At the limit of usability?
8
The Price of Over-Generalization?
Or the curse of adding special purpose widgets?
(Special order 1200from Wenger.)
9
Competing with charcoal?
(Thanks to Bob Colwell.)
10
The Killer Micros

At Supercomputing 89 in Reno, Eugene Brooks
prophesied
No one will survive the Attack of the
Killer Micros!

Will we survive the attack of the killer
accelerators?
11
Fast Forward to the late 90s

1995 edition ? emerging battleBear and dogs
on cover.
1998 revision ?
Cerebus, 3-headed dog.

Message large groups of small, agile,
general-purpose devices will deprive large,
inflexible, specialized devices of significant
parts of their market niche(s).
12
CPUs are changing too More On-Chip
Parallelism,More Efficient Parallelism.

Single thread ILP
More efficient, smaller cores. Less speculation.
Less predication.
Multi-threading CLP
Run other threads during (potential) stall
cycles.
Multi-core CLP
Replicate simple core designs.

? Dramatic improvements in performance and
efficiency using general purpose cores.
13
Special Purpose vs Conventional
3.2 GHz 230 SP GFLOPS 110 DP GFLOPS
(08)1 TF (est, 2010)
3.4 GHz54 SP GFLOPS
Peter Hofstee, LACSI 06
14
Multimedia and Graphics Devices
AMD/ATI R600 64 Unified Shaders515 SP
GFLOPS
Nvidia G80 128 cores organized as teams of
SIMD engines. 330 SP GFLOPS
15
Intel Teraflops Research Chip

80 cores on the die.
gt 1 TF
a research experiment, not a general purpose
device.
but products are in the planning stages.

16
Research Issues RENCI Contribution.

Our version of the challenges
Identify niche(s) for GPGPU acceleration.
Expand those niches.
Issues we will address
GPGPU vs CPU evolutionary curves.
Graphics quality vs Mission-critical RAS
Battling Amdahls law.
Productive programmability.
New computational methods.
Approach
Quantitative Analysis and Design
Performance instrumentation
Application Performance Studies
Moving enough of an LQCD code to GPGPU to beat
Amdahl.
Language support Compilers and libraries to
move larger fractions of applications onto the
accelerators.
Use the same approach to generate tightly-coupled
code on general-purpose multi-core chips.

17
Amdahls law issues.

Approximately, Value of acceleration is bounded
by applicable fraction of program accelerated
Scenario today
GPGPU speeds up some kernels by 6-10X, and more.
Costs
Capital and maintenance costs of accelerators.
Just plugging in the GPGPU increases power
consumption by 28 to 90 watts.
At peak execution, top-end GPU adds 125-185 w,
(or more?).
Question Can we beat the tradeoff?
Fraction of critical app. runnable on GPU?
Job mix?
Future architectures will improve this
Increasing co-location of CPU and GPU.
Better on-chip power management.
Design for lower peak power.

18
RENCI Performance Work.

Performance Instrumentation and Analysis.
Integrate GPU counters with CPU EBS
instrumentation
Application study
LQCD - Move at least 60 of full app. to GPU.
Data transposition to improve locality
GPU-friendly versions of QLA, QMP, and QDP
Execute entire loop nests in parallel on GPU
Pipeline between tiles, reduce CPU ?? GPU xfers.
RAS issues with implementation
End-to-end robustness must be added to the
software architecture.
Errors need to be detectable and correctable.

19
Lattice Quantum Chromodynamics in Practice

Computation is organized as campaigns.
Each campaign is a set of workflows.
Each workflow is a collection of serial and
parallel jobs.
No one job is really huge!
? Run modest size jobs on GPU-accelerated box.

20
A Path to 20 TF Sustained.

Application characteristics
MILC QCD code achieves 35 of peak on X86_64
clusters and BG/L. 40 deemed achievable.
Over-partition to keep data in L2/3 cache.
Scalability is now limited by CG solver.
Production tasks 128 CPUs, ( 1.5(0.5) TF peak
(sustained).
Prototype machine characteristics
Each Node
CPUs s ? 2 sockets x 4 cores ? 100 GF
GPUs g ? 1,2 _at_ 500 GF ? g x 500 GF
40 nodes ? 4.0 g x 20 TF (peak)
Cost per node 2.5k g 600.
Cost for fast interconnect 1K/node
204 (404) TF Peak machine for 150K (168K)
Sustained 20 TF depends critically on reducing
the fraction that must be run on the CPUs.
Example, if 50 acceleratable and all parts run
at 40 ?
Need 25 TF peak CPUS infinite acceleration.

21
RENCI Language Effort.

Observation OpenMP-like programming is
insufficient to beat Amdahl, even on conventional
multi-processors.
Libraries for important kernels approach is
also very limited, though useful
See MATLAB, Clearspeed, etc.
Our Approach Supercomputer-style languages and
compilers.
Languages GAS and HPCS best of breed
Parallelization Semi-automatic, directive-driven
Vectorization SIMD-ization See Cray TMI
Streaming and Coarse-grain Pipelining between
cores.
See Rice HPF and CAF compilation strategies.