Title: Expanding the Computational Niche of GPGPUAccelerated Systems
1Expanding the ComputationalNiche of
GPGPU-Accelerated Systems
- Rob Fowler, Allan Porterfield, Dan Reed
- Renaissance Computing Institute
- North Carolina
- April 13, 2007rjf,akp,dan_reed_at_renci.org
2Renaissance Computing Institute
- RENCI vision
- a multidisciplinary institute
- academe, commerce and society
- broad in scope and participation
- from art to zoology
- Objectives
- enrich and empower human potential
- faculty, staff, students, collaborators
- create multidisciplinary partnerships
- science, engineering and computing
- commerce, humanities and the arts
- develop and deploy leading infrastructure
- driven by collaborative opportunities
- enable and sustain economic development
- Multidisciplinary team model
- scientists, creative artists, and computing
researchers - exploring new approaches to old and new problems
- Leverages research excellence and targets
statewide opportunities
3RENCI What Is It?
- Statewide objectives
- create broad benefit in a competitive world
- engage industry, academia, government and
citizens - Four target application areas
- public benefit
- supporting urban planning, disaster response,
- economic development
- helping companies and people with innovative
ideas - research engagement across disciplines
- catalyzing new projects and increasing success
- building multidisciplinary partnerships
- education and outreach
- providing hands on experiences and broadening
participation - Mechanisms and approaches
- partnerships and collaborations
- infrastructure as needed to accomplish goals
4RENCI HPC Activities.
- Los Alamos Computer Science Institute
- Performance, reliability analysis for NNSA apps
- SciDAC PERC/PERI
- Performance, reliability for Off. Science.
- Biomedical computing, various sources
- Genetics, proteomics through NC Bioportal
- Wearable devices to monitor chronic problems
- LEAD Linked Environments for Atmospheric
Discovery - Adaptive mesoscale weather sensing, simulation,
and reaction - VGrADS Virtual Grid Application Development
System - Usable Grid programming, used by LEAD, Bio-Med,
Hurricane modeling. - NSF Cyberinfrastructure Evaluation Center
- Honest, relevant evaluation of petascale system
proposals - SciDAC Lattice QCD Consortium
- Performance evaluation and tuning of LQCD codes.
- Cyberinfrastructure
- RENCI-UNC provides HPC services to UNC C-H campus
- CI for RENCI internal activities, BG/L, Viz.
systems, Clusters, - Partnership in NSF Petascale proposal(s)
5RENCI Capabilities.
- Rob Fowler Rice parallel compiler group
(96-2006), Copenhagen, Rochester. Parallel
compilers, performance tools, OS for MPPs, memory
hierarchy architecture and analysis, distributed
O-O systems. - Allan Porterfield Rice parallel compiler group
(Ph.D. 1989), 18 years at Tera and Cray
processor design group, performance analysis,
compilers, and runtime for multi-threaded
systems. - Dan Reed Chancellors Eminent Professor and
RENCI director, PCAST and PITAC, NCSA director,
head of CS at UIUC. Performance tools and
analysis.
6Moore's law
- Circuit element count doubles every N months. (N
18) - Why Features shrink, semiconductor dies grow.
- Corollaries Gate delays decrease. Wires are
relatively longer. - In the past the focus has been making
"conventional" processors faster. - Faster clocks
- Clever architecture and implementation ?
instruction-level parallelism. - Clever architecture (speculation, predication,
etc), HW/SW Prefetching, and massive caches ease
the memory wall problem. - Problems
- Faster clocks --gt more power P X V2F, but
Fmax V so P F3 - Where X is proportional to the avg. number of
gates active per clock cycle. - More power goes to overhead cache, predictors,
Tomasulo, clock, - Big dies --gt fewer dies/wafer, lower yields,
higher costs - Together --gt Expensive, power-hog processors on
which some signals take 6 cycles to cross. - Meanwhile, the market for high-end graphics (and
gaming) drove - Graphics Processors to ride the Moores law
wave. - Number of transistors and power consumption is
comparable to CPUs, but
7General Purpose Hardware
At the limit of usability?
8The Price of Over-Generalization?
Or the curse of adding special purpose widgets?
(Special order 1200from Wenger.)
9Competing with charcoal?
(Thanks to Bob Colwell.)
10The Killer Micros
- At Supercomputing 89 in Reno, Eugene Brooks
prophesied - No one will survive the Attack of the
Killer Micros!
Will we survive the attack of the killer
accelerators?
11Fast Forward to the late 90s
- 1995 edition ? emerging battleBear and dogs
on cover. - 1998 revision ?
- Cerebus, 3-headed dog.
Message large groups of small, agile,
general-purpose devices will deprive large,
inflexible, specialized devices of significant
parts of their market niche(s).
12CPUs are changing too More On-Chip
Parallelism,More Efficient Parallelism.
- Single thread ILP
- More efficient, smaller cores. Less speculation.
Less predication. - Multi-threading CLP
- Run other threads during (potential) stall
cycles. - Multi-core CLP
- Replicate simple core designs.
? Dramatic improvements in performance and
efficiency using general purpose cores.
13Special Purpose vs Conventional
3.2 GHz 230 SP GFLOPS 110 DP GFLOPS
(08)1 TF (est, 2010)
3.4 GHz54 SP GFLOPS
Peter Hofstee, LACSI 06
14Multimedia and Graphics Devices
AMD/ATI R600 64 Unified Shaders515 SP
GFLOPS
Nvidia G80 128 cores organized as teams of
SIMD engines. 330 SP GFLOPS
15Intel Teraflops Research Chip
- 80 cores on the die.
- gt 1 TF
- a research experiment, not a general purpose
device. - but products are in the planning stages.
16Research Issues RENCI Contribution.
- Our version of the challenges
- Identify niche(s) for GPGPU acceleration.
- Expand those niches.
- Issues we will address
- GPGPU vs CPU evolutionary curves.
- Graphics quality vs Mission-critical RAS
- Battling Amdahls law.
- Productive programmability.
- New computational methods.
- Approach
- Quantitative Analysis and Design
- Performance instrumentation
- Application Performance Studies
- Moving enough of an LQCD code to GPGPU to beat
Amdahl. - Language support Compilers and libraries to
move larger fractions of applications onto the
accelerators. - Use the same approach to generate tightly-coupled
code on general-purpose multi-core chips.
17Amdahls law issues.
- Approximately, Value of acceleration is bounded
by applicable fraction of program accelerated - Scenario today
- GPGPU speeds up some kernels by 6-10X, and more.
- Costs
- Capital and maintenance costs of accelerators.
- Just plugging in the GPGPU increases power
consumption by 28 to 90 watts. - At peak execution, top-end GPU adds 125-185 w,
(or more?). - Question Can we beat the tradeoff?
- Fraction of critical app. runnable on GPU?
- Job mix?
- Future architectures will improve this
- Increasing co-location of CPU and GPU.
- Better on-chip power management.
- Design for lower peak power.
18RENCI Performance Work.
- Performance Instrumentation and Analysis.
- Integrate GPU counters with CPU EBS
instrumentation - Application study
- LQCD - Move at least 60 of full app. to GPU.
- Data transposition to improve locality
- GPU-friendly versions of QLA, QMP, and QDP
- Execute entire loop nests in parallel on GPU
- Pipeline between tiles, reduce CPU ?? GPU xfers.
- RAS issues with implementation
- End-to-end robustness must be added to the
software architecture. - Errors need to be detectable and correctable.
19Lattice Quantum Chromodynamics in Practice
- Computation is organized as campaigns.
- Each campaign is a set of workflows.
- Each workflow is a collection of serial and
parallel jobs. - No one job is really huge!
- ? Run modest size jobs on GPU-accelerated box.
-
20A Path to 20 TF Sustained.
- Application characteristics
- MILC QCD code achieves 35 of peak on X86_64
clusters and BG/L. 40 deemed achievable. - Over-partition to keep data in L2/3 cache.
- Scalability is now limited by CG solver.
- Production tasks 128 CPUs, ( 1.5(0.5) TF peak
(sustained). - Prototype machine characteristics
- Each Node
- CPUs s ? 2 sockets x 4 cores ? 100 GF
- GPUs g ? 1,2 _at_ 500 GF ? g x 500 GF
- 40 nodes ? 4.0 g x 20 TF (peak)
- Cost per node 2.5k g 600.
- Cost for fast interconnect 1K/node
- 204 (404) TF Peak machine for 150K (168K)
- Sustained 20 TF depends critically on reducing
the fraction that must be run on the CPUs. - Example, if 50 acceleratable and all parts run
at 40 ? - Need 25 TF peak CPUS infinite acceleration.
21RENCI Language Effort.
- Observation OpenMP-like programming is
insufficient to beat Amdahl, even on conventional
multi-processors. - Libraries for important kernels approach is
also very limited, though useful - See MATLAB, Clearspeed, etc.
- Our Approach Supercomputer-style languages and
compilers. - Languages GAS and HPCS best of breed
- Parallelization Semi-automatic, directive-driven
- Vectorization SIMD-ization See Cray TMI
- Streaming and Coarse-grain Pipelining between
cores. - See Rice HPF and CAF compilation strategies.