Expanding the Computational Niche of GPGPUAccelerated Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Expanding the Computational Niche of GPGPUAccelerated Systems

Description:

scientists, creative artists, and computing researchers ... Or the curse of adding special purpose widgets? Competing with charcoal? (Thanks to Bob Colwell. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 22
Provided by: robfo
Learn more at: http://gamma.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Expanding the Computational Niche of GPGPUAccelerated Systems


1
Expanding the ComputationalNiche of
GPGPU-Accelerated Systems
  • Rob Fowler, Allan Porterfield, Dan Reed
  • Renaissance Computing Institute
  • North Carolina
  • April 13, 2007rjf,akp,dan_reed_at_renci.org

2
Renaissance Computing Institute
  • RENCI vision
  • a multidisciplinary institute
  • academe, commerce and society
  • broad in scope and participation
  • from art to zoology
  • Objectives
  • enrich and empower human potential
  • faculty, staff, students, collaborators
  • create multidisciplinary partnerships
  • science, engineering and computing
  • commerce, humanities and the arts
  • develop and deploy leading infrastructure
  • driven by collaborative opportunities
  • enable and sustain economic development
  • Multidisciplinary team model
  • scientists, creative artists, and computing
    researchers
  • exploring new approaches to old and new problems
  • Leverages research excellence and targets
    statewide opportunities

3
RENCI What Is It?
  • Statewide objectives
  • create broad benefit in a competitive world
  • engage industry, academia, government and
    citizens
  • Four target application areas
  • public benefit
  • supporting urban planning, disaster response,
  • economic development
  • helping companies and people with innovative
    ideas
  • research engagement across disciplines
  • catalyzing new projects and increasing success
  • building multidisciplinary partnerships
  • education and outreach
  • providing hands on experiences and broadening
    participation
  • Mechanisms and approaches
  • partnerships and collaborations
  • infrastructure as needed to accomplish goals

4
RENCI HPC Activities.
  • Los Alamos Computer Science Institute
  • Performance, reliability analysis for NNSA apps
  • SciDAC PERC/PERI
  • Performance, reliability for Off. Science.
  • Biomedical computing, various sources
  • Genetics, proteomics through NC Bioportal
  • Wearable devices to monitor chronic problems
  • LEAD Linked Environments for Atmospheric
    Discovery
  • Adaptive mesoscale weather sensing, simulation,
    and reaction
  • VGrADS Virtual Grid Application Development
    System
  • Usable Grid programming, used by LEAD, Bio-Med,
    Hurricane modeling.
  • NSF Cyberinfrastructure Evaluation Center
  • Honest, relevant evaluation of petascale system
    proposals
  • SciDAC Lattice QCD Consortium
  • Performance evaluation and tuning of LQCD codes.
  • Cyberinfrastructure
  • RENCI-UNC provides HPC services to UNC C-H campus
  • CI for RENCI internal activities, BG/L, Viz.
    systems, Clusters,
  • Partnership in NSF Petascale proposal(s)

5
RENCI Capabilities.
  • Rob Fowler Rice parallel compiler group
    (96-2006), Copenhagen, Rochester. Parallel
    compilers, performance tools, OS for MPPs, memory
    hierarchy architecture and analysis, distributed
    O-O systems.
  • Allan Porterfield Rice parallel compiler group
    (Ph.D. 1989), 18 years at Tera and Cray
    processor design group, performance analysis,
    compilers, and runtime for multi-threaded
    systems.
  • Dan Reed Chancellors Eminent Professor and
    RENCI director, PCAST and PITAC, NCSA director,
    head of CS at UIUC. Performance tools and
    analysis.

6
Moore's law
  • Circuit element count doubles every N months. (N
    18)
  • Why Features shrink, semiconductor dies grow.
  • Corollaries Gate delays decrease. Wires are
    relatively longer.
  • In the past the focus has been making
    "conventional" processors faster.
  • Faster clocks
  • Clever architecture and implementation ?
    instruction-level parallelism.
  • Clever architecture (speculation, predication,
    etc), HW/SW Prefetching, and massive caches ease
    the memory wall problem.
  • Problems
  • Faster clocks --gt more power P X V2F, but
    Fmax V so P F3
  • Where X is proportional to the avg. number of
    gates active per clock cycle.
  • More power goes to overhead cache, predictors,
    Tomasulo, clock,
  • Big dies --gt fewer dies/wafer, lower yields,
    higher costs
  • Together --gt Expensive, power-hog processors on
    which some signals take 6 cycles to cross.
  • Meanwhile, the market for high-end graphics (and
    gaming) drove
  • Graphics Processors to ride the Moores law
    wave.
  • Number of transistors and power consumption is
    comparable to CPUs, but

7
General Purpose Hardware
At the limit of usability?
8
The Price of Over-Generalization?
Or the curse of adding special purpose widgets?
(Special order 1200from Wenger.)
9
Competing with charcoal?
(Thanks to Bob Colwell.)
10
The Killer Micros
  • At Supercomputing 89 in Reno, Eugene Brooks
    prophesied
  • No one will survive the Attack of the
    Killer Micros!

Will we survive the attack of the killer
accelerators?
11
Fast Forward to the late 90s
  • 1995 edition ? emerging battleBear and dogs
    on cover.
  • 1998 revision ?
  • Cerebus, 3-headed dog.

Message large groups of small, agile,
general-purpose devices will deprive large,
inflexible, specialized devices of significant
parts of their market niche(s).
12
CPUs are changing too More On-Chip
Parallelism,More Efficient Parallelism.
  • Single thread ILP
  • More efficient, smaller cores. Less speculation.
    Less predication.
  • Multi-threading CLP
  • Run other threads during (potential) stall
    cycles.
  • Multi-core CLP
  • Replicate simple core designs.

? Dramatic improvements in performance and
efficiency using general purpose cores.
13
Special Purpose vs Conventional
3.2 GHz 230 SP GFLOPS 110 DP GFLOPS
(08)1 TF (est, 2010)
3.4 GHz54 SP GFLOPS
Peter Hofstee, LACSI 06
14
Multimedia and Graphics Devices
AMD/ATI R600 64 Unified Shaders515 SP
GFLOPS
Nvidia G80 128 cores organized as teams of
SIMD engines. 330 SP GFLOPS
15
Intel Teraflops Research Chip
  • 80 cores on the die.
  • gt 1 TF
  • a research experiment, not a general purpose
    device.
  • but products are in the planning stages.

16
Research Issues RENCI Contribution.
  • Our version of the challenges
  • Identify niche(s) for GPGPU acceleration.
  • Expand those niches.
  • Issues we will address
  • GPGPU vs CPU evolutionary curves.
  • Graphics quality vs Mission-critical RAS
  • Battling Amdahls law.
  • Productive programmability.
  • New computational methods.
  • Approach
  • Quantitative Analysis and Design
  • Performance instrumentation
  • Application Performance Studies
  • Moving enough of an LQCD code to GPGPU to beat
    Amdahl.
  • Language support Compilers and libraries to
    move larger fractions of applications onto the
    accelerators.
  • Use the same approach to generate tightly-coupled
    code on general-purpose multi-core chips.

17
Amdahls law issues.
  • Approximately, Value of acceleration is bounded
    by applicable fraction of program accelerated
  • Scenario today
  • GPGPU speeds up some kernels by 6-10X, and more.
  • Costs
  • Capital and maintenance costs of accelerators.
  • Just plugging in the GPGPU increases power
    consumption by 28 to 90 watts.
  • At peak execution, top-end GPU adds 125-185 w,
    (or more?).
  • Question Can we beat the tradeoff?
  • Fraction of critical app. runnable on GPU?
  • Job mix?
  • Future architectures will improve this
  • Increasing co-location of CPU and GPU.
  • Better on-chip power management.
  • Design for lower peak power.

18
RENCI Performance Work.
  • Performance Instrumentation and Analysis.
  • Integrate GPU counters with CPU EBS
    instrumentation
  • Application study
  • LQCD - Move at least 60 of full app. to GPU.
  • Data transposition to improve locality
  • GPU-friendly versions of QLA, QMP, and QDP
  • Execute entire loop nests in parallel on GPU
  • Pipeline between tiles, reduce CPU ?? GPU xfers.
  • RAS issues with implementation
  • End-to-end robustness must be added to the
    software architecture.
  • Errors need to be detectable and correctable.

19
Lattice Quantum Chromodynamics in Practice
  • Computation is organized as campaigns.
  • Each campaign is a set of workflows.
  • Each workflow is a collection of serial and
    parallel jobs.
  • No one job is really huge!
  • ? Run modest size jobs on GPU-accelerated box.

20
A Path to 20 TF Sustained.
  • Application characteristics
  • MILC QCD code achieves 35 of peak on X86_64
    clusters and BG/L. 40 deemed achievable.
  • Over-partition to keep data in L2/3 cache.
  • Scalability is now limited by CG solver.
  • Production tasks 128 CPUs, ( 1.5(0.5) TF peak
    (sustained).
  • Prototype machine characteristics
  • Each Node
  • CPUs s ? 2 sockets x 4 cores ? 100 GF
  • GPUs g ? 1,2 _at_ 500 GF ? g x 500 GF
  • 40 nodes ? 4.0 g x 20 TF (peak)
  • Cost per node 2.5k g 600.
  • Cost for fast interconnect 1K/node
  • 204 (404) TF Peak machine for 150K (168K)
  • Sustained 20 TF depends critically on reducing
    the fraction that must be run on the CPUs.
  • Example, if 50 acceleratable and all parts run
    at 40 ?
  • Need 25 TF peak CPUS infinite acceleration.

21
RENCI Language Effort.
  • Observation OpenMP-like programming is
    insufficient to beat Amdahl, even on conventional
    multi-processors.
  • Libraries for important kernels approach is
    also very limited, though useful
  • See MATLAB, Clearspeed, etc.
  • Our Approach Supercomputer-style languages and
    compilers.
  • Languages GAS and HPCS best of breed
  • Parallelization Semi-automatic, directive-driven
  • Vectorization SIMD-ization See Cray TMI
  • Streaming and Coarse-grain Pipelining between
    cores.
  • See Rice HPF and CAF compilation strategies.
Write a Comment
User Comments (0)
About PowerShow.com