GPU Cluster for High Performance Computing - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

GPU Cluster for High Performance Computing

Description:

GPGPU = general-purpose computation using GPU. The Computational ... 9 HP Sepia-2A cards (composite) ServerNet (compositing network) Gomputational. GPU cluster ... – PowerPoint PPT presentation

Number of Views:601
Avg rating:3.0/5.0
Slides: 31
Provided by: zhe3
Category:

less

Transcript and Presenter's Notes

Title: GPU Cluster for High Performance Computing


1
GPU Cluster for High Performance Computing
  • Zhe Fan, Feng Qiu, Arie Kaufman,
  • Suzanne Yoakum-Stover
  • Stony Brook University

2
Outline
  • Background
  • GPU graphics processing unit
  • GPGPU general-purpose computation using GPU
  • The Computational GPU Cluster
  • The Lattice Boltzmann Computation
  • Performance Evaluation
  • Application for Urban Dispersion Modeling
  • Conclusion Future Work

3
Background Whats GPU?
  • Graphics processing units
  • nVIDIA NV40 ATI R420
  • In COTS 3D graphics cards
  • GeForce 6800 Ultra Radeon X800 XT
  • Modern GPU 600M vertices, 6G pixels / second

4
Background GPU Growth Rate
  • Driven by booming market for games
  • Rendering task is embarrassingly parallel. Can
    Efficiently use large number of computational
    units

5
Background Graphics Pipeline
Vertex Processing
Transform 3D coords to 2D coords
Rasterization
Texture Memory
Framgment Processing
Combine colors to final image
Composite
6
Background GPU Compute Power
  • High compute parallelism
  • 6 16 pipelines
  • 4D vector fp32 instructions
  • More than 100 FLOPs/cycle !
  • Fast on-board memory
  • Bandwidth 35.2 GB/sec
  • Size 256 - 512 MB
  • Low price

6 Vertex Pipelines
Vertex Processing
Rasterization
256-Bit GDDR3
Texture Memory
16 Fragment Pipelines
Framgment Processing
Composite
7
Background GPGPU
  • Recent development of GPUs
  • Programmability
  • High-level language, Cg
  • 32-bit floating point
  • General-purpose computation using GPU
  • GPU accelerates certain computations 10 times
  • Data parallelism
  • Computational intensive
  • Relatively simple data structures and control flow

8
Background GPGPU Examples
  • Scientific computation
  • Physically-based simulations
  • Linear algebra, conjugate gradient solver
  • Level set
  • Fast Fourier transform
  • Image and volume processing
  • Computational geometry
  • Database
  • Flow visualization
  • Global illumination
  • GPGPU language
  • Many papers. See http//www.gpgpu.org

9
Motivations
  • Scale-up to GPU cluster for large-scale problems
  • Explore the use for high-performance computing
  • Outperform COTS CPU clusters for certain
    computations
  • Motivated by
  • PlayStation2 computational cluster at NCSA
  • Graphics PC clusters
  • Humphrey et al., 02, Govindaraju et al., 03,
    etc

10
Stony Brook Visual Computing Cluster
  • Gigabit Ethernet
  • 32 HP PCs
  • 64 Pentium Xeon 2.4GHz
  • 32 2GB DDR Memory
  • 32 120GB Hard Disk
  • 32 GeForce FX 5800 Ultra
  • 32 VolumePro 1000
  • (volume rendering)
  • 9 HP Sepia-2A cards (composite)
  • ServerNet (compositing network)
  • Gomputational
  • GPU cluster
  • Visualization
  • cluster

11
Computational GPU Cluster
  • Gigabit Ethernet
  • 32 HP PCs
  • 64 Pentium Xeon 2.4GHz
  • 32 2GB DDR Memory
  • 32 120GB Hard Disk
  • 32 GeForce FX 5800 Ultra

1,621 x 32 349 x 32 607 x 32 150 x
32
399 x 32
16 GFLOPS x 32 (Fragment
pipelines capability)
Plugging 32 GPUs, theoretical peak has been
increased by 512 GFLOPS at a price of only
12,768. 1 GFLOPS for 25
12
GPU Cluster Architecture
  • Currently, read from GPU is slow

13
Lattice Boltzmann Model (LBM)
  • CFD method on the lattice
  • Numerical calculations
  • Stream
  • Collision
  • Yields Navier-Stokes for incompressible fluid
  • Greatly flexible in specifying complex boundaries

A cell of the D3Q19 lattice
14
LBM on a Single GPU
  • Li et al., Visual Computer 03
  • Program fragment processing stage

15
Store LBM Data in Textures
16
Scale-up LBM
  • Domain decomposing
  • Communication
  • Read out from GPU
  • Network transfer
  • Write into GPUs

17
GPU lt-gt PC data transfer
  • Read data from GPU
  • Aggregate necessary boundary data together into a
    texture
  • Read them out in a single operation
  • Write data into GPU
  • Reverse the above procedure

18
Network Transfer
  • MPI
  • To minimize communication cost
  • Overlap network transfer time with computation
    time
  • Simplify communication pattern

19
To Compare with CPU Cluster Code
  • Backgrounds
  • For our CPU cluster, each node use 1 CPU to
    compute
  • The CPU cluster code hasnt used SSE
  • (SSE is expected to be about 3 times faster)

CPU cluster code
GPU cluster code
Communication
Read from GPU overhead N / A
17 Network transfer time
Fully overlapped Mostly overlapped

with computation with computation
20
Performance Comparison
Scale-up test (Each node computes 803 sub-domain)
21
Performance Comparison (cont.)
22
For Further Improvement
  • Use newer generation GPUs
  • Already 2.2 times faster
  • Use PCI-Express bus
  • Much faster GPU lt-gt PC communication
  • 4 GB / sec either way
  • Multiple GPUs on each PC to be feasible soon
  • Code can be optimized
  • Faster network

23
Urban Application
  • Use LBM to simulate airborne contaminant
    dispersion in complex urban environments
  • Simulation and visualization on a single GPU
  • Qiu et al., Visualization 2004

24
Simulation Area for GPU Cluster
Times Square Area of New York City
  • 1.66 km x 1.13 km
  • 91 blocks
  • 850 buildings

25
Air Flow, Times Square Area, NYC
  • 480 x 400 x 80
  • On 30 GPUs
  • 0.31 sec/step

26
Dispersion Plume, Volume Rendered
27
Future Work
  • Other computations
  • E.g., PDE, FEM
  • GPUs as co-processors of CPUs
  • carr et al., 2002, etc
  • Online-visualization computational steering
  • Potential advantage of GPU cluster

28
Conclusions
  • Cluster of commodity GPUs for high-performance
    computing
  • LBM to simulate airborne dispersion in the urban
    environment
  • Compared with the same model on CPU cluster, GPU
    cluster is much faster, and better performance is
    anticipated
  • GPU cluster is very promising for scientific
    computation

29
Acknowledgements
  • NSF CCR0306438
  • Department of Homeland Security _at_ Environment
    Measurement Lab
  • Hewlett Packard
  • TeraRecon
  • Reviewers
  • Bin Zhang, Wei Li, Ye Zhao, Xiaoming Wei, Klaus
    Mueller, Brent Lindquist

30
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com