GPU Computing for Battlefield HPC Applications: - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

GPU Computing for Battlefield HPC Applications:

Description:

In practice, two main challenges encountered. Data transfer bottleneck ... 80% of the co-processor time unrelated to actual GPU ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 15
Provided by: osc4
Category:

less

Transcript and Presenter's Notes

Title: GPU Computing for Battlefield HPC Applications:


1
GPU Computing for Battlefield HPC Applications
Real-time SIRE Radar Image Processing David
Richie Brown Deer Technology James Ross High
Performance Technologies, Inc. Song Park and
Dale Shires U.S. Army Research Lab March 19th,
2009
2
Outline of Talk
  • Advanced Computing Strategic Technology
    Initiative
  • Battlefield Application UWB SIRE RADAR
  • GPU acceleration of critical algorithm
  • Initial Benchmarks
  • Future work and conclusions

3
Advanced Computing Strategic Technology Initiative
Battlespace Applications
Company Operations and Intelligence Cells
Networks
Sensors Information Fusion
Biometrics
4
Modern GPU Architectures
  • FireStream 9250
  • AMD RV770 Architecture
  • 800 SIMD superscalar processors
  • Supports SSE-like vec4 operations
  • IEEE single/double precision
  • 1 TFLOP peak single precision
  • 200 GFLOPS peak double-precision
  • 1 GB GDDR3 on-board memory
  • lt 120 W max - 80 W typical
  • 8-12 GFLOPS per Watt
  • MSRP 1,000

5
Why the Interest in GPUs?
  • N-Body Simulation
  • N particles subject to gravitational force -
    canonical O(N2) algorithm
  • As much as 123x speedup - 164 GFLOPS sustained
    (incl. div and sqrt)?
  • GPU Floating-point engines are powerful, trick is
    to keep the pipeline full

Number of Particles
AMD FireStream 9170 Nvidia Quadro 4600 Athlon64
X2 5000BE 3.2 GHz o.c. Intel Xeon 5150 2.66 GHz
Step Time (sec)?
super-linear scaling
(benchmarks from September 2008)
6
100 TFLOPS, battlefield deployable, by 2012?
  • What can be built today?
  • COTS solution 2U4U - 16 RV770 GPUs - 16 TFLOPS
    - 2.5 KW
  • Future assumptions
  • Architecture assume 3x performance increase
  • RV770 (55nm) - 800 cores - Today
  • RV870 (40nm) 2000(?) cores 2009
  • RV970 (32nm?) 2400(?) cores - 2010
  • Design assume 2x performance increase
  • Dual-GPU boards available now, dual-slot form
    factor
  • Dual-GPU boards, single-slot via lower power
    liquid cooling
  • Power assume power constrained 200W/per board(?)
  • Result
  • 96 TFLOPS - 3.2 KW 2 cu. ft. (2U4U) by 2011
  • What will the software look like?
  • Programming model? Compilers? Runtime?
    Portability?
  • Impact of deployable HPC for battlefield
    applications?

7
Battlefield Application UWB SIRE RADAR
  • Ultra Wide-Band Synchronous Impulse
    Reconstruction RADAR
  • Obstacle avoidance and concealed target detection
  • Under development by researchers at ARL/SEDD
  • Algorithms developed in MATLAB, being ported to C
    and GPUs

8
UWB SIRE RADAR
Vehicle
Tx
Area to be Imaged
Data Acquisition Processing
Rx
Tx
  • Left/right transmitter and multi-channel
    receivers
  • Onboard Data Acquisition hardware
  • Original work involved off-platform
    post-processing using MATLAB
  • Objective is real-time processing capability
  • RADAR image processing and target detection
  • Current project is focusing on achieving
    real-time capability using GPUs

9
GPU Acceleration of SIRE Back Projection
Host code using ATI Stream Brook compiler
... float s_dataltnasgt float4
s_rxltnagt float4 s_txltnagt float4
s_imglt100,64gt streamRead(s_data,data_all) str
eamRead(s_rx,rx4) streamRead(s_tx,tx4) backpr
ojection_gpu_kern( (float)na,
(float)ns, (float)nrange2, (float)nxrange2,
yref, xr_inc, r_inc, r_start,
rdr, coef1,coef2,coef3, s_rx,
s_tx, s_data,s_img ) streamWrite(s_img,img)
...
Compute Frame Data
Transform and Extraction
Fix Moving Distortion and Filter
Get Frame Data
Interp1
Calc Rx and Tx
Back Projection
70 of computation
GPU
Update Image
10
GPU Kernels for Stream Computing
Host code using ATI Stream Brook compiler
Stream Computing Model
... float s_dataltnasgt float4
s_rxltnagt float4 s_txltnagt float4
s_imglt100,64gt streamRead(s_data,data_all) str
eamRead(s_rx,rx4) streamRead(s_tx,tx4) backpr
ojection_gpu_kern( (float)na,
(float)ns, (float)nrange2, (float)nxrange2,
yref, xr_inc, r_inc, r_start,
rdr, coef1,coef2,coef3, s_rx,
s_tx, s_data,s_img ) streamWrite(s_img,img)
...

Kernel
In
Out
  • Organize data into 1D/2D/3D arrays
  • Kernel implicitly applied over domain
  • Memory access can stream or scatter/gather
  • Tune algorithm to architecture
  • SSE-like 4-vector operations
  • Avoid branching, use masks
  • In practice, two main challenges encountered
  • Data transfer bottleneck
  • Keeping the floating-point pipelines full

11
UWB SIRE RADAR Initial Benchmarks
  • CPU baseline uses a single-core opportunity for
    SSE and OpenMP optimization
  • GPU implementations have opportunity for
    optimization as well
  • Impact on real-time capability
  • C/Xeon E5450 total time 45.5sec ? 13 mph
  • ATI/Radeon HD 4870 total time ? 34 mph
  • Amdahls Law appears relative cost of Back
    Projection 70 ? 23
  • Need to examine other parts of the overall
    algorithm

31
12
GPU Acceleration, Where Does the Time Go?
tCPU,CTL 1.3 sec (31)
control
PCIe
tCPU,XFER 1.0 sec (24)
tPCIe 1.0 sec (24) (2 GB/s or 25 max)
tGPU 0.9 sec (21)
  • Breakdown back-projection timing for GPU
    co-processor
  • Under-clock CPU, redundant double data transfer
  • 80 of the co-processor time unrelated to actual
    GPU
  • Room for improvement (vendor and programmer)?

13
Ongoing and Future Work
  • Continuing optimization of the SIRE algorithm
  • Multi-core CPU and GPU
  • Multi-node distributed algorithms using GPUs
  • Ongoing efforts for other applications
  • Transmission Line Matrix (TLM) solver
  • Ray Tracing
  • Other C4ISR applications
  • Practical evaluation of rapidly evolving GPU
    technology
  • Programming models, software, middleware

14
Conclusions
  • Hardware for deployable HPC (gt100 TFLOP)
    available within 3 years
  • Programming models, software, middleware are
    evolving rapidly, but consensus beginning to
    appear, e.g., stream computing, OpenCL
  • True acceleration can be achieved now in
    realistic prototypical codes
  • Most likely critical deficiency will be lack of
    experience/expertise
  • The ability to leverage this capability in future
    depends critically on working with the technology
    now

Contact David Richie (drichie_at_browndeertechnology
.com)
Write a Comment
User Comments (0)
About PowerShow.com