GPU Computing for Battlefield HPC Applications: presentation

About This Presentation

Transcript and Presenter's Notes

Title: GPU Computing for Battlefield HPC Applications:

1
GPU Computing for Battlefield HPC Applications
Real-time SIRE Radar Image Processing David
Richie Brown Deer Technology James Ross High
Performance Technologies, Inc. Song Park and
Dale Shires U.S. Army Research Lab March 19th,
2009
2
Outline of Talk

Advanced Computing Strategic Technology
Initiative
Battlefield Application UWB SIRE RADAR
GPU acceleration of critical algorithm
Initial Benchmarks
Future work and conclusions

3
Advanced Computing Strategic Technology Initiative
Battlespace Applications
Company Operations and Intelligence Cells
Networks
Sensors Information Fusion
Biometrics
4
Modern GPU Architectures

FireStream 9250
AMD RV770 Architecture
800 SIMD superscalar processors
Supports SSE-like vec4 operations
IEEE single/double precision
1 TFLOP peak single precision
200 GFLOPS peak double-precision
1 GB GDDR3 on-board memory
lt 120 W max - 80 W typical
8-12 GFLOPS per Watt
MSRP 1,000

5
Why the Interest in GPUs?

N-Body Simulation
N particles subject to gravitational force -
canonical O(N2) algorithm
As much as 123x speedup - 164 GFLOPS sustained
(incl. div and sqrt)?
GPU Floating-point engines are powerful, trick is
to keep the pipeline full

Number of Particles
AMD FireStream 9170 Nvidia Quadro 4600 Athlon64
X2 5000BE 3.2 GHz o.c. Intel Xeon 5150 2.66 GHz
Step Time (sec)?
super-linear scaling
(benchmarks from September 2008)
6
100 TFLOPS, battlefield deployable, by 2012?

What can be built today?
COTS solution 2U4U - 16 RV770 GPUs - 16 TFLOPS
- 2.5 KW
Future assumptions
Architecture assume 3x performance increase
RV770 (55nm) - 800 cores - Today
RV870 (40nm) 2000(?) cores 2009
RV970 (32nm?) 2400(?) cores - 2010
Design assume 2x performance increase
Dual-GPU boards available now, dual-slot form
factor
Dual-GPU boards, single-slot via lower power
liquid cooling
Power assume power constrained 200W/per board(?)
Result
96 TFLOPS - 3.2 KW 2 cu. ft. (2U4U) by 2011
What will the software look like?
Programming model? Compilers? Runtime?
Portability?
Impact of deployable HPC for battlefield
applications?

7
Battlefield Application UWB SIRE RADAR

Ultra Wide-Band Synchronous Impulse
Reconstruction RADAR
Obstacle avoidance and concealed target detection
Under development by researchers at ARL/SEDD
Algorithms developed in MATLAB, being ported to C
and GPUs

8
UWB SIRE RADAR
Vehicle
Tx
Area to be Imaged
Data Acquisition Processing
Rx
Tx

Left/right transmitter and multi-channel
receivers
Onboard Data Acquisition hardware
Original work involved off-platform
post-processing using MATLAB
Objective is real-time processing capability
RADAR image processing and target detection
Current project is focusing on achieving
real-time capability using GPUs

9
GPU Acceleration of SIRE Back Projection
Host code using ATI Stream Brook compiler
... float s_dataltnasgt float4
s_rxltnagt float4 s_txltnagt float4
s_imglt100,64gt streamRead(s_data,data_all) str
eamRead(s_rx,rx4) streamRead(s_tx,tx4) backpr
ojection_gpu_kern( (float)na,
(float)ns, (float)nrange2, (float)nxrange2,
yref, xr_inc, r_inc, r_start,
rdr, coef1,coef2,coef3, s_rx,
s_tx, s_data,s_img ) streamWrite(s_img,img)
...
Compute Frame Data
Transform and Extraction
Fix Moving Distortion and Filter
Get Frame Data
Interp1
Calc Rx and Tx
Back Projection
70 of computation
GPU
Update Image
10
GPU Kernels for Stream Computing
Host code using ATI Stream Brook compiler
Stream Computing Model
... float s_dataltnasgt float4
s_rxltnagt float4 s_txltnagt float4
s_imglt100,64gt streamRead(s_data,data_all) str
eamRead(s_rx,rx4) streamRead(s_tx,tx4) backpr
ojection_gpu_kern( (float)na,
(float)ns, (float)nrange2, (float)nxrange2,
yref, xr_inc, r_inc, r_start,
rdr, coef1,coef2,coef3, s_rx,
s_tx, s_data,s_img ) streamWrite(s_img,img)
...

Kernel
In
Out

Organize data into 1D/2D/3D arrays
Kernel implicitly applied over domain
Memory access can stream or scatter/gather
Tune algorithm to architecture
SSE-like 4-vector operations
Avoid branching, use masks
In practice, two main challenges encountered
Data transfer bottleneck
Keeping the floating-point pipelines full

11
UWB SIRE RADAR Initial Benchmarks

CPU baseline uses a single-core opportunity for
SSE and OpenMP optimization
GPU implementations have opportunity for
optimization as well
Impact on real-time capability
C/Xeon E5450 total time 45.5sec ? 13 mph
ATI/Radeon HD 4870 total time ? 34 mph
Amdahls Law appears relative cost of Back
Projection 70 ? 23
Need to examine other parts of the overall
algorithm

31
12
GPU Acceleration, Where Does the Time Go?
tCPU,CTL 1.3 sec (31)
control
PCIe
tCPU,XFER 1.0 sec (24)
tPCIe 1.0 sec (24) (2 GB/s or 25 max)
tGPU 0.9 sec (21)

Breakdown back-projection timing for GPU
co-processor
Under-clock CPU, redundant double data transfer
80 of the co-processor time unrelated to actual
GPU
Room for improvement (vendor and programmer)?

13
Ongoing and Future Work

Continuing optimization of the SIRE algorithm
Multi-core CPU and GPU
Multi-node distributed algorithms using GPUs
Ongoing efforts for other applications
Transmission Line Matrix (TLM) solver
Ray Tracing
Other C4ISR applications
Practical evaluation of rapidly evolving GPU
technology
Programming models, software, middleware

14
Conclusions

Hardware for deployable HPC (gt100 TFLOP)
available within 3 years
Programming models, software, middleware are
evolving rapidly, but consensus beginning to
appear, e.g., stream computing, OpenCL
True acceleration can be achieved now in
realistic prototypical codes
Most likely critical deficiency will be lack of
experience/expertise
The ability to leverage this capability in future
depends critically on working with the technology
now

Contact David Richie (drichie_at_browndeertechnology
.com)

Write a Comment

User Comments (0)

About PowerShow.com

GPU Computing for Battlefield HPC Applications: PowerPoint PPT Presentation