Title: Challenges in Binary Translation for Desktop Supercomputing
1Challenges in Binary Translation for Desktop
Supercomputing
David Kaeli Rodrigo Dominguez Department of
Electrical and Computer Engineering Northeastern
University Boston, MA
2Current trends in Many-core Computing
- The CPU industry has elected to jump off the
cycle-time scaling bandwagon - Power/thermal constraints have become a limiting
factor - We now see CPU vendors placing multiple (10s of)
cores on a single chip - Clock speeds have not changed
- The memory wall persists and multiple cores that
assume a shared-memory model place further
pressure on this problem - Software vendors are looking for new
parallelization technology - Multi-core aware operating systems
- Semi-automatic parallelizing compilers
3Current trends in Many-core Computing
- There has been a renewed interest in parallel
computing paradigms and languages - Existing many-core architectures are being
considered for general-purpose platforms (e.g.,
Cell, GPUs, DSPs) - Heterogeneous systems are becoming a common theme
- The trend will only accelerate if proper
programming frameworks are available to
effectively exploit many-core resources
4Graphics Processors
- Graphics Processing Units
- More than 64 of Americans played a video game in
2009 - High-end - primarily used for 3-D rendering for
videogame graphics and movie animation - Mid/low-end primarily used for computer
displays - Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
- Very competitive commodities market
5GPU Performance
- GPUs provide a path for performance growth
- Cost and power usage numbers are also impressive
Near exponential growth in performance for
GPUS!!
SourceNVIDIA 2009
6Comparison of CPU and GPU Hardware
Architectures
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel,
throughput-oriented
7CPU/GPU Relationship
CPU (host)
GPU w/ local DRAM (device)
8A wide range of GPU apps
- Film
- Financial
- Languages
- GIS
- Holographics cinema
- Machine learning
- Mathematics research
- Military
- Mine planning
- Molecular dynamics
- MRI reconstruction
- Multispectral imaging
- N-body simulation
- Network processing
- Neural network
- Oceanographic research
- Optical inspection
- Particle physics
- 3D image analysis
- Adaptive radiation therapy
- Acoustics
- Astronomy
- Audio
- Automobile vision
- Bioinfomatics
- Biological simulation
- Broadcast
- Cellular automata
- Fluid dynamics
- Computer vision
- Cryptography
- CT reconstruction
- Data mining
- Digital cinema / projections
- Electromagnetic simulation
- Equity training
- Protein folding
- Quantum chemistry
- Ray tracing
- Radar
- Reservoir simulation
- Robotic vision / AI
- Robotic surgery
- Satellite data analysis
- Seismic imaging
- Surgery simulation
- Surveillance
- Ultrasound
- Video conferencing
- Telescope
- Video
- Visualization
- Wireless
- X-Ray
9GPU as a General Purpose Computing Platform
- Speedups are impressive and ever increasing!
Real Time Elimination of Undersampling Artifacts
Lattice-Boltzmann Method for Numerical Fluid
Mechanics
Genetic Algorithm
Total Variation Modeling
2300 X
1840 X
1000 X
2600 X
Monte Carlo Simulation Of Photon Migration
Stochastic Differential Equations
K-Nearest Neighbor Search
Fast Total Variation for Computer Vision
1000 X
675 X
470 X
1000 X
Source CUDA Zone at www.nvidia.com/cuda/
10GPGPU is becoming mainstream research
- Research activities are expanding significantly
Search result for keyword GPGPU in IEEE and ACM
11Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
NVIDIA GT200 architecture
SP
SP
- 240 shader cores
- 1.4B transistors
- Up to 2GB onboard memory
- 150GB/sec BW
- 1.06 SP GFLOPS
- CUDA and OpenCL support
- Programmable memory spaces
- Tesla S1070 provides 4 GPUs in a 1U unit
SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
12AMD/ATI Radeon HD 5870
- Codename Evergreen
-
- 1600 SIMD cores
- L1/L2 memory architecture
- 153GB/sec memory bandwidth
- 2.72 TFLOPS SP
- OpenCL and DirectX11
- Hidden memory microarchitecure
- Provides for vectorized operation
13Comparison of CPU and GPU Hardware
Architectures
CPU/GPU Single precision TFLOPs Cores GFLOPs/Watt /GFLOP
NVIDIA 285 1.06 240 5.8 3.12
NVIDIA 295 1.79 480 6.2 3.80
AMD HD 5870 2.72 1600 14.5 0.16
AMD HD 4890 1.36 800 7.2 0.18
Intel I-7 965 0.051 4 0.39 11.02
Source NVIDIA, AMD and Intel
14AMD vs. NVIDIA
AMD NVIDIA
Hardware architecture Vector Scalar
Programming language Brook, IL, OpenCL CUDA, OpenCL
Programming model SIMD vector SIMT
Thread hierarchy Single level Two level
Memory exposure Uniform space Multiple space
Source of horsepower Vectorization and multiple output Memory spaces utilization including shared memory
Pros Easier programming More flexible programming
Challenges Harnessing the potential horsepower Harnessing the potential horsepower
15Talk Outline
- Introduction on GPUs
- Overview of the tool chains for both CUDA and
OpenCL - Motivation for pursuing this work
- Comparing intermediate representations
- Leveraging/analyzing benefits of Open64
optimization on AMD GPUs - Comparing challenges with fundamentally different
ISAs (SS SIMT versus VLIW SIMT) - Discuss PTX and IL
- Describe new common IR
- Two examples of PTX-gtIR-gtIL binary translation
- Discuss status of project and future work
16GPU Programming Model
- Single Instruction Multiple Threads (SIMT)
- Parallelism is implicit
- Programs (also called kernels or shaders) are
generally small and contain nested loops - Synchronization is handled explicitly
17Toolchains
- Toolchain compiler runtime library
NVIDIA
AMD
18CUDA Compiler
c for cuda
compile-time
cudafe
gpu
Open64
host
ptx
host compiler
exe
execution-time
binary
runtime
driver
ptx is included as data in the host application
19OpenCL (Dynamic) Compiler
OpenCL
compile-time
host compiler
exe
execution-time
OpenCL Library
LLVM
binary
runtime
driver
20Objectives of our work
- Compare two different IRs from similar
massively-threaded architectures - Influence future IR design (an active topic in
GPGPU research) - Leverage/analyze benefits of Open64 optimizations
- Compare challenges with fundamentally different
ISAs Superscalar/SIMT versus VLIW/SIMT
21CUDA Runtime
- Device Management
- cudaSetDevice, cudaGetDevice
- Memory Management
- Allocation cudaMalloc, cudaFree
- Transfer cudaMemcpy, cudaMemset
- Execution Control
- Kernel launch cudaLaunch
- Config cudaConfigureCall
- Thread Management
- cudaSynchronize
22CUDA Runtime (Vector Add example)
__global__ void vecAdd(int A , int B , int C
) int i threadIdx.x Ci Ai
Bi int main() int hA
int hB cudaMemcpy(dA, hA,
sizeof(hA), HostToDevice) cudaMemcpy(dB, hB,
sizeof(hB), HostToDevice) vecAddltltlt1,
Ngtgtgt(dA, dB, dC) cudaMemcpy(dA, hA,
sizeof(hA), DeviceToHost)
cudaConfigureCall cudaSetupArgument cudaLaunch
23NVIDIA PTX
- Low-level IR (close to ISA)
- Pseudo-assembly style syntax
- Load-Store instruction set
- Strongly typed language
- cvt.s32.u16 r1, tid.x
- Unlimited virtual registers
- Predicate registers
24AMD IL
- High-level IR
- Structured control flow (if-endif, while-end,
switch-end) - No predication
- 32-bit registers (4 components) - vectorization
25Common PTX and IL instructions
vectorAdd (PTX)
mov.u16 rh1, ctaid.x mov.u16 rh2,
ntid.x mul.wide.u16 r1, rh1,
rh2 cvt.u32.u16 r2, tid.x add.u32 r3,
r2, r1 ld.param.s32 r4, N setp.le.s32
p1, r4, r3 _at_p1 bra LabelA cvt.u64.s32
rd1, r3 mul.lo.u64 rd2, rd1,
4 ld.param.u64 rd3, A add.u64 rd4, rd3,
rd2 ld.global.f32 f1, rd40 ld.param.u6
4 rd5, B add.u64 rd6, rd5,
rd2 ld.global.f32 f2, rd60 add.f32
f3, f1, f2 ld.param.u64 rd7, C add.u64
rd8, rd7, rd2 st.global.f32 rd80,
f3 LabelA exit
- Data movement (mov)
- Memory access (ld, st)
- Arithmetic (mul, add)
- Conversion (cvt)
- Comparison and selection (setp)
- Control flow (bra) uses predication for
conditional branch
26Common PTX and IL instructions
vectorAdd (IL)
mov r0, vThreadGrpId.x mov r1, cb00.x imul r2,
r0, r1 mov r3, vTidInGrp.x iadd r4, r3, r2 mov
r5, cb13 ige r6, r4, r5 if_logicalz r6 mov r7,
r4 imul r8, r7, l0 mov r9, cb10 iadd r10, r9,
r8 uav_raw_load_id(0) r11, r10 mov r12,
cb11 iadd r13, r12, r8 uav_raw_load_id(0) r14,
r13 add r15, r11, r14 mov r16, cb12 iadd r17,
r16, r8 uav_raw_store_id(0) mem.xyzw, r17,
r15 endif end
- Data movement (mov)
- Memory access (uav_raw)
- Arithmetic (imul, iadd)
- No conversion instructions
- Comparison and Selection (ige)
- Control Flow (if_logicalz) structured
statements
27Ocelot Framework
- Implemented as a CUDA library
- Intercepts library calls
- PTX Emulation on the CPU
- Parses PTX into an internal IR
- Analysis CFG, SSA, Data flow, optimizations
- Our work
- IR for IL programs
- PTX IR -gt IL IR translation
- AMD/CAL Backend
Andrew Kerr, Gregory Diamos, and Sudhakar
Yalamanchili. Modeling gpu-cpu workloads and
systems. In GPGPU 10 Proceedings of the 3rd
Workshop on General-Purpose Computation on
Graphics Processing Units, pages 3142, New York,
NY, USA, 2010. ACM.
28Translation Framework
compile-time
exe
29IL Control Tree
- Based on Structural Analysis
- Build DFS spanning tree of the control flow graph
and traverse in postorder - Form regions and collapse the nodes in the CFG
- Construct the Control Tree in the process
- Repeat until only 1 node is left in the CFG
S. Muchnick. Advanced Compiler Design and
Implementation, chapter 7.7. Morgan Kaufmann,
1997.
30IL Control Tree
abstract node representing regions
31Example 1 (if-then)
PTX
mov.u16 setp.le.s32 p1, r4, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelA exit
32Example 1 (if-then)
IL
mov ige r6, r4, r5 if_logicalz
r6 mov endif end
33Example 2 (for-loop)
Entry Block
PTX
mov.u16 setp.le.s32 p1, r5, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelB setp.lt.s32 p2,
r4, r5 _at_p2 bra LabelB LabelA exit
BB mov..
IF
BB exit
cond
true
BB setp..
Block
BB cvt
WHILE
cond
body
setp
34Example 2 (for-loop)
IL
mov ige r7, r4, r6 if_logicalz
r7 mov whileloop if_logicalz
r17 break endif endloop endif end
body
35Other BT Challenges
- Pointer arithmetic in CUDA needs to be emulated
in CAL - Translate Application Binary Interface (ABI),
e.g. different calling conventions - Architectural bitness Tesla and Cypress are
32-bit architectures but Fermi is 64-bits
36Project Status
- Main CUDA library APIs are implemented
(cudaMalloc, cudaMemcpy, cudaLaunch, etc.) - 3 CUDA applications from the SDK running
- Code quality comparable to LLVM code generation
37Next Steps
- Enhance translation of the Control Tree to
support other IL constructs (e.g., switch-case) - Implement other GPGPU abstractions (e.g., shared
memory, textures, etc.) - Handle PTX predicated instructions (since IL does
not support predication directly)
38Summary and Future Work
- GPUs are revolutionizing desktop supercomputing
- A number of critical applications have been
migrated successfully - CUDA and OpenCL have made these platforms much
more accessible for general purpose computing - AMD presently has the highest DP FP performance
- CUDA presently produces higher performance code
for NVIDIA - We are developing a platform that leverages the
best of both worlds ?