Challenges in Binary Translation for Desktop Supercomputing - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Challenges in Binary Translation for Desktop Supercomputing

Description:

Challenges in Binary Translation for Desktop Supercomputing David Kaeli Rodrigo Dominguez Department of Electrical and Computer Engineering Northeastern University – PowerPoint PPT presentation

Number of Views:154

Avg rating:3.0/5.0

Slides: 39

Provided by: BahaaE

Category:

more less

Transcript and Presenter's Notes

Title: Challenges in Binary Translation for Desktop Supercomputing

1
Challenges in Binary Translation for Desktop
Supercomputing
David Kaeli Rodrigo Dominguez Department of
Electrical and Computer Engineering Northeastern
University Boston, MA
2
Current trends in Many-core Computing

The CPU industry has elected to jump off the
cycle-time scaling bandwagon
Power/thermal constraints have become a limiting
factor
We now see CPU vendors placing multiple (10s of)
cores on a single chip
Clock speeds have not changed
The memory wall persists and multiple cores that
assume a shared-memory model place further
pressure on this problem
Software vendors are looking for new
parallelization technology
Multi-core aware operating systems
Semi-automatic parallelizing compilers

3
Current trends in Many-core Computing

There has been a renewed interest in parallel
computing paradigms and languages
Existing many-core architectures are being
considered for general-purpose platforms (e.g.,
Cell, GPUs, DSPs)
Heterogeneous systems are becoming a common theme
The trend will only accelerate if proper
programming frameworks are available to
effectively exploit many-core resources

4
Graphics Processors

Graphics Processing Units
More than 64 of Americans played a video game in
2009
High-end - primarily used for 3-D rendering for
videogame graphics and movie animation
Mid/low-end primarily used for computer
displays
Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
Very competitive commodities market

5
GPU Performance

GPUs provide a path for performance growth
Cost and power usage numbers are also impressive

Near exponential growth in performance for
GPUS!!
SourceNVIDIA 2009
6
Comparison of CPU and GPU Hardware
Architectures
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel,
throughput-oriented
7
CPU/GPU Relationship
CPU (host)
GPU w/ local DRAM (device)
8
A wide range of GPU apps

Film
Financial
Languages
GIS
Holographics cinema
Machine learning
Mathematics research
Military
Mine planning
Molecular dynamics
MRI reconstruction
Multispectral imaging
N-body simulation
Network processing
Neural network
Oceanographic research
Optical inspection
Particle physics

3D image analysis
Adaptive radiation therapy
Acoustics
Astronomy
Audio
Automobile vision
Bioinfomatics
Biological simulation
Broadcast
Cellular automata
Fluid dynamics
Computer vision
Cryptography
CT reconstruction
Data mining
Digital cinema / projections
Electromagnetic simulation
Equity training

Protein folding
Quantum chemistry
Ray tracing
Radar
Reservoir simulation
Robotic vision / AI
Robotic surgery
Satellite data analysis
Seismic imaging
Surgery simulation
Surveillance
Ultrasound
Video conferencing
Telescope
Video
Visualization
Wireless
X-Ray

9
GPU as a General Purpose Computing Platform

Speedups are impressive and ever increasing!

Real Time Elimination of Undersampling Artifacts
Lattice-Boltzmann Method for Numerical Fluid
Mechanics
Genetic Algorithm
Total Variation Modeling
2300 X
1840 X
1000 X
2600 X
Monte Carlo Simulation Of Photon Migration
Stochastic Differential Equations
K-Nearest Neighbor Search
Fast Total Variation for Computer Vision
1000 X
675 X
470 X
1000 X
Source CUDA Zone at www.nvidia.com/cuda/
10
GPGPU is becoming mainstream research

Research activities are expanding significantly

Search result for keyword GPGPU in IEEE and ACM
11
Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
NVIDIA GT200 architecture
SP
SP

240 shader cores
1.4B transistors
Up to 2GB onboard memory
150GB/sec BW
1.06 SP GFLOPS
CUDA and OpenCL support
Programmable memory spaces
Tesla S1070 provides 4 GPUs in a 1U unit

SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
12
AMD/ATI Radeon HD 5870

Codename Evergreen
1600 SIMD cores
L1/L2 memory architecture
153GB/sec memory bandwidth
2.72 TFLOPS SP
OpenCL and DirectX11
Hidden memory microarchitecure
Provides for vectorized operation

13
Comparison of CPU and GPU Hardware
Architectures
CPU/GPU Single precision TFLOPs Cores GFLOPs/Watt /GFLOP
NVIDIA 285 1.06 240 5.8 3.12
NVIDIA 295 1.79 480 6.2 3.80
AMD HD 5870 2.72 1600 14.5 0.16
AMD HD 4890 1.36 800 7.2 0.18
Intel I-7 965 0.051 4 0.39 11.02
Source NVIDIA, AMD and Intel
14
AMD vs. NVIDIA
AMD NVIDIA
Hardware architecture Vector Scalar
Programming language Brook, IL, OpenCL CUDA, OpenCL
Programming model SIMD vector SIMT
Thread hierarchy Single level Two level
Memory exposure Uniform space Multiple space
Source of horsepower Vectorization and multiple output Memory spaces utilization including shared memory
Pros Easier programming More flexible programming
Challenges Harnessing the potential horsepower Harnessing the potential horsepower
15
Talk Outline

Introduction on GPUs
Overview of the tool chains for both CUDA and
OpenCL
Motivation for pursuing this work
Comparing intermediate representations
Leveraging/analyzing benefits of Open64
optimization on AMD GPUs
Comparing challenges with fundamentally different
ISAs (SS SIMT versus VLIW SIMT)
Discuss PTX and IL
Describe new common IR
Two examples of PTX-gtIR-gtIL binary translation
Discuss status of project and future work

16
GPU Programming Model

Single Instruction Multiple Threads (SIMT)
Parallelism is implicit
Programs (also called kernels or shaders) are
generally small and contain nested loops
Synchronization is handled explicitly

17
Toolchains

Toolchain compiler runtime library

NVIDIA
AMD
18
CUDA Compiler
c for cuda
compile-time
cudafe
gpu
Open64
host
ptx
host compiler
exe
execution-time
binary
runtime
driver
ptx is included as data in the host application
19
OpenCL (Dynamic) Compiler
OpenCL
compile-time
host compiler
exe
execution-time
OpenCL Library
LLVM
binary
runtime
driver
20
Objectives of our work

Compare two different IRs from similar
massively-threaded architectures
Influence future IR design (an active topic in
GPGPU research)
Leverage/analyze benefits of Open64 optimizations
Compare challenges with fundamentally different
ISAs Superscalar/SIMT versus VLIW/SIMT

21
CUDA Runtime

Device Management
cudaSetDevice, cudaGetDevice
Memory Management
Allocation cudaMalloc, cudaFree
Transfer cudaMemcpy, cudaMemset
Execution Control
Kernel launch cudaLaunch
Config cudaConfigureCall
Thread Management
cudaSynchronize

22
CUDA Runtime (Vector Add example)
__global__ void vecAdd(int A , int B , int C
) int i threadIdx.x Ci Ai
Bi int main() int hA
int hB cudaMemcpy(dA, hA,
sizeof(hA), HostToDevice) cudaMemcpy(dB, hB,
sizeof(hB), HostToDevice) vecAddltltlt1,
Ngtgtgt(dA, dB, dC) cudaMemcpy(dA, hA,
sizeof(hA), DeviceToHost)
cudaConfigureCall cudaSetupArgument cudaLaunch
23
NVIDIA PTX

Low-level IR (close to ISA)
Pseudo-assembly style syntax
Load-Store instruction set
Strongly typed language
cvt.s32.u16 r1, tid.x
Unlimited virtual registers
Predicate registers

24
AMD IL

High-level IR
Structured control flow (if-endif, while-end,
switch-end)
No predication
32-bit registers (4 components) - vectorization

25
Common PTX and IL instructions
vectorAdd (PTX)
mov.u16 rh1, ctaid.x mov.u16 rh2,
ntid.x mul.wide.u16 r1, rh1,
rh2 cvt.u32.u16 r2, tid.x add.u32 r3,
r2, r1 ld.param.s32 r4, N setp.le.s32
p1, r4, r3 _at_p1 bra LabelA cvt.u64.s32
rd1, r3 mul.lo.u64 rd2, rd1,
4 ld.param.u64 rd3, A add.u64 rd4, rd3,
rd2 ld.global.f32 f1, rd40 ld.param.u6
4 rd5, B add.u64 rd6, rd5,
rd2 ld.global.f32 f2, rd60 add.f32
f3, f1, f2 ld.param.u64 rd7, C add.u64
rd8, rd7, rd2 st.global.f32 rd80,
f3 LabelA exit

Data movement (mov)
Memory access (ld, st)
Arithmetic (mul, add)
Conversion (cvt)
Comparison and selection (setp)
Control flow (bra) uses predication for
conditional branch

26
Common PTX and IL instructions
vectorAdd (IL)
mov r0, vThreadGrpId.x mov r1, cb00.x imul r2,
r0, r1 mov r3, vTidInGrp.x iadd r4, r3, r2 mov
r5, cb13 ige r6, r4, r5 if_logicalz r6 mov r7,
r4 imul r8, r7, l0 mov r9, cb10 iadd r10, r9,
r8 uav_raw_load_id(0) r11, r10 mov r12,
cb11 iadd r13, r12, r8 uav_raw_load_id(0) r14,
r13 add r15, r11, r14 mov r16, cb12 iadd r17,
r16, r8 uav_raw_store_id(0) mem.xyzw, r17,
r15 endif end

Data movement (mov)
Memory access (uav_raw)
Arithmetic (imul, iadd)
No conversion instructions
Comparison and Selection (ige)
Control Flow (if_logicalz) structured
statements

27
Ocelot Framework

Implemented as a CUDA library
Intercepts library calls
PTX Emulation on the CPU
Parses PTX into an internal IR
Analysis CFG, SSA, Data flow, optimizations
Our work
IR for IL programs
PTX IR -gt IL IR translation
AMD/CAL Backend

Andrew Kerr, Gregory Diamos, and Sudhakar
Yalamanchili. Modeling gpu-cpu workloads and
systems. In GPGPU 10 Proceedings of the 3rd
Workshop on General-Purpose Computation on
Graphics Processing Units, pages 3142, New York,
NY, USA, 2010. ACM.
28
Translation Framework
compile-time
exe
29
IL Control Tree

Based on Structural Analysis
Build DFS spanning tree of the control flow graph
and traverse in postorder
Form regions and collapse the nodes in the CFG
Construct the Control Tree in the process
Repeat until only 1 node is left in the CFG

S. Muchnick. Advanced Compiler Design and
Implementation, chapter 7.7. Morgan Kaufmann,
1997.
30
IL Control Tree
abstract node representing regions
31
Example 1 (if-then)
PTX
mov.u16 setp.le.s32 p1, r4, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelA exit
32
Example 1 (if-then)
IL
mov ige r6, r4, r5 if_logicalz
r6 mov endif end
33
Example 2 (for-loop)
Entry Block
PTX
mov.u16 setp.le.s32 p1, r5, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelB setp.lt.s32 p2,
r4, r5 _at_p2 bra LabelB LabelA exit
BB mov..
IF
BB exit
cond
true
BB setp..
Block
BB cvt
WHILE
cond
body
setp

34
Example 2 (for-loop)
IL
mov ige r7, r4, r6 if_logicalz
r7 mov whileloop if_logicalz
r17 break endif endloop endif end
body
35
Other BT Challenges

Pointer arithmetic in CUDA needs to be emulated
in CAL
Translate Application Binary Interface (ABI),
e.g. different calling conventions
Architectural bitness Tesla and Cypress are
32-bit architectures but Fermi is 64-bits

36
Project Status

Main CUDA library APIs are implemented
(cudaMalloc, cudaMemcpy, cudaLaunch, etc.)
3 CUDA applications from the SDK running
Code quality comparable to LLVM code generation

37
Next Steps

Enhance translation of the Control Tree to
support other IL constructs (e.g., switch-case)
Implement other GPGPU abstractions (e.g., shared
memory, textures, etc.)
Handle PTX predicated instructions (since IL does
not support predication directly)

38
Summary and Future Work

GPUs are revolutionizing desktop supercomputing
A number of critical applications have been
migrated successfully
CUDA and OpenCL have made these platforms much
more accessible for general purpose computing
AMD presently has the highest DP FP performance
CUDA presently produces higher performance code
for NVIDIA
We are developing a platform that leverages the
best of both worlds ?