Challenges in Binary Translation for Desktop Supercomputing - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Challenges in Binary Translation for Desktop Supercomputing

Description:

Challenges in Binary Translation for Desktop Supercomputing David Kaeli Rodrigo Dominguez Department of Electrical and Computer Engineering Northeastern University – PowerPoint PPT presentation

Number of Views:149
Avg rating:3.0/5.0
Slides: 39
Provided by: BahaaE
Category:

less

Transcript and Presenter's Notes

Title: Challenges in Binary Translation for Desktop Supercomputing


1
Challenges in Binary Translation for Desktop
Supercomputing
David Kaeli Rodrigo Dominguez Department of
Electrical and Computer Engineering Northeastern
University Boston, MA
2
Current trends in Many-core Computing
  • The CPU industry has elected to jump off the
    cycle-time scaling bandwagon
  • Power/thermal constraints have become a limiting
    factor
  • We now see CPU vendors placing multiple (10s of)
    cores on a single chip
  • Clock speeds have not changed
  • The memory wall persists and multiple cores that
    assume a shared-memory model place further
    pressure on this problem
  • Software vendors are looking for new
    parallelization technology
  • Multi-core aware operating systems
  • Semi-automatic parallelizing compilers

3
Current trends in Many-core Computing
  • There has been a renewed interest in parallel
    computing paradigms and languages
  • Existing many-core architectures are being
    considered for general-purpose platforms (e.g.,
    Cell, GPUs, DSPs)
  • Heterogeneous systems are becoming a common theme
  • The trend will only accelerate if proper
    programming frameworks are available to
    effectively exploit many-core resources

4
Graphics Processors
  • Graphics Processing Units
  • More than 64 of Americans played a video game in
    2009
  • High-end - primarily used for 3-D rendering for
    videogame graphics and movie animation
  • Mid/low-end primarily used for computer
    displays
  • Manufacturers include NVIDIA, AMD/ATI, IBM-Cell
  • Very competitive commodities market

5
GPU Performance
  • GPUs provide a path for performance growth
  • Cost and power usage numbers are also impressive

Near exponential growth in performance for
GPUS!!
SourceNVIDIA 2009
6
Comparison of CPU and GPU Hardware
Architectures
CPU Cache heavy, focused on individual thread
performance
GPU ALU heavy, massively parallel,
throughput-oriented
7
CPU/GPU Relationship
CPU (host)
GPU w/ local DRAM (device)
8
A wide range of GPU apps
  • Film
  • Financial
  • Languages
  • GIS
  • Holographics cinema
  • Machine learning
  • Mathematics research
  • Military
  • Mine planning
  • Molecular dynamics
  • MRI reconstruction
  • Multispectral imaging
  • N-body simulation
  • Network processing
  • Neural network
  • Oceanographic research
  • Optical inspection
  • Particle physics
  • 3D image analysis
  • Adaptive radiation therapy
  • Acoustics
  • Astronomy
  • Audio
  • Automobile vision
  • Bioinfomatics
  • Biological simulation
  • Broadcast
  • Cellular automata
  • Fluid dynamics
  • Computer vision
  • Cryptography
  • CT reconstruction
  • Data mining
  • Digital cinema / projections
  • Electromagnetic simulation
  • Equity training
  • Protein folding
  • Quantum chemistry
  • Ray tracing
  • Radar
  • Reservoir simulation
  • Robotic vision / AI
  • Robotic surgery
  • Satellite data analysis
  • Seismic imaging
  • Surgery simulation
  • Surveillance
  • Ultrasound
  • Video conferencing
  • Telescope
  • Video
  • Visualization
  • Wireless
  • X-Ray

9
GPU as a General Purpose Computing Platform
  • Speedups are impressive and ever increasing!

Real Time Elimination of Undersampling Artifacts
Lattice-Boltzmann Method for Numerical Fluid
Mechanics
Genetic Algorithm
Total Variation Modeling
2300 X
1840 X
1000 X
2600 X
Monte Carlo Simulation Of Photon Migration
Stochastic Differential Equations
K-Nearest Neighbor Search
Fast Total Variation for Computer Vision
1000 X
675 X
470 X
1000 X
Source CUDA Zone at www.nvidia.com/cuda/
10
GPGPU is becoming mainstream research
  • Research activities are expanding significantly

Search result for keyword GPGPU in IEEE and ACM
11
Streaming Processor Array
Grid of thread blocks
Multiple thread blocks, many warps of threads
Texture Processor Cluster
Streaming Multiprocessor
NVIDIA GT200 architecture
SP
SP
  • 240 shader cores
  • 1.4B transistors
  • Up to 2GB onboard memory
  • 150GB/sec BW
  • 1.06 SP GFLOPS
  • CUDA and OpenCL support
  • Programmable memory spaces
  • Tesla S1070 provides 4 GPUs in a 1U unit

SP
SP
SFU
SFU
SP
SP
SP
SP
Texture Unit
Individual threads
12
AMD/ATI Radeon HD 5870
  • Codename Evergreen
  • 1600 SIMD cores
  • L1/L2 memory architecture
  • 153GB/sec memory bandwidth
  • 2.72 TFLOPS SP
  • OpenCL and DirectX11
  • Hidden memory microarchitecure
  • Provides for vectorized operation

13
Comparison of CPU and GPU Hardware
Architectures
CPU/GPU Single precision TFLOPs Cores GFLOPs/Watt /GFLOP
NVIDIA 285 1.06 240 5.8 3.12
NVIDIA 295 1.79 480 6.2 3.80
AMD HD 5870 2.72 1600 14.5 0.16
AMD HD 4890 1.36 800 7.2 0.18
Intel I-7 965 0.051 4 0.39 11.02
Source NVIDIA, AMD and Intel
14
AMD vs. NVIDIA
AMD NVIDIA
Hardware architecture Vector Scalar
Programming language Brook, IL, OpenCL CUDA, OpenCL
Programming model SIMD vector SIMT
Thread hierarchy Single level Two level
Memory exposure Uniform space Multiple space
Source of horsepower Vectorization and multiple output Memory spaces utilization including shared memory
Pros Easier programming More flexible programming
Challenges Harnessing the potential horsepower Harnessing the potential horsepower
15
Talk Outline
  • Introduction on GPUs
  • Overview of the tool chains for both CUDA and
    OpenCL
  • Motivation for pursuing this work
  • Comparing intermediate representations
  • Leveraging/analyzing benefits of Open64
    optimization on AMD GPUs
  • Comparing challenges with fundamentally different
    ISAs (SS SIMT versus VLIW SIMT)
  • Discuss PTX and IL
  • Describe new common IR
  • Two examples of PTX-gtIR-gtIL binary translation
  • Discuss status of project and future work

16
GPU Programming Model
  • Single Instruction Multiple Threads (SIMT)
  • Parallelism is implicit
  • Programs (also called kernels or shaders) are
    generally small and contain nested loops
  • Synchronization is handled explicitly

17
Toolchains
  • Toolchain compiler runtime library

NVIDIA
AMD
18
CUDA Compiler
c for cuda
compile-time
cudafe
gpu
Open64
host
ptx
host compiler
exe
execution-time
binary
runtime
driver
ptx is included as data in the host application
19
OpenCL (Dynamic) Compiler
OpenCL
compile-time
host compiler
exe
execution-time
OpenCL Library
LLVM
binary
runtime
driver
20
Objectives of our work
  • Compare two different IRs from similar
    massively-threaded architectures
  • Influence future IR design (an active topic in
    GPGPU research)
  • Leverage/analyze benefits of Open64 optimizations
  • Compare challenges with fundamentally different
    ISAs Superscalar/SIMT versus VLIW/SIMT

21
CUDA Runtime
  • Device Management
  • cudaSetDevice, cudaGetDevice
  • Memory Management
  • Allocation cudaMalloc, cudaFree
  • Transfer cudaMemcpy, cudaMemset
  • Execution Control
  • Kernel launch cudaLaunch
  • Config cudaConfigureCall
  • Thread Management
  • cudaSynchronize

22
CUDA Runtime (Vector Add example)
__global__ void vecAdd(int A , int B , int C
) int i threadIdx.x Ci Ai
Bi int main() int hA
int hB cudaMemcpy(dA, hA,
sizeof(hA), HostToDevice) cudaMemcpy(dB, hB,
sizeof(hB), HostToDevice) vecAddltltlt1,
Ngtgtgt(dA, dB, dC) cudaMemcpy(dA, hA,
sizeof(hA), DeviceToHost)
cudaConfigureCall cudaSetupArgument cudaLaunch
23
NVIDIA PTX
  • Low-level IR (close to ISA)
  • Pseudo-assembly style syntax
  • Load-Store instruction set
  • Strongly typed language
  • cvt.s32.u16 r1, tid.x
  • Unlimited virtual registers
  • Predicate registers

24
AMD IL
  • High-level IR
  • Structured control flow (if-endif, while-end,
    switch-end)
  • No predication
  • 32-bit registers (4 components) - vectorization

25
Common PTX and IL instructions
vectorAdd (PTX)
mov.u16 rh1, ctaid.x mov.u16 rh2,
ntid.x mul.wide.u16 r1, rh1,
rh2 cvt.u32.u16 r2, tid.x add.u32 r3,
r2, r1 ld.param.s32 r4, N setp.le.s32
p1, r4, r3 _at_p1 bra LabelA cvt.u64.s32
rd1, r3 mul.lo.u64 rd2, rd1,
4 ld.param.u64 rd3, A add.u64 rd4, rd3,
rd2 ld.global.f32 f1, rd40 ld.param.u6
4 rd5, B add.u64 rd6, rd5,
rd2 ld.global.f32 f2, rd60 add.f32
f3, f1, f2 ld.param.u64 rd7, C add.u64
rd8, rd7, rd2 st.global.f32 rd80,
f3 LabelA exit
  • Data movement (mov)
  • Memory access (ld, st)
  • Arithmetic (mul, add)
  • Conversion (cvt)
  • Comparison and selection (setp)
  • Control flow (bra) uses predication for
    conditional branch

26
Common PTX and IL instructions
vectorAdd (IL)
mov r0, vThreadGrpId.x mov r1, cb00.x imul r2,
r0, r1 mov r3, vTidInGrp.x iadd r4, r3, r2 mov
r5, cb13 ige r6, r4, r5 if_logicalz r6 mov r7,
r4 imul r8, r7, l0 mov r9, cb10 iadd r10, r9,
r8 uav_raw_load_id(0) r11, r10 mov r12,
cb11 iadd r13, r12, r8 uav_raw_load_id(0) r14,
r13 add r15, r11, r14 mov r16, cb12 iadd r17,
r16, r8 uav_raw_store_id(0) mem.xyzw, r17,
r15 endif end
  • Data movement (mov)
  • Memory access (uav_raw)
  • Arithmetic (imul, iadd)
  • No conversion instructions
  • Comparison and Selection (ige)
  • Control Flow (if_logicalz) structured
    statements

27
Ocelot Framework
  • Implemented as a CUDA library
  • Intercepts library calls
  • PTX Emulation on the CPU
  • Parses PTX into an internal IR
  • Analysis CFG, SSA, Data flow, optimizations
  • Our work
  • IR for IL programs
  • PTX IR -gt IL IR translation
  • AMD/CAL Backend

Andrew Kerr, Gregory Diamos, and Sudhakar
Yalamanchili. Modeling gpu-cpu workloads and
systems. In GPGPU 10 Proceedings of the 3rd
Workshop on General-Purpose Computation on
Graphics Processing Units, pages 3142, New York,
NY, USA, 2010. ACM.
28
Translation Framework
compile-time
exe
29
IL Control Tree
  • Based on Structural Analysis
  • Build DFS spanning tree of the control flow graph
    and traverse in postorder
  • Form regions and collapse the nodes in the CFG
  • Construct the Control Tree in the process
  • Repeat until only 1 node is left in the CFG

S. Muchnick. Advanced Compiler Design and
Implementation, chapter 7.7. Morgan Kaufmann,
1997.
30
IL Control Tree
abstract node representing regions
31
Example 1 (if-then)
PTX
mov.u16 setp.le.s32 p1, r4, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelA exit
32
Example 1 (if-then)
IL
mov ige r6, r4, r5 if_logicalz
r6 mov endif end
33
Example 2 (for-loop)
Entry Block
PTX
mov.u16 setp.le.s32 p1, r5, r3 _at_p1 bra
LabelA cvt.u64.s32 LabelB setp.lt.s32 p2,
r4, r5 _at_p2 bra LabelB LabelA exit
BB mov..
IF
BB exit
cond
true
BB setp..
Block
BB cvt
WHILE
cond
body
setp

34
Example 2 (for-loop)
IL
mov ige r7, r4, r6 if_logicalz
r7 mov whileloop if_logicalz
r17 break endif endloop endif end
body
35
Other BT Challenges
  • Pointer arithmetic in CUDA needs to be emulated
    in CAL
  • Translate Application Binary Interface (ABI),
    e.g. different calling conventions
  • Architectural bitness Tesla and Cypress are
    32-bit architectures but Fermi is 64-bits

36
Project Status
  • Main CUDA library APIs are implemented
    (cudaMalloc, cudaMemcpy, cudaLaunch, etc.)
  • 3 CUDA applications from the SDK running
  • Code quality comparable to LLVM code generation

37
Next Steps
  • Enhance translation of the Control Tree to
    support other IL constructs (e.g., switch-case)
  • Implement other GPGPU abstractions (e.g., shared
    memory, textures, etc.)
  • Handle PTX predicated instructions (since IL does
    not support predication directly)

38
Summary and Future Work
  • GPUs are revolutionizing desktop supercomputing
  • A number of critical applications have been
    migrated successfully
  • CUDA and OpenCL have made these platforms much
    more accessible for general purpose computing
  • AMD presently has the highest DP FP performance
  • CUDA presently produces higher performance code
    for NVIDIA
  • We are developing a platform that leverages the
    best of both worlds ?
Write a Comment
User Comments (0)
About PowerShow.com