Parallel%20Programming%20with%20CUDA

About This Presentation

Title:

Parallel%20Programming%20with%20CUDA

Description:

CUDA is a platform for performing massively parallel computations on ... This includes dereferencing pointers, even in the host's memory (main system RAM) ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 26

Provided by: mccl98

Learn more at: https://ww2.cs.fsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel%20Programming%20with%20CUDA

1
Parallel Programming with CUDA

Matthew Guidry
Charles McClendon

2
Introduction to CUDA

CUDA is a platform for performing massively
parallel computations on graphics accelerators
CUDA was developed by NVIDIA
It was first available with their G8X line of
graphics cards
Approximately 1 million CUDA capable GPUs are
shipped every week
CUDA presents a unique opportunity to develop
widely-deployed parallel applications

3
CUDA

Because of the Power Wall, Latency Wall, etc
(free lunch is over), we must find a way to keep
our processor intensive programs from slowing
down to a crawl
With CUDA developments it is possible to do
things like simulating Networks of Brain Neurons
CUDA brings the possibility of ubiquitous
supercomputing to the everyday computer

4
CUDA

CUDA is supported on all of NVIDIAs G8X and
above graphics cards
The current CUDA GPU Architecture is branded
Tesla
8-series GPUs offer 50-200 GFLOPS

5
CUDA Compilation

As a programming model, CUDA is a set of
extensions to ANSI C
CPU code is compiled by the host C compiler and
the GPU code (kernel) is compiled by the CUDA
compiler. Separate binaries are produced

6
CUDA Stack
7
Limitations of CUDA

Tesla does not fully support IEEE spec for double
precision floating point operations
Code only supported on NVIDIA hardware
No use of recursive functions (can workaround)
Bus latency between host CPU and GPU
(Although double precision will be resolved with
Fermi)

8
Thread Hierarchy

Thread Distributed by the CUDA runtime
(identified by threadIdx)
Warp A scheduling unit of up to 32 threads
Block A user defined group of 1 to 512 threads.
(identified by blockIdx)
Grid A group of one or more blocks. A
grid is created for each CUDA kernel
function

9
CUDA Memory Hierarchy

The CUDA platform has three primary memory types
Local Memory per thread memory for automatic
variables and register spilling.
Shared Memory per block low-latency memory to
allow for intra-block data sharing and
synchronization. Threads can safely share data
through this memory and can perform barrier
synchronization through _ _syncthreads()
Global Memory device level memory that may be
shared between blocks or grids

10
Moving Data
CUDA allows us to copy data from one memory type
to another. This includes dereferencing
pointers, even in the hosts memory (main system
RAM) To facilitate this data movement CUDA
provides cudaMemcpy()
11
Optimizing Code for CUDA

Prevent thread starvation by breaking your
problem down (128 execution units are available
for use, thousands of threads may be in flight)
Utilize shared memory and avoid latency problems
(communicating with system memory is slow)
Keep in mind there is no built-in way to
synchronize threads in different blocks
Avoid thread divergence in warps by blocking
threads with similar control paths

12
Code Example
Will be explained more in depth later
13
Kernel Functions

A kernel function is the basic unit of work
within a CUDA thread
Kernel functions are CUDA extensions to ANSI C
that are compiled by the CUDA compiler and the
object code generator

14
Kernel Limitations

There must be no recursion theres no call stack
There must no static variable declarations
Functions must have a non-variable number of
arguments

15
CUDA Warp

CUDA utilizes SIMT (Single Instruction Multiple
Thread)
Warps are groups of 32 threads. Each warp
receives a single instruction and broadcasts it
to all of its threads.
CUDA provides zero-overhead warp and thread
scheduling. Also, the overhead of thread creation
is on the order of 1 clock.
Because a warp receives a single instruction, it
will diverge and converge as each thread branches
independently

16
CUDA Hardware

The primary components of the Tesla architecture
are
Streaming Multiprocessor (The 8800 has 16)
Scalar Processor
Memory hierarchy
Interconnection network
Host interface

17
Streaming Multiprocessor (SM)

- Each SM has 8 Scalar Processors (SP)
IEEE 754 32-bit floating point support
(incomplete support)
- Each SP is a 1.35 GHz processor (32 GFLOPS
peak)
- Supports 32 and 64 bit integers
- 8,192 dynamically partitioned 32-bit registers
- Supports 768 threads in hardware (24 SIMT warps
of 32 threads)
Thread scheduling done in hardware
16KB of low-latency shared memory
2 Special Function Units (reciprocal square
root, trig functions, etc)

Each GPU has 16 SMs
18
The GPU
19
Scalar Processor

Supports 32-bit IEEE floating point instructions
FADD, FMAD, FMIN, FMAX, FSET, F2I, I2F
Supports 32-bit integer operations
IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I, SHR,
SHL, AND, OR, XOR
Fully pipelined

20
Code Example Revisited
21
Myths About CUDA

GPUs are the only processors in a CUDA
application
The CUDA platform is a co-processor, using the
CPU and GPU
GPUs have very wide (1000s) SIMD machines
No, a CUDA Warp is only 32 threads
Branching is not possible on GPUs
Incorrect.
GPUs are power-inefficient
Nope, performance per watt is quite good
CUDA is only for C or C programmers
Not true, there are third party wrappers for
Java, Python, and more

22
Different Types of CUDA Applications
23
Future Developments of CUDA

The next generation of CUDA, called Fermi, will
be the standard on the GeForce 300 series
Fermi will have full support IEEE 754 double
precision
Fermi will natively support more programming
languages
Also, there is a new project, OpenCL that seeks
to provide an abstraction layer over CUDA and
similar platforms (AMDs Stream)

24
Things to Ponder

Is CUDA better than Cell??
How do I utilize 12,000 threads??
Is CUDA really relevant anyway, in world where
web applications are so popular??

Parallel%20Programming%20with%20CUDA - PowerPoint PPT Presentation

Parallel%20Programming%20with%20CUDA

CUDA is a platform for performing massively parallel computations on ... This includes dereferencing pointers, even in the host's memory (main system RAM) ... – PowerPoint PPT presentation