Parallel%20Programming%20with%20CUDA - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel%20Programming%20with%20CUDA

Description:

CUDA is a platform for performing massively parallel computations on ... This includes dereferencing pointers, even in the host's memory (main system RAM) ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 26
Provided by: mccl98
Learn more at: https://ww2.cs.fsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel%20Programming%20with%20CUDA


1
Parallel Programming with CUDA
  • Matthew Guidry
  • Charles McClendon

2
Introduction to CUDA
  • CUDA is a platform for performing massively
    parallel computations on graphics accelerators
  • CUDA was developed by NVIDIA
  • It was first available with their G8X line of
    graphics cards
  • Approximately 1 million CUDA capable GPUs are
    shipped every week
  • CUDA presents a unique opportunity to develop
    widely-deployed parallel applications

3
CUDA
  • Because of the Power Wall, Latency Wall, etc
    (free lunch is over), we must find a way to keep
    our processor intensive programs from slowing
    down to a crawl
  • With CUDA developments it is possible to do
    things like simulating Networks of Brain Neurons
  • CUDA brings the possibility of ubiquitous
    supercomputing to the everyday computer

4
CUDA
  • CUDA is supported on all of NVIDIAs G8X and
    above graphics cards
  • The current CUDA GPU Architecture is branded
    Tesla
  • 8-series GPUs offer 50-200 GFLOPS

5
CUDA Compilation
  • As a programming model, CUDA is a set of
    extensions to ANSI C
  • CPU code is compiled by the host C compiler and
    the GPU code (kernel) is compiled by the CUDA
    compiler. Separate binaries are produced

6
CUDA Stack
7
Limitations of CUDA
  • Tesla does not fully support IEEE spec for double
    precision floating point operations
  • Code only supported on NVIDIA hardware
  • No use of recursive functions (can workaround)
  • Bus latency between host CPU and GPU
  • (Although double precision will be resolved with
    Fermi)

8
Thread Hierarchy
  • Thread Distributed by the CUDA runtime
  • (identified by threadIdx)
  • Warp A scheduling unit of up to 32 threads
  • Block A user defined group of 1 to 512 threads.
  • (identified by blockIdx)
  • Grid A group of one or more blocks. A
    grid is created for each CUDA kernel
    function

9
CUDA Memory Hierarchy
  • The CUDA platform has three primary memory types
  • Local Memory per thread memory for automatic
    variables and register spilling.
  • Shared Memory per block low-latency memory to
    allow for intra-block data sharing and
    synchronization. Threads can safely share data
    through this memory and can perform barrier
    synchronization through _ _syncthreads()
  • Global Memory device level memory that may be
    shared between blocks or grids

10
Moving Data
CUDA allows us to copy data from one memory type
to another. This includes dereferencing
pointers, even in the hosts memory (main system
RAM) To facilitate this data movement CUDA
provides cudaMemcpy()
11
Optimizing Code for CUDA
  • Prevent thread starvation by breaking your
    problem down (128 execution units are available
    for use, thousands of threads may be in flight)
  • Utilize shared memory and avoid latency problems
    (communicating with system memory is slow)
  • Keep in mind there is no built-in way to
    synchronize threads in different blocks
  • Avoid thread divergence in warps by blocking
    threads with similar control paths

12
Code Example
Will be explained more in depth later
13
Kernel Functions
  • A kernel function is the basic unit of work
    within a CUDA thread
  • Kernel functions are CUDA extensions to ANSI C
    that are compiled by the CUDA compiler and the
    object code generator

14
Kernel Limitations
  • There must be no recursion theres no call stack
  • There must no static variable declarations
  • Functions must have a non-variable number of
    arguments

15
CUDA Warp
  • CUDA utilizes SIMT (Single Instruction Multiple
    Thread)
  • Warps are groups of 32 threads. Each warp
    receives a single instruction and broadcasts it
    to all of its threads.
  • CUDA provides zero-overhead warp and thread
    scheduling. Also, the overhead of thread creation
    is on the order of 1 clock.
  • Because a warp receives a single instruction, it
    will diverge and converge as each thread branches
    independently

16
CUDA Hardware
  • The primary components of the Tesla architecture
    are
  • Streaming Multiprocessor (The 8800 has 16)
  • Scalar Processor
  • Memory hierarchy
  • Interconnection network
  • Host interface

17
Streaming Multiprocessor (SM)
  • - Each SM has 8 Scalar Processors (SP)
  • IEEE 754 32-bit floating point support
    (incomplete support)
  • - Each SP is a 1.35 GHz processor (32 GFLOPS
    peak)
  • - Supports 32 and 64 bit integers
  • - 8,192 dynamically partitioned 32-bit registers
  • - Supports 768 threads in hardware (24 SIMT warps
    of 32 threads)
  • Thread scheduling done in hardware
  • 16KB of low-latency shared memory
  • 2 Special Function Units (reciprocal square
    root, trig functions, etc)

Each GPU has 16 SMs
18
The GPU
19
Scalar Processor
  • Supports 32-bit IEEE floating point instructions
  • FADD, FMAD, FMIN, FMAX, FSET, F2I, I2F
  • Supports 32-bit integer operations
  • IADD, IMUL24, IMAD24, IMIN, IMAX, ISET, I2I, SHR,
    SHL, AND, OR, XOR
  • Fully pipelined

20
Code Example Revisited
21
Myths About CUDA
  • GPUs are the only processors in a CUDA
    application
  • The CUDA platform is a co-processor, using the
    CPU and GPU
  • GPUs have very wide (1000s) SIMD machines
  • No, a CUDA Warp is only 32 threads
  • Branching is not possible on GPUs
  • Incorrect.
  • GPUs are power-inefficient
  • Nope, performance per watt is quite good
  • CUDA is only for C or C programmers
  • Not true, there are third party wrappers for
    Java, Python, and more

22
Different Types of CUDA Applications
23
Future Developments of CUDA
  • The next generation of CUDA, called Fermi, will
    be the standard on the GeForce 300 series
  • Fermi will have full support IEEE 754 double
    precision
  • Fermi will natively support more programming
    languages
  • Also, there is a new project, OpenCL that seeks
    to provide an abstraction layer over CUDA and
    similar platforms (AMDs Stream)

24
Things to Ponder
  • Is CUDA better than Cell??
  • How do I utilize 12,000 threads??
  • Is CUDA really relevant anyway, in world where
    web applications are so popular??

25
Parallel Programming with CUDA
  • By Matthew Guidry
  • Charles McClendon
Write a Comment
User Comments (0)
About PowerShow.com