Introduction to GPU Programming for EDA - PowerPoint PPT Presentation

1 / 113
About This Presentation
Title:

Introduction to GPU Programming for EDA

Description:

Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P. Khatri Texas A&M University Acknowledgements: NVIDIA, Nascentric Inc ... – PowerPoint PPT presentation

Number of Views:337
Avg rating:3.0/5.0
Slides: 114
Provided by: eceTamuE7
Category:

less

Transcript and Presenter's Notes

Title: Introduction to GPU Programming for EDA


1
Introduction to GPU Programming for EDA
  • John F. Croix
  • Cadence Design Systems, Inc.
  • Sunil P. Khatri
  • Texas AM University
  • Acknowledgements NVIDIA, Nascentric Inc.,
    Accelicon Inc.
  • Students Kanupriya Gulati, Vinay Karkala,
    Kalyana Bollapalli

2
Outline
  • GPU Architecture Overview
  • GPU Programming
  • Algorithm Acceleration Guidelines
  • Case Studies
  • Conclusion
  • QA

2
3
Outline
  • GPU Architecture Overview
  • Evolution and architecture
  • Peak performance
  • GPU and CPU interaction practical
    considerations
  • GPU Programming
  • Algorithm Acceleration Guidelines
  • Case Studies
  • Conclusion
  • QA

3
4
GPU Evolution
  • In the early days, graphics accelerators were
    primitive
  • Acceleration of graphics rendering tasks for
    (CRT) displays
  • Many hardwired graphics acceleration units
  • With VLSI technology scaling, the GPU was born
  • Many programmable processors to handle graphics
    rendering tasks
  • Increased peak memory bandwidths and peak
    performance
  • Goal was faster and more realistic rendering for
    gaming applications
  • Recently, several scientific communities began to
    leverage these GPUs
  • Initially used graphics APIs like OpenGL and
    DirectX for these tasks
  • GPU vendors recognized this interest
  • Development of C-like programming environments
    such as CUDA
  • Development of GPU architectures tuned for
    scientific computations

4
5
GPU Introduction
  • A GPU is essentially a commodity stream processor
  • Highly parallel (100s of processor cores)
  • Very fast (gt900 GFLOPS of peak performance)
  • Operates in a SIMD manner. This is a key
    restriction
  • Multiple processors operate in lock-step (same
    instruction) but on different data
  • GPUs, owing to their massively parallel
    architecture, have been used to accelerate
  • Image/stream processing, data compression,
    numerical algorithms
  • Recently they have been used to accelerate CAD
    algorithms as well.
  • Inexpensive, off-the-shelf cards like the NVIDIA
    Quadro FX / 280 GTX GPU achieve impressive
    performance
  • 933 GFLOPs peak performance
  • 240 SIMD cores partitioned into 30
    Multiprocessors (MPs)
  • 4GB (Quadro) and 1GB (GTX 280) device memory with
    142 GB/s bandwidth
  • 1.4 GHz GPU operating frequency
  • Programmed with Compute Unified Device
    Architecture (CUDA) framework

6
GPU Architecture
  • In the GTX 280, there are 10 Thread Processing
    Clusters (TPCs)
  • Each has 3 Streaming Multiprocessors (SMs), which
    we will refer to as multiprocessors (MPs)
  • Each MP has 8 Streaming Processors (SPs) or
    Thread Processors (TPs). We will refer to these
    as processors.
  • 240 processors and 30 MPs in all!
  • One double-precision FP unit per SM

6
Source NVIDIA
7
GPU vs CPUNVIDIA 280 vs Intel i7 860
1http//ark.intel.com/Product.aspx?id41316 2TPC
Thread Processing Cluster (24 cores) 330
multi-processors in a 280
7
8
GPU vs CPU Peak Performance Trends
  • GPU peak performance has grown aggressively.
  • Hardware has kept up with Moores law

8
Source NVIDIA
9
GPU Programming Model
  • The GPU is viewed as a compute device that
  • Is a coprocessor (slave) to the CPU (host)
  • Has its own DRAM (device memory) but no virtual
    memory
  • Entire design instance may not fit on the GPU!
  • Kernel is a CPU-callable function. Thread is an
    instance of a kernel.
  • GPU runs many threads in parallel.

Device
Host
(CPU)
(GPU)
Kernel
Threads (instances of the kernel)
PCIe
Device
Memory
10
Data Transfers (CPU?GPU)
  • GPUs and CPUs communicate via a PCIe bus
  • This communication is expensive and should be
    minimized for target applications
  • Graphics applications usually require
  • Initial data to be sent from CPU to GPU
  • Single transfer of processed data from GPU to CPU
  • General purpose computations usually require
  • Multiple transfers between CPU and GPU (since
    conditional checks on CPU)
  • Possibility of saturating the PCIe bus and
    reducing the achievable performance

10
11
GPU Threads v/s CPU Threads
  • GPU threads
  • Lightweight, small creation and scheduling
    overhead, extremely fast hardware context
    switching
  • Need to issue 1000s of GPU threads to hide global
    memory latencies (600-800 cycles)
  • CPU threads
  • Heavyweight, large scheduling overhead, slow
    context switching
  • Multi-GPU usage requires invocation of multiple
    CPU threads
  • Each CPU thread creates a GPU context
  • Context swapping is required for a CPU thread to
    access GPU memory allocated by another CPU thread

11
12
Device Memory Space Overview
  • Each thread runs on a SP and has
  • R/W per-thread registers (on-chip)
  • Limit usage (max 16K/MP)
  • R/W per-thread local memory (off)
  • R/W per-block shared memory (on)
  • Need to avoid bank conflicts
  • R/W per-grid global memory (off)
  • Not cached, 600-800 cycle read
  • Latency hidden by parallelism
  • and fast context switches
  • Main means for data transfer from host and device
  • Coalescing recommended
  • RO per-grid cached constant and texture memory
    (off)
  • The host can R/W global, constant and texture
    memories (visible to all threads)

(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local Memory
Local Memory
Local Memory
Local Memory
Global Memory
Host
Constant Memory
Texture Memory
Source NVIDIA CUDA Programming Guide version
1.1
13
Outline
  • GPU Architecture Overview
  • GPU Programming
  • CPU threads
  • Conditional and Loop processing
  • Floating point
  • General GPU program structure
  • CUDA and OpenCL
  • Algorithm Acceleration Guidelines
  • Case Studies
  • Conclusion
  • QA

13
14
CPU Threading
  • CPU
  • All threads are equivalent
  • Read/write concurrently to the same memory
  • Synchronization primitives required to avoid
    collisions
  • GPU (NVIDIA)
  • Each CPU thread maintains a unique context
  • GPU resources (e.g. memory, code modules, address
    space) are context-specific
  • Each CPU thread can access a single context at
    once
  • Contexts must be exchanged between CPU threads to
    share GPU resources between CPU threads
  • Contexts use reference counting and are
    automatically destroyed

14
15
SIMD Conditional Processing
  • Unlike threads in a CPU-based program, SIMD
    programs cannot follow different execution paths
  • Ideal scenario
  • All GPU threads follow the same execution path
  • All processors active continuously
  • In divergent paths, some processors execute the
    then-block and others the else-block
  • Program flow cannot actually diverge. All
    instructions are executed
  • The then- and else- blocks are both executed
  • A bit is used to enable/disable processors based
    on the block being executed
  • Parallelism is reduced, impacting performance

15
16
Idle Processors
  • Idle CPU processors can be dynamically
    rescheduled by OS
  • SIMD processors are not actually idle
  • All processors scheduled are following identical
    execution paths
  • Disabled (idle) processors are unavailable for
    other work and cannot be rescheduled
  • Effective utilization of processors is the
    programmers responsibility
  • Scheduling is an art, not necessarily a science
  • Techniques will vary from chip to chip

16
17
Conditional Processing
  • If (condition)
  • else

17
18
Nested Conditional Processing
  • If (condition)
  • if (condition2)
  • else
  • else

18
19
Loop Processing
  • while (condition)
  • if (cond2)

19
20
The Cost of Memory Access
  • Registers are extremely fast, but are a limited
    resource
  • Cached memories also tend to be small
  • For large data sets, global memory provides read
    write access
  • Accesses take between 600 and 800 clock cycles
  • Accesses are not cached
  • To hide memory latency, the hardware provides
    fast context switches when memory is accessed
  • However, there must be enough computational work
    to do to hide the high cost of memory access
  • Programmers need to be smart
  • Compilers often dont provide the necessary
    optimizations when optimizing for speed instead
    of code size
  • It can sometimes be cheaper to recompute a result
    than perform a memory read/write

20
21
Conditional Processing
  • float a someVar
  • if (condition)
  • else
  • ...
  • if (condition)
  • ...
  • float a someVar
  • ...
  • else
  • ...
  • float a someVar
  • ...
  • ...

Access Swap
Access Swap
Access Swap
21
22
Floating Point
  • GPUs are optimized for 32-bit accesses
  • 64-bit double-precision values fetched from
    memory as two 32-bit quantities
  • May impact performance in the event of memory
    bank conflicts
  • One double-precision unit per multi-processor1

1http//www.ddj.com/hpc-high-performance-computing
/210102115
22
23
OpenCL vs CUDA
  • CUDA uses early code binding
  • Code is compiled with normal C/C/FORTRAN (beta)
    source code
  • Need CUDA occupancy calculator to determine
    number of threads based on resource utilization
  • Library support BLAS FFT DPT
  • OpenCL
  • Late binding of OpenCL code to executable
  • OpenCL compiler/linker embedded within
    application
  • No need for CUDA occupancy calculator
  • Only supports C
  • No libraries

23
24
CUDA Occupancy Calculator
24
25
OpenCL vs CUDA
25
26
General Program Structure
  • Initialize GPU
  • Create GPU context
  • Build GPU program
  • Allocate GPU memory
  • Transfer data from CPU to GPU
  • Invoke GPU functions
  • Transfer data from GPU to CPU
  • Deallocate GPU memory
  • Finalize GPU usage

26
27
Create GPU Context
  • CUDA
  • Context creation is implicit in single-threaded
    programs
  • Multiple contexts can be explicitly created
  • Each thread maintains a context stack
  • Top context is current context
  • Threads
  • Contexts can be swapped between threads
  • A thread can only have one context active at a
    time (stack)
  • A context cannot be shared simultaneously between
    threads
  • OpenCL
  • All commands explicitly associated with a context
  • Must create a command queue to invoke

27
28
Initialize GPU
  • CUDA
  • cudaGetDeviceCount()
  • cudaSetDevice()
  • cudaGetDeviceProperties()

CUDACUDA(int Device) Base() mValid
false int DeviceCount cudaGetDeviceCount(
DeviceCount ) if (!DeviceCount)
return Device Device -1 ?
DeviceCount - 1 Device cudaSetDevice(
Device ) mValid true
28
29
Initialize GPU
  • OpenCL
  • Context must be built before anything can be done
    on the GPU
  • All commands are with respect to a given context

OpenCLOpenCL(int Device) Base() init()
// Initialize class pointers to NULL
cl_int RC mGPUContext clCreateContextFromTyp
e( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, RC )
size_t Bytes RC clGetContextInfo(
mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, Bytes
) int NumDevices Bytes / sizeof(
cl_device_id ) cl_device_id Devices
new cl_device_id NumDevices RC
clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES,
Bytes, Devices, NULL )
mCommandQueue clCreateCommandQueue(
mGPUContext, Devices Device , 0, RC )
size_t MaxWorkItemSizes 256 RC
clGetDeviceInfo( Devices Device ,
CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof( MaxWorkItemSizes ),
MaxWorkItemSizes, NULL ) mMaxWorkItems
MaxWorkItemSizes 0 mMaxWorkItemsMask
(mMaxWorkItems - 1)
29
30
Build GPU Program
  • CUDA
  • GPU code is compiled using nvcc compiler
  • Object code is statically bound to CPU executable
  • GPU code is intrinsically part of the program
  • Mapping of problem to threads performed at
    compile time

30
31
Build GPU Program
  • OpenCL
  • GPU code is bound at runtime to the GPU
  • OpenCL compiler is part of executable
  • Code can be source code or object code
  • Source code can be dynamically generated by the
    program
  • Can be stored in an external file

// Continued from constructor char code
shrFindFilePath( code.cl", "." ) size_t
CodeLength 0 char Source
oclLoadProgSource( myCode, "", CodeLength )
const char SourceCode Source mProgram
clCreateProgramWithSource( mGPUContext, 1,
SourceCode,
CodeLength, RC ) RC
clBuildProgram( mProgram, 0, NULL, NULL, NULL,
NULL ) stdfree( code )
stdfree( Source ) mValid RC
CL_SUCCESS
31
32
Allocate/Deallocate GPU Memory
  • CUDA
  • Most frequently used allocator cudaMalloc()
  • Returns a memory pointer to GPU memory
  • Memory pointer cannot be used by CPU directly
  • Passed to GPU calls

void CUDAmalloc(size_t Bytes) void
Memory cudaError_t RC cudaMalloc( Memory,
Bytes ) return( RC cudaSuccess ? Memory
NULL ) void CUDAfree(void Memory) if
(Memory) cudaFree( Memory )
32
33
Allocate/Deallocate GPU Memory
  • OpenCL
  • Like all things, memory allocation explicitly
    performed within a context

void OpenCLmalloc(size_t NumBytes) size_t
Size NumBytes / 32 (NumBytes 31 ? 1 0)
cl_int RC cl_mem Memory clCreateBuffer(
mGPUContext, CL_MEM_READ_WRITE,
Size, NULL, RC ) return( RC
CL_SUCCESS ? Memory NULL ) void
OpenCLfree(void Memory) if (Memory)
cl_mem Ptr reinterpret_castltcl_memgt(
Memory ) clReleaseMemObject( Memory )

33
34
CPU/GPU Data Transfer
  • Data moved across PCIe bus
  • CUDA
  • Data transfer accomplished via cudaMemcpy()
    routine
  • Implicit synchronization point
  • Non-blocking copies are available
  • Direction is determined by enumeration
  • cudaMemcpyHostToDevice
  • cudaMemcpyDeviceToHost
  • Allocated memory can be bound to texture memory
  • cudaBindTexture
  • OpenCL
  • Memory transfer via clEnqueueWriteBuffer() and
    clEnqueueReadBuffer()
  • Synchronization controlled by parameters to calls
  • Default is non-blocking

34
35
Call GPU Functions (Kernels)
  • Functions in CPU are executed when invoked
  • GPU function calls from CPU create execution
    queue
  • CPU does not wait until GPU function completes
    command is simply queued
  • GPU executes commands on the queue using its own
    ordering
  • Synchronization points cause CPU to stall to wait
    for GPU return
  • CUDA
  • cudaThreadSynchronize()

35
36
GPU Function Calls
  • GPU function calls have an associated
    dimensionality (which can be 1D, 2D or 3D)
  • CUDA
  • Extended language syntax to include problem
    dimension
  • Syntax
  • functionltltltdimBlock,dimGridgtgtgt( arguments )
  • OpenCL
  • Must explicitly put function arguments into
    context
  • clSetKernelArg()
  • Invoke kernel using the context
  • Kernel retrieves arguments from context
    automatically

36
37
GPU Cleanup/Termination
  • CUDA
  • Manages most cleanup operations automatically as
    a context is destroyed
  • OpenCL
  • Provides low-level APIs for deallocation of all
    resources
  • Invoked in order opposite to invocation
  • clReleaseKernel()
  • clReleaseProgram()
  • clReleaseCommandQueue()
  • clReleaseContext()

37
38
Thread Batching Grids and Blocks
  • A kernel is executed as a grid of thread blocks
    (aka blocks)
  • A thread block is a batch of threads that can
    cooperate with each other by
  • Synchronizing their execution
  • Diverging execution results in performance loss
  • Efficiently sharing data through a low latency
    shared memory
  • Two threads from two different blocks cannot
    cooperate

Host
Device
Kernel 1
Kernel 2
Source NVIDIA CUDA Programming Guide version
1.1
39
Block and Thread IDs
  • Threads and blocks have IDs
  • So each thread can identify what data they will
    operate on
  • Block ID 1D or 2D
  • Thread ID 1D, 2D, or 3D
  • Simplifies memoryaddressing when
    processingmultidimensional data
  • Image processing
  • Solving PDEs on volumes
  • Other problems with underlying 1D, 2D or 3D
    geometry

Source NVIDIA CUDA Programming Guide version
1.1
40
GPU Kernels
  • Each function is passed data to create a unique
    ID
  • Data typically specifies spatial coordinates of
    function execution processor within the hardware
  • The ID is used to coordinate data access
  • Ensures that two threads accesses do not collide
  • CUDA function types
  • __global__
  • Callable by CPU
  • Cannot be called by GPU
  • __device__
  • Callable by other GPU functions
  • Cannot be called by CPU
  • CUDA expands these as inline functions via nvcc
  • Adds to function resource utilization

40
41
OpenCL Kernel Invocation
  • Use C templates to simplify argument handling

templatelttypename Tgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, T Arg)
return( clSetKernelArg( Kernel, Pos, sizeof( T
), Arg ) ) templateltgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, size_t
SharedSize) // This routine, unlike the
others, sets up shared memory by passing //
NULL in as the pointer to the variable. return(
clSetKernelArg( Kernel, Pos, SharedSize, NULL )
) templateltgt inline cl_int setArg(cl_kernel
Kernel, unsigned Pos, int Arg) cl_int ArgInt
Arg return( clSetKernelArg( Kernel, Pos,
sizeof( ArgInt ), ArgInt ) ) templateltgt
inline cl_int setArg(cl_kernel Kernel, unsigned
Pos, float Arg) cl_float ArgFloat Arg
return( clSetKernelArg( Kernel, Pos, sizeof(
ArgFloat ), ArgFloat ) ) ... templatelttypename
T0gt inline cl_int setArgs(cl_kernel Kernel, T0
Arg0) return( setArg( Kernel, 0, Arg0 )
) templatelttypename T0, typename T1gt inline
cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1
Arg1) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) ) templatelttypename
T0, typename T1, typename T2gt inline cl_int
setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2
Arg2) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) setArg( Kernel, 2,
Arg2 ) ) ...
41
42
OpenCL Kernel Invocation
  • BLAS-like example
  • CUDA provides BLAS library OpenCL doesnt
  • Must write own BLAS routines in OpenCL to port
    between the two easily
  • swap() function swaps contents of 2 vectors with
    differing vector strides

void OpenCLblasSswap(int n, float x, int incx,
float y, int incy) if (!checkBLASKernel(
mSswapKernel, "Sswap" )) return
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
if (x y) if (setArgs(
mSswapKernel, n, x, incx, y, incy )
CL_SUCCESS) executeBLASKernel(
mSswapKernel, n )
42
43
OpenCL Kernel Invocation
  • BLAS support functions

bool OpenCLcheckBLASKernel(cl_kernel Kernel,
const char KernelName) if (!mValid)
mLastBLASStatus BaseBLAS_NOT_INITIALIZED
return( false ) if (!(Kernel))
cl_int RC Kernel
clCreateKernel( mProgram, KernelName, RC )
if (RC ! CL_SUCCESS)
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
return( false ) return( true
) inline void OpenCLexecuteBLASKernel(cl_ker
nel Kernel, int n) size_t Size n
size_t GlobalWorkSize Size mMaxWorkItemsMask
if (Size mMaxWorkItemsMask)
GlobalWorkSize mMaxWorkItems cl_int
RC clEnqueueNDRangeKernel( mCommandQueue,
Kernel, 1, NULL, GlobalWorkSize,
mMaxWorkItems, 0, NULL,
NULL ) clFinish( mCommandQueue )
mLastBLASStatus (RC CL_SUCCESS) ?
BaseBLAS_SUCCESS BaseBLAS_EXECUTION_FAILED

43
44
OpenCL Kernels
  • BLAS SSWAP example

__kernel void Sswap(__global int n, __global
float x, __global int incx,
__global float y, __global
int incy) const unsigned GID
get_global_id( 0 ) if (GID lt n)
int lx (incx gt 0) ? 0 ((1 - n) incx)
int ly (incy gt 0) ? 0 ((1 - n) incy)
float temp y ly GID incy y ly
GID incy x lx GID incx x
lx GID incx temp
http//developer.download.nvidia.com/OpenCL/NVIDIA
_OpenCL_JumpStart_Guide.pdf
44
45
CUDA Kernels
  • CPU
  • GPU (kernel.cu)

include kernel.cu ... const unsigned int
size_x 256 const unsigned int size_y
4096 ... dim3 grid(size_x / BLOCK_DIM,
size_y / BLOCK_DIM, 1) dim3
threads(BLOCK_DIM, BLOCK_DIM, 1)
transpose_naiveltltlt grid, threads gtgtgt(d_odata,
d_idata, size_x, size_y) cudaThreadSynchronize
() ...
define BLOCK_DIM 16 __global__ void
transpose_naive(float odata, float idata, int
width, int height) unsigned int xIndex
blockDim.x blockIdx.x threadIdx.x
unsigned int yIndex blockDim.y blockIdx.y
threadIdx.y if (xIndex lt width yIndex lt
height) unsigned int index_in
xIndex width yIndex unsigned int
index_out yIndex height xIndex
odataindex_out idataindex_in
45
46
Outline
  • GPU Architecture Overview
  • GPU Programming
  • Algorithm Acceleration Guidelines
  • Streams and Pinned Memory
  • Thread Scheduling
  • Parallel reduction
  • Program partitioning
  • Simultaneous graphics and algorithm processing
  • Case Studies
  • Conclusion
  • QA

46
47
Streams
Data1
Data2
  • Sequence of commands that execute serially
  • Allow overlapping of memory transfers and
    kernel computations from different streams
  • Hides data transfer cost
  • Implementable in CUDA deviceswith compute
    capability 1.1
  • Host memory must be of typepinned

Data1
Data2
Data2
Data1
H?D Transfers
D?H Transfers
Kernel Computation
Data1
Data2
Data1
Data2
Data1
Data2
H?D Transfers
Kernel Computation
47
D?H Transfers
48
Pinned Memory
  • Memory on the host that is mapped to devices
    address space and thus accessible directly by a
    kernel
  • Has several advantages
  • There is no need to allocate a block in device
    memory and copy data between this block and the
    block in host memory data transfers are
    implicitly performed as needed by the kernel
  • Bandwidth between host and device memories is
    higher
  • Write-combining Memory
  • Type of pinned memory where individual writes are
    aggregated into a larger write operation
  • Avoids internal L1, L2 cache writes making more
    cache available for rest of the application
  • Is not snooped during transfers across the PCI
    Express bus, which can improve transfer
    performance by up to 40

48
49
Threads and Scheduling in GPU
  • GPU consists of multiprocessors, each of which
    has many processors
  • A kernel is executed as a grid of blocks
  • Thread block is a batch of threads that
    cooperate with each other by
  • Synchronizing their execution
  • Diverging execution results in performance loss
  • Efficiently sharing data through a low latency
    shared memory
  • All threads of a block reside on the same
    multiprocessor (max 1024/MP)
  • Number of blocks a multiprocessor can process at
    once depends on register and shared memory usage
    per thread

Source NVIDIA CUDA Programming Guide version
1.1
50
Threads and Scheduling in GPU (contd)
  • Before execution a block is split into warps
  • A warp is a set of 32 threads which execute the
    same instruction on a MP
  • Half-warp is either first 16 or second 16 threads
    of a warp
  • Full efficiency is realized when all 16 threads
    of a half-warp agree on their execution path
  • Branch divergence occurs if threads of a
    half-warp diverge via a data dependent
    conditional branch
  • The half-warp serially executes each branch path
    taken, ignoring the result from threads that are
    not on that path
  • Increases kernel execution time
  • Warps of the same block are executed in time
    sliced fashion

50
51
Program Parallelism
  • The GPU is designed to address applications that
    are data-parallel
  • Parallelism is an inherent factor to determine
    suitability of a problem for GPU applications
  • In fact, applications in which enough parallelism
    cannot be exposed may be slower on a GPU in
    comparison to a single threaded CPU
  • Since the same program is executed for each data
    element, there is no sophisticated flow control
  • Conditional checks need to be done on the CPU
  • Reduce the output of all threads, transfer
    reduced result to CPU which tests condition and
    appropriately issues further GPU threads
  • Can be expensive since transfers are done over
    the PCIe bus!

52
Parallel Reduction
  • Perform a reduction of the data before
    transferring to the CPU
  • Tree based reduction approach used within each
    thread block
  • Reduction decomposed into multiple kernels to
    reduce number of threads issued in the later
    stages of tree based reduction

Example of tree based SUM
syncThreads()
52
53
Parallel Reduction (contd)
  • Types of optimization for efficient parallel
    reduction
  • Algorithmic optimizations
  • Avoid divergent warps
  • Avoid shared memory bank conflicts sequential
    addressing
  • First addition during global load halves the
    number of blocks
  • Code optimizations
  • Loop unrolling
  • Multiple adds per thread to increase arithmetic
    intensity of kernels (high ratio of computation
    in kernel to global read and writes)

53
54
Parallel Reduction (contd)
  • Example of tree based reduced sum

Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
11
2
-1
0
0
Thread IDs
0
2
4
6
8
10
12
14
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
4
8
12
-3
18
1
6
-2
8
5
-5
9
7
13
11
2
7
-1
2
0
8
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
11
41
1
7
6
-2
8
5
17
-3
9
7
2
13
2
54
55
Parallel Reduction (contd)
  • Warp divergence removed

0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Bank IDs
Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
2
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
41
1
7
6
-2
8
5
17
-3
9
7
11
2
13
2
55
56
Parallel Reduction (contd)
  • Sequential Addressing

10
1
8
0
-2
3
5
-2
-3
2
7
2
Shared Memory
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
11
1
-2
-2
8
9
7
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
41
1
7
-1
6
-2
8
5
17
-3
9
7
11
2
13
2
56
57
Program Partitioning
  • Assume a subroutine S is invoked N times in an
    application
  • A multiprocessor of the GPU has 16K registers,
    then maximum parallelism 16K/x
  • Since GPU can do fast hardware
    context switches between the threads,
    which share the 16K registers
  • However, data transfers between kernels will
    become a significant overhead with increase in
    number of partitions

N
3
1
2
Registers y
Time T Registers x
57
58
Simultaneous Graphics and Algorithm Processing
  • If the same GPU is used for graphics and
    algorithmic processing
  • GPU resources may be saturated by graphics
    application, leaving little bandwidth for other
    applications
  • The fixed size of GPU memory (without swap space)
    may cause application launch failure
  • Graphics tasks may cause cache pollution which
    may cause erratic runtimes for general purpose
    applications
  • Run warm up code to flush out caches
  • A single kernel execution cannot be longer than 5
    seconds
  • Using a separate GPU for graphics and computation
    avoids the above listed problems

58
59
Outline
  • GPU Architecture Overview
  • GPU Programming
  • Algorithm Acceleration Guidelines
  • Case Studies
  • Boolean Satisfiability
  • Fast SPICE model evaluation
  • Fault Simulation
  • SSTA
  • Conclusion
  • QA

59
60
Guidelines for GPU Acceleration for Software
  • Current GPUs have an expensive communication link
    to the host. Data transfers should be minimized
  • Streams should be used to overlap communication
    and computation
  • Partition kernels to increase parallelism that
    can be leveraged
  • Full efficiency is realized when all 16 threads
    of a half-warp agree on their execution path
  • Reduce warp divergence
  • Avoid bank conflicts when using shared memory
  • Kernels should have high arithmetic intensity

60
61
Case Studies
  • Two approaches for accelerating an algorithm on
    the GPU
  • Re-architecting approach
  • Applicable when the problem does not have
    inherent SIMD nature
  • May require significant algorithmic modifications
  • Examples
  • Boolean Satisfiability
  • Fault Dictionary Computation (not covered in this
    talk, slides at end)
  • Porting approach
  • Applicable when problem runtime is dominated by a
    subroutine, multiple invocations of which operate
    upon independent data
  • Partition the subroutine into GPU kernels
  • Examples
  • Accelerating SPICE by porting model evaluation on
    the GPU
  • Fault Simulation
  • Monte Carlo based statistical static timing
    analysis (SSTA)

61
62
Boolean Satisfiability (SAT)
  • Given a Boolean formula in conjunctive normal
    form (CNF)
  • Either find a satisfying truth assignment of all
    variables
  • Or prove that there is no satisfying assignment
  • Decisions x true y true
  • The unassigned literal z gets implied because of
    the unit clause rule
  • Implication z false
  • Iterative application of the unit clause rule is
    called Boolean constant propagation (BCP)
  • Recent BCP based SAT solvers incorporate conflict
    driven learning
  • A learned clause represents the search space that
    has been pruned

x true
y true
Negative Literal
Positive Literal
Clause
62
63
Approach
  • Complete Approaches for SAT
  • Are exact, but algorithms do not easily lend
    themselves to parallel implementations. Examples
    GRASP, zChaff , CirCUs, MiniSAT
  • Stochastic Approaches for SAT
  • Can execute at high speeds, are scalable, but are
    not exact. Examples Survey Propagation, WalkSAT,
    RandomSAT
  • Present a hybrid procedure for SAT
  • Retains the best features of complete and
    stochastic approaches
  • Proposed algorithm is based on MiniSAT
    (implemented on the CPU)
  • The variable ordering heuristic of MiniSAT is
    enhanced by a survey propagation (SP) based
    procedure, which is implemented on the GPU
  • Proposed approach is called MESP (MiniSAT
    enhanced with SP)

MESP
  • Next few slides
  • Discuss the GPU based SP implementation
  • Describe our MESP approach

MiniSAT
SP
63
64
Survey Propagation (SP) based SAT
  • Factor Graph - graphical representation of a SAT
    instance
  • Variable nodes (variables)
  • Function nodes (clauses)
  • Is a tree if it has no cycles
  • SP is an algorithm in which agreement between
    clauses and variables is reached by sending
    probabilistic messages along edges of the
    factor graph (message passing)
  • Pros highly scalable, parallelizable, exact for
    factor graphs that are trees
  • Cons incomplete for non-tree factor graphs

64
65
Survey Propagation Equations
  • Notation
  • ?, ß are clauses i, j are variables
  • V (i) set of all clauses where i appears in the
    positive form
  • V -(i) set of all clauses where i appears in the
    negative - form
  • ?a?i is a warning (a probability) from clause ?
    to variable i
  • Let i be in the form in ?
  • ?s and ps are iteratively computed until
    convergence

During Computation
After Convergence
65
66
Survey Propagation Flowchart
Randomly initialize ?a?i
Fixed variables satisfied clauses (ignored)
Compute p
Compute ?a?i
new
N
Declare non-convergence
C S ?a?i - ?a?i e?01
new
N
If contradiction, report and quit
?a?i??a?i itgtmax
new
C0
N
Y
Y
Y
S(?a?i ) 0
Call WalkSAT to determine satisfying assignment
N
Sorted List
Compute W (biases) Sort variables in decreasing
order of Ws
Fix first x of variables
66
67
Survey Propagation on the GPU
  • Implemented GPU kernels for the following
  • Compute ps, for all variables (V ) in parallel
  • Compute ?s, for all clauses (C ) in parallel
  • In particular, computes ?a?i for each variable i
    in clause a
  • Check convergence (S(?a?i - ?a?i ) e?01)
    using a reduced integer add operation over all
    literals in all clauses
  • Compute S( ?a?i ) (to determine if non- trivial
    convergence) using a reduced float add
    operation
  • Compute Ws, for all variables in parallel
  • Parallel bitonic sort to find the largest x of
    the Ws
  • CPU performs conditional checks, fixes variables
    and executes WalkSAT

new
67
68
Data Structure on the GPU
V
2
1
Clause
Literal
Polarity
Per Variable Data (Static)
C
2
1
Variable
Polarity
Per Clause Data (Static)
C
1
2
?a?i
?s Written by Clauses Read by Variables
V
1
2
  • With 1 GB of Global memory, the 280 GTX GPU can
    fit instances with upto 10M clauses and 1M
    variables

p -
p
ps Written by Variables Read by Clauses
68
69
Survey Propagation on the GPU
  • Memory transfers between GPU and CPU
  • Single transfer for static per variable and per
    clause data
  • During the computation of p and ?, there are no
    transfers at all. All intermediate data is stored
    in the global memory of the GPU
  • After convergence is detected, the sorted list of
    variables in decreasing order of biases is
    transferred (GPU ? CPU)
  • After the graph is simplified, the following are
    updated (CPU ? GPU)
  • Variables that are fixed (dont contribute to ?
    computation)
  • Clauses that are satisfied (dont contribute to p
    computation)

69
70
Results (GPU based SP)
  • MESP is compared against
  • Braunstein et al. 2005 (B05) and MiniSAT which
    were executed on a 3.6 GHz, 3GB Intel machine
    running Linux
  • Manolios et al. 2006 (M06), which uses OpenGL on
    NVIDIA GTX 7900 (512 MB memory , 128 cores,
    750MHz) to implement survey propagation
  • For hard random instances MESP shows a 22
    speedup over B05
  • M06 reports a 9 speedup over B05

70
71
MESP
  • SAT instance is read into MiniSAT and on the GPU
    (executing SP)
  • MiniSAT is first invoked on the instance and
    after it has made some progress, it invokes
    GPU-based SP. MiniSAT transfers to SP
  • The current assignments and
  • A subset of the current learned clauses
  • Augment the current clause database in GPU-based
    SP with 3 sets of learned clauses (LC) C1, C2 and
    C3 . L is num. of literals in LC
  • C1 (0 lt L 10) C2 (10 lt L 25) C3 (25 lt L
    50)
  • Statically allocate enough space in GPUs Global
    Memory to store 8K clauses in C1, C2 and C3 each
  • Messages computed over all clauses (?) are now
    computed in 4 separate kernels, one for each set
    of clauses (C1, C2, C3 and C)
  • On convergence, SP (in MESP) fixes variables for
    which the absolute bias difference W () - W
    (-) lt t

71
72
MESP
  • MiniSAT decides the next variable to assign based
    on Variable State Independent Decaying Sum
    (VSIDS) heuristic
  • VSIDS chooses next decision variable with the
    highest activity
  • Activity is the variable occurrence count, with a
    higher weight on the variables of the more
    recently added learned clauses
  • Activity of the variables in the learned clauses
    is incremented by FM
  • In MESP, GPU-based SP invocation can return with
    the following outcomes

SP converges and fixes certain variables, S
MiniSAT updates activity of variables in S by FSP
SP converges, fixes S and determines factor graph
is a tree, invokes WalkSAT. If WalkSAT finds
assignment, instance is solved. Else fixed
variables in S are returned to MiniSAT
SP converges but does not fix any variable
MiniSAT continues the search
SP does not converge/reports contradiction
72
73
MESP
MiniSAT (complete)
Survey Propagation (stochastic)
Current Assignments Subset of Learned Clauses
MiniSATs Decision Tree
Initial search
GPU attempts to converge on the SP messages
GPU
Continues search using updated activities
Activity Table
GPU works in conjunction with CPU to fix
variables
CPU
CPU instructs GPU to ignore fixed variables and
satisfied clauses
Activity updated for the variables S that are
fixed in SP
CPU
GPU
73
74
Results
  • MESP approach on GTX 280 GPU card on an Intel i7
    CPU with 2.6 GHz, 9GB RAM, and running Linux.
    MiniSAT run on the same CPU. Runtime in seconds
  • D 1 of Number of Variables FSP FM 1 C
    20 t 0.01
  • The learned clauses on the GPU were updated at
    every 5th invocation of SP
  • Up to 24K learned clauses
  • None of these instances were solved in MESP by an
    invocation to WalkSAT

74
75
Summary
  • MESP is a GPU enhanced variable ordering
    heuristic for SAT
  • GPU based survey propagation
  • ps for all variables and ?s for all clauses
    computed in parallel
  • Check convergence using a reduced integer add
    operation over all literals in all clauses
  • Test whether non-trivial convergence uses a
    reduced float add operation
  • Compute biases for all variables in parallel
  • Parallel bitonic sort to find the largest x of
    the biases
  • Survey propagation enhances the variable ordering
    in MESP
  • Augment clause database on GPU with 3 sets of
    learned clauses
  • ?s for all clauses computed in 4 different
    kernels
  • On average MESP is
  • 64 (92) faster than MiniSAT on original (3-SAT)
    instance

75
76
SPICE Model Evaluation on a GPU
  • SPICE is the de facto industry standard for VLSI
    circuit simulations
  • Significant motivation for accelerating SPICE
    simulations without losing accuracy
  • Increasing complexity and size of VLSI circuits
  • Increasing impact of process variations on the
    electrical behavior of circuits
  • Require Monte Carlo based simulations
  • Accelerate the computationally expensive portion
    of SPICE transistor model evaluation on a GPU
  • Proposed approach is integrated into a commercial
    SPICE accelerator tool OmegaSIM
  • Already 10-1000x faster than traditional SPICE
    implementations
  • With the proposed approach integrated, OmegaSIM
    achieves a further speedup of 2.36X (3.07X) on
    average (max)

77
Approach
  • Profiled SPICE simulations over several
    benchmarks
  • 75 of time spent in BSIM3 device model
    evaluations
  • Billions of calls to device model evaluation
    routines
  • Every device in the circuit is evaluated for
    every time step
  • Possibly repeatedly until the Newton Raphson loop
    for solving non-linear equations converges
  • Asymptotic speedup of 4X considering Amdahls
    law.
  • These calls are parallelizable
  • Since they are independent of each other
  • Each call performs identical computations on
    different data
  • Conform to the GPUs SIMD operating paradigm

78
Approach
  • CDFG-guided manual partitioning of BSIM3
    evaluation code
  • Limitation on the available hardware resources
  • Registers (8192/per multiprocessor)
  • Shared Memory (16KB/per multiprocessor)
  • Bandwidth to global memory (max. sustainable is
    80 GB/s)
  • If entire BSIM3 model is implemented as a single
    kernel
  • Number of threads that can be issued in parallel
    are not enough
  • To hide global memory access latency
  • If BSIM3 code is partitioned into many (small)
    kernels
  • Requires large amounts of data transfer across
    kernels
  • Done using global memory (not cached)
  • Negatively impacts performance
  • Proposed approach
  • Creates CDFG of the BSIM3 equations
  • Uses maximally disconnected components of this
    graph as different kernels, considering the above
    hardware limitations

79
Approach
  • Take GPU memory constraints into account
  • Global Memory
  • Used to store intermediate data which is
    generated by one kernel and needed by another
    (instead of transferring this data to host)
  • Texture Memory
  • Used for storing runtime parameters
  • Device parameters that remain unchanged
    throughout the simulation
  • Advantages
  • It is cached, unlike global memory
  • No coalescing requirements, unlike global memory
  • No bank conflicts, such as possible in shared
    memory
  • CUDAs efficient built in texture fetching
    routines are used
  • Small texture memory loading overhead is easily
    amortized
  • Constant Memory used for storing physical
    constants
  • Most efficient when all threads access the same
    data

80
Experiments
  • Proposed approach is implemented and integrated
    into a commercial SPICE accelerator tool
    OmegaSIM
  • Hardware used
  • CPU Intel Core 2 Quad, 2.4 GHz, 4GB RAM
  • GPU GeForce 8800 GTS, 128 Processors, 675 MHz,
    512 MB RAM
  • Comparing BSIM3 model evaluation alone

81
Experiments - Complete SPICE Sim
  • With increase in number of transistors, speedup
    obtained is higher
  • More device evaluation calls made in parallel,
    latencies are better hidden
  • High accuracy with single precision floating
    point implementation
  • Over 1M device evals. avg. (max.) error of 2.88
    X 10-26 (9.0 X 10-22) Amp.
  • Newer devices with double precision capability
    already in market

82
Conclusions
  • Significant interest in accelerating SPICE
  • 75 of the SPICE runtime spent in BSIM3 model
    evaluation allows asymptotic speedup of 4X
  • Our approach of accelerating model evaluation
    using GPUs has been integrated with a commercial
    fast SPICE tool
  • Obtained speedup of 2.36 X on average
  • BSIM3 model evaluation can be sped up by 30-40X
    over 1M-2M calls
  • Take GPU memory constraints into account
  • Global Memory used to store intermediate data
  • Texture Memory used for storing runtime
    parameters
  • Constant Memory used for storing physical
    constants
  • Carefully partition kernels since
  • If entire BSIM3 model is implemented as a single
    kernel
  • Number of threads that can be issued in parallel
    are not enough to hide global memory access
    latency
  • If BSIM3 code is partitioned into many (small)
    kernels
  • Requires large amounts of data transfer across
    kernels done using global memory

83
Introduction Fault Simulation
  • Fault Simulation (FS) is crucial in the VLSI
    design flow
  • Given a digital design and a set of vectors V, FS
    evaluates the number of stuck at faults (Fsim)
    tested by applying V
  • The ratio of Fsim/Ftotal is a measure of fault
    coverage
  • Current designs have millions of logic gates
  • The number of faulty variations are proportional
    to design size
  • Each of these variations needs to be simulated
    for the V vectors
  • Therefore, it is important to explore ways to
    accelerate FS
  • The ideal FS approach should be
  • Fast
  • Scalable
  • Cost effective

83
84
Approach
  • Implement a look up table (LUT) based FS
  • All gates LUTs stored in texture memory (cached)
  • LUTs of all library gates fit in texture cache
  • To avoid cache misses during lookup
  • Individual k-input gate LUT requires 2k entries
  • Each gates LUT entries are located at a fixed
    offset in the texture memory as shown above
  • Gate output is obtained by
  • accessing the memory at the gate offset input
    value
  • Example output of AND2 gate when inputs are 1
    and 0

0 1 2 3
0
84
85
Approach
  • Evaluate two vectors for the same gate in a
    single thread
  • 1/2/3/4 input gates require 4/16/64/256 entries
    in LUT respectively
  • Our library consists of an INV and 2/3/4 input
    AND, NAND, NOR and OR gates.
  • Hence total memory required for all LUTs is 1348
    words
  • This fits in the texture memory cache (8KB per
    MP)
  • Exploit both fault and pattern parallelism
  • Fault Parallel
  • All gates at a fixed topological level are
    evaluated in parallel
  • Pattern Parallel
  • Simulations for any gate, for different patterns,
    are done in parallel

85
86
Approach
Good
Faulty
vector
vector
vector
2
N
1
Good circuit value for vector 1
Faulty circuit value for vector 1
  • In practice, simulations for any gate, for
    different patterns, are done in 2 phases, for all
    the faults which lie in its TFI only
  • Phase 1 Good circuit simulation. Results
    returned to CPU
  • Phase 2 Faulty circuit simulation. CPU does not
    schedule a stuck-at-v fault in a pattern which
    has v as the good circuit value
  • Fault injection also performed in parallel

86
87
Approach Fault Injection
Approach Fault Simulation
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
m0, m1 // Mask variables threadData
87
88
Approach Fault Detection
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
Good_Circuit_threadID // Good circuit simulation
thread ID threadData_Detect
88
89
Approach
  • We maximize GPU performance by ensuring that
  • No data dependency exists between threads issued
    in parallel
  • The same instructions, on different data are
    executed by all threads
  • We adapt to specific G80 memory constraints
  • LUT stored in texture memory. Key advantages are
  • Texture memory is cached
  • Total LUT size easily fits into available cache
    size of 8KB/MP
  • No memory coalescing requirements
  • Efficient built-in texture fetching routines
    available in CUDA
  • Non-zero time taken to load texture memory, but
    cost easily amortized
  • Global memory writes for level i gates (and reads
    for level i1 gates) are performed in a coalesced
    fashion

89
90
Results
  • FS on
Write a Comment
User Comments (0)
About PowerShow.com