Title: Basic CUDA Programming
1Basic CUDA Programming
- Shin-Kai Chen
- skchen_at_twins.ee.nctu.edu.tw
- VLSI Signal Processing Laboratory
- Department of Electronics Engineering
- National Chiao Tung University
2What will you learn in this lab?
- Concept of multicore accelerator
- Multithreaded/multicore programming
- Memory optimization
3Slides
- Mostly from Prof. Wen-Mei Hwu of UIUC
- http//courses.ece.uiuc.edu/ece498/al/Syllabus.htm
l
4CUDA Hardware? Software?
5Host-Device Architecture
CPU (host)
GPU w/ local DRAM (device)
6G80 CUDA mode A Device Example
7Functional Units in G80
- Streaming Multiprocessor (SM)
- 1 instruction decoder ( 1 instruction / 4 cycle )
- 8 streaming processor (SP)
- Shared memory
SM 1
SM 0
Blocks
Blocks
8Setup CUDA for Windows
9CUDA Environment Setup
- Get GPU that support CUDA
- http//www.nvidia.com/object/cuda_learn_products.h
tml - Download CUDA
- http//www.nvidia.com/object/cuda_get.html
- CUDA driver
- CUDA toolkit
- CUDA SDK (optional)
- Install CUDA
- Test CUDA
- Device Query
10Setup CUDA for Visual Studio
- From scratch
- http//forums.nvidia.com/index.php?showtopic30273
- CUDA VS Wizard
- http//sourceforge.net/projects/cudavswizard/
- Modified from existing project
11Lab1 First CUDA Program
12CUDA Computing Model
13Data Manipulation between Host and Device
- cudaError_t cudaMalloc( void devPtr, size_t
count ) - Allocates count bytes of linear memory on the
device and return in devPtr as a pointer to the
allocated memory - cudaError_t cudaMemcpy( void dst, const void
src, size_t count, enum cudaMemcpyKind kind) - Copies count bytes from memory area pointed to by
src to the memory area pointed to by dst - kind indicates the type of memory transfer
- cudaMemcpyHostToHost
- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
- cudaMemcpyDeviceToDevice
- cudaError_t cudaFree( void devPtr )
- Frees the memory space pointed to by devPtr
14Example
- Functionality
- Given an integer array A holding 8192 elements
- For each element in array A, calculate Ai256
and leave the result in Bi
15Now, go and finish your first CUDA program !!!
16- Download http//twins.ee.nctu.edu.tw/skchen/lab1.
zip - Open project with Visual C 2008 (
lab1/cuda_lab/cuda_lab.vcproj ) - main.cu
- Random input generation, output validation,
result reporting - device.cu
- Lunch GPU kernel, GPU kernel code
- parameter.h
- Fill in appropriate APIs
- GPU_kernel() in device.cu
17Lab2 Make the Parallel Code Faster
18Parallel Processing in CUDA
- Parallel code can be partitioned into blocks and
threads - cuda_kernelltltltnBlk, nTidgtgtgt()
- Multiple tasks will be initialized, each with
different block id and thread id - The tasks are dynamically scheduled
- Tasks within the same block will be scheduled on
the same stream multiprocessor - Each task take care of single data partition
according to its block id and thread id
19Locate Data Partition by Built-in Variables
- Built-in Variables
- gridDim
- x, y
- blockIdx
- x, y
- blockDim
- x, y, z
- threadIdx
- x, y, z
20Data Partition for Previous Example
When processing 64 integer data cuda_kernelltltlt2,
2gtgtgt()
int total_task gridDim.x blockDim.x int
task_sn blockIdx.x blockDim.x threadIdx.x
int length SIZE / total_task int head
task_sn length
21Processing Single Data Partition
22Parallelize Your Program !!!
23- Partition kernel into threads
- Increase nTid from 1 to 512
- Keep nBlk 1
- Group threads into blocks
- Adjust nBlk and see if it helps
- Maintain total number of threads below 512, e.g.
nBlk nTid lt 512
24Lab3 Resolve Memory Contention
25Parallel Memory Architecture
- Memory is divided into banks to achieve high
bandwidth - Each bank can service one address per cycle
- Successive 32-bit words are assigned to
successive banks
26Lab2 Review
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
27How about Interleave Accessing?
When processing 64 integer data cuda_kernelltltlt1,
4gtgtgt()
28Implementation of Interleave Accessing
cuda_kernelltltlt1, 4gtgtgt()
- head task_sn
- stripe total_task
29Improve Your Program !!!
30- Modify original kernel code in interleaving
manner - cuda_kernel() in device.cu
- Adjusting nBlk and nTid as in Lab2 and examine
the effect - Maintain total number of threads below 512, e.g.
nBlk nTid lt 512
31Thank You
- http//twins.ee.nctu.edu.tw/skchen/lab3.zip
- Final project issue
- Group issue