General Purpose Computing using GPU GPGPU - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

General Purpose Computing using GPU GPGPU

Description:

Grid. Global Memory. Block (0, 0)? Shared Memory. Thread (0, 0)? Registers. Thread (1, 0) ... Matrix multiplication on the (CPU) host ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 17
Provided by: sundar
Category:

less

Transcript and Presenter's Notes

Title: General Purpose Computing using GPU GPGPU


1
General PurposeComputing using GPU (GPGPU)?
2
GeForce 6 Series Architecture
Courtesy GPGPU - Sam Adams
3
GeForce 8800 GPU
  • Built around programmable units
  • Unified Shader

4
GPU Architecture Flow
8 series
6 series
5
CUDA C with no shader limitations!
  • Integrated host device app C program
  • Serial or modestly parallel parts in host C code
  • Highly parallel parts in device kernel C code

Serial Code (host)?
Parallel Kernel (device)? KernelAltltlt nBlk, nTid
gtgtgt(args)
Serial Code (host)?
Parallel Kernel (device)? KernelBltltlt nBlk, nTid
gtgtgt(args)
6
(No Transcript)
7
Block IDs and Thread IDs
  • Each thread uses IDs to decide what data to work
    on
  • Block ID 1D or 2D
  • Thread ID 1D, 2D, or 3D
  • Simplifies memoryaddressing when
    processingmultidimensional data

8
CUDA Memory Model Overview
  • Global memory
  • Main means of communicating R/W Data between host
    and device
  • Contents visible to all threads
  • Long latency access
  • Constant memory
  • Shared Memory
  • Local Registers

Grid
Block (0, 0)?
Block (1, 0)?
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)?
Thread (1, 0)?
Thread (0, 0)?
Thread (1, 0)?
Host
Global Memory
Courtesy Nvidia Corp.
9
CUDA Example Matrix Multiplication - A Simple
Host Version in C
// Matrix multiplication on the (CPU) host void
MatrixMulOnHost(float M, float N, float P, int
Width)? for (int i 0 i lt Width
i)? for (int j 0 j lt Width j)
double sum 0 for (int k
0 k lt Width k) double a
Mi width k double b Nk
width j sum a b
Pi Width j sum

N
k
j
WIDTH
M
P
i
WIDTH

k
WIDTH
WIDTH
Courtesy David Kirk
10
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
11
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Allocate memory on device
12
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Do Matrix multiplication on device
13
CUDA Example Matrix Multiplication - The GPU
Version
void MatrixMulOnDevice(float M, float N, float
P, int Width)? int size Width Width
sizeof(float) float Md, Nd, Pd //
Allocate and Load M, N to device memory
cudaMalloc(Md, size) cudaMemcpy(Md, M,
size, cudaMemcpyHostToDevice)
cudaMalloc(Nd, size) cudaMemcpy(Nd, N,
size, cudaMemcpyHostToDevice) // Allocate P
on the device cudaMalloc(Pd, size) dim3
dimGrid(1, 1) dim3 dimBlock(Width, Width)
// Launch the device computation threads!
MatrixMulKernelltltltdimGrid, dimBlockgtgtgt(Md, Nd,
Pd, Width) cudaMemcpy(P, Pd, size,
cudaMemcpyDeviceToHost) // Read P from device
cudaFree(Md) cudaFree(Nd) cudaFree (Pd)
Free device memory
14
CUDA Example Matrix Multiplication - The GPU
Version continued
// Matrix multiplication kernel per thread
code __global__ void MatrixMulKernel(float Md,
float Nd, float Pd, int Width)? //
Pvalue is used to store the element of the
matrix // that is computed by the thread
float Pval 0 for (int k 0 k lt Width
k)? float Melement
MdthreadIdx.yWidthk float Nelement
NdkWidththreadIdx.x Pval Melement
Nelement PdthreadIdx.yWidththreadIdx
.x Pval
Nd
k
WIDTH
tx
Md
Pd
ty
ty
WIDTH
tx

k
WIDTH
WIDTH
Courtesy David Kirk
15
Problem description
  • Solve delay differential equations which describe
    the dynamics of the genetic processes
  • Solve multiple equations in parallel
  • Optimize for memory access.

16
Bibliography
  • A Performance Study of General-Purpose
    Applications on Graphic Processors using CUDA
    Shui che, Micheal Boyer, David Tarjan, Kevin
    Skadron
  • CUDA Textbook David Kirk from Nvidia and Prof.
    Wen-mei Hwu from UIUC
  • GPU Gems 2 Nvidia Corporation
Write a Comment
User Comments (0)
About PowerShow.com