Title: Compilers, Parallel Computing, and Grid Computing
1OpenCL
These notes will introduce OpenCL
ITCS 4/5010 CUDA Programming, UNC-Charlotte, B.
Wilkinson, Feb 28, 2013, OpenCL.ppt
2OpenCL (Open Computing Language)
A standard based upon C for portable parallel
applications. Focuses on multi platform support
(multiple CPUs, GPUs, ) Task parallel and data
parallel applications. Very similar to CUDA but a
little more complicated to handle heterogeneous
platforms. Initiated by Apple. Developed by
Khromos group who also managed OpenGL. Now
adopted by Intel, AMD, NVIDIA, OpenCL 1.0
2008. Released with Max OS 10.6 (Snow Leopard)
Most recent OpenCL 1.2 Nov 2011 Implementation
available for NVIDIA GPUs
http//www.khronos.org/opencl/
Wikipedia OpenCL http//en.wikipedia.org/wiki/Ope
nCL
3OpenCL Programming Model
Uses data parallel programming model, similar to
CUDA Host program launches kernel routines as in
CUDA, but allows for just-in-time compilation
during host execution. OpenCL work items
corresponds to CUDA threads OpenCL work groups
corresponds to CUDA thread blocks OpenCL
NDRange corresponds to CUDA Grid Work items in
same work group can be synchronized with a
barrier as in CUDA.
4Sample OpenCL code to add two vectors
To illustrate OpenCL commands Add two vectors, A
and B to produce C A and B transferred to device
(GPU) Result, C, returned to host (CPU) Similar
to CUDA vector addition
5Structure of OpenCL main program
61. Platform "The host plus a collection of
devices managed by the OpenCL framework that
allow an application to share resources and
execute kernels on devices in the platform."
Platforms represented by a cl_platform object,
initialized with clGetPlatformID()
http//opencl.codeplex.com/wikipage?titleOpenCL2
0Tutorials20-201
7clGetPlatformIDs Obtain the list of platforms
available.
cl_int clGetPlatformIDs(cl_uint num_entries,
cl_platform_id platforms, cl_uint
num_platforms)
Parameters num_entries Number of cl_platform_id
entries that can be added to platforms. If
platforms is not NULL, num_entries must be
greater than zero. platforms Returns list of
OpenCL platforms found. cl_platform_id values
returned in platforms can be used to identify a
specific OpenCL platform. If platforms argument
is NULL, this argument ignored. Number of OpenCL
platforms returned is mininum of value specified
by num_entries or number of OpenCL platforms
available. num_platforms Returns number of
OpenCL platforms available. If num_platforms is
NULL, this argument ignored.
http//www.khronos.org/registry/cl/sdk/1.1/docs/ma
n/xhtml/
8Simple code for identifying platform
//Platform cl_platform_id platform clGetPlatfor
mIDs (1, platform, NULL)
Returns number of OpenCL platforms available. If
NULL, ignored.
List of OpenCL platforms found. (Platform
IDs) In our case just one platform, identified by
platform
Number of platform entries
92. Context
The environment within which the kernels execute
and the domain in which synchronization and
memory management is defined. The context
includes a set of devices, the memory accessible
to those devices, the corresponding memory
properties and one or more command-queues used to
schedule execution of a kernel(s) or operations
on memory objects.
The OpenCL Specification version 1.1
http//www.khronos.org/registry/cl/specs/opencl-1.
1.pdf
10Code for context
//Context cl_context_properties
props3 props0 (cl_context_properties)
CL_CONTEXT_PLATFORM props1
(cl_context_properties) platform props2
(cl_context_properties) 0 cl_context GPUContext
clCreateContextFromType(props,CL_DEVICE_TYPE_GPU
,NULL,NULL,NULL) //Context info size_t
ParmDataBytes clGetContextInfo(GPUContext,CL_CONT
EXT_DEVICES,0,NULL,ParmDataBytes) cl_device_id
GPUDevices (cl_device_id)malloc(ParmDataBytes)
clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,Pa
rmDataBytes,GPUDevices,NULL)
113. Command Queue
An object that holds commands that will be
executed on a specific device. The command-queue
is created on a specific device in a
context. Commands to a command-queue are queued
in-order but may be executed in-order or
out-of-order. ...
The OpenCL Specification version 1.1
http//www.khronos.org/registry/cl/specs/opencl-1.
1.pdf
12Simple code for creating a command queue
// Create command-queue cl_command_queue
GPUCommandQueue clCreateCommandQueue(GPUContext,
GPUDevices0,0,NULL)
134. Allocating memory on device
OpenCL context, from clCreateContextFromType()
Use clCreatBuffer cl_mem clCreateBuffer(cl_conte
xt context, cl_mem_flags flags,
size_t size, void host_ptr,
cl_int errcode_ret)
Bit field to specify type of allocation/usage
(CL_MEM_READ_WRITE ,)
No of bytes in buffer memory object
Ptr to buffer data (May be previously allocated.)
Returns memory object
Returns error code if an error
14Sample code for allocating memory on device for
source data
// source data on host, two vectors int A,
B A new intN B new intN for(int i
0 i lt N i) Ai rand()1000 Bi
rand()1000 // Allocate GPU memory for
source vectors cl_mem GPUVector1
clCreateBuffer(GPUContext,CL_MEM_READ_ONLY
CL_MEM_COPY_HOST_PTR,sizeof(int)N, A,
NULL) cl_mem GPUVector2 clCreateBuffer(GPUCont
ext,CL_MEM_READ_ONLY CL_MEM_COPY_HOST_PTR,sizeof
(int)N, B, NULL)
15Sample code for allocating memory on device for
results on GPU
// Allocate GPU memory for output vector cl_mem
GPUOutputVector clCreateBuffer(GPUContext,CL_MEM
_WRITE_ONLY,sizeof(int)N, NULL,NULL)
166. Kernel Program
Simple programs might be in the same file as the
host code (as in our CUDA examples) In that
case need to formed into strings in a character
array. If in a separate file, can read that file
into host program as a character string
17If in same program as host, kernel needs to be
strings (I think it can be a single string)
Kernel program
OpenCL qualifier to indicate kernel code
const char OpenCLSource "__kernel void
vectorAdd (const __global int a,", " const
__global int b,", " __global int c)", "", "
unsigned int gid get_global_id(0)", " cgid
agid bgid", "" int main(int argc,
char argv)
OpenCL qualifier to indicate kernel
memory (Memory objects allocated from global
memory pool)
Returns global work-item ID in given dimension (0
here)
Double underscores optional in OpenCL qualifiers
18Kernel in a separate file
// Load the kernel source code into the array
source_str FILE fp char source_str
size_t source_size fp
fopen("vector_add_kernel.cl", "r") if (!fp)
fprintf(stderr, "Failed to load
kernel.\n") exit(1)
source_str (char)malloc(MAX_SOURCE_SIZE)
source_size fread( source_str, 1,
MAX_SOURCE_SIZE, fp) fclose( fp )
http//mywiki-science.wikispaces.com/OpenCL
19Create kernel program object
const char OpenCLSource int
main(int argc, char argv) // Create OpenCL
program object cl_program OpenCLProgram
clCreateProgramWithSource(GPUContext,7,OpenCLSour
ce,NULL,NULL)
This example uses a single file for both host and
kernel code. Can use clCreateprogramWithSource()
with a separate kernel file read into host program
Used to return error code if error
Number of strings in kernel program array
Used if strings not null-terminated to given
length of strings
207. Build kernel program
// Build the program (OpenCL JIT
compilation) clBuildProgram(OpenCLProgram,0,NULL,
NULL,NULL,NULL)
Arguments for notification routine
Build options
Number of devices
Program object from clCreateProgramwithSource
Function ptr to notification routine called with
build complete. Then clBuildProgram will return
immediately, otherwise only when build complete
List of devices, if more than one
218. Creating Kernel Objects
// Create a handle to the compiled OpenCL
function cl_kernel OpenCLVectorAdd
clCreateKernel(OpenCLProgram, "vectorAdd", NULL)
Built prgram from clBuildProgram
Function name with __kernel qualifier
Return error code
229. Set Kernel Arguments
// Set kernel arguments clSetKernelArg(OpenCLVect
orAdd,0,sizeof(cl_mem), (void)GPUVector1) clSe
tKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem),
(void)GPUVector2) clSetKernelArg(OpenCLVectorA
dd,2,sizeof(cl_mem), (void)GPUOutputVector)
Which argument
Size of argument
Pointer to data for argument, from
clCreateBuffer()
Kernel object from clCreateKernel()
2310. Enqueue command to execute kernel on device
// Launch the kernel size_t WorkSize1 N
// Total number of work items size_t
localWorkSize1256 //No of work items in
work group // Launch the kernel clEnqueueNDRange
Kernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL, Wor
kSize, localWorkSize, 0, NULL, NULL)
Dimensions of work items
Kernel object from clCreatKernel()
Offset used with work item
Number of events to complete before this commands
Array describing no of global work items
Array describing no of work items that make up a
work group
Event wait list
Event
24Function to copy from host memory to buffer object
The following function enqueue command to write
to a buffer object from host memory cl_int
clEnqueueWriteBuffer (cl_command_queue
command_queue, cl_mem buffer, cl
_bool blocking_write, size_t
offset, size_t cb, const void
ptr, cl_uint num_events_in_wait_list,
const cl_event event_wait_list, c
l_event event)
The OpenCL Specification version 1.1
http//www.khronos.org/registry/cl/specs/opencl-1.
1.pdf
25Function to copy from buffer object to host memory
The following function enqueue command to read
from a buffer object to host memory cl_int
clEnqueueReadBuffer (cl_command_queue
command_queue, cl_mem buffer, cl
_bool blocking_read, size_t
offset, size_t cb, void
ptr, cl_uint num_events_in_wait_list,
const cl_event event_wait_list, c
l_event event)
The OpenCL Specification version 1.1
http//www.khronos.org/registry/cl/specs/opencl-1.
1.pdf
2611. Copy data back from kernel
// Copy the output back to CPU memory int C C
new intN clEnqueueReadBuffer(GPUCommandQueue,G
PUOutputVector,CL_TRUE, 0, Nsizeof(int), C, 0,
NULL, NULL)
Command queue from clCreateCommandQueue
Device buffer from clCreateBuffer
Number of events to complete before this commands
Read is blocking
Byte offset in buffer
Pointer to buffer in host to write data
Event wait list
Event
Size of data to read in bytes
27Results from GPU
cout ltlt "C ltlt 0 ltlt " " ltlt A0 ltlt""ltlt B0
ltlt"" ltlt C0 ltlt "\n" cout ltlt "C ltlt N-1 ltlt
" ltlt AN-1 ltlt " ltlt BN-1 ltlt "" ltlt CN-1
ltlt "\n"
C here
28Clean-up
// Cleanup free(GPUDevices) clReleaseKernel(OpenC
LVectorAdd) clReleaseProgram(OpenCLProgram) clRe
leaseCommandQueue(GPUCommandQueue) clReleaseConte
xt(GPUContext) clReleaseMemObject(GPUVector1) cl
ReleaseMemObject(GPUVector2) clReleaseMemObject(G
PUOutputVector)
29Compiling
Need OpenCL header include ltCL/cl.hgt (For
mac include ltOpenCL/opencl.hgt ) and link to
the OpenCL library. Compile OpenCL host program
main.c using gcc, two phases gcc -c -I
/path-to-include-dir-with-cl.h/ main.c -o
main.o gcc -L /path-to-lib-folder-with-OpenCL-libf
ile/ -l OpenCL main.o -o host
Ref http//www.thebigblob.com/getting-started-wit
h-opencl-and-gpu-computing/
30Make File (Program called scalarmulocl)
CC g LD g -lm CFLAGS -Wall
-shared CDEBUG LIBOCL -L/nfs-home/mmishra2/NVI
DIA_GPU_Computing_SDK/OpenCL/common/lib INCOCL
-I/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/Open
CL/common/inc SRCS scalarmulocl.cpp OBJS
scalarmulocl.o EXE scalarmulocl.a all
(EXE) (OBJS) (SRCS) (CC) (CFLAGS)
(INCOCL) -I/usr/include -c (SRCS) (EXE)
(OBJS) (LD) -L/usr/local/lib (OBJS) (LIBOCL)
-o (EXE) -l OpenCL clea rm -f (OBJS) clear
References http//mywiki-science.wikispaces.com/O
penCL Submitted by Manisha Mishra
31Includes
include ltstdio.hgt include ltstdlib.hgt include
ltCL/cl.hgt //OpenCL header for C include
ltiostreamgt //C input/output using namespace
std
32Another OpenCL program to add two vectors
http//www.olcf.ornl.gov/tutorials/opencl-vector-a
ddition/
include ltstdio.hgt include ltstdlib.hgt include
ltmath.hgt include ltCL/opencl.hgt // Enable
double precision values pragma OPENCL EXTENSION
cl_khr_fp64 enable // OpenCL kernel. Each
work item takes care of one element of c const
char kernelSource
"\n" \ "__kernel void vecAdd( __global
double a, \n" \ "
__global double b,
\n" \ " __global
double c, \n" \ "
const unsigned int n)
\n" \ "
\n" \ " //Get our
global thread ID
\n" \ " int id get_global_id(0)
\n" \ "
\n" \ "
//Make sure we do not go out of bounds
\n" \ " if (id lt n)
\n" \ "
cid aid bid
\n" \ "
\n" \
"\n"
Kernel code
33 int main( int argc, char argv ) //
Length of vectors unsigned int n 100000
// Host input vectors double h_a
double h_b // Host output vector double
h_c // Device input buffers cl_mem
d_a cl_mem d_b // Device output buffer
cl_mem d_c cl_platform_id cpPlatform
// OpenCL platform cl_device_id
device_id // device ID cl_context
context // context
cl_command_queue queue // command
queue cl_program program //
program cl_kernel kernel //
kernel // Size, in bytes, of each vector
size_t bytes nsizeof(double) //
Allocate memory for each vector on host h_a
(double)malloc(bytes) h_b
(double)malloc(bytes) h_c
(double)malloc(bytes) // Initialize
vectors on host int i for( i 0 i lt n
i ) h_ai sinf(i)sinf(i)
h_bi cosf(i)cosf(i) size_t
globalSize, localSize cl_int err //
Number of work items in each local work group
localSize 64 // Number of total work
items - localSize must be devisor globalSize
ceil(n/(float)localSize)localSize //
Bind to platform err clGetPlatformIDs(1,
cpPlatform, NULL) // Get ID for the
device err clGetDeviceIDs(cpPlatform,
CL_DEVICE_TYPE_GPU, 1, device_id, NULL)
// Create a context context
clCreateContext(0, 1, device_id, NULL, NULL,
err) // Create a command queue queue
clCreateCommandQueue(context, device_id, 0,
err)
34 // Create the compute program from the
source buffer program clCreateProgramWithSou
rce(context, 1,
(const char ) kernelSource, NULL, err)
// Build the program executable
clBuildProgram(program, 0, NULL, NULL, NULL,
NULL) // Create the compute kernel in the
program we wish to run kernel
clCreateKernel(program, "vecAdd", err) //
Create the input and output arrays in device
memory for our calculation d_a
clCreateBuffer(context, CL_MEM_READ_ONLY, bytes,
NULL, NULL) d_b clCreateBuffer(context,
CL_MEM_READ_ONLY, bytes, NULL, NULL) d_c
clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes,
NULL, NULL) // Write our data set into the
input array in device memory err
clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
bytes, h_a, 0,
NULL, NULL) err clEnqueueWriteBuffer(queue
, d_b, CL_TRUE, 0,
bytes, h_b, 0, NULL, NULL) // Set the
arguments to our compute kernel err
clSetKernelArg(kernel, 0, sizeof(cl_mem), d_a)
err clSetKernelArg(kernel, 1,
sizeof(cl_mem), d_b) err
clSetKernelArg(kernel, 2, sizeof(cl_mem), d_c)
err clSetKernelArg(kernel, 3,
sizeof(unsigned int), n) // Execute the
kernel over the entire range of the data set
err clEnqueueNDRangeKernel(queue, kernel, 1,
NULL, globalSize, localSize,
0,
NULL, NULL) // Wait for the command queue
to get serviced before reading back results
clFinish(queue) // Read the results from
the device clEnqueueReadBuffer(queue, d_c,
CL_TRUE, 0,
bytes, h_c, 0, NULL, NULL ) //Sum up
vector c and print result divided by n, this
should equal 1 within error double sum 0
for(i0 iltn i) sum h_ci
printf("final result f\n", sum/n) //
release OpenCL resources clReleaseMemObject(d_
a) clReleaseMemObject(d_b)
clReleaseMemObject(d_c) clReleaseProgram(prog
ram) clReleaseKernel(kernel)
clReleaseCommandQueue(queue)
clReleaseContext(context) //release host
memory free(h_a) free(h_b)
free(h_c) return 0
Build program
Program arguments
Write input data
Set input arguments
Run program
Read results
35Questions
36More Information
Chapter 11 of Programming Massively Parallel
Processors by D. B. Kirk and W-M W. Hwu, Morgan
Kaufmann, 2010