ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

ME964 High Performance Computing for Engineering Applications

Description:

Moving data back and forth between the host and device is a killer ... On average, less than of the threads will be activated for all warps over time. ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 36

Provided by: sbel3

Learn more at: http://sbel.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: ME964 High Performance Computing for Engineering Applications

1
ME964High Performance Computing for Engineering
Applications

Gauging Kernel Performance Control Flow in CUDA
Oct. 9, 2008

2
Before we get started

Last Time
Guest Lecturer Michael Garland, Researcher at
NVIDIA
Writing Efficient CUDA Algorithms
Today
Gauging the extent to which you use hardware
resources in CUDA
Control Flow in CUDA
Homework related
HW6 available for download (exclusive scan
operation)
HW5, 2D matrix convolution, due at 1159 PM today

2
3
Exercise Does Matrix Multiplication Incur Shared
Memory Bank Conflicts?
Scenario A. The tile matrix is computed as
follows one half warp computes one row of the
tile at a time.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
the same shared memory entry leading to
broadcast. Below whats highlighted is the
second step of computing the 5th row of the tile.
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
In scenario A, all threads in a half-warp access
elements in neighboring banks as they walk
through the computation
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
Scenario B. The tile matrix is computed as
follows one half warp computes one column of the
tile at a time (do it yourself).
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
0
1
2
3
4
5
6
7
15
3
0
1
2
3
4
5
6
7
15
4
Final Comments, Memory Access

Given the GPU memory spaces and their latency, a
typical programming pattern emerges at the thread
level
Load data from device memory into shared memory
(coalesced if possible)
Synchronize with al the other threads of the
block to avoid data access hazards
Process the data that you just brought over in
shared memory
Synchronize as needed
Write the results back to global memory
(coalesced if possible)
NOTE for CUDA computing, always try hard to make
your computation fit this model

4
5
CUDA Programming Common Sense Advice

Keep this in mind
Allocating memory on device or host is expensive
Moving data back and forth between the host and
device is a killer
Global memory accesses are going to be slow
If they are not coalesced they are even slower
Make sure that you keep the SM
Occupied (currently, 24 warps can be managed
concurrently)
Busy (avoid data starvation, have it crunch
numbers)
If you can, avoid bank conflicts. Not that big
of a deal tough.

5
6
Gauging the level of HW use

In order to gauge how well your code uses the HW,
you need to use the CUDA occupancy calculator
(google it)

http//developer.download.nvidia.com/compute/cuda/
CUDA_Occupancy_calculator.xls
6
7
Gauging the level of HW use (cont.)

Three things are asked of you
Number of threads per block (this is trivial to
provide)
Number of registers per thread
Number of bytes of shared memory used by each
block
The last two quantities, you get them by adding
the ptxas-options v to the compile command
line

(CUDA_BIN_PATH)\nvcc.exe -cuda --ptxas-options
-v -I"(CUDA_INC_PATH)" -I./ -I../../common/inc
-I"(VCInstallDir)\include" -I"(VCInstallDir)\Pl
atformSDK\include" -o (ConfigurationName)\matrixm
ul.gen.c matrixmul.cu

In Visual Studio, right-click the main .cu file,
go to properties, and edit the Custom Build Step
by adding ptxas-options v

7
8
Gauging the level of HW use (cont.)

Open in a text editor the object file to find, in
a pile of stuff that doesnt make any sense, a
chunk of text that looks like this

code name _Z15MatrixMulKernel6MatrixS_S_ lm
em 0 smem 2112 reg 14 bar 0 bincode
0x3004c815 0xe43007c0 0x10008025 0x00000003
0xd0800205 0x00400780 0xa000000d 0x04000780
0xa0000205 0x04000780 0x10056003 0x00000100
0x30040601 0xc4100780 0x20000001 0x04004780

This is telling you that MatrixMulKernel (which
is the name I gave my kernel) uses 2112 bytes in
shared memory, 14 registers per thread, and that
there is no use of the local memory (lmem)

8
9
Alternatively, in Developer Studio
This is what you are interested in smem 2672
bytes registers 9
9
10
End Discussion on Memory Spaces (Access and
Latency Issues) Begin Control Flow
10
11
Objective

Understand the implications of control flow on
Branch divergence overhead
SM execution resource utilization
Learn better ways to write code with control flow
Understand compiler/HW predication
An idea meant to reduce the impact of control
flow
There is a cost involved with this process.

11
HK-UIUC
12
Quick terminology review

Thread concurrent code executed and an
associated state on the CUDA device (in parallel
with other threads)
The unit of parallelism in CUDA
Number of threads used controlled by user
Warp a group of threads executed physically in
parallel in G80
Number of threads in warp not controlled by user
Block a group of threads that are executed
together and form the unit of resource assignment
Number of blocks used controlled by user
Grid a group of thread blocks that must all
complete before the next phase of the program can
begin

12
HK-UIUC
13
How thread blocks are partitioned

Each thread block is partitioned into warps
Thread IDs within a warp are consecutive and
increasing
Remember In multidimensional blocks, the x
thread index runs first, followed by the y thread
index, and finally followed by the z thread index
Warp 0 starts with Thread ID 0
Partitioning is always the same
Thus you can use this knowledge in control flow
However, the exact size of warps may change from
release to release
While you can rely on ordering among threads, DO
NOT rely on any ordering among warps
Remember, the concept of warp is not something
you control through CUDA
If there are any dependencies between threads,
you must __syncthreads() to get correct results

13
HK-UIUC
14
Control Flow Instructions

Main performance concern with branching is
divergence
Threads within a single warp take different paths
Different execution paths are serialized in G80
The control paths taken by the threads in a warp
are traversed one at a time until there is no
more.
NOTE Dont forget that divergence can manifest
only at the warp level. You can not discuss this
concept in relation to code executed by threads
in different warps

14
HK-UIUC
15
Control Flow Instructions (cont.)

A common case avoid divergence when branch
condition is a function of thread ID
Example with divergence
If (threadIdx.x gt 2)
This creates two different control paths for
threads in a block
Branch granularity lt warp size threads 0 and 1
follow different path than the rest of the
threads in the first warp
Example without divergence
If (threadIdx.x / WARP_SIZE gt 2)
Also creates two different control paths for
threads in a block
Branch granularity is a whole multiple of warp
size all threads in any given warp follow the
same path

15
HK-UIUC
16
Illustration Parallel Reduction

Use the Parallel Reduction algorithm as a
vehicle to discuss the issue of control flow
Given an array of values, reduce them in
parallel to a single value
Examples
Sum reduction sum of all values in the array
Max reduction maximum of all values in the array
Typically parallel implementation
Recursively halve the number of threads, add two
values per thread
Takes log(n) steps for n elements, requires n/2
threads

16
HK-UIUC
17
A Vector Reduction Example

Assume an in-place reduction using shared memory
We are in the process of summing up a 512 element
array
The shared memory used to hold a partial sum
vector
Each iteration brings the partial sum vector
closer to the final sum
The final sum will be stored in element 0

17
HK-UIUC
18
A simple implementation

Assume we have already loaded array into
__shared__ float partialSum

unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
18
HK-UIUC
19
The Bank Conflicts Aspect
Array elements
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
19
HK-UIUC
20
The Branch Divergence Aspect
Thread 0
Thread 8
Thread 2
Thread 4
Thread 6
Thread 10
0
1
2
3
4
5
7
6
10
9
8
11
01
23
45
67
1011
89
1
0...3
4..7
8..11
2
0..7
8..15
3
iterations
Array elements
20
HK-UIUC
21
Some Observations

In each iterations, two control flow paths will
be sequentially traversed for each warp
Threads that perform addition and threads that do
not
Threads that do not perform addition may cost
extra cycles depending on the implementation of
divergence

21
HK-UIUC
22
Some Observations (cont.)

No more than half of the threads will be
executing at any time
All odd index threads are disabled right from the
beginning!
On average, less than ¼ of the threads will be
activated for all warps over time.
After the 5th iteration, entire warps in each
block will be disabled, poor resource utilization
but no divergence.
This can go on for a while, up to 4 more
iterations (512/3216 24), where each iteration
only has one thread activated until all warps
retire

22
HK-UIUC
23
Shortcomings of the implementation

Assume we have already loaded array into
__shared__ float partialSum

BAD Divergence due to interleaved branch
decisions
unsigned int t threadIdx.x for (unsigned int
stride 1 stride lt blockDim.x stride 2)
__syncthreads() if (t (2stride) 0)
partialSumt partialSumtstride
BAD Bank conflicts due to stride
23
HK-UIUC
24
A better implementation

Assume we have already loaded array into
__shared__ float partialSum

unsigned int t threadIdx.x for (unsigned int
stride blockDim.x stride gt 1 stride gtgt 1)
__syncthreads() if (t lt stride)
partialSumt partialSumtstride
24
HK-UIUC
25
No Divergence until lt 16 sub-sums
Thread 0
0
1
2
3

13
15
14
18
17
16
19
016
1531
1
3
4
25
HK-UIUC
26
Some Observations About the New Implementation

Only the last 5 iterations will have divergence
Entire warps will be shut down as iterations
progress
For a 512-thread block, 4 iterations to shut down
all but one warp in the block
Better resource utilization, will likely retire
warps and thus block executes faster
Recall, no bank conflicts either

26
HK-UIUC
27
A Potential Further Refinement but Bad Idea

For last 6 loops only one warp active (i.e. tids
0..31)
Shared reads writes SIMD synchronous within a
warp
So skip __syncthreads() and unroll last 5
iterations

unsigned int tid threadIdx.x for (unsigned int
d ngtgt1 d gt 32 d gtgt 1) __syncthreads() i
f (tid lt d) sharedtid sharedtid
d __syncthreads() if (tid lt 32) //
unroll last 6 predicated steps sharedtid
sharedtid 32 sharedtid sharedtid
16 sharedtid sharedtid
8 sharedtid sharedtid 4 sharedtid
sharedtid 2 sharedtid sharedtid
1
27
HK-UIUC
28
A Potential Further Refinement but bad idea

Concluding remarks on the further refinement
This would not work properly is warp size
decreases.
Also doesnt look that attractive if the warp
size increases.
Finally you need __synchthreads() between each
statement!
Having __synchthreads() in an if-statement is
problematic.

28
HK-UIUC
29
Control Flow Instructions

if, switch, for, while can significantly impact
the effective instruction throughput when threads
of the same warp diverge
If this happens, the execution is serialized
This increases the number of instructions
executed for this warp
When all the different execution paths have
completed, the threads converge back to the same
execution path
Not only that you execute more instructions, but
you also need logic associated with this process
(book-keeping)

29
30
Predicated Execution Concept

The thread divergence can be avoided in some
cases by using the concept of predication
ltp1gt LDR r1,r2,0
If p1 is TRUE, the assembly code instruction
above executes normally
If p1 is FALSE, instruction treated as NOP

30
HK-UIUC
31
Predication Example
if (x 10) c c 1
LDR r5, X p1 lt- r5 eq
10 ltp1gt LDR r1 lt- C ltp1gt ADD r1, r1, 1 ltp1gt STR
r1 -gt C
31
HK-UIUC
32
Predication very helpful for if-else
A
A B C D
B
C
D
32
HK-UIUC
33
If-else example
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp1gt inst 2 from B ltp1gt ltp2gt inst 1 from
C ltp2gt inst 2 from C
p1,p2 lt- r5 eq 10 ltp1gt inst 1 from
B ltp2gt inst 1 from C ltp1gt inst 2 from B ltp2gt
inst 2 from C ltp1gt
This is what gets scheduled
The cost is extra instructions will be issued
each time the code is executed. However, there is
no branch divergence.
33
HK-UIUC
34
Instruction Predication in G80

Your comparison instructions set condition codes
(CC)
Instructions can be predicated to write results
only when CC meets criterion (CC ! 0, CC gt 0,
etc.)
The compiler tries to predict if a branch
condition is likely to produce many divergent
warps
If thats the case, go ahead and predicate if the
branch has lt7 instructions
If thats not the case, only predicate if the
branch has lt4 instructions
Note its pretty bad if you predicate when it
was obvious that there would have been no
divergence

34
HK-UIUC
35
Instruction Predication in G80 (cont.)

ALL predicated instructions take execution cycles
Those with false conditions dont write their
output, and do not evaluate addresses or read
operands
Saves branch instructions, so can be cheaper than
serializing divergent paths
If all this business is confusing, remember this
Avoid thread divergence
Its not 100 clear to me, but I believe that
there is no cost if a subset of threads belonging
to a warp sits there and does nothing while the
other warp threads are all running the same
instruction

35
HK-UIUC

Write a Comment

User Comments (0)