Title: ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications
1ME964High Performance Computing for Engineering
Applications
- CUDA Memory Spaces and Access Overhead
- Gauging Kernel Performance
- Oct. 2, 2008
2Before we get started
- Last Time
- Details on the CUDA memory spaces and access
related overhead - Relevant for getting good performance out of your
GPU application - Covered registers, constant memory, global memory
- Today
- Wrap up discussion on CUDA memory spaces
- Discuss the shared memory
- Gauging the extent to which you use HW resources
in CUDA - HW5, matrix convolution, to be posted on the
class website. It also requires some reading of
the parallel programming patterns book. - NOTE Next Tu, Michael Garland, Senior Researcher
at NVIDIA is going to be our guest lecturer.
2
3 Address 120
Address 120
Address 124
Address 124
Thread 0
Address 128
Thread 0
Address 128
Thread 1
Address 132
Thread 1
Address 132
Thread 2
Address 136
Thread 2
Address 136
Thread 3
Address 140
Thread 3
Address 140
Thread 4
Address 144
Thread 4
Address 144
Thread 5
Address 148
Thread 5
Address 148
Thread 6
Address 152
Thread 6
Address 152
Thread 7
Address 156
Thread 7
Address 156
64B segment
64B segment
Thread 8
Address 160
Thread 8
Address 160
Thread 9
Address 164
Thread 9
Address 164
Thread 10
Address 168
Thread 10
Address 168
Thread 11
Address 172
Thread 11
Address 172
Thread 12
Address 176
Thread 12
Address 176
Thread 13
Address 180
Thread 13
Address 180
Thread 14
Address 184
Thread 14
Address 184
Thread 15
Address 188
Thread 15
Address 188
Address 192
Address 192
Address 196
Address 196
Address 200
Address 200
Examples of Coalesced Memory Access
Patterns (fetching floats from global memory,
leads to one memory transaction)
Address 204
Address 204
Address 208
Address 208
Address 212
Address 212
Address 214
Address 214
Address 218
Address 218
Address 222
Address 222
3
4 Address 120
Address 120
Address 96
Address 124
Address 124
Address 100
Thread 0
Address 128
Thread 0
Address 128
Address 104
Thread 1
Address 132
Thread 1
Address 132
Address 108
32B segment
Thread 2
Thread 2
Address 136
Address 136
Address 112
Thread 3
Address 140
Thread 3
Address 140
Address 116
Thread 4
Thread 4
Address 144
Address 144
Address 120
Thread 5
Address 148
Thread 5
Address 148
Address 124
Thread 0
Thread 6
Address 152
Thread 6
Address 152
Address 128
Thread 1
Thread 7
Address 156
Thread 7
Address 156
Address 132
64B segment
Thread 2
Thread 8
Address 160
Thread 8
Address 160
Address 136
Thread 3
Thread 9
Address 164
Thread 9
Address 164
Address 140
Thread 4
Thread 10
Address 168
Thread 10
Address 168
Address 144
Thread 5
Thread 11
Address 172
Thread 11
Address 172
Address 148
128B segment
Thread 6
Thread 12
Address 176
Thread 12
Address 176
Address 152
Thread 7
Thread 13
Address 180
Thread 13
Address 180
Address 156
64B segment
Thread 8
Thread 14
Address 184
Thread 14
Address 184
Address 160
Thread 9
Thread 15
Address 188
Thread 15
Address 188
Address 164
Thread 10
Address 192
Address 192
Address 168
Thread 11
Address 196
Address 196
Address 172
Thread 12
Address 200
Address 200
Address 176
NOTE All of theseare coalesced
memorytransactions in CUDA 2.0 (released in
summer 2008) result inone or two memory
transactions.
Thread 13
Address 204
Address 204
Address 180
Thread 14
Address 208
Address 184
Example of float non-coalesced memory access,
16memory transactionsin CUDA 1.1
...
Thread 15
Address 212
Address 188
Address 214
Address 192
Address 218
Address 252
Address 196
Address 222
Address 256
Address 200
4
5Coalesced Global Memory Access(Concluding
Remarks)
- Happens when half warp (16 threads) accesses
contiguous region of device memory - 16 data elements loaded in one instruction
- int, float 64 bytes (fastest)
- int2, float2 128 bytes
- int4, float4 256 bytes (2 transactions)
- If un-coalesced, issues 16 sequential loads
- CUDA 2.0 became more lax with these requirements,
its simpler to get coalesced memory operations - NOTE when you have 2D (Dx, Dy) and 3D (Dx, Dy,
Dz) blocks, count on this indexing scheme of your
threads when considering memory coalescence - 2D thread ID in the block for thread of index
(x,y) is x Dxy - 3D thread ID in the block for thread of index
(x,y,z) is x Dx(y Dyz) - To conclude, the x thread id runs the fastest,
followed by the y, and then by the z.
5
6Exercise coalesced memory access
- Suppose b is of type int and lives in the
global memory space - Suppose a is of type int and is a register
variable - Consider the two lines below, which are supposed
to be each part of a kernel with a 1D grid - a bthreadIdx.x
- a b2threadIdx.x
- Are these loads leading to coalesced or
non-coalesced memory transactions?
6
7Standard Trick Load/Store (Memory read/write)
Clustering/Batching
- Use LD to hide LD latency (non-dependent LD ops
only) - Use same thread to help hide own latency
- Instead of
- LD 0 (long latency)
- Dependent MATH 0
- LD 1 (long latency)
- Dependent MATH 1
- Do
- LD 0 (long latency)
- LD 1 (long latency - hidden)
- MATH 0
- MATH 1
- Compiler typically handles this on your behalf
- But, you must have enough non-dependent LDs and
Math - This is where loop unrolling comes into play and
can have a significant impact
7
8Shared Memory
- Each SM has 16 KB of Shared Memory
- Physically organized as 16 banks of 4 byte words
- Note that shared memory can store less data than
the registers (16 vs. 32 KB) - The 16 banks of the Shared Memory are organized
like benches in a movie theater - You have 256 rows of benches. Each row has 16
benches, in each bench you can seat a family of
four (bytes). Note that a bank represents a
column of benches in the movie theater - CUDA uses Shared Memory as shared storage visible
to all threads in a thread block - All threads in the block have read write access
I
L
1
Multithreaded
Instruction Buffer
R
C
Shared
F
L
1
Mem
Operand Select
MAD
SFU
8
9Q Is 16K of Shared Memory Enough?Revisit the
Matrix Multiplication Example
- One block computes one square sub-matrix Csub of
size Block_Size - One thread computes one element of Csub
- Assume that the dimensions of A and B are
multiples of Block_Size and square shape - Doesnt have to be like this, but keeps example
simpler and focused on the concepts of interest
tx
B
Block_Size
wA
Block_Size
A
C
Csub
hA
Block_Size
ty
Block_Size
Block_Size
Block_Size
wB
wA
9
10Matrix Multiplication Shared Memory Usage
- Each Block requires 2 WIDTH2 4 bytes of shared
memory storage - For WIDTH 16, each BLOCK requires 2KB, up to 8
Blocks can fit into the Shared Memory of an SM - Since each SM can only take 768 threads, each SM
can only take 3 Blocks of 256 threads each - Shared memory size is not a limitation for our
implementation of the Matrix Multiplication
10
11Shared Memory Architecture
- Common sense observation in a parallel machine
many threads access memory at the same time - To service more than one thread, memory is
divided into banks - Essential to achieve high bandwidth
- Each bank can service one address per cycle
- A memory can service as many simultaneous
accesses as it has banks - Multiple simultaneous accesses to a bankresult
in a bank conflict - Conflicting accesses are serialized
11
12Bank Addressing Examples
- No Bank Conflicts
- Linear addressing stride 1
- No Bank Conflicts
- Random 11 Permutation
12
13Bank Addressing Examples
- 2-way Bank Conflicts
- Linear addressing stride 2
- 8-way Bank Conflicts
- Linear addressing stride 8
13
14Shared Memory Bank Conflicts
- Shared memory is as fast as registers if there
are no bank conflicts - The fast case
- If all threads of a half-warp access different
banks, there is no bank conflict - If all threads of a half-warp access and
identical address for a fetch operation, there is
no bank conflict (broadcast) - The slow case
- Bank Conflict multiple threads in the same
half-warp access the same bank - Must serialize the accesses
- Cost max of simultaneous accesses to a single
bank
14
15How addresses map to banks on G80
- Each bank has a bandwidth of 32 bits per clock
cycle - Successive 32-bit words are assigned to
successive banks - G80 has 16 banks
- Bank you work with address 16
- Same as the number of threads in a half-warp
- NOTE There is no such thing as bank conflicts
between threads belonging to different
half-warps this issue only relevant for threads
from within a single half-warp
15
16Linear Addressing
- Given
- __shared__ float sharedM256
- float foo sharedMbaseIndex s
threadIdx.x - This is bank-conflict-free if s shares no common
factors with the number of banks - 16 on G80, so s must be odd
s1
s3
16
17The Math Beyond Bank Conflicts
- We are in a half-warp, and the question is if
thread t1 and thread t2 gt t1 might access the
same bank of shared memory - Let b be the base of the array (the shareM
pointer on previous slide) - How should you not choose s?
- If s2, take k1, and then any threads t1 and t2
which are eight apart satisfy the condition above
and will have a bank conflict (0,8, 1,9,
etc.) two way conflict - If s4, take k2, any threads t1 and t2 which are
four apart will have a bank conflict (0,4,8,12,
1,5,9,13, etc.) four way conflict - NOTE you cant get a bank conflict is s is odd
(no quartet k, s, t1, t2 satisfies the bank
conflict condition above). So take stride
s1,3,5, etc.
17
18Data types and bank conflicts
- This has no conflicts if type of shared is
32-bits
- foo sharedbaseIndex threadIdx.x
- But not if the data type is smaller
- 4-way bank conflicts
- __shared__ char shared
- foo sharedbaseIndex threadIdx.x
- 2-way bank conflicts
- __shared__ short shared
- foo sharedbaseIndex threadIdx.x
18
19Structs and Bank Conflicts
- Struct assignments compile into as many memory
accesses as there are struct members - struct vector float x, y, z
- struct myType
- float f
- int c
-
- __shared__ struct vector vectors64
- __shared__ struct myType myTypes64
- This has no bank conflicts for vector struct
size is 3 words - 3 accesses per thread, contiguous banks (no
common factor with 16) - struct vector v vectorsbaseIndex
threadIdx.x - This has 2-way bank conflicts for my Type (2
accesses per thread) - struct myType m myTypesbaseIndex
threadIdx.x
Bank 0
Thread 0
Bank 1
Thread 1
Bank 2
Thread 2
Bank 3
Thread 3
Bank 4
Thread 4
Bank 5
Thread 5
Bank 6
Thread 6
Bank 7
Thread 7
Bank 15
Thread 15
19
20Common Array Bank Conflict Patterns 1D
- Each thread loads 2 elements into shared memory
- 2-way-interleaved loads result in 2-way bank
conflicts - int tid threadIdx.x
- shared2tid global2tid
- shared2tid1 global2tid1
- This makes sense for traditional CPU threads,
locality in cache line usage and reduced sharing
traffic. - Not in shared memory usage where there is no
cache line effects but banking effects
20
21A Better Array Access Pattern
- Each thread loads one element in every
consecutive group of bockDim elements. - sharedtid globaltid
- sharedtid blockDim.x globaltid
blockDim.x
21
22Vector Reduction with Bank Conflicts(assume 1024
vector entries)
Array elements (floats)
0
1
2
3
4
5
7
6
10
9
8
11
1
2
3
22
23No Bank Conflicts
0
1
2
3
13
15
14
18
17
16
19
1
2
3
23
24Common Bank Conflict Patterns (2D)
Bank Indices without Padding
- Operating on 2D array of floats in shared memory
- e.g. image processing
- Example 16x16 block
- Threads in a block access the elements in each
column simultaneously (example bank 1 in purple) - 16-way bank conflicts
- Solution 1) pad the rows
- Add one float to the end of each row
- Solution 2) transpose before processing
- Suffer bank conflicts during transpose
- But possibly save them later
Bank Indices with Padding
24