ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications - PowerPoint PPT Presentation

About This Presentation

Title:

ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications

Description:

(fetching floats from. global memory, leads to one. memory transaction) 3. Thread 15. Thread 14 ... identical address for a fetch operation, there is no bank ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 25

Provided by: sbel3

Learn more at: http://sbel.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: ME964%20High%20Performance%20Computing%20for%20Engineering%20Applications

1
ME964High Performance Computing for Engineering
Applications

CUDA Memory Spaces and Access Overhead
Gauging Kernel Performance
Oct. 2, 2008

2
Before we get started

Last Time
Details on the CUDA memory spaces and access
related overhead
Relevant for getting good performance out of your
GPU application
Covered registers, constant memory, global memory
Today
Wrap up discussion on CUDA memory spaces
Discuss the shared memory
Gauging the extent to which you use HW resources
in CUDA
HW5, matrix convolution, to be posted on the
class website. It also requires some reading of
the parallel programming patterns book.
NOTE Next Tu, Michael Garland, Senior Researcher
at NVIDIA is going to be our guest lecturer.

2
3

Address 120
Address 120
Address 124
Address 124
Thread 0
Address 128
Thread 0
Address 128
Thread 1
Address 132
Thread 1
Address 132
Thread 2
Address 136
Thread 2
Address 136
Thread 3
Address 140
Thread 3
Address 140
Thread 4
Address 144
Thread 4
Address 144
Thread 5
Address 148
Thread 5
Address 148
Thread 6
Address 152
Thread 6
Address 152
Thread 7
Address 156
Thread 7
Address 156
64B segment
64B segment
Thread 8
Address 160
Thread 8
Address 160
Thread 9
Address 164
Thread 9
Address 164
Thread 10
Address 168
Thread 10
Address 168
Thread 11
Address 172
Thread 11
Address 172
Thread 12
Address 176
Thread 12
Address 176
Thread 13
Address 180
Thread 13
Address 180
Thread 14
Address 184
Thread 14
Address 184
Thread 15
Address 188
Thread 15
Address 188
Address 192
Address 192
Address 196
Address 196
Address 200
Address 200
Examples of Coalesced Memory Access
Patterns (fetching floats from global memory,
leads to one memory transaction)
Address 204
Address 204
Address 208
Address 208
Address 212
Address 212
Address 214
Address 214
Address 218
Address 218
Address 222
Address 222
3
4

Address 120
Address 120
Address 96
Address 124
Address 124
Address 100
Thread 0
Address 128
Thread 0
Address 128
Address 104
Thread 1
Address 132
Thread 1
Address 132
Address 108
32B segment
Thread 2
Thread 2
Address 136
Address 136
Address 112
Thread 3
Address 140
Thread 3
Address 140
Address 116
Thread 4
Thread 4
Address 144
Address 144
Address 120
Thread 5
Address 148
Thread 5
Address 148
Address 124
Thread 0
Thread 6
Address 152
Thread 6
Address 152
Address 128
Thread 1
Thread 7
Address 156
Thread 7
Address 156
Address 132
64B segment
Thread 2
Thread 8
Address 160
Thread 8
Address 160
Address 136
Thread 3
Thread 9
Address 164
Thread 9
Address 164
Address 140
Thread 4
Thread 10
Address 168
Thread 10
Address 168
Address 144
Thread 5
Thread 11
Address 172
Thread 11
Address 172
Address 148
128B segment
Thread 6
Thread 12
Address 176
Thread 12
Address 176
Address 152
Thread 7
Thread 13
Address 180
Thread 13
Address 180
Address 156
64B segment
Thread 8
Thread 14
Address 184
Thread 14
Address 184
Address 160
Thread 9
Thread 15
Address 188
Thread 15
Address 188
Address 164
Thread 10
Address 192
Address 192
Address 168
Thread 11
Address 196
Address 196
Address 172
Thread 12
Address 200
Address 200
Address 176
NOTE All of theseare coalesced
memorytransactions in CUDA 2.0 (released in
summer 2008) result inone or two memory
transactions.
Thread 13
Address 204
Address 204
Address 180
Thread 14
Address 208
Address 184
Example of float non-coalesced memory access,
16memory transactionsin CUDA 1.1
...
Thread 15
Address 212
Address 188
Address 214
Address 192
Address 218
Address 252
Address 196
Address 222
Address 256
Address 200
4
5
Coalesced Global Memory Access(Concluding
Remarks)

Happens when half warp (16 threads) accesses
contiguous region of device memory
16 data elements loaded in one instruction
int, float 64 bytes (fastest)
int2, float2 128 bytes
int4, float4 256 bytes (2 transactions)
If un-coalesced, issues 16 sequential loads
CUDA 2.0 became more lax with these requirements,
its simpler to get coalesced memory operations
NOTE when you have 2D (Dx, Dy) and 3D (Dx, Dy,
Dz) blocks, count on this indexing scheme of your
threads when considering memory coalescence
2D thread ID in the block for thread of index
(x,y) is x Dxy
3D thread ID in the block for thread of index
(x,y,z) is x Dx(y Dyz)
To conclude, the x thread id runs the fastest,
followed by the y, and then by the z.

5
6
Exercise coalesced memory access

Suppose b is of type int and lives in the
global memory space
Suppose a is of type int and is a register
variable
Consider the two lines below, which are supposed
to be each part of a kernel with a 1D grid
a bthreadIdx.x
a b2threadIdx.x
Are these loads leading to coalesced or
non-coalesced memory transactions?

6
7
Standard Trick Load/Store (Memory read/write)
Clustering/Batching

Use LD to hide LD latency (non-dependent LD ops
only)
Use same thread to help hide own latency
Instead of
LD 0 (long latency)
Dependent MATH 0
LD 1 (long latency)
Dependent MATH 1
Do
LD 0 (long latency)
LD 1 (long latency - hidden)
MATH 0
MATH 1
Compiler typically handles this on your behalf
But, you must have enough non-dependent LDs and
Math
This is where loop unrolling comes into play and
can have a significant impact

7
8
Shared Memory

Each SM has 16 KB of Shared Memory
Physically organized as 16 banks of 4 byte words
Note that shared memory can store less data than
the registers (16 vs. 32 KB)
The 16 banks of the Shared Memory are organized
like benches in a movie theater
You have 256 rows of benches. Each row has 16
benches, in each bench you can seat a family of
four (bytes). Note that a bank represents a
column of benches in the movie theater
CUDA uses Shared Memory as shared storage visible
to all threads in a thread block
All threads in the block have read write access

I

L
1
Multithreaded
Instruction Buffer
R
C

Shared
F
L
1
Mem
Operand Select
MAD
SFU
8
9
Q Is 16K of Shared Memory Enough?Revisit the
Matrix Multiplication Example

One block computes one square sub-matrix Csub of
size Block_Size
One thread computes one element of Csub
Assume that the dimensions of A and B are
multiples of Block_Size and square shape
Doesnt have to be like this, but keeps example
simpler and focused on the concepts of interest

tx
B
Block_Size
wA
Block_Size
A
C
Csub
hA
Block_Size
ty

Block_Size
Block_Size
Block_Size
wB
wA
9
10
Matrix Multiplication Shared Memory Usage

Each Block requires 2 WIDTH2 4 bytes of shared
memory storage
For WIDTH 16, each BLOCK requires 2KB, up to 8
Blocks can fit into the Shared Memory of an SM
Since each SM can only take 768 threads, each SM
can only take 3 Blocks of 256 threads each
Shared memory size is not a limitation for our
implementation of the Matrix Multiplication

10
11
Shared Memory Architecture

Common sense observation in a parallel machine
many threads access memory at the same time
To service more than one thread, memory is
divided into banks
Essential to achieve high bandwidth
Each bank can service one address per cycle
A memory can service as many simultaneous
accesses as it has banks
Multiple simultaneous accesses to a bankresult
in a bank conflict
Conflicting accesses are serialized

11
12
Bank Addressing Examples

No Bank Conflicts
Linear addressing stride 1

No Bank Conflicts
Random 11 Permutation

12
13
Bank Addressing Examples

2-way Bank Conflicts
Linear addressing stride 2

8-way Bank Conflicts
Linear addressing stride 8

13
14
Shared Memory Bank Conflicts

Shared memory is as fast as registers if there
are no bank conflicts
The fast case
If all threads of a half-warp access different
banks, there is no bank conflict
If all threads of a half-warp access and
identical address for a fetch operation, there is
no bank conflict (broadcast)
The slow case
Bank Conflict multiple threads in the same
half-warp access the same bank
Must serialize the accesses
Cost max of simultaneous accesses to a single
bank

14
15
How addresses map to banks on G80

Each bank has a bandwidth of 32 bits per clock
cycle
Successive 32-bit words are assigned to
successive banks
G80 has 16 banks
Bank you work with address 16
Same as the number of threads in a half-warp
NOTE There is no such thing as bank conflicts
between threads belonging to different
half-warps this issue only relevant for threads
from within a single half-warp

15
16
Linear Addressing

Given
__shared__ float sharedM256
float foo sharedMbaseIndex s
threadIdx.x
This is bank-conflict-free if s shares no common
factors with the number of banks
16 on G80, so s must be odd

s1
s3
16
17
The Math Beyond Bank Conflicts

We are in a half-warp, and the question is if
thread t1 and thread t2 gt t1 might access the
same bank of shared memory
Let b be the base of the array (the shareM
pointer on previous slide)
How should you not choose s?

If s2, take k1, and then any threads t1 and t2
which are eight apart satisfy the condition above
and will have a bank conflict (0,8, 1,9,
etc.) two way conflict
If s4, take k2, any threads t1 and t2 which are
four apart will have a bank conflict (0,4,8,12,
1,5,9,13, etc.) four way conflict
NOTE you cant get a bank conflict is s is odd
(no quartet k, s, t1, t2 satisfies the bank
conflict condition above). So take stride
s1,3,5, etc.

17
18
Data types and bank conflicts

This has no conflicts if type of shared is
32-bits

foo sharedbaseIndex threadIdx.x
But not if the data type is smaller
4-way bank conflicts
__shared__ char shared
foo sharedbaseIndex threadIdx.x
2-way bank conflicts
__shared__ short shared
foo sharedbaseIndex threadIdx.x

18
19
Structs and Bank Conflicts

Struct assignments compile into as many memory
accesses as there are struct members
struct vector float x, y, z
struct myType
float f
int c
__shared__ struct vector vectors64
__shared__ struct myType myTypes64
This has no bank conflicts for vector struct
size is 3 words
3 accesses per thread, contiguous banks (no
common factor with 16)
struct vector v vectorsbaseIndex
threadIdx.x
This has 2-way bank conflicts for my Type (2
accesses per thread)
struct myType m myTypesbaseIndex
threadIdx.x

Bank 0
Thread 0
Bank 1
Thread 1
Bank 2
Thread 2
Bank 3
Thread 3
Bank 4
Thread 4
Bank 5
Thread 5
Bank 6
Thread 6
Bank 7
Thread 7
Bank 15
Thread 15
19
20
Common Array Bank Conflict Patterns 1D

Each thread loads 2 elements into shared memory
2-way-interleaved loads result in 2-way bank
conflicts
int tid threadIdx.x
shared2tid global2tid
shared2tid1 global2tid1
This makes sense for traditional CPU threads,
locality in cache line usage and reduced sharing
traffic.
Not in shared memory usage where there is no
cache line effects but banking effects

20
21
A Better Array Access Pattern

Each thread loads one element in every
consecutive group of bockDim elements.
sharedtid globaltid
sharedtid blockDim.x globaltid
blockDim.x

21
22
Vector Reduction with Bank Conflicts(assume 1024
vector entries)
Array elements (floats)
0
1
2
3
4
5
7
6
10
9
8
11
1
2
3
22
23
No Bank Conflicts
0
1
2
3

13
15
14
18
17
16
19
1
2
3
23
24
Common Bank Conflict Patterns (2D)
Bank Indices without Padding

Operating on 2D array of floats in shared memory
e.g. image processing
Example 16x16 block
Threads in a block access the elements in each
column simultaneously (example bank 1 in purple)
16-way bank conflicts
Solution 1) pad the rows
Add one float to the end of each row
Solution 2) transpose before processing
Suffer bank conflicts during transpose
But possibly save them later

Bank Indices with Padding
24

Write a Comment

User Comments (0)