Title: Linear Algebra I
1Linear Algebra I
- Sashi Kumar Penta
- GPGP class presentation
- Spring 2007
2LA 1 Outline
- Motivation
- Dense matrices and vectors
- Representation
- Vector-Vector operations
- Vector reduction type operations
- Matrix-Matrix multiplication
- Larsen et. al.
- Fatahalian et. al.
- Naga Govindaraju et. al.
- Introduction to CUDA
- Matrix multiplication
3LA 2 Outline
- Memory requirements for Balanced computer
architectures - Sparse matrices
- Banded matrices
- Random sparse matrices
- Sparse Matrix-Vector multiplication
- Solving Navier - strokes on GPU
-
4Why LA on GPUs?
- Applications of LA
- Scientific computations
- Search (LSI)
- Graphics Simulations
- The GPU is a fast streaming processor
- LA operations are easily streamable
- The result of the computation is already on the
GPU and ready for display - For simulations
5Arrays textures
- 2D arrays are the native GPU data layout
- Maximum array size in GPU 8K x 8K
- Texture coordinates to access values stored in
the textures
6Representation
Courtesy Jens Krüger
7Matrix multiplication Why is it important?
- Inherently parallel
- Memory intensive
- Greatly benefit from GPUs
- Used for bench marking hardware
8Early Pre-floating point
- Larsen and McAllister described an early
pre-floating point implementation of matrix
multiplies. - 2D textures and blending operations to perform
matrix product - nVidia GeForce3 GPU 8-bit fixed point output
- 4.4 GB ops
9L-M GPU-based Nested Loop Matrix multiplication
- for(i 0 i lt N i)
-
- for(j 0 j lt N j)
-
- Zi,j 0
- // Each iteration in the following
loop is quadrilateral - // rasterization of size N x N
- for(k 0 k lt N k)
- Zi,j ZI, j Xi,k
Yk,j -
10Continued ..
The ability to modulate with two textures
provides us with the ability to multiply values
from two separate textures using the graphics
hardware
11Continued
Load texture texA with matrix A. Load texture
texB with matrix B. Set the multitexturing mode
to modulate. Set the framebuffer write mode to
accumulate. For i 0 .. n -1 draw an n x
n pixel rectangle with texA coords
(0,i), (0,i), (n-1, i), (n-1, i) and
texB coords (i, 0), (i, n-1), (i, n-1),
(i,0). End Screen contains result of A x B
12Floating point textures
- Texture targets
- GL_TEXTURE_2D
- Coordinates have to be normalized 0, 1
- GL_TEXTURE_RECTANGLE_ARB
- Coordinates are not normalized
13Operations
Courtesy Jens Krüger
- Vector-Vector Operations
- Reduced to 2D texture operations
- Coded in pixel shaders
- Example Vector1 Vector2 ? Vector3
Render Texture
Static quad
Vertex Shader
Pixel Shader
14Vector-vector operation saxpy
- Scalar alpha x plus y, i.e. (alpha x y)
- Y_new y_old alpha x
- On CPU
- for (int i 0 i lt N i)
- Yi Yi alpha Xi
- Calculations are independent
- Computational kernel
- Y_newi Y_oldi alpha Xi
15Kernels shaders
- float saxpy (
- float2 coords TEXCOORD0,
- uniform sampler2D textureY,
- uniform sampler2D textureX,
- uniform float alpha) COLOR
-
- float result
- float y tex2D(textureY, coords)
- float x tex2D(textureX, coords)
- result y alpha x
- return result
-
16Four-channeled RGBA texture
- float4 saxpy (
- float2 coords TEXCOORD0,
- uniform sampler2D textureY,
- uniform sampler2D textureX,
- uniform float alpha) COLOR
-
- float4 result
- float4 y tex2D(textureY, coords)
- float4 x tex2D(textureX, coords)
- result y alpha x
- return result
-
4 channeled textures to improve efficiency in
fetching and computation
17Computingdrawing
- glPolygonMode(GL_FRONT,GL_FILL)
- glBegin(GL_QUADS)
- glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
- glTexCoord2f(texSize, 0.0)
glVertex2f(texSize, 0.0) - glTexCoord2f(texSize, texSize)
glVertex2f(texSize, texSize) - glTexCoord2f(0.0, texSize) glVertex2f(0.0,
texSize) - glEnd()
Texture rectangles
18Computingdrawing
- glPolygonMode(GL_FRONT,GL_FILL)
- glBegin(GL_QUADS)
- glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
- glTexCoord2f(1.0, 0.0) glVertex2f(texSize,
0.0) - glTexCoord2f(1.0, 1.0) glVertex2f(texSize,
texSize) - glTexCoord2f(0.0, 1.0) glVertex2f(0.0,
texSize) - glEnd()
Texture2D
19Reduction-type operations
- Maximum, vector norm, dot products etc.
- One method render a 1x1 output
- only one parallel processing element (PE)
- Might exceed the maximum allowed shader length
- Local maximum of each 2x2 group of elements in
one kernel - Input M by M, output M/2 by M/2
- Recursively repeated until the final 1 by 1
- Logarithmetic number of iterations
20Rendering a quad
void drawQuad(int w, int h)
glBegin(GL_QUADS) glMultiTexCoord2f(GL_T
EXTURE0, -0.5, -0.5) glMultiTexCoord2f(G
L_TEXTURE1, 0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5)
glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5)
glVertex2f(0.0, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE1, 2w0.5,
-0.5) glMultiTexCoord2f(GL_TEXTURE2,
2w-0.5, 0.5) glMultiTexCoord2f(GL_TEXTU
RE3, 2w0.5, 0.5) glVertex2f(w, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE1,
2w0.5, 2h-0.5) glMultiTexCoord2f(GL_T
EXTURE2, 2w-0.5, 2h0.5)
glMultiTexCoord2f(GL_TEXTURE3, 2w0.5, 2h0.5)
glVertex2f(w, h)
glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2h-0.5)
glMultiTexCoord2f(GL_TEXTURE1, 0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE2,
-0.5, 2h0.5) glMultiTexCoord2f(GL_TEXT
URE3, 0.5, 2h0.5) glVertex2f(0.0, h)
glEnd()
21maximum
float maximum (float2 left TEXCOORD0, float2
right TEXCOORD1, float2
top TEXCOORD2, float2 bottom TEXCOORD3,
uniform samplerRECT texture)
COLOR float val1 texRECT(texture,
left) float val2 texRECT(texture,
right) float val3 texRECT(texture,
top) float val4 texRECT(texture,
bottom) return max(val1,max(val2,max(val3,
val4)))
22Matrix-Matrix multiplication
for (i 0 i lt N i) for( j 0 j lt N
j) Ci, j 0 for(k 0
k lt N k) Ci, j Ai,k
Bk, j
- On CPUs
- Suffers from locality Elements of B are accessed
column wise. Not sequential order in memory. - Bandwidth limited
- Size of its per-loop prevents bandwidth
amplification from the closest and fastest levels
of memory hierarchy.
blocks inputs into small Sub matrices that fit
in Processor cache
23Fatahalian et. al. Matrix-matrix multiplication
on the GPU
- 2x2 blocks of matrix elements in 4 component
texels - Reads 2 rows from matrix A and 2 columns of B to
compute 4 elements of the output matrix - Each shader program performs 4 inner loop
iterations of the CPU program - GPU texture caches are organized to capture 2D
locality, so sequential access along either
access are cache coherent
for (k 0 k lt N/2 i) Ci, j.xyzw
Ai,k.xxzz Bk, j.xyxy
Ai,k.yyww Bk, j.zwzw
24Multipass algorithm
- Single pass instruction count exceed the limits
of GPU architectures for sufficiently large
matrices - 4 consecutive elements from a matrix column into
a texel - 4x4 matrix by 4x1 vector products
- 4x reuse of data in texel B
- 6 fetches for 4 mad operations 1.5x more fetches
Ci, j.xyzw accumi,j.xyzw
Ai,k.xyzw Bk/4, j.xxxx
Ai,k1.xyzw Bk/4, j.yyyy
Ai,k2.xyzw Bk/4, j.zzzz
Ai,k3.xyzw
Bk/4, j.wwww
25Measurements multiplication of 1024 x 1024
matrices
Available floating point bandwidth from the
closest cache on these GPUs is up to several
times lower than that of current CPUs to their L1
caches
26Naga et. al. Matrix-matrix multiplication on the
GPU
- A memory model for scientific algorithms on
Graphics Processors - Explicit blocking to avoid hardware dependencies
- Computation on the tiles of the size T x T is
invoked by drawing quadrilaterals of size T x T
on the screen. - A single fragment program evaluates the dot
product from vectors of size D in X and Y
(inputs).
27GPU-based nested loop matrix multiplication
- for(ib 0 ib lt N ib ib T)
- for(jb 0 jb lt N jb jb T)
- for(kb 0 kb lt N kb kb D)
- // following two loops invoked
using a quadrilateral of size TxT - for(i ib i lt ib T i i 1)
- for(j jb j lt jb T j
j 1) - // following loop is
performed inside - // a fragment program
- for(k kb k lt kb
D k k 1) - Zi,j Zi,j
Xi,k Yk,j
28Measurements
- NVIDIA 7900 GTX GPU
- 17.6 GFLOPS
- 40 GB / S
- NVIDIA 6800 Ultra
- 8 GFLOPS
29Matrix-Matrix multiplication in CUDA/CTM
- 100 GFLOPS
- How?
- Use of extended memory formats (fetch4) and
deeper loop unrolling (R580) - Shader memory (G80)
- On chip shared memory allows you to keep blocks
of data close to the ALUs rather than having to
ping-pong off-chip to an FBO. - SM is only exposed through CUDA
30Introduction to CUDA
- Thread Batching
- Thread Block
- A batch of threads that can cooperate
- Sharing data through fast shared memory
- Synchronize their execution
- Grid of Thread Blocks
- Blocks that execute the same kernel can
- be batched together
- Reduced thread cooperation
threadID
blockID
31Memory model
32Matrix multiplication using CUDA
- Each thread block is responsible for computing
one square sub-matrix Csub of C. - Block size of Csub is equal to 16
33void Mul(const float A, const float B, int
hA, int wA, int wB, float C) float
Ad int size hA wA
sizeof(float) cudaMalloc((void ) Ad,
size) cudaMemcpy(Ad, A, size,
cudaMemcpyHostToDevice) float Bd
size wA wB sizeof(float)
cudaMalloc((void ) Bd, size)
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice)
float Cd size hA wB
sizeof(float) cudaMalloc((void)Cd,
size) dim3 dimBlock(BLOCK_SIZE,
BLOCK_SIZE) dim3 dimGrid(wB /
dimBlock.x, hA / dimBlock.y)
MuldltltltdimGrid, dimBlockgtgtgt(Ad, Bd, wA, wB,
Cd) cudaMemcpy(C, Cd, size,
cudaMemcpyDeviceToHost)
Loading A into device global memory
Loading B into device global memory
Allocate C on device global memory
Execution configuration
Launch device computation
Read C from the device
34__global__ void Muld(const float A, const float
B, int wA, int wB, float C) int bx
blockIDx.x int by blockIDx.y int
tx threadIDx.x int ty threadID.y
int aBegin wA BLOCK_SIZE by int
aEnd aBegin wA // (wA 1) BLOCK_SIZE
by int aStep BLOCK_SIZE int
bBegin BLOCK_SIZE bx int bStep
BLOCK_SIZE wB float Csub 0
aBegin Index of the first sub-matrix of A
processed by the block
35 .. .. for( int a aBegin, b
bBegin a lt aEnd a aStep, b bStep)
__shared__ float
AsBLOCK_SIZEBLOCK_SIZE
__shared__ float BsBLOCK_SIZEBLOCK_SIZE
Astytx Aa wA ty tx
Bstytx Bb wB ty tx
__syncthreads() for(int k
0 k lt BLOCK_SIZE k) Csub
Astyk Bsktx
__syncthreads() int c wB
BLOCK_SIZE by BLOCK_SIZE bx Cc
wB ty tx CSub
Iterate through all sub-matrices of A and B
Load one sub-matrix of A and one of B from
global memory to shared memory
Synchronize to make sure both matrices Are fully
loaded by all threads
Compute the product of the two sub-matrices
Synchronize to make sure that the product Of the
two sub-matrices is done
Write the block sub-matrix to global memory
36Conclusions
- Different ways of representing matrices and
vectors on the GPU - Various algorithms for matrix-matrix
multiplication and vector-vector operations on
the GPU - Computation throughput of the GPU
- Memory performance of the GPU
37Questions?
38Thank you
39References
- E.S. Larsen and D. McAllister. Fast matrix
multiplies using graphics hardware. Super
Computing 2001. - Dominik Göddeke. GPGPU Tutorials.
http//www.mathematik.uni-dortmund.de/goeddeke/gp
gpu/index.html - Adam Moravanszky, Dense matrix algebra on the
GPU. 2003 http//www.shaderx2.com/shaderx.PDF - Nico Galoppo, Naga K. Govindaraju, Michael Henson
and Dinesh Manocha LU-GPU Efficient algorithms
for solving Dense Linear systems on Graphics
hardware. Super Computing 2005 - CUDA Programming Guide and CUDA BLAS.
http//developer.nvidia.com/object/cuda.html
40References ctd..
- K. Fatahalian, J. Sugerman, and P. Hanran,
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix multiplication, Graphics
Hardware 2004 - Jens Krüger, Linear Algebra on GPUs SIGGRAPH 2004
course notes - Naga K. Govindaraju, Scott Larsen, Jim Gray,
Dinesh Manocha, A Memory Model for Scientific
Algorithms on Graphics Procesors, Super Computing
2006.