Linear Algebra I - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Linear Algebra I

Description:

Naga et. al. Matrix-matrix multiplication on the GPU ... Nico Galoppo, Naga K. Govindaraju, Michael Henson and Dinesh Manocha: LU-GPU : ... – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 41

Provided by: depts155

Category:

more less

Transcript and Presenter's Notes

Title: Linear Algebra I

1
Linear Algebra I

Sashi Kumar Penta
GPGP class presentation
Spring 2007

2
LA 1 Outline

Motivation
Dense matrices and vectors
Representation
Vector-Vector operations
Vector reduction type operations
Matrix-Matrix multiplication
Larsen et. al.
Fatahalian et. al.
Naga Govindaraju et. al.
Introduction to CUDA
Matrix multiplication

3
LA 2 Outline

Memory requirements for Balanced computer
architectures
Sparse matrices
Banded matrices
Random sparse matrices
Sparse Matrix-Vector multiplication
Solving Navier - strokes on GPU

4
Why LA on GPUs?

Applications of LA
Scientific computations
Search (LSI)
Graphics Simulations
The GPU is a fast streaming processor
LA operations are easily streamable
The result of the computation is already on the
GPU and ready for display
For simulations

5
Arrays textures

2D arrays are the native GPU data layout
Maximum array size in GPU 8K x 8K
Texture coordinates to access values stored in
the textures

6
Representation
Courtesy Jens Krüger
7
Matrix multiplication Why is it important?

Inherently parallel
Memory intensive
Greatly benefit from GPUs
Used for bench marking hardware

8
Early Pre-floating point

Larsen and McAllister described an early
pre-floating point implementation of matrix
multiplies.
2D textures and blending operations to perform
matrix product
nVidia GeForce3 GPU 8-bit fixed point output
4.4 GB ops

9
L-M GPU-based Nested Loop Matrix multiplication

for(i 0 i lt N i)
for(j 0 j lt N j)
Zi,j 0
// Each iteration in the following
loop is quadrilateral
// rasterization of size N x N
for(k 0 k lt N k)
Zi,j ZI, j Xi,k
Yk,j

10
Continued ..
The ability to modulate with two textures
provides us with the ability to multiply values
from two separate textures using the graphics
hardware
11
Continued
Load texture texA with matrix A. Load texture
texB with matrix B. Set the multitexturing mode
to modulate. Set the framebuffer write mode to
accumulate. For i 0 .. n -1 draw an n x
n pixel rectangle with texA coords
(0,i), (0,i), (n-1, i), (n-1, i) and
texB coords (i, 0), (i, n-1), (i, n-1),
(i,0). End Screen contains result of A x B
12
Floating point textures

Texture targets
GL_TEXTURE_2D
Coordinates have to be normalized 0, 1
GL_TEXTURE_RECTANGLE_ARB
Coordinates are not normalized

13
Operations
Courtesy Jens Krüger

Vector-Vector Operations
Reduced to 2D texture operations
Coded in pixel shaders
Example Vector1 Vector2 ? Vector3

Render Texture
Static quad
Vertex Shader
Pixel Shader
14
Vector-vector operation saxpy

Scalar alpha x plus y, i.e. (alpha x y)
Y_new y_old alpha x
On CPU
for (int i 0 i lt N i)
Yi Yi alpha Xi
Calculations are independent
Computational kernel
Y_newi Y_oldi alpha Xi

15
Kernels shaders

float saxpy (
float2 coords TEXCOORD0,
uniform sampler2D textureY,
uniform sampler2D textureX,
uniform float alpha) COLOR
float result
float y tex2D(textureY, coords)
float x tex2D(textureX, coords)
result y alpha x
return result

16
Four-channeled RGBA texture

float4 saxpy (
float2 coords TEXCOORD0,
uniform sampler2D textureY,
uniform sampler2D textureX,
uniform float alpha) COLOR
float4 result
float4 y tex2D(textureY, coords)
float4 x tex2D(textureX, coords)
result y alpha x
return result

4 channeled textures to improve efficiency in
fetching and computation
17
Computingdrawing

glPolygonMode(GL_FRONT,GL_FILL)
glBegin(GL_QUADS)
glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
glTexCoord2f(texSize, 0.0)
glVertex2f(texSize, 0.0)
glTexCoord2f(texSize, texSize)
glVertex2f(texSize, texSize)
glTexCoord2f(0.0, texSize) glVertex2f(0.0,
texSize)
glEnd()

Texture rectangles
18
Computingdrawing

glPolygonMode(GL_FRONT,GL_FILL)
glBegin(GL_QUADS)
glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
glTexCoord2f(1.0, 0.0) glVertex2f(texSize,
0.0)
glTexCoord2f(1.0, 1.0) glVertex2f(texSize,
texSize)
glTexCoord2f(0.0, 1.0) glVertex2f(0.0,
texSize)
glEnd()

Texture2D
19
Reduction-type operations

Maximum, vector norm, dot products etc.
One method render a 1x1 output
only one parallel processing element (PE)
Might exceed the maximum allowed shader length
Local maximum of each 2x2 group of elements in
one kernel
Input M by M, output M/2 by M/2
Recursively repeated until the final 1 by 1
Logarithmetic number of iterations

20
Rendering a quad
void drawQuad(int w, int h)
glBegin(GL_QUADS) glMultiTexCoord2f(GL_T
EXTURE0, -0.5, -0.5) glMultiTexCoord2f(G
L_TEXTURE1, 0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5)
glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5)
glVertex2f(0.0, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE1, 2w0.5,
-0.5) glMultiTexCoord2f(GL_TEXTURE2,
2w-0.5, 0.5) glMultiTexCoord2f(GL_TEXTU
RE3, 2w0.5, 0.5) glVertex2f(w, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE1,
2w0.5, 2h-0.5) glMultiTexCoord2f(GL_T
EXTURE2, 2w-0.5, 2h0.5)
glMultiTexCoord2f(GL_TEXTURE3, 2w0.5, 2h0.5)
glVertex2f(w, h)
glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2h-0.5)
glMultiTexCoord2f(GL_TEXTURE1, 0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE2,
-0.5, 2h0.5) glMultiTexCoord2f(GL_TEXT
URE3, 0.5, 2h0.5) glVertex2f(0.0, h)
glEnd()
21
maximum
float maximum (float2 left TEXCOORD0, float2
right TEXCOORD1, float2
top TEXCOORD2, float2 bottom TEXCOORD3,
uniform samplerRECT texture)
COLOR float val1 texRECT(texture,
left) float val2 texRECT(texture,
right) float val3 texRECT(texture,
top) float val4 texRECT(texture,
bottom) return max(val1,max(val2,max(val3,
val4)))
22
Matrix-Matrix multiplication
for (i 0 i lt N i) for( j 0 j lt N
j) Ci, j 0 for(k 0
k lt N k) Ci, j Ai,k
Bk, j

On CPUs
Suffers from locality Elements of B are accessed
column wise. Not sequential order in memory.
Bandwidth limited
Size of its per-loop prevents bandwidth
amplification from the closest and fastest levels
of memory hierarchy.

blocks inputs into small Sub matrices that fit
in Processor cache
23
Fatahalian et. al. Matrix-matrix multiplication
on the GPU

2x2 blocks of matrix elements in 4 component
texels
Reads 2 rows from matrix A and 2 columns of B to
compute 4 elements of the output matrix
Each shader program performs 4 inner loop
iterations of the CPU program
GPU texture caches are organized to capture 2D
locality, so sequential access along either
access are cache coherent

for (k 0 k lt N/2 i) Ci, j.xyzw
Ai,k.xxzz Bk, j.xyxy
Ai,k.yyww Bk, j.zwzw
24
Multipass algorithm

Single pass instruction count exceed the limits
of GPU architectures for sufficiently large
matrices
4 consecutive elements from a matrix column into
a texel
4x4 matrix by 4x1 vector products
4x reuse of data in texel B
6 fetches for 4 mad operations 1.5x more fetches

Ci, j.xyzw accumi,j.xyzw
Ai,k.xyzw Bk/4, j.xxxx
Ai,k1.xyzw Bk/4, j.yyyy
Ai,k2.xyzw Bk/4, j.zzzz
Ai,k3.xyzw
Bk/4, j.wwww
25
Measurements multiplication of 1024 x 1024
matrices
Available floating point bandwidth from the
closest cache on these GPUs is up to several
times lower than that of current CPUs to their L1
caches
26
Naga et. al. Matrix-matrix multiplication on the
GPU

A memory model for scientific algorithms on
Graphics Processors
Explicit blocking to avoid hardware dependencies
Computation on the tiles of the size T x T is
invoked by drawing quadrilaterals of size T x T
on the screen.
A single fragment program evaluates the dot
product from vectors of size D in X and Y
(inputs).

27
GPU-based nested loop matrix multiplication

for(ib 0 ib lt N ib ib T)
for(jb 0 jb lt N jb jb T)
for(kb 0 kb lt N kb kb D)
// following two loops invoked
using a quadrilateral of size TxT
for(i ib i lt ib T i i 1)
for(j jb j lt jb T j
j 1)
// following loop is
performed inside
// a fragment program
for(k kb k lt kb
D k k 1)
Zi,j Zi,j
Xi,k Yk,j

28
Measurements

NVIDIA 7900 GTX GPU
17.6 GFLOPS
40 GB / S
NVIDIA 6800 Ultra
8 GFLOPS

29
Matrix-Matrix multiplication in CUDA/CTM

100 GFLOPS
How?
Use of extended memory formats (fetch4) and
deeper loop unrolling (R580)
Shader memory (G80)
On chip shared memory allows you to keep blocks
of data close to the ALUs rather than having to
ping-pong off-chip to an FBO.
SM is only exposed through CUDA

30
Introduction to CUDA

Thread Batching
Thread Block
A batch of threads that can cooperate
Sharing data through fast shared memory
Synchronize their execution
Grid of Thread Blocks
Blocks that execute the same kernel can
be batched together
Reduced thread cooperation

threadID
blockID
31
Memory model
32
Matrix multiplication using CUDA

Each thread block is responsible for computing
one square sub-matrix Csub of C.
Block size of Csub is equal to 16

33
void Mul(const float A, const float B, int
hA, int wA, int wB, float C) float
Ad int size hA wA
sizeof(float) cudaMalloc((void ) Ad,
size) cudaMemcpy(Ad, A, size,
cudaMemcpyHostToDevice) float Bd
size wA wB sizeof(float)
cudaMalloc((void ) Bd, size)
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice)
float Cd size hA wB
sizeof(float) cudaMalloc((void)Cd,
size) dim3 dimBlock(BLOCK_SIZE,
BLOCK_SIZE) dim3 dimGrid(wB /
dimBlock.x, hA / dimBlock.y)
MuldltltltdimGrid, dimBlockgtgtgt(Ad, Bd, wA, wB,
Cd) cudaMemcpy(C, Cd, size,
cudaMemcpyDeviceToHost)
Loading A into device global memory
Loading B into device global memory
Allocate C on device global memory
Execution configuration
Launch device computation
Read C from the device
34
__global__ void Muld(const float A, const float
B, int wA, int wB, float C) int bx
blockIDx.x int by blockIDx.y int
tx threadIDx.x int ty threadID.y
int aBegin wA BLOCK_SIZE by int
aEnd aBegin wA // (wA 1) BLOCK_SIZE
by int aStep BLOCK_SIZE int
bBegin BLOCK_SIZE bx int bStep
BLOCK_SIZE wB float Csub 0

aBegin Index of the first sub-matrix of A
processed by the block
35
.. .. for( int a aBegin, b
bBegin a lt aEnd a aStep, b bStep)
__shared__ float
AsBLOCK_SIZEBLOCK_SIZE
__shared__ float BsBLOCK_SIZEBLOCK_SIZE
Astytx Aa wA ty tx
Bstytx Bb wB ty tx
__syncthreads() for(int k
0 k lt BLOCK_SIZE k) Csub
Astyk Bsktx
__syncthreads() int c wB
BLOCK_SIZE by BLOCK_SIZE bx Cc
wB ty tx CSub
Iterate through all sub-matrices of A and B
Load one sub-matrix of A and one of B from
global memory to shared memory
Synchronize to make sure both matrices Are fully
loaded by all threads
Compute the product of the two sub-matrices
Synchronize to make sure that the product Of the
two sub-matrices is done
Write the block sub-matrix to global memory
36
Conclusions

Different ways of representing matrices and
vectors on the GPU
Various algorithms for matrix-matrix
multiplication and vector-vector operations on
the GPU
Computation throughput of the GPU
Memory performance of the GPU

37
Questions?
38
Thank you
39
References

E.S. Larsen and D. McAllister. Fast matrix
multiplies using graphics hardware. Super
Computing 2001.
Dominik Göddeke. GPGPU Tutorials.
http//www.mathematik.uni-dortmund.de/goeddeke/gp
gpu/index.html
Adam Moravanszky, Dense matrix algebra on the
GPU. 2003 http//www.shaderx2.com/shaderx.PDF
Nico Galoppo, Naga K. Govindaraju, Michael Henson
and Dinesh Manocha LU-GPU Efficient algorithms
for solving Dense Linear systems on Graphics
hardware. Super Computing 2005
CUDA Programming Guide and CUDA BLAS.
http//developer.nvidia.com/object/cuda.html

40
References ctd..

K. Fatahalian, J. Sugerman, and P. Hanran,
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix multiplication, Graphics
Hardware 2004
Jens Krüger, Linear Algebra on GPUs SIGGRAPH 2004
course notes
Naga K. Govindaraju, Scott Larsen, Jim Gray,
Dinesh Manocha, A Memory Model for Scientific
Algorithms on Graphics Procesors, Super Computing
2006.