Linear Algebra I - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Linear Algebra I

Description:

Naga et. al. Matrix-matrix multiplication on the GPU ... Nico Galoppo, Naga K. Govindaraju, Michael Henson and Dinesh Manocha: LU-GPU : ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 41
Provided by: depts155
Category:
Tags: algebra | linear | naga

less

Transcript and Presenter's Notes

Title: Linear Algebra I


1
Linear Algebra I
  • Sashi Kumar Penta
  • GPGP class presentation
  • Spring 2007

2
LA 1 Outline
  • Motivation
  • Dense matrices and vectors
  • Representation
  • Vector-Vector operations
  • Vector reduction type operations
  • Matrix-Matrix multiplication
  • Larsen et. al.
  • Fatahalian et. al.
  • Naga Govindaraju et. al.
  • Introduction to CUDA
  • Matrix multiplication

3
LA 2 Outline
  • Memory requirements for Balanced computer
    architectures
  • Sparse matrices
  • Banded matrices
  • Random sparse matrices
  • Sparse Matrix-Vector multiplication
  • Solving Navier - strokes on GPU

4
Why LA on GPUs?
  • Applications of LA
  • Scientific computations
  • Search (LSI)
  • Graphics Simulations
  • The GPU is a fast streaming processor
  • LA operations are easily streamable
  • The result of the computation is already on the
    GPU and ready for display
  • For simulations

5
Arrays textures
  • 2D arrays are the native GPU data layout
  • Maximum array size in GPU 8K x 8K
  • Texture coordinates to access values stored in
    the textures

6
Representation
Courtesy Jens Krüger
7
Matrix multiplication Why is it important?
  • Inherently parallel
  • Memory intensive
  • Greatly benefit from GPUs
  • Used for bench marking hardware

8
Early Pre-floating point
  • Larsen and McAllister described an early
    pre-floating point implementation of matrix
    multiplies.
  • 2D textures and blending operations to perform
    matrix product
  • nVidia GeForce3 GPU 8-bit fixed point output
  • 4.4 GB ops

9
L-M GPU-based Nested Loop Matrix multiplication
  • for(i 0 i lt N i)
  • for(j 0 j lt N j)
  • Zi,j 0
  • // Each iteration in the following
    loop is quadrilateral
  • // rasterization of size N x N
  • for(k 0 k lt N k)
  • Zi,j ZI, j Xi,k
    Yk,j

10
Continued ..
The ability to modulate with two textures
provides us with the ability to multiply values
from two separate textures using the graphics
hardware
11
Continued
Load texture texA with matrix A. Load texture
texB with matrix B. Set the multitexturing mode
to modulate. Set the framebuffer write mode to
accumulate. For i 0 .. n -1 draw an n x
n pixel rectangle with texA coords
(0,i), (0,i), (n-1, i), (n-1, i) and
texB coords (i, 0), (i, n-1), (i, n-1),
(i,0). End Screen contains result of A x B
12
Floating point textures
  • Texture targets
  • GL_TEXTURE_2D
  • Coordinates have to be normalized 0, 1
  • GL_TEXTURE_RECTANGLE_ARB
  • Coordinates are not normalized

13
Operations
Courtesy Jens Krüger
  • Vector-Vector Operations
  • Reduced to 2D texture operations
  • Coded in pixel shaders
  • Example Vector1 Vector2 ? Vector3

Render Texture
Static quad
Vertex Shader
Pixel Shader
14
Vector-vector operation saxpy
  • Scalar alpha x plus y, i.e. (alpha x y)
  • Y_new y_old alpha x
  • On CPU
  • for (int i 0 i lt N i)
  • Yi Yi alpha Xi
  • Calculations are independent
  • Computational kernel
  • Y_newi Y_oldi alpha Xi

15
Kernels shaders
  • float saxpy (
  • float2 coords TEXCOORD0,
  • uniform sampler2D textureY,
  • uniform sampler2D textureX,
  • uniform float alpha) COLOR
  • float result
  • float y tex2D(textureY, coords)
  • float x tex2D(textureX, coords)
  • result y alpha x
  • return result

16
Four-channeled RGBA texture
  • float4 saxpy (
  • float2 coords TEXCOORD0,
  • uniform sampler2D textureY,
  • uniform sampler2D textureX,
  • uniform float alpha) COLOR
  • float4 result
  • float4 y tex2D(textureY, coords)
  • float4 x tex2D(textureX, coords)
  • result y alpha x
  • return result

4 channeled textures to improve efficiency in
fetching and computation
17
Computingdrawing
  • glPolygonMode(GL_FRONT,GL_FILL)
  • glBegin(GL_QUADS)
  • glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
  • glTexCoord2f(texSize, 0.0)
    glVertex2f(texSize, 0.0)
  • glTexCoord2f(texSize, texSize)
    glVertex2f(texSize, texSize)
  • glTexCoord2f(0.0, texSize) glVertex2f(0.0,
    texSize)
  • glEnd()

Texture rectangles
18
Computingdrawing
  • glPolygonMode(GL_FRONT,GL_FILL)
  • glBegin(GL_QUADS)
  • glTexCoord2f(0.0, 0.0) glVertex2f(0.0, 0.0)
  • glTexCoord2f(1.0, 0.0) glVertex2f(texSize,
    0.0)
  • glTexCoord2f(1.0, 1.0) glVertex2f(texSize,
    texSize)
  • glTexCoord2f(0.0, 1.0) glVertex2f(0.0,
    texSize)
  • glEnd()

Texture2D
19
Reduction-type operations
  • Maximum, vector norm, dot products etc.
  • One method render a 1x1 output
  • only one parallel processing element (PE)
  • Might exceed the maximum allowed shader length
  • Local maximum of each 2x2 group of elements in
    one kernel
  • Input M by M, output M/2 by M/2
  • Recursively repeated until the final 1 by 1
  • Logarithmetic number of iterations

20
Rendering a quad
void drawQuad(int w, int h)
glBegin(GL_QUADS) glMultiTexCoord2f(GL_T
EXTURE0, -0.5, -0.5) glMultiTexCoord2f(G
L_TEXTURE1, 0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE2, -0.5, 0.5)
glMultiTexCoord2f(GL_TEXTURE3, 0.5, 0.5)
glVertex2f(0.0, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5, -0.5)
glMultiTexCoord2f(GL_TEXTURE1, 2w0.5,
-0.5) glMultiTexCoord2f(GL_TEXTURE2,
2w-0.5, 0.5) glMultiTexCoord2f(GL_TEXTU
RE3, 2w0.5, 0.5) glVertex2f(w, 0.0)
glMultiTexCoord2f(GL_TEXTURE0, 2w-0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE1,
2w0.5, 2h-0.5) glMultiTexCoord2f(GL_T
EXTURE2, 2w-0.5, 2h0.5)
glMultiTexCoord2f(GL_TEXTURE3, 2w0.5, 2h0.5)
glVertex2f(w, h)
glMultiTexCoord2f(GL_TEXTURE0, -0.5, 2h-0.5)
glMultiTexCoord2f(GL_TEXTURE1, 0.5,
2h-0.5) glMultiTexCoord2f(GL_TEXTURE2,
-0.5, 2h0.5) glMultiTexCoord2f(GL_TEXT
URE3, 0.5, 2h0.5) glVertex2f(0.0, h)
glEnd()
21
maximum
float maximum (float2 left TEXCOORD0, float2
right TEXCOORD1, float2
top TEXCOORD2, float2 bottom TEXCOORD3,
uniform samplerRECT texture)
COLOR float val1 texRECT(texture,
left) float val2 texRECT(texture,
right) float val3 texRECT(texture,
top) float val4 texRECT(texture,
bottom) return max(val1,max(val2,max(val3,
val4)))
22
Matrix-Matrix multiplication
for (i 0 i lt N i) for( j 0 j lt N
j) Ci, j 0 for(k 0
k lt N k) Ci, j Ai,k
Bk, j
  • On CPUs
  • Suffers from locality Elements of B are accessed
    column wise. Not sequential order in memory.
  • Bandwidth limited
  • Size of its per-loop prevents bandwidth
    amplification from the closest and fastest levels
    of memory hierarchy.

blocks inputs into small Sub matrices that fit
in Processor cache
23
Fatahalian et. al. Matrix-matrix multiplication
on the GPU
  • 2x2 blocks of matrix elements in 4 component
    texels
  • Reads 2 rows from matrix A and 2 columns of B to
    compute 4 elements of the output matrix
  • Each shader program performs 4 inner loop
    iterations of the CPU program
  • GPU texture caches are organized to capture 2D
    locality, so sequential access along either
    access are cache coherent

for (k 0 k lt N/2 i) Ci, j.xyzw
Ai,k.xxzz Bk, j.xyxy
Ai,k.yyww Bk, j.zwzw
24
Multipass algorithm
  • Single pass instruction count exceed the limits
    of GPU architectures for sufficiently large
    matrices
  • 4 consecutive elements from a matrix column into
    a texel
  • 4x4 matrix by 4x1 vector products
  • 4x reuse of data in texel B
  • 6 fetches for 4 mad operations 1.5x more fetches

Ci, j.xyzw accumi,j.xyzw
Ai,k.xyzw Bk/4, j.xxxx
Ai,k1.xyzw Bk/4, j.yyyy
Ai,k2.xyzw Bk/4, j.zzzz
Ai,k3.xyzw
Bk/4, j.wwww
25
Measurements multiplication of 1024 x 1024
matrices
Available floating point bandwidth from the
closest cache on these GPUs is up to several
times lower than that of current CPUs to their L1
caches
26
Naga et. al. Matrix-matrix multiplication on the
GPU
  • A memory model for scientific algorithms on
    Graphics Processors
  • Explicit blocking to avoid hardware dependencies
  • Computation on the tiles of the size T x T is
    invoked by drawing quadrilaterals of size T x T
    on the screen.
  • A single fragment program evaluates the dot
    product from vectors of size D in X and Y
    (inputs).

27
GPU-based nested loop matrix multiplication
  • for(ib 0 ib lt N ib ib T)
  • for(jb 0 jb lt N jb jb T)
  • for(kb 0 kb lt N kb kb D)
  • // following two loops invoked
    using a quadrilateral of size TxT
  • for(i ib i lt ib T i i 1)
  • for(j jb j lt jb T j
    j 1)
  • // following loop is
    performed inside
  • // a fragment program
  • for(k kb k lt kb
    D k k 1)
  • Zi,j Zi,j
    Xi,k Yk,j

28
Measurements
  • NVIDIA 7900 GTX GPU
  • 17.6 GFLOPS
  • 40 GB / S
  • NVIDIA 6800 Ultra
  • 8 GFLOPS

29
Matrix-Matrix multiplication in CUDA/CTM
  • 100 GFLOPS
  • How?
  • Use of extended memory formats (fetch4) and
    deeper loop unrolling (R580)
  • Shader memory (G80)
  • On chip shared memory allows you to keep blocks
    of data close to the ALUs rather than having to
    ping-pong off-chip to an FBO.
  • SM is only exposed through CUDA

30
Introduction to CUDA
  • Thread Batching
  • Thread Block
  • A batch of threads that can cooperate
  • Sharing data through fast shared memory
  • Synchronize their execution
  • Grid of Thread Blocks
  • Blocks that execute the same kernel can
  • be batched together
  • Reduced thread cooperation

threadID
blockID
31
Memory model
32
Matrix multiplication using CUDA
  • Each thread block is responsible for computing
    one square sub-matrix Csub of C.
  • Block size of Csub is equal to 16

33
void Mul(const float A, const float B, int
hA, int wA, int wB, float C) float
Ad int size hA wA
sizeof(float) cudaMalloc((void ) Ad,
size) cudaMemcpy(Ad, A, size,
cudaMemcpyHostToDevice) float Bd
size wA wB sizeof(float)
cudaMalloc((void ) Bd, size)
cudaMemcpy(Bd, B, size, cudaMemcpyHostToDevice)
float Cd size hA wB
sizeof(float) cudaMalloc((void)Cd,
size) dim3 dimBlock(BLOCK_SIZE,
BLOCK_SIZE) dim3 dimGrid(wB /
dimBlock.x, hA / dimBlock.y)
MuldltltltdimGrid, dimBlockgtgtgt(Ad, Bd, wA, wB,
Cd) cudaMemcpy(C, Cd, size,
cudaMemcpyDeviceToHost)
Loading A into device global memory
Loading B into device global memory
Allocate C on device global memory
Execution configuration
Launch device computation
Read C from the device
34
__global__ void Muld(const float A, const float
B, int wA, int wB, float C) int bx
blockIDx.x int by blockIDx.y int
tx threadIDx.x int ty threadID.y
int aBegin wA BLOCK_SIZE by int
aEnd aBegin wA // (wA 1) BLOCK_SIZE
by int aStep BLOCK_SIZE int
bBegin BLOCK_SIZE bx int bStep
BLOCK_SIZE wB float Csub 0

aBegin Index of the first sub-matrix of A
processed by the block
35
.. .. for( int a aBegin, b
bBegin a lt aEnd a aStep, b bStep)
__shared__ float
AsBLOCK_SIZEBLOCK_SIZE
__shared__ float BsBLOCK_SIZEBLOCK_SIZE
Astytx Aa wA ty tx
Bstytx Bb wB ty tx
__syncthreads() for(int k
0 k lt BLOCK_SIZE k) Csub
Astyk Bsktx
__syncthreads() int c wB
BLOCK_SIZE by BLOCK_SIZE bx Cc
wB ty tx CSub
Iterate through all sub-matrices of A and B
Load one sub-matrix of A and one of B from
global memory to shared memory
Synchronize to make sure both matrices Are fully
loaded by all threads
Compute the product of the two sub-matrices
Synchronize to make sure that the product Of the
two sub-matrices is done
Write the block sub-matrix to global memory
36
Conclusions
  • Different ways of representing matrices and
    vectors on the GPU
  • Various algorithms for matrix-matrix
    multiplication and vector-vector operations on
    the GPU
  • Computation throughput of the GPU
  • Memory performance of the GPU

37
Questions?
38
Thank you
39
References
  • E.S. Larsen and D. McAllister. Fast matrix
    multiplies using graphics hardware. Super
    Computing 2001.
  • Dominik Göddeke. GPGPU Tutorials.
    http//www.mathematik.uni-dortmund.de/goeddeke/gp
    gpu/index.html
  • Adam Moravanszky, Dense matrix algebra on the
    GPU. 2003 http//www.shaderx2.com/shaderx.PDF
  • Nico Galoppo, Naga K. Govindaraju, Michael Henson
    and Dinesh Manocha LU-GPU Efficient algorithms
    for solving Dense Linear systems on Graphics
    hardware. Super Computing 2005
  • CUDA Programming Guide and CUDA BLAS.
    http//developer.nvidia.com/object/cuda.html

40
References ctd..
  • K. Fatahalian, J. Sugerman, and P. Hanran,
    Understanding the Efficiency of GPU Algorithms
    for Matrix-Matrix multiplication, Graphics
    Hardware 2004
  • Jens Krüger, Linear Algebra on GPUs SIGGRAPH 2004
    course notes
  • Naga K. Govindaraju, Scott Larsen, Jim Gray,
    Dinesh Manocha, A Memory Model for Scientific
    Algorithms on Graphics Procesors, Super Computing
    2006.
Write a Comment
User Comments (0)
About PowerShow.com