Scheduling of QR Factorization Algorithms

About This Presentation

Title:

Scheduling of QR Factorization Algorithms

Description:

Ernie Chan et al., University of Texas at Austin. Presented by Yi-Gang Tai. 2. Content. Examines scalable implementation of QR factorization targeting SMP and ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 19

Provided by: csU66

Category:

more less

Transcript and Presenter's Notes

Title: Scheduling of QR Factorization Algorithms

1

Scheduling of QR Factorization Algorithms
on SMP and Multi-Core Architectures
Euromicro International Conference on
Parallel, Distributed and network-based
Processing 2008
/ FLAME Working Note 24
Gregorio Quintana-Orti et al, Universidad Jaume,
Spain
Ernie Chan et al., University of Texas at Austin
Presented by Yi-Gang Tai

2
Content

Examines scalable implementation of QR
factorization targeting SMP and multicore
Two algorithms-by-blocks
Conventional blocked algorithm with dynamic
scheduling
Givens rotations based blocked algorithm with
dynamic scheduling
FLAME/FLASH API
SuperMatrix runtime system

3
Introduction

New many-threaded architectures change the
parameters of linear algebra libraries
Cores faster
Latencies lower
Memory per core smaller
Parallelism for smaller problems
Traditional QR factorization for dense matrix
based on Householder transformations
Column by column execution
Execution order constrained

4
QR Decomposition

A QR, where
A is an m ? n matrix (m n)
Q is an m ? m unitary matrix
R is an m ? n upper triangular matrix
Many methods to compute
Givens rotations
Gram-Schmidt
Householder transformations
For mgtgtn and Q not needed explicitly, typically
the Householder method is used

5
Householder Transformation

Partition a column vector x into x1 and x2 , then
is the Householder transformation

6
Unblocked Householder QR

Cost 2n2(m-n/3)FLOPs

7
Blocked Householder QR

Costif sltltm,n,equal to unblocked

8
Basic Linear Algebra Subprograms

QR factorization are implemented so that the bulk
of computation is performed by BLAS
Benefits
Legacy libraries need not be modified
Parallelism achieved through multithreaded BLAS
Disadvantages
Degree of parallelism limited by BLAS
The end of each BLAS routine is sync. point
Algorithmic variants significantly impact
performance

9
Algorithms-by-blocks

To overcome the difficulties above
Expression of algorithm simplified actual block
storage encapsulated
Consider mmss, nnss, with s the block size

10
Algorithms-by-blocks I

Can be decomposed to

11
Algorithms-by-blocks II

Givens rotations
Given a vector , ? and s can be
determined so that and
So QR can be computed with

Cost 3n2(m-n/3)

12
Algorithms-by-blocks II (contd.)

Partition A into mt by mt blocks of size t
First, compute QR fac.with Householder method

13
Algorithms-by-blocks II (contd.)

Next, use Householder QR to factorize
then

Cost 2n2(m-n/3)(1s/2t)

14
FLAME/FLASH API
15
SuperMatrix

Two phases
Analyzer
Dependency DAG construction
Scheduler/dispatcher
Runtime scheduling of block computations
Out-of-order computation

16
Experiments

SGI Altix 350
ccNUMA architecture
8 nodes, 2 x 1.5GHz Intel Itanium2 each gt
totally 16 CPUs
Theoretical peak performance 96 GFLOPs/s
Intel Math Kernel Library (MKL) as BLAS

17
Experiments (contd.)
18
Conclusions

QR decomposition reviewed
Two algorithms-by-blocks presented
Experiments are conducted to compare the
algorithms, though the success of AB-II appears
to make further study
Programming effort greatly reduced with
FLAME/FLASH API

Write a Comment

User Comments (0)