Scheduling of QR Factorization Algorithms - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Scheduling of QR Factorization Algorithms

Description:

Ernie Chan et al., University of Texas at Austin. Presented by Yi-Gang Tai. 2. Content. Examines scalable implementation of QR factorization targeting SMP and ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 19
Provided by: csU66
Category:

less

Transcript and Presenter's Notes

Title: Scheduling of QR Factorization Algorithms


1
  • Scheduling of QR Factorization Algorithms
  • on SMP and Multi-Core Architectures
  • Euromicro International Conference on
  • Parallel, Distributed and network-based
    Processing 2008
  • / FLAME Working Note 24
  • Gregorio Quintana-Orti et al, Universidad Jaume,
    Spain
  • Ernie Chan et al., University of Texas at Austin
  • Presented by Yi-Gang Tai

2
Content
  • Examines scalable implementation of QR
    factorization targeting SMP and multicore
  • Two algorithms-by-blocks
  • Conventional blocked algorithm with dynamic
    scheduling
  • Givens rotations based blocked algorithm with
    dynamic scheduling
  • FLAME/FLASH API
  • SuperMatrix runtime system

3
Introduction
  • New many-threaded architectures change the
    parameters of linear algebra libraries
  • Cores faster
  • Latencies lower
  • Memory per core smaller
  • Parallelism for smaller problems
  • Traditional QR factorization for dense matrix
    based on Householder transformations
  • Column by column execution
  • Execution order constrained

4
QR Decomposition
  • A QR, where
  • A is an m ? n matrix (m n)
  • Q is an m ? m unitary matrix
  • R is an m ? n upper triangular matrix
  • Many methods to compute
  • Givens rotations
  • Gram-Schmidt
  • Householder transformations
  • For mgtgtn and Q not needed explicitly, typically
    the Householder method is used

5
Householder Transformation
  • Partition a column vector x into x1 and x2 , then
  • is the Householder transformation

6
Unblocked Householder QR
  • Cost 2n2(m-n/3)FLOPs

7
Blocked Householder QR
  • Costif sltltm,n,equal to unblocked

8
Basic Linear Algebra Subprograms
  • QR factorization are implemented so that the bulk
    of computation is performed by BLAS
  • Benefits
  • Legacy libraries need not be modified
  • Parallelism achieved through multithreaded BLAS
  • Disadvantages
  • Degree of parallelism limited by BLAS
  • The end of each BLAS routine is sync. point
  • Algorithmic variants significantly impact
    performance

9
Algorithms-by-blocks
  • To overcome the difficulties above
  • Expression of algorithm simplified actual block
    storage encapsulated
  • Consider mmss, nnss, with s the block size

10
Algorithms-by-blocks I
  • Can be decomposed to

11
Algorithms-by-blocks II
  • Givens rotations
  • Given a vector , ? and s can be
    determined so that and
  • So QR can be computed with
  • Cost 3n2(m-n/3)

12
Algorithms-by-blocks II (contd.)
  • Partition A into mt by mt blocks of size t
  • First, compute QR fac.with Householder method

13
Algorithms-by-blocks II (contd.)
  • Next, use Householder QR to factorize
  • then
  • Cost 2n2(m-n/3)(1s/2t)

14
FLAME/FLASH API
15
SuperMatrix
  • Two phases
  • Analyzer
  • Dependency DAG construction
  • Scheduler/dispatcher
  • Runtime scheduling of block computations
  • Out-of-order computation

16
Experiments
  • SGI Altix 350
  • ccNUMA architecture
  • 8 nodes, 2 x 1.5GHz Intel Itanium2 each gt
    totally 16 CPUs
  • Theoretical peak performance 96 GFLOPs/s
  • Intel Math Kernel Library (MKL) as BLAS

17
Experiments (contd.)
18
Conclusions
  • QR decomposition reviewed
  • Two algorithms-by-blocks presented
  • Experiments are conducted to compare the
    algorithms, though the success of AB-II appears
    to make further study
  • Programming effort greatly reduced with
    FLAME/FLASH API
Write a Comment
User Comments (0)
About PowerShow.com