04lila - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

04lila

Description:

04-lila. Integrating a ScaLAPACK call in an MPI code (for Householder QRF) ... RFP: Trick to have continuous memory for triangular matrices (for CholeskyQR) ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 46
Provided by: Gue2101
Category:

less

Transcript and Presenter's Notes

Title: 04lila


1
04-lila
  • Integrating a ScaLAPACK call in an MPI code (for
    Householder QRF)
  • MPI_OP to compute x (for Gram-Schmidt)
  • Example of construction of Datatype for
    triangular matrices, Example of MPI_OP on
    triangular martrices (for CholeskyQR)
  • RFP Trick to have continuous memory for
    triangular matrices (for CholeskyQR)
  • Weirdest MPI_OP ever motivation results
  • Weirdest MPI_OP ever how to attach attributes to
    a datatype

2
1. Integrating a ScaLAPACK call in an MPI code
  • See
  • scalapackqrf_A.c
  • scalapackqrf_B.c
  • scalapackqr2_A.c
  • scalapackqr2_B.c

3
2. MPI_OP to compute x
  • Consider the vector
  • x 1e200 1e200 1e200 1e200
  • The 2-norm of x is
  • x sqrt( x(1)2 x(2)2 x(3)2 x(4)2
    )
  • 2e200

4
2. MPI_OP to compute x
  • Consider the vector
  • x 1e200 1e200 1e200 1e200
  • The 2-norm of x is
  • x sqrt( x(1)2 x(2)2 x(3)2 x(4)2
    )
  • 2e200
  • However a note careful implementation would
    overflow in double precision arithmetic
  • To compute z sqrt(x2y2) without unnecessary
    overflow
  • z max(x,y)sqrt(1(min(x,y)/max(x,y)
    )2 )
  • See LAPACK routine DLAPY2, BLAS DNORM, etc.

5
2. MPI_OP
  • The MPI_OP is in LILA_mpiop_dlapy2.c
  • Example of use are in
  • check_representativity.c
  • rowmgs_v1.c
  • mgs_v1.c
  • cgs_v1.c
  • very simple MPI_OP (but very useful as well) used
    to compute a norm without unnecessary overflow
  • the idea is that a not careful computation of
    x _2 would easily overflow
  • alpha xTx (might overflow)
  • alpha MPI_Allreduce(alpha) (might
    overflow as well)
  • alpha sqrt(alpha) (too late if you have
    already overflowed ....)
  • good implementation
  • alpha x _2 (good seq dnorm will
    not overflow)
  • alpha MPI_Allreduce_dlapy2 ( alpha )

6
2. Examples.
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 100
    -cgs_v0
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 100
    -cgs_v1
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 100
    -mgs_v0
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 100
    -mgs_v1
  • (and then multiply the entries of the initial
    matrix by 1e200)

7
3. EXAMPLE OF CONSTRUCTION OF DATA_TYPE FOR
TRIANGULAR MATRICES, EXAMPLE OF MPI_OP ON
TRIANGULAR MATRICES
  • See
  • choleskyqr_A_v1.c
  • choleskyqr_B_v1.c
  • LILA_mpiop_sum_upper.c
  • starting from choleskyqr_A_v0.c, this
  • shows how to construct a datatype for a
    triangular matrix
  • show how to use a MPI_OP on the datatype for an
    AllReduce operation
  • here we simply want to summ the upper triangular
    matrices together

8
4. TRICK FOR TRIANGULAR MATRICES DATATYPES
  • See
  • check_orthogonality_RFP.c
  • choleskyqr_A_v2.c
  • choleskyqr_A_v3.c
  • choleskyqr_B_v2.c
  • choleskyqr_B_v3.c
  • A trick to use RFP format to do an fast allreduce
    on P triangular matrices without datatypes. The
    trick is at the user level.

9
Idea behind RFP
  • Rectangular full packed format
  • Just be careful in the case for odd and even
    matrices

10
5. A very weird MPI_OP motivation examples
11
Gram-Schmidt algorithm
The QR factorization of a long and skinny matrix
with its data partitioned vertically across
several processors arises in a wide range of
applications.
Input A is block distributed by rows
Output Q is block distributed by rows R is global
Q1
R
A1
Q2
A2
Q3
A3
11
12
An example with modified Gram-Schmidt.
A nonsingular m x 3
Q A r11 Q1 2 Q1 Q1 / r11 Q1 /
r11 r12 Q1T Q2 Q2 Q2 Q1 r12 r22 Q2
2 Q2 Q2 / r22 r13 Q1T Q3 Q3 Q3 Q1
r13 r23 Q2T Q3 Q3 Q3 Q2 r23 r33 Q3
2 Q3 Q3 / r33
A QR QTQ I3
13
Reduce Algorithms
  • The gather-scatter variant of our algorithm can
    be summarized as follows
  • perform local QR factorization of the matrix A
  • gather the p R factors on processor 0
  • perform a QR factorization of all the R put the
    ones on top of the others, the R factor obtained
    is the R factor
  • scatter the the Q factors from processor 0 to all
    the processors
  • multiply locally the two Q factors together,
    done.

14
Reduce Algorithms
  • This is the scatter-gather version of our
    algorithm.
  • This variant is not very efficient for two
    reasons
  • first the communication phases 2 and 4 are highly
    involving processor 0
  • second the cost of step 3 is p/3n3, so can get
    prohibitive for large p.
  • Note that the CholeskyQR algorithm can also be
    implemented in a scatter-gather way but
    reduce-broadcast. This leads naturally to the
    algorithm presented below where a
    reduce-broadcast version of the previous
    algorithm is described. This will be our final
    algorithm.

15
On two processes
A0
Q0
processes
A1
Q1
time
16
On two processes
1
R0(0)
(
,
)
QR (
)
A0
V0(0)
processes
1
R1(0)
(
,
)
QR (
)
A1
V1(0)
time
17
On two processes
1
R0(0)
R0(0)
(
,
)
QR (
)
)
(
R1(0)
A0
V0(0)
1
processes
1
R1(0)
(
,
)
QR (
)
A1
V1(0)
time
18
On two processes
2
1
R0(0)
V0(1)
R0(0)
R0(1)
(
,
)
QR (
)
(
,
)
QR (
)
V1(1)
R1(0)
A0
V0(0)
1
processes
1
R1(0)
(
,
)
QR (
)
A1
V1(0)
time
19
On two processes
2
1
R0(0)
V0(1)
R0(0)
R0(1)
(
,
)
QR (
)
(
,
)
QR (
)
V1(1)
R1(0)
A0
V0(0)
3
V0(1)
Q0(1)
In
Apply (
to
)
0n
V1(1)
Q1(1)
1
processes
1
R1(0)
(
,
)
QR (
)
A1
V1(0)
time
20
On two processes
2
1
R0(0)
V0(1)
R0(0)
R0(1)
Q0(1)
(
,
)
QR (
)
(
,
)
QR (
)
V1(1)
R1(0)
A0
V0(0)
3
V0(1)
Q0(1)
In
Apply (
to
)
0n
V1(1)
Q1(1)
1
2
processes
1
R1(0)
(
,
)
QR (
)
Q1(1)
A1
V1(0)
time
21
On two processes
2
1
4
R0(0)
V0(1)
R0(0)
R0(1)
Q0(1)
(
,
)
QR (
)
(
,
)
QR (
)
Apply (
to
)
V1(1)
R1(0)
0n
A0
V0(0)
V0(0)
Q0
3
V0(1)
Q0(1)
In
Apply (
to
)
0n
V1(1)
Q1(1)
1
2
processes
1
4
R1(0)
(
,
)
QR (
)
Q1(1)
Apply (
to
)
0n
A1
V1(0)
V1(0)
Q1
time
22
The big picture .
R
Q0
A0
R
Q1
A1
R
A2
Q2
processes
R
Q3
A3
R
A4
Q4
R
Q5
A5
R
Q6
A6
A
QR
time
23
The big picture .
1
A0
1
A1
1
A2
processes
A3
1
1
A4
1
A5
1
A6
time
computation
1
2
3
4
communication
1
2
24
The big picture .
1
2
A0
1
1
A1
1
2
A2
processes
1
A3
1
1
2
A4
1
1
A5
1
A6
time
computation
1
2
3
4
communication
1
2
25
The big picture .
1
2
2
A0
1
1
1
A1
1
2
A2
processes
1
A3
1
1
2
2
A4
1
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
26
The big picture .
1
2
2
2
A0
1
1
1
A1
1
2
1
A2
processes
1
A3
1
1
2
2
A4
1
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
27
The big picture .
1
2
2
2
3
A0
1
1
1
A1
1
2
1
A2
processes
1
A3
1
1
2
2
A4
1
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
28
The big picture .
1
2
2
2
3
3
A0
1
1
1
A1
1
2
2
1
A2
processes
1
A3
1
1
2
2
3
A4
1
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
29
The big picture .
1
2
2
2
3
3
3
A0
1
1
2
1
A1
1
2
3
2
1
A2
processes
1
A3
1
1
2
2
3
3
A4
1
2
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
30
The big picture .
1
2
2
2
3
3
3
A0
2
1
1
2
1
A1
1
2
3
2
1
A2
processes
2
1
A3
1
1
2
2
3
3
A4
2
1
2
1
A5
1
1
A6
time
computation
1
2
3
4
communication
1
2
31
The big picture .
1
2
2
2
3
3
3
R
4
Q0
A0
2
1
1
4
R
2
1
Q1
A1
1
2
3
4
R
2
1
A2
Q2
processes
2
1
R
4
Q3
A3
1
4
R
1
2
2
3
3
A4
Q4
2
1
R
4
2
Q5
1
A5
1
R
1
4
Q6
A6
time
computation
1
2
3
4
communication
1
2
32
Latency but also possibility of fast panel
factorization.
  • DGEQR3 is the recursive algorithm (see Elmroth
    and Gustavson, 2000), DGEQRF and DGEQR2 are the
    LAPACK routines.
  • Times include QR and DLARFT.
  • Run on Pentium III.

MFLOP/sec
m1000,000, the x axis is n
33
When only R is wanted
2
2
1
R0(0)
R0(1)
R0(1)
(
)
R
(
)
R0(0)
(
)
QR (
)
QR (
)
R1(0)
QR (
)
R2(1)
A0
1
1
R1(0)
(
)
1
QR (
)
A1
processes
1
2
R2(0)
R2(1)
(
)
R2(0)
(
)
QR (
)
QR (
)
R3(0)
A2
1
1
R3(0)
(
)
QR (
)
A3
time
34
When only R is wanted The MPI_Allreduce
In the case where only R is wanted, instead of
constructing our own tree, one can simply use
MPI_Allreduce with a user defined operation. The
operation we give to MPI is basically the
Algorithm 2. It performs the operation This
binary operation is associative and this is all
MPI needs to use a user-defined operation on a
user-defined datatype. Moreover, if we change the
signs of the elements of R so that the diagonal
of R holds positive elements then the binary
operation Rfactor becomes commutative. The code
becomes two lines lapack_dgeqrf( mloc, n, A,
lda, tau, dlwork, lwork, info
) MPI_Allreduce( MPI_IN_PLACE, A, 1,
MPI_UPPER, LILA_MPIOP_QR_UPPER, mpi_comm)
R1
QR ( )
R
R2
35
Does it work?
  • The experiments are performed on the beowulf
    cluster at the University of Colorado at Denver.
    The cluster is made of 35 bi-pro Pentium III
    (900MHz) connected with Dolphin interconnect.
  • Number of operations is taken as 2mn2 for all the
    methods
  • The block size used in ScaLAPACK is 32.
  • The code is written in C, use MPI (mpich-2.1),
    LAPACK (3.1.1), BLAS (goto-1.10), the LAPACK
    Cwrappers (http//icl.cs.utk.edu/delmas/lapwrapmw
    .htm ) and the BLAS C wrappers (http//www.netlib.
    org/blas/blast-forum/cblas.tgz)
  • The codes has been tested in various
    configuration and have never failed to produce a
    correct answer, releasing those codes is in the
    agenda
  • Number of operations is taken as 2mn2 for all the
    methods

36
Q and R Strong scalability
  • In this experiment, we fix the problem m100,000
    and n50. Then we increase the number of
    processors.
  • Once more the algorithm rhh_qr3 is the second
    behind CholeskyQR. Note that rhh_qr3 is
    incondionnally stable while the stability of
    CholeskyQR depends on the square of the condition
    number of the initial matrix.

MFLOP/sec/proc
procs
MFLOP/sec/proc
Time in sec
37
Q and R Weak scalability with respect to m
  • We fix the local size to be mloc100,000 and
    n50. When we increase the number of processors,
    the global m grows proportionally.
  • rhh_qr3 is the Allreduce algorithm with
    recursive panel factorization, rhh_qrf is the
    same with LAPACK Householder QR. We see the
    obvious benefit of using recursion. See as well
    (6). qr2 and qrf correspond to the ScaLAPACK
    Householder QR factorization routines.

MFLOP/sec/proc
procs
MFLOP/sec/proc
Time in sec
38
Q and R Weak scalability with respect to n
  • We fix the global size m100,000 and then we
    increase n as sqrt(p) so that the workload mn2
    per processor remains constant.
  • Due to better performance in the local
    factorization or SYRK, CholeskyQR, rhh_q3 and
    rhh_qrf exhibit increasing performance at the
    beginning until the n3 comes into play

n3 effect
MFLOP/sec/proc
MFLOP/sec/proc
Time in sec
procs ( n )
39
R only Strong scalability
  • In this experiment, we fix the problem m100,000
    and n50. Then we increase the number of
    processors.

MFLOP/sec/proc
procs
MFLOP/sec/proc
Time in sec
40
R only Weak scalability with respect to m
  • We fix the local size to be mloc100,000 and
    n50. When we increase the number of processors,
    the global m grows proportionally.

MFLOP/sec/proc
procs
MFLOP/sec/proc
Time in sec
41
Q and R Strong scalability
In this experiment, we fix the problem
m1,000,000 and n50. Then we increase the number
of processors.
Blue Gene L frost.ncar.edu
MFLOPs/sec/proc
of processors
42
VARIATION ON A FAIRLY COMPLEX MPI_OP
  • reducehouseholder_B_v0.c
  • LILA_mpiop_qr_upper_v0.c
  • MPI will perform a very weird MPI_OP. The
    datatype is square matrix.
  • reducehouseholder_B_v1.c
  • LILA_mpiop_qr_upper_v1.c
  • MPI will perform a very weird MPI_OP. The
    datatype is triangular matrix.
  • reducehouseholder_B_v2.c
  • LILA_mpiop_qr_upper_v2.c
  • The goal here from the application point of view
    is to perform the operation in place. This will
    be the occasion to introduce how to provide. In
    practice, this example is not good. MPI will work
    on a way too large array. Since this can be a
    mistake in programming your MPI application, we
    will discuss this as well.
  • reducehouseholder_B_v3.c
  • LILA_mpiop_qr_upper_v1.c
  • This is B_v1 recursive QR factorization.
    Nothing to do with MPI but that goes fast!

43
6. Weirdest MPI_OP ever how to attach attributes
to a datatype
n
  • In rhhB_v2, we will show how to the MPI_Allreduce
    in place.

lda
44
6. Weirdest MPI_OP ever how to attach attributes
to a datatype
  • Use global variables
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 50
    -rhhB_v2_000
  • reducehouseholder_B_v2_000
  • LILA_mpiop_qr_upper_v2_000
  • Use extent / size
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 50
    -rhhB_v2_001
  • reducehouseholder_B_v2_001
  • LILA_mpiop_qr_upper_v2_001
  • Use attributes (1/2)
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 50
    -rhhB_v2_002
  • reducehouseholder_B_v2_002
  • LILA_mpiop_qr_upper_v2_002
  • Use attributes (2/2)
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 50
    -rhhB_v2_003
  • reducehouseholder_B_v2_003
  • LILA_mpiop_qr_upper_v2_003
  • Use name (gore!)
  • mpirun -np 4 ./xtest -verbose 2 -m 1000 -n 50
    -rhhB_v2_004
  • reducehouseholder_B_v2_004

45
v2_001
n
Two equations / Two unknowns size n ( n 1
) / 2 1/2 n2 1/2 n extent (n-1)lda n
lda
  • MPI_Type_size( (mpidatatype), n)
  • n n / sizeof(double)
  • n (int) ( (-1.00e00 sqrt( 1.00e00
    8.00e00((double)n) ) )/2.00e00 )
  • MPI_Type_get_extent( (mpidatatype), lb,
    extent)
  • lda ( extent / sizeof(double) - n ) / ( n - 1 )
Write a Comment
User Comments (0)
About PowerShow.com