Title: Autooptimization of linear algebra parallel routines: the Cholesky factorization
1Auto-optimization of linear algebra parallel
routines the Cholesky factorization
- Luis-Pedro García
- Servicio de Apoyo a la Investigación Tecnológica
- Universidad Politécnica de Cartagena, Spain
luis.garcia_at_sait.upct.es
Javier Cuenca Departamento de Ingeniería y
Tecnología de Computadores Universidad de Murcia,
Spain javiercm_at_ditec.um.es
Domingo Giménez Departamento de Informática y
Sistemas Universidad de Murcia,
Spain domingo_at_dif.um.es
2Outline
- Introduction
- Parallel routine for the Cholesky factorization
- Experimental Results
- Conclusions
3Introduction
- Our Goal to obtain linear algebra parallel
routines with auto-optimization capacity - The approach model the behavior of the algorithm
- This work improve the model for the
communication costs when - The routine uses different types of MPI
communication mechanisms - The system has more than one interconnection
network - The communication parameters vary with the
volume of the communication
4Introduction
- Theoretical and experimental study of the
algorithm. AP selection. - In linear algebra parallel routines, typical AP
and SP are - b, p r x c and the basic library
- k1, k2, k3, ts and tw
- An analytical model of the execution time
- T(n) f(n,AP,SP)
5Parallel Cholesky factorization
- The n x n matrix is mapped through a block cyclic
2-D distribution onto a two-dimensional mesh of p
r x c processes (in ScaLAPACK style)
(a) First step (b) Second step
(c) Third step
Figure 1. Work distribution in the first three
steps, with n/b 6 and p 2 x 3
6Parallel Cholesky factorization
- The general model t(n) f(n,AP,SP)
- Problem size
- n matrix size
- Algorithmic parameters (AP)
- b block size
- p r x c processes
- System parameters (SP) SP g(n,AP)
- k(n,b,p) k2potf2, k3trsm, k3gemm and k3syrk cost
of basic arithmetic operations - ts(p) start-up time
- tws(n,p), twd(n,p) word-sending time for
different types of communications
tcom(n,p) ts(p)ntw(n,p)
7Parallel Cholesky factorization
- Theoretical model
- Arithmetic cost
- Communication cost
T tarit tcom
8Experimental Results
- Systems
- A network of four nodes Intel Pentium 4 (P4net)
with a FastEthernet switch, enabling parallel
communications between them. The MPI library used
is MPICH - A network of four nodes HP AlphaServer quad
processors (HPC160) using Shared Memory
(HPC160smp), MemoryChannel (HPC160mc) and both
(HPC160smp-mc) for the communications between
processes. A MPI library optimized for Shared
Memory and for MemoryChannel has been used.
9Experimental Results
- How to estimate the arithmetic SPs
- With routines performing some basic operation
(dgemm, dsyrk, dtrsm) with the same data access
scheme used in the algorithm - How to estimate the communication SPs
- With routines that communicate rows or columns in
the logical mesh of processes - With a broadcast for MPI derived data type
between processes in the same column - With a broadcast for MPI predefined data type
between processes in the same row - In both cases the experiments are repeated
several times, to obtain an average value
10Experimental Results
- Lowest execution time with the optimized version
of BLAS and LAPACK for Pentium 4 and for Alpha
Table 1. Values of arithmetic system parameters
(in µsec) in Pentium 4 with BLASopt
Table 2. Values of arithmetic system parameters
(in µsec) in Alpha with CXML
11Experimental Results
- But other SPs can depend on n and b, for example
k2,potf2
Table 3. Values of k2,potf2 (in µsec) in Pentium
4 with BLASopt
Table 4. Values of k2,potf2 (in µsec) in Alpha
with CXML
12Experimental Results
- Communication system parameters
- Broadcast cost for MPI predefined data type, tws
Table 6. Values of tws (in µsecs) in HPC160
Table 5. Values of tws (in µsecs) in P4net
13Experimental Results
- Communication system parameters
- Word sending time of a broadcast for MPI derived
data type twd
Table 7. Values of twd (in µsecs) obtained
experimentally for different b and p
14Experimental Results
- Communication system parameters
- Startup time of MPI broadcast ts
- Can be considered ts(n,p) ts(p)
Table 8. Values of ts (in µsecs) obtained
experimentally for different number of processes
15Experimental Results
16Experimental Results
17Experimental Results
- Parameters selection in P4net
Table 9. Parameters selection for the Cholesky
factorization in P4net
18Experimental Results
- Parameters Selection in HPC160
Table 10. Parameters selection for the Cholesky
factorization in HPC160 with shared memory
(HPC160smp), MemoryChannel (HPC160mc) and both
(HPC160smp-mc)
19Conclusions
- The method has been applied successfully to the
Cholesky factorization and can be applied to
other linear algebra routines - It is neccesary to use different costs for
different types of MPI communication mechanisms. - and to use different cost for the communication
parameters in systems with more than one
interconnection network. - It is necessary to decide the optimal allocation
of processes by node, according to the speed of
the interconnection networks. Hybrid systems.