Autooptimization of linear algebra parallel routines: the Cholesky factorization PowerPoint PPT Presentation

presentation player overlay
1 / 19
About This Presentation
Transcript and Presenter's Notes

Title: Autooptimization of linear algebra parallel routines: the Cholesky factorization


1
Auto-optimization of linear algebra parallel
routines the Cholesky factorization
  • Luis-Pedro García
  • Servicio de Apoyo a la Investigación Tecnológica
  • Universidad Politécnica de Cartagena, Spain
    luis.garcia_at_sait.upct.es

Javier Cuenca Departamento de Ingeniería y
Tecnología de Computadores Universidad de Murcia,
Spain javiercm_at_ditec.um.es
Domingo Giménez Departamento de Informática y
Sistemas Universidad de Murcia,
Spain domingo_at_dif.um.es
2
Outline
  • Introduction
  • Parallel routine for the Cholesky factorization
  • Experimental Results
  • Conclusions

3
Introduction
  • Our Goal to obtain linear algebra parallel
    routines with auto-optimization capacity
  • The approach model the behavior of the algorithm
  • This work improve the model for the
    communication costs when
  • The routine uses different types of MPI
    communication mechanisms
  • The system has more than one interconnection
    network
  • The communication parameters vary with the
    volume of the communication

4
Introduction
  • Theoretical and experimental study of the
    algorithm. AP selection.
  • In linear algebra parallel routines, typical AP
    and SP are
  • b, p r x c and the basic library
  • k1, k2, k3, ts and tw
  • An analytical model of the execution time
  • T(n) f(n,AP,SP)

5
Parallel Cholesky factorization
  • The n x n matrix is mapped through a block cyclic
    2-D distribution onto a two-dimensional mesh of p
    r x c processes (in ScaLAPACK style)

(a) First step (b) Second step
(c) Third step
Figure 1. Work distribution in the first three
steps, with n/b 6 and p 2 x 3
6
Parallel Cholesky factorization
  • The general model t(n) f(n,AP,SP)
  • Problem size
  • n matrix size
  • Algorithmic parameters (AP)
  • b block size
  • p r x c processes
  • System parameters (SP) SP g(n,AP)
  • k(n,b,p) k2potf2, k3trsm, k3gemm and k3syrk cost
    of basic arithmetic operations
  • ts(p) start-up time
  • tws(n,p), twd(n,p) word-sending time for
    different types of communications

tcom(n,p) ts(p)ntw(n,p)
7
Parallel Cholesky factorization
  • Theoretical model
  • Arithmetic cost
  • Communication cost

T tarit tcom
8
Experimental Results
  • Systems
  • A network of four nodes Intel Pentium 4 (P4net)
    with a FastEthernet switch, enabling parallel
    communications between them. The MPI library used
    is MPICH
  • A network of four nodes HP AlphaServer quad
    processors (HPC160) using Shared Memory
    (HPC160smp), MemoryChannel (HPC160mc) and both
    (HPC160smp-mc) for the communications between
    processes. A MPI library optimized for Shared
    Memory and for MemoryChannel has been used.

9
Experimental Results
  • How to estimate the arithmetic SPs
  • With routines performing some basic operation
    (dgemm, dsyrk, dtrsm) with the same data access
    scheme used in the algorithm
  • How to estimate the communication SPs
  • With routines that communicate rows or columns in
    the logical mesh of processes
  • With a broadcast for MPI derived data type
    between processes in the same column
  • With a broadcast for MPI predefined data type
    between processes in the same row
  • In both cases the experiments are repeated
    several times, to obtain an average value

10
Experimental Results
  • Lowest execution time with the optimized version
    of BLAS and LAPACK for Pentium 4 and for Alpha

Table 1. Values of arithmetic system parameters
(in µsec) in Pentium 4 with BLASopt
Table 2. Values of arithmetic system parameters
(in µsec) in Alpha with CXML
11
Experimental Results
  • But other SPs can depend on n and b, for example
    k2,potf2

Table 3. Values of k2,potf2 (in µsec) in Pentium
4 with BLASopt
Table 4. Values of k2,potf2 (in µsec) in Alpha
with CXML
12
Experimental Results
  • Communication system parameters
  • Broadcast cost for MPI predefined data type, tws

Table 6. Values of tws (in µsecs) in HPC160
Table 5. Values of tws (in µsecs) in P4net
13
Experimental Results
  • Communication system parameters
  • Word sending time of a broadcast for MPI derived
    data type twd

Table 7. Values of twd (in µsecs) obtained
experimentally for different b and p
14
Experimental Results
  • Communication system parameters
  • Startup time of MPI broadcast ts
  • Can be considered ts(n,p) ts(p)

Table 8. Values of ts (in µsecs) obtained
experimentally for different number of processes
15
Experimental Results
  • P4net

16
Experimental Results
  • HPC160smp

17
Experimental Results
  • Parameters selection in P4net

Table 9. Parameters selection for the Cholesky
factorization in P4net
18
Experimental Results
  • Parameters Selection in HPC160

Table 10. Parameters selection for the Cholesky
factorization in HPC160 with shared memory
(HPC160smp), MemoryChannel (HPC160mc) and both
(HPC160smp-mc)
19
Conclusions
  • The method has been applied successfully to the
    Cholesky factorization and can be applied to
    other linear algebra routines
  • It is neccesary to use different costs for
    different types of MPI communication mechanisms.
  • and to use different cost for the communication
    parameters in systems with more than one
    interconnection network.
  • It is necessary to decide the optimal allocation
    of processes by node, according to the speed of
    the interconnection networks. Hybrid systems.
Write a Comment
User Comments (0)
About PowerShow.com