Autooptimization of linear algebra parallel routines: the Cholesky factorization presentation

About This Presentation

Transcript and Presenter's Notes

Title: Autooptimization of linear algebra parallel routines: the Cholesky factorization

1
Auto-optimization of linear algebra parallel
routines the Cholesky factorization

Luis-Pedro García
Servicio de Apoyo a la Investigación Tecnológica
Universidad Politécnica de Cartagena, Spain
luis.garcia_at_sait.upct.es

Javier Cuenca Departamento de Ingeniería y
Tecnología de Computadores Universidad de Murcia,
Spain javiercm_at_ditec.um.es
Domingo Giménez Departamento de Informática y
Sistemas Universidad de Murcia,
Spain domingo_at_dif.um.es
2
Outline

Introduction
Parallel routine for the Cholesky factorization
Experimental Results
Conclusions

3
Introduction

Our Goal to obtain linear algebra parallel
routines with auto-optimization capacity
The approach model the behavior of the algorithm
This work improve the model for the
communication costs when
The routine uses different types of MPI
communication mechanisms
The system has more than one interconnection
network
The communication parameters vary with the
volume of the communication

4
Introduction

Theoretical and experimental study of the
algorithm. AP selection.
In linear algebra parallel routines, typical AP
and SP are
b, p r x c and the basic library
k1, k2, k3, ts and tw
An analytical model of the execution time
T(n) f(n,AP,SP)

5
Parallel Cholesky factorization

The n x n matrix is mapped through a block cyclic
2-D distribution onto a two-dimensional mesh of p
r x c processes (in ScaLAPACK style)

(a) First step (b) Second step
(c) Third step
Figure 1. Work distribution in the first three
steps, with n/b 6 and p 2 x 3
6
Parallel Cholesky factorization

The general model t(n) f(n,AP,SP)
Problem size
n matrix size
Algorithmic parameters (AP)
b block size
p r x c processes
System parameters (SP) SP g(n,AP)
k(n,b,p) k2potf2, k3trsm, k3gemm and k3syrk cost
of basic arithmetic operations
ts(p) start-up time
tws(n,p), twd(n,p) word-sending time for
different types of communications

tcom(n,p) ts(p)ntw(n,p)
7
Parallel Cholesky factorization

Theoretical model
Arithmetic cost
Communication cost

T tarit tcom
8
Experimental Results

Systems
A network of four nodes Intel Pentium 4 (P4net)
with a FastEthernet switch, enabling parallel
communications between them. The MPI library used
is MPICH
A network of four nodes HP AlphaServer quad
processors (HPC160) using Shared Memory
(HPC160smp), MemoryChannel (HPC160mc) and both
(HPC160smp-mc) for the communications between
processes. A MPI library optimized for Shared
Memory and for MemoryChannel has been used.

9
Experimental Results

How to estimate the arithmetic SPs
With routines performing some basic operation
(dgemm, dsyrk, dtrsm) with the same data access
scheme used in the algorithm
How to estimate the communication SPs
With routines that communicate rows or columns in
the logical mesh of processes
With a broadcast for MPI derived data type
between processes in the same column
With a broadcast for MPI predefined data type
between processes in the same row
In both cases the experiments are repeated
several times, to obtain an average value

10
Experimental Results

Lowest execution time with the optimized version
of BLAS and LAPACK for Pentium 4 and for Alpha

Table 1. Values of arithmetic system parameters
(in µsec) in Pentium 4 with BLASopt
Table 2. Values of arithmetic system parameters
(in µsec) in Alpha with CXML
11
Experimental Results

But other SPs can depend on n and b, for example
k2,potf2

Table 3. Values of k2,potf2 (in µsec) in Pentium
4 with BLASopt
Table 4. Values of k2,potf2 (in µsec) in Alpha
with CXML
12
Experimental Results

Communication system parameters
Broadcast cost for MPI predefined data type, tws

Table 6. Values of tws (in µsecs) in HPC160
Table 5. Values of tws (in µsecs) in P4net
13
Experimental Results

Communication system parameters
Word sending time of a broadcast for MPI derived
data type twd

Table 7. Values of twd (in µsecs) obtained
experimentally for different b and p
14
Experimental Results

Communication system parameters
Startup time of MPI broadcast ts
Can be considered ts(n,p) ts(p)

Table 8. Values of ts (in µsecs) obtained
experimentally for different number of processes
15
Experimental Results

P4net

16
Experimental Results

HPC160smp

17
Experimental Results

Parameters selection in P4net

Table 9. Parameters selection for the Cholesky
factorization in P4net
18
Experimental Results

Parameters Selection in HPC160

Table 10. Parameters selection for the Cholesky
factorization in HPC160 with shared memory
(HPC160smp), MemoryChannel (HPC160mc) and both
(HPC160smp-mc)
19
Conclusions

The method has been applied successfully to the
Cholesky factorization and can be applied to
other linear algebra routines
It is neccesary to use different costs for
different types of MPI communication mechanisms.
and to use different cost for the communication
parameters in systems with more than one
interconnection network.
It is necessary to decide the optimal allocation
of processes by node, according to the speed of
the interconnection networks. Hybrid systems.

Write a Comment

User Comments (0)

About PowerShow.com

Autooptimization of linear algebra parallel routines: the Cholesky factorization PowerPoint PPT Presentation