Automatic Parameterisation of Parallel Linear Algebra Routines - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Automatic Parameterisation of Parallel Linear Algebra Routines

Description:

grid configuration (logical 2D mesh) Analytical Model. Pre-installing ... k3 matrix-matrix multiplication with DGEMM. k1 Givens Rotation to 2 vectors with DROT ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 56
Provided by: Txuri
Category:

less

Transcript and Presenter's Notes

Title: Automatic Parameterisation of Parallel Linear Algebra Routines


1
Automatic Parameterisation of Parallel Linear
Algebra Routines
  • Domingo Giménez
  • Javier Cuenca
  • José González
  • University of Murcia
  • SPAIN
  • Algèbre Linéaire et Arithmétique Calcul
    Numérique, Symbolique et Paralèle
  • Rabat, Maroc. 28-31 Mai 2001

2
Outline
  • Current Situation of Linear Algebra Parallel
    Routines (LAPRs)
  • Objective
  • Approach I Analytical Model of the LAPRs
  • Application Jacobi Method on Origin 2000
  • Approach II Exhaustive Executions
  • Application Gauss elimination on networks of
    processors
  • Validation with the LU factorization
  • Conclusions
  • Future Works

3
Current Situation of Linear Algebra Parallel
Routines (LAPRs)
  • Linear Algebra highly optimizable operations
  • Optimizations are Platform Specific
  • Traditional method Hand-Optimization for each
    platform

4
Problems of traditional method
  • Time-consuming
  • Incompatible with Hardware Evolution
  • Incompatible with changes in the system
    (architecture and basic libraries)
  • Unsuitable for dynamic systems
  • Misuse by non expert users

5
Current approaches
  • ATLAS, FLAME, I-LIB
  • Analyse platform characteristics in detail
  • Sequential code
  • Empirical results of the LAPR Automation
  • High Installation Time

6
Our objective
  • Develop a methodology for obtaining Automatically
    Tuned Software
  • Execution Environment
  • Auto-tuning Software

7
Methodology
  • Routines Parameterised
  • System parameters, Algorithmic parameters
  • System parameters obtained at installation time
  • Analytical model of the routine and simple
    installation routines to obtain the system
    parameters
  • A reduced number of executions at installation
    time
  • Algorithmic parameters obtained at running time
  • From the analytical model with the system
    parameters obtained in the installation process
  • From the file with information generated in the
    installation process

8
Analytical modelling
  • System parameters obtained at installation time
  • Analytical model of the routine and simple
    installation routines to obtain the system
    parameters
  • Algorithmic parameters obtained at running time
  • From the analytical model with the system
    parameters obtained in the installation process

9
Analytical Model
  • The behaviour of the algorithm on the platform is
    defined
  • Texec f (SPs, n, APs)
  • SPs f(n, APs) System Parameters
  • APs Algorithmic Parameters
  • n Problem Size

10
Analytical Model
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries
  • How to estimate each SP?
  • 1º.- Obtain the kernel of performance cost of
    LAPR
  • 2º.- Make an Estimation Routine from this
    kernel
  • Two Kinds of SPs
  • Communication System Parameters (CSPs)
  • Arithmetic System Parameters (ASPs)

LAPRs Performance
11
Analytical Model
  • Arithmetic System Parameters (ASPs)
  • tc arithmetic cost
  • but using BLAS k1 k2 and k3.
  • Computation Kernel of the LAPR ? Estimation
    Routine
  • Similar storage scheme
  • Similar quantity of data

12
Analytical Model
  • Communication System Parameters (CSPs)
  • ts start-up time
  • tw word-sending time
  • Communication Kernel of the LAPR ? Estimation
    Routine
  • Similar kind of communication
  • Similar quantity of data

13
Analytical Model
Algorithmic Parameters (APs) Values chosen in
each execution b block size p number of
processors r ? c logical topology grid
configuration (logical 2D mesh)
14
The Methodology. Step by step
Pre-installing (manual) 1º Make the Analytical
Model Texec f (SPs, n, APs) 2º Write the
Estimation Routines for the SPs Installing on a
Platform (automatic) 3º Estimate the SPs using
the Estimation Routines of step 2 4º Write a
Configuration File, or include the information in
the LAPR for each n APs that minimize
Texec Execution The user executes LAPR for a
size n LAPR obtains optimal APs
15
Application Example
  • LAPR One-sided Block Jacobi Method to solve the
    Symmetric Eigenvalue Problem.
  • Message-passing with MPI
  • Logical Ring Logical 2D-Mesh
  • Platform SGI Origin 2000

16
Application Example. Algorithm Scheme
B
W
D
b
n/r
00
01
01
00
01
00
n
10
10
11
11
10
11
20
20
21
21
20
21
17
Application Example Pre-installing.
1º Make the Analytical Model Texec f
(SPs,n,APs)

18
Application Example Pre-installing.
2º Write the Estimation Routines for the SPs k3
matrix-matrix multiplication with DGEMM k1
Givens Rotation to 2 vectors with
DROT ts communications along the 2 directions
of the 2D-mesh tw

19
Application Example Installing
3º Estimate the SPs using the Estimation
Routines k1 0.01 µs 0.005 µs b
32 k3 0.004 µs b 64 0.003 µs b
128 ts 20 µs tw 0.1 µs

20
Application Example Executing
Comparison of execution times using
different sets of Execution Parameters (4
processors)
21
Application Example Executing
Comparison of execution times using
different sets of Execution Parameters (8
processors)
22
Application Example Executing
  • LAPR One-sided Block Jacobi Method
  • Algorithmic Parameters block size
  • mesh topology
  • Platform SGI Origin 2000 with message-passing
  • System Parameters arithmetic costs
  • communication costs
  • Satisfactory Reduction of the Execution Time
  • from 25 higher than the optimal to only 2

23
Outline
  • Current Situation of Linear Algebra Parallel
    Routines (LAPRs)
  • Objective
  • Approach I Analytical Model of the LAPRs
  • Application Jacobi Method on Origin 2000
  • Approach II Exhaustive Executions
  • Application Gauss elimination on networks of
    processors
  • Validation with the LU factorization
  • Conclusions
  • Future Works

24
Exhaustive Execution
  • System parameters obtained at installation time
  • Installation routines making a reduced number of
    executions at installation time
  • Algorithmic parameters obtained at running time
  • From the file with information generated in the
    installation process

25
Exhaustive Execution
  • The behaviour of the algorithm on the platform is
    defined (as in Analytical Modelling)
  • Texec f (SPs, n, APs)
  • SPs f(n, APs) System Parameters
  • APs Algorithmic Parameters
  • n Problem Size

26
Exhaustive Execution
Identify Algorithmic Parameters (APs) (as in
Analytical Modelling) Values chosen in each
execution b block size p number of
processors r ? c logical topology grid
configuration (logical 2D mesh)
27
The Methodology. Step by step
Pre-installing (manual) 1º Determine the APs
2º Decide heuristics to reduce execution time
in the installation process Installing on a
Platform (automatic) 3º Decide (the manager)
the problem sizes to be analysed 4º Execute and
write a Configuration File, or include the
information in the LAPR for each n APs that
minimize Texec Execution The user executes
LAPR for a size n LAPR obtains optimal APs
28
Application Example
  • LAPR Gaussian elimination.
  • Message-passing with MPI
  • Logical Ring,
  • rowwise block-cyclic striped partitioning
  • Platform networks of processors (heterogeneous
    system)

29
Application Example Pre-installing.
1º Determine the APs logical ring, rowwise
block-cyclic striped partitioning p number of
processors b block size for the data
distribution different block sizes in
heterogeneous systems

b0
b1
b2
b0
b1
b2
b0
b1
b2
b0
30
Application Example Pre-installing.
  • 2º Decide heuristics to reduce execution time in
    the installation process
  • Execution time varies in a continuous way with
    the problem size and the APs
  • Consider the system as homogeneous
  • Installation can finish
  • When Analytical and Experimental predictions
    coincide
  • When a certain time has been spent on the
    installation


31
Application Example Installing
  • Homogeneous Systems
  • 3º The manager decides the problem sizes
  • 4º Execute and write a Configuration File, or
    include the information in the LAPR
  • for each n APs that minimize Texec
  • Heterogeneous Systems
  • 3º The manager decides the problem sizes
  • 4º Execute
  • write a Configuration File, for each n APs that
    minimize Texec
  • write a Speed File, with the relative speeds of
    the processors in the system

32
Application Example Installation Routines
  • RI-THE Obtains p and b from the formula.
  • RI-HOM Obtains p and b through a reduced number
    of executions.
  • RI-HET 1º. As RI-HOM.
  • 2º. Obtains bi for each processor

33
Application Example Systems
Three different configurations PLA_HOM 5 SUN
Ultra-1 PLA_HYB 5 SUN Ultra-1 1 SUN
Ultra-5 PLA_HET 1 SUN Ultra-1 1 SUN
Ultra-5 1 SUN Ultra-1 (manages the file
system)
34
Application Example Executing
Experimental results in PLA-HOM Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
35
Application Example Executing
Experimental results in PLA-HYB Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
36
Application Example Executing
Experimental results in PLA-HET Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
37
Comparison
  • Two techniques for automatic tuning of Parallel
    Linear Algebra Routines
  • 1. Analytical Modelling
  • For predictable systems (homogeneous, static,
    ...)
  • like Origin 2000
  • 2. Exhaustive Execution
  • For less predictable systems (heterogeneous,
    dynamic, ...)
  • like networks of workstations
  • Transparent to the user
  • Execution close to the optimum

38
Outline
  • Current Situation of Linear Algebra Parallel
    Routines (LAPRs)
  • Objective
  • Approach I Analytical Model of the LAPRs
  • Application Jacobi Method on Origin 2000
  • Approach II Exhaustive Executions
  • Application Gauss elimination on networks of
    processors
  • Validation with the LU factorization
  • Conclusions
  • Future Works

39
Validation with the LU factorization
  • To validate the methodology it is necessary to
    experiment with
  • More routines
  • block LU factorization
  • More systems
  • Architectures
  • IBM SP2 and Origin 2000
  • Libraries
  • reference BLAS, machine BLAS, ATLAS

40
Sequential LU
Analytical Model Texec f (SPs,n,APs) SPs
cost of arithmetic operations of different
levels k1, k2, k3 APs block size b

LU
ES
b
ES
UM
41
Sequential LU. Comparison in IBM SP2
Quotient between different execution
times and the optimum execution time
42
Sequential LU. Model execution time/optimum
execution time
Quotient between the execution time
with the parameters provided by the model and the
optimum execution time, with different basic
libraries. In SUN 1
43
Parallel LU
Analytical Model Texec f (SPs,n,APs) SPs
cost of arithmetic operations k1, k2, k3
cost of communications ts, tw APs block size b,
number of processors p, grid
configuration r?c

00
01
02
00
01
02
b
10
11
12
10
11
12
00
01
02
00
01
02
10
11
12
10
11
12
00
01
02
00
01
02
10
11
12
10
11
12
44
Parallel LU. Comparison in IBM SP2
Quotient between the execution time with
the parameters provided by the model and the
optimum execution time. In the sequential case,
and in parallel with 4 and 8 processors.
45
Parallel LU. Comparison in Origin 2000
Quotient between the execution time with
the parameters provided by the model and the
optimum execution time. In the sequential case,
and in parallel with 4 and 8 processors.
46
Parallel LU. Conclusions
  • The modelling of the algorithm provides
    satisfactory results in different systems
  • Origin 2000, IBM SP2
  • reference BLAS, machine BLAS, ATLAS
  • The prediction is worse in some cases
  • When the number of processors increases
  • In multicomputers where communications are more
    important (IBM SP2)
  • ? Exhaustive Executions

47
Parallel LU. Exhaustive Execution
If the manager installs the routine for sizes
512, 1536, 2560, and executions are performed for
sizes 1024, 2048, 3072, the execution time is
well predicted The same policy can be used in
the installation of other software Quotien
t between the execution time with the parameters
provided by the installation process and the
optimum execution time. With ScaLAPACK, in IBM
SP2
48
Conclusions
  • Parameterisation of Parallel Linear Algebra
    Routines enables development of Automatically
    Tuned Software
  • Two techniques can be used
  • Analytical Modelling
  • Exhaustive Executions
  • or
  • a combination of both
  • Experiments performed in different systems and
    with different routines

49
Future Works
  • We try to develop a methodology valid for a wide
    range of systems, and to include it in the design
    of linear algebra libraries
  • it is necessary to analyse the methodology in
    more systems and with more routines
  • Architecture of an Automatically Tuned Linear
    Algebra Library
  • At the moment we are analysing routines
    individually, but it could be preferable to
    analyse algorithmic schemes

50
Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
Compilation
51
Architecture of an Automatically Tuned Linear
Algebra Library
Installation routines
designer
Library
designer
52
Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
Installation routines
designer
Basic routines library
Library
designer
53
Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
Library
designer
54
Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
55
Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
Compilation
Write a Comment
User Comments (0)
About PowerShow.com