Title: Advances in the Optimization of Parallel Routines II
1Advances in the Optimization of Parallel Routines
(II)
- Domingo Giménez
- Departamento de Informática y Sistemas
- Universidad de Murcia, Spain
- dis.um.es/domingo
2Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Hybrid programming
- Peer to peer computing
3Colaborations and autoreferences
- Autotuning routines
- J. Cuenca J. González
- Automatic parameterization of parallel linear
algebra routines. 2001 - J. Cuenca
- Some considerations about the Automatic
Optimization of Parallel Linear Algebra Routines.
2002
4Autotuning routines
- Our approach
- Routines Parameterised
- System parameters, Algorithmic parameters
- System parameters obtained at installation time
- Analytical model of the routine and simple
installation routines to obtain the system
parameters - A reduced number of executions at installation
time - Algorithmic parameters
- From the analytical model with the system
parameters obtained in the installation process
5Autotuning routines
6Autotuning routines
7Autotuning routines
- LAR-MOD Analytical Model of LAR
- The behaviour of the algorithm on the platform is
defined - Texec f (SPs, n, APs)
- SPs f(n, APs) System Parameters
- APs Algorithmic Parameters
- n Problem Size
8Autotuning routines
- LAR-MOD Analytical Model of LAR
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
LARs Performance
9Autotuning routines
- LAR-MODAnalytical Model of LAR
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
- Two Kinds of SPs
- Communication System Parameters (CSPs)
- Arithmetic System Parameters (ASPs)
LARs Performance
10Autotuning routines
- LAR-MODAnalytical Model of LAR
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
- Two Kinds of SPs
- Communication System Parameters (CSPs)
- ts start-up time
- tw word-sending time
LARs Performance
11Autotuning routines
- LAR-MODAnalytical Model of LAR
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
- Two Kinds of SPs
- Communication System Parameters (CSPs)
-
LARs Performance
- Arithmetic System Parameters (ASPs)
- tc arithmetic cost. Using BLAS k1 k2 and k3
12Autotuning routines
- LAR-MODAnalytical Model of LAR
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
-
LARs Performance
- How to estimate each SP?
- 1º.- Obtain the kernel of performance cost of
LAR - 2º.- Make an Estimation Routine from this
kernel
13Autotuning routines
14Autotuning routines
- Design
- Making the
- LAR-ERs
15Autotuning routines
- LAR-ERs Estimation Routines
- Arithmetic System Parameters (ASPs)
- Computation Kernel of the LAR ? Estimation
Routine - Similar storage scheme
- Similar quantity of data
- Communication System Parameters (CSPs)
- Communication Kernel of the LAR ? Estimation
Routine - Similar kind of communication
- Similar quantity of data
16Autotuning routines
- Design
- Process
- has finished
17Autotuning routines
- Installation
- Runing
- the LAR-ERs
18Autotuning routines
- Installation
- obtaining
- the OAP
19Autotuning routines
- Installation obtaining the OAP
- Algorithmic Parameters (APs)
- Known the SPs values,
- the Optimum Values for the APs are calculated
(OAP) - b block size
- p number of processors
- r ? c logical topology
- grid configuration (logical 2D mesh)
20Autotuning routines
- Installation
- putting it
- all together
SYSTEM MANAGER
21Autotuning routines
- Experiments
- LAR block LU factorization.
- Platforms IBM SP2,
- SGI Origin 2000,
- NoW
- Basic Libraries reference BLAS,
- machine BLAS, ATLAS
22Autotuning routines
- LU on IBM SP2
- Quotient between the
- execution time with the
- parameters selected by
- the model and the lowest
- experimentl execution
- time (varying the
- value of the parameters)
23Autotuning routines
- LU on Origin 2000
- Quotient between the
- execution time with the
- parameters selected by
- the model and the lowest
- experimentl execution
- time (varying the
- value of the parameters)
24Autotuning routines
- LU on NoW
- Quotient between the
- execution time with the
- parameters selected by
- the model and the lowest
- experimentl execution
- time (varying the
- value of the parameters)
25Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Hybrid programming
- Peer to peer computing
26Colaborations and autoreferences
- Modifications to libraries hierarchy
- J. Cuenca J. González
- Architecture of an Automatic Tuned Linear Algebra
Library. 2002 - 2004
27Modifications to libraries hierarchy
- In the optimization of routines individual basic
operations appear repeatedly - LU
- QR
28Modifications to libraries hierarchy
- The information generated to instal a routine
could be used for another different routine
without additional experiments - ts and tw are obtained when the communication
library (MPI, PVM, ) is installed - K3,gemm is obtained when the basic computational
library (BLAS, ATLAS, ) is installed
29Modifications to libraries hierarchy
- To determine
- the type of experiments necessary for the
different routines in the library - ts and tw obtained with ping-pong, broadcast,
? - K3,gemm obtained for small block sizes, ?
- the format in which the data will be stored, to
facilitate the use of them when installing other
routines
30Modifications to libraries hierarchy
- The method could be valid not only for one
library (that I am developing) but also for
others libraries I or somebody else will develop
in the future - the type of experiments
- the format in which the data will be stored
- must be decided by the Parallel Linear Algebra
Community - and the typical hierarchy of libraries would
change
31Modifications to libraries hierarchy
- typical hierarchy
- of Parallel Linear
- Algebra libraries
ScaLAPACK
PBLAS
LAPACK
BLACS
BLAS
Communications
32Modifications to libraries hierarchy
- To include
- installation information
- in the lowest levels
- of the hierarchy
ScaLAPACK
PBLAS
LAPACK
BLACS
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
33Modifications to libraries hierarchy
- When installing libraries
- in a higher level this
- information can be used,
- and new information
- is generated
ScaLAPACK
PBLAS
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
34Modifications to libraries hierarchy
ScaLAPACK
Self-Optimisation Information
PBLAS
Self-Optimisation Information
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
35Modifications to libraries hierarchy
- And new libraries
- with autotunig capacity
- could be developed
Inverse Eigenvalue Problem
Least Square Problem
PDE Solver
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
ScaLAPACK
Self-Optimisation Information
PBLAS
Self-Optimisation Information
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
36Modifications to libraries hierarchy
- Movement
- of information
- between routines
- in the different
- levels of the
- hierarchy
37Modifications to libraries hierarchy
- Movement
- of information
- between routines
- in the different
- levels of the
- hierarchy
38Modifications to libraries hierarchy
- Movement
- of information
- between routines
- in the different
- levels of the
- hierarchy
39Modifications to libraries hierarchy
- Movement
- of information
- between routines
- in the different
- levels of the
- hierarchy
40Modifications to libraries hierarchy
SOLAR_manager
LAR(n, AP) ...
Texec f (SP,AP, n) SP f(AP,n)
AP0
- Architecture of
- a Self Optimized
- Linear Algebra
- Routine manager
Optimum_AP
Model
nc
Current_problem_size
n1 ... nw AP1 ... APz
net1-1 ...net1-p ... netP-1 ..netp-p
Current_network_availability
CPU1 ... CPUp
Current_CPUs_availability
Installation_information
Current_system_information
SP1_manager
SP1_manager
SP1_manager
SPt_manager
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SPt1,1
.... SPt1,z nw SPtw,1 .... SPtw,z
. . .
Installation_SP1_values
Installation_SP1_values
Installation_SP1_values
Installation_SP1_values
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SPtc,1 ....
SPtc,z
Current_SP1_values
Current_SP1_values
Current_SP1_values
Current_SP1_values
SP1_information
SP1_information
SP1_information
SPt_information
41Modifications to libraries hierarchy
42DESIGN PROCESS
D E S I G N
LAR
LAR Linear Algebra Routine Made by the LAR
Designer
Example of LAR Parallel Block LU factorisation
43Modelling the LAR
D E S I G N
LAR
Modelling the LAR
MODEL
44Modelling the LAR
D E S I G N
LAR
Made by the LAR-Designer Only once per LAR
Modelling the LAR
MODEL
SP System Parameters AP Algorithmic
Parameters n Problem size
MODEL Texec f (SP, AP, n)
45Modelling the LAR
D E S I G N
LAR
SP k3, k2, ts, tw AP p r x c, b n
Problem size
Modelling the LAR
MODEL
MODEL LAR Parallel Block LU factorisation
46Implementation of SP-Estimators
D E S I G N
LAR
Modelling the LAR
MODEL
Implementation of SP-Estimators
SP-Estimators
47Implementation of SP-Estimators
D E S I G N
LAR
Modelling the LAR
- Estimators of Arithmetic-SP
- Computation Kernel of the LAR
- Similar storage scheme
- Similar quantity of data
- Estimators of Communication-SP Communication
Kernel of the LAR - Similar kind of communication
- Similar quantity of data
MODEL
Implementation of SP-Estimators
SP-Estimators
48INSTALLATION PROCESS
D E S I G N
LAR
Modelling the LAR
MODEL
Implementation of SP-Estimators
SP-Estimators
I N S T A L L A T I O N
Installation Process Only once per Platform Done
by the System Manager
49Estimation of Static-SP
50Estimation of Static-SP
Basic Libraries Basic Communication Library
MPI PVM Basic Linear Algebra Library
reference-BLAS machine-specific-BLAS ATLAS
Installation File SP values are obtained using
the information (n and AP values) of this file.
51Estimation of Static-SP
D E S I G N
PlatformCluster of Pentium III Fast
Ethernet Basic Libraries ATLAS and MPI
LAR
Modelling the LAR
Estimation of the Static-SP k3-static (in
?sec) Block size 16 32 64 128 k3-static 0.003
8 0.0033 0.0030 0.0027
MODEL
Implementation of SP-Estimators
SP-Estimators
I N S T A L L A T I O N
Estimation of the Static-SP tw-static (in
?sec) Message size (Kbytes) 32 256 1024 2048 tw-
static 0.700 0.690 0.680 0.675
Basic Libraries
Installation-File
Estimation of Static-SP
Static-SP-File
52RUN-TIME PROCESS
53RUN-TIME PROCESS
54RUN-TIME PROCESS
55Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Hybrid programming
- Peer to peer computing
56Colaborations and autoreferences
- Polylibraries
- P. Alberti P. Alonso J. Cuenca A. Vidal
- Designing Polylibraries to Speed Up Parallel
Computations. 2003
57Polylibraries
- Different basic libraries can be available
- Reference BLAS, machine specific BLAS, ATLAS,
- MPICH, machine specific MPI, PVM,
- Reference LAPACK, machine specific LAPACK,
- ScaLAPACK, PLAPACK,
- To use a number of different basic libraries to
develop a polylibrary
58Polylibraries
- Typical parallel linear algebra libraries
hierarchy
59Polylibraries
- A possible parallel linear algebra polylibraries
hierarchy
60Polylibraries
- A possible parallel linear algebra polylibraries
hierarchy
61Polylibraries
- A possible parallel linear algebra polylibraries
hierarchy
62Polylibraries
BLACS
63Polylibraries
- The advantage of Polylibraries
- A library optimised for the system might not be
available - The characteristics of the system can change
- Which library is the best may vary according to
the routines and the systems - Even for different problem sizes or different
data access schemes the preferred library can
change - In parallel system with the file system shared by
processors of different types
64Architecture of a Polylibrary
Library_1
65Architecture of a Polylibrary
66Architecture of a Polylibrary
67Architecture of a Polylibrary
68Architecture of a Polylibrary
69Architecture of a Polylibrary
70Architecture of a Polylibrary
71Architecture of a Polylibrary
72Architecture of a Polylibrary
73Architecture of a Polylibrary
74Polylibraries
- Combining Polylibraries with other Optimisation
Techniques - Polyalgorithms
- Algorithmic Parameters
- Block size
- Number of processors
- Logical topology of processors
75Experimental Results
- Routines of different levels in the hierarchy
- Lowest level
- GEMM matrix-matrix multiplication
- Medium level
- LU and QR factorisations
- Highest level
- a Lift-and-Project algorithm to solve the inverse
additive eigenvalue problem - an algorithm to solve the Toeplitz least square
problem
76Experimental Results
- The platforms
- SGI Origin 2000
- IBM-SP2
- Different networks of processors
- SUN Workstations Ethernet
- PCs Fast-Ethernet
- PCs Myrinet
77Experimental Results GEMM
- Routine GEMM (matrix-matrix multiplication)
- Platform five SUN Ultra 1 / one SUN Ultra 5
- Libraries
- refBLAS macBLAS
- ATLAS1 ATLAS2 ATLAS5
- Algorithms and Parameters
- Strassen ? base size
- By blocks ? block size
- Direct method
78Experimental Results GEMM
- MATRIX-MATRIX MULTIPLICATION INTERFACE
- if processor is SUN Ultra 5
- if problem-sizelt600
- solve using ATLAS5 and Strassen method with base
size half of problem size - else if problem-sizelt1000
- solve using ATLAS5 and block method with block
size 400 - else
- solve using ATLAS5 and Strassen method with base
size half of problem size - endif
- else if processor is SUN Ultra 1
- if problem-sizelt600
- solve using ATLAS5 and direct method
- else if problem-sizelt1000
- solve using ATLAS5 and Strassen method with base
size half of problem size - else
- solve using ATLAS5 and direct method
- endif
- endif
79Experimental Results GEMM
80Experimental Results LU
- Routine LU factorisation
- Platform 4 PentiumIII Myrinet
- Libraries
- ATLAS
- BLAS for Pentium II
- BLAS for Pentium III
81Experimental Results LU
- The cost of parallel block LU factorisation
- Tuning Algorithmic Parameters
- block size b
- 2D-mesh of p proccesors p r ?c dmax(r,c)
- System Parameters
- cost of arithmetic operations k2,getf2
k3,trsmm k3,gemm - communication parameters ts tw
82Experimental Results LU
83Experimental Results QR
- Routine QR factorisation
- Platform 8 PentiumIII Fast-Ethernet
- Libraries
- ATLAS
- BLAS for Pentium II
- BLAS for Pentium III
84Experimental Results QR
- The cost of parallel block QR factorisation
- Tuning Algorithmic Parameters
- block size b
- 2D-mesh of p proccesors p r ?c
- System Parameters
- cost of arithmetic operations k2,geqr2
k2,larft k3,gemm k3,trmm - communication parameters ts tw
85Experimental Results QR
86Experimental Results LP
- Routine Lift-and-Project method for the Inverse
Additive Eigenvalue Prob - Platform dual Pentium III
- Libraries combinations
87Experimental Results LP
- The theoretical model of the sequential algorithm
cost - System Parameters
- ksyev ? LAPACK
- k3, gemm k3, diaggemm ? BLAS-3
- k1,dot k1,scal k1,axpy ? BLAS-1
88Experimental Results LP
89Experimental Results LP
90Polylibraries
- The method can be applied to sequential and
parallel algorithms - It can be combined with other methods of
computation speed up. - The LIF contains the cost of an operation for
each one of the routines. These costs may be
different for different data sizes or access
schemes. - Could be applied to help in the development of
efficient parallel libraries in other fields.