Title: Advances in the Optimization of Parallel Routines I
1Advances in the Optimization of Parallel Routines
(I)
- Domingo Giménez
- Departamento de Informática y Sistemas
- Universidad de Murcia, Spain
- dis.um.es/domingo
2Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Peer to peer computing
3Collaborations and autoreferences
- Modelling Linear Algebra Routines
- J. Cuenca J. González
- Modelling the Behaviour of Linear Algebra
Algorithms with Message-passing. 2001 - Towards the Design of an Automatically Tuned
Linear Algebra Library. 2002 - J. Cuenca L. P. GarcÃa J. González A.
Vidal - Empirical Modelling of Parallel Linear Algebra
Routines. 2003
4Colaborations and autoreferences
- Installation routines
- G. Carrillo
- Installation routines for linear algebra
libraries on LANs. 2000 - G. Carrillo J. Cuenca J. González
- Optimización automática de rutinas paralelas de
álgebra lineal. 2000
5Colaborations and autoreferences
- Autotuning routines
- J. Cuenca J. González
- Automatic parameterization of parallel linear
algebra routines. 2001 - J. Cuenca
- Some considerations about the Automatic
Optimization of Parallel Linear Algebra Routines.
2002
6Colaborations and autoreferences
- Modifications to the libraries hierarchy
- J. Cuenca J. González
- Architecture of an Automatic Tuned Linear Algebra
Library. 2002 - 2004
7Colaborations and autoreferences
- Polylibraries
- P. Alberti P. Alonso J. Cuenca A. Vidal
- Designing Polylibraries to Speed Up Parallel
Computations. 2003
8Colaborations and autoreferences
- Algorithmic schemes
- J. P. MartÃnez
- Automatic Optimization in Parallel Dynamic
Programming Schemes. 2004
9Colaborations and autoreferences
- Heterogeneous systems
- J. Cuenca J. Dongarra J. González K.
Roche - Automatic Optimization of Parallel Linear Algebra
Routines in Systems with Variable Load. 2003 - J. Cuenca J. P. MartÃnez
- Heuristics for Work Distribution of a Homogeneous
Parallel Dynamic Programming Scheme on
Heterogeneous Systems. 2004
10Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Peer to peer computing
11A little history
- Parallel optimization in the past
- Hand-optimization for each platform
- Time consuming
- Incompatible with hardware evolution
- Incompatible with changes in the system
(architecture and basic libraries) - Unsuitable for systems with variable workloads
- Misuse by non expert users
-
-
12A little history
- Initial solutions to this situation
- Problem-specific solutions
- Polyalgorithms
- Installation tests
13A little history
- Problem specific solutions
- Brewer (1994) Sorting Algorithms, Differential
Equations - Frigo (1997) FFTW The Fastest Fourier Transform
in the West - LAWRA (1997) Linear Algebra With Recursive
Algorithms
14A little history
- Polyalgorithms
- Brewer
- FFTW
- PHiPAC (1997) Linear Algebra
15A little history
- Installation tests
- ATLAS (2001) Dense Linear Algebra, sequential
- Carrillo Giménez (2000) Gauss elimination,
heterogeneous algorithm - I-LIB (2000) some parallel linear algebra
routines
16A little history
- Parallel optimization today
- Optimization based on computational kernels
- Systematic development of routines
- Auto-optimization of routines
- Middleware for auto-optimization
-
17A little history
- Optimization based on computational kernels
- Efficient kernels (BLAS) and algorithms based on
these kernels - Auto-optimization of the basic kernels (ATLAS)
-
18A little history
- Systematic development of routines
- FLAME project
- R. van de Geijn E. Quintana
- Dense Linear Algebra
- Based on Object Oriented Design
- LAWRA
- Dense Linear Algebra
- For Shared Memory Systems
-
19A little history
- Auto-optimization of routines
- At installation time
- ATLAS, Dongarra Whaley
- I-LIB, Kanada Katagiri Kuroda
- SOLAR, Cuenca Giménez González
- LFC, Dongarra Roche
- At execution time
- Solve a reduced problem in each processor
(Kalinov Lastovetsky) - Use a system evaluation tool (NWS)
-
20A little history
- Middleware for auto-optimization
- LFC
- Middleware for Dense Linear Algebra Software in
Clusters. - Hierarchy of autotuning libraries
- Include in the libraries installation routines to
be used in the development of higher level
libraries - FIBER
- Proposal of general middleware
- Evolution of I-LIB
- mpC
- For heterogeneous systems
-
21A little history
- Parallel optimization in the future?
- Skeletons and languages
- Heterogeneous and variable-load systems
- Distributed systems
- P2P computing
-
22A little history
- Skeletons and languages
- Develop skeletons for parallel algorithmic
schemes - together with execution time models
- and provide the users with these libraries
(MALLBA, Málaga-La Laguna-Barcelona) or languages
(P3L, Pisa) -
23A little history
- Heterogeneous and variable-load systems
- Heterogeneous algorithms unbalanced distribution
of data (static or dynamic) - Homogeneous algorithms more processes than
processors and assignation of processes to
processors (static or dynamic) - Variable-load systems as dynamic heterogeneous
-
24A little history
- Distributed systems
- Intrinsically heterogeneous and variable-load
- Very high cost of communications
- Necessary special middleware (Globus, NWS)
- There can be servers to attend queries of
clients -
25A little history
- P2P computing
- Users can go in and out dynamically
- All the users are the same type (initially)
- Is distributed, heterogeneous and variable-load
- But special middleware is necessary
-
26Outline
- A little story
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Peer to peer computing
27Modelling Linear Algebra Routines
- Necessary to predict accurately the execution
time and select - The number of processes
- The number of processors
- Which processors
- The number of rows and columns of processes (the
topology) - The processes to processors assignation
- The computational block size (in linear algebra
algorithms) - The communication block size
- The algorithm (polyalgorithms)
- The routine or library (polylibraries)
-
-
28Modelling Linear Algebra Routines
- Cost of a parallel program
- arithmetic time
- communication time
- overhead, for synchronization, imbalance,
processes creation, ... - overlapping of communication and computation
29Modelling Linear Algebra Routines
- Estimation of the time
- Considering computation and communication divided
in a number of steps - And for each part of the formula that of the
process which gives the highest value. -
30Modelling Linear Algebra Routines
- The time depends on the problem (n) and the
system (p) size - But also on some ALGORITHMIC PARAMETERS like the
block size (b) and the number of rows (r) and
columns (c) of processors in algorithms for a
mesh of processors -
31Modelling Linear Algebra Routines
- And some SYSTEM PARAMETERS which reflect the
computation and communication characteristics of
the system. - Typically the cost of an arithmetic operation
(tc) and the start-up (ts) and word-sending time
(tw)
32Modelling Linear Algebra Routines
- LU factorisation (Golub - Van Loan)
-
- Step 1 (factorisation LU no blocks)
- Step 2 (multiple lower triangular systems)
- Step 3 (multiple upper triangular systems)
- Step 4 (update south-east blocks)
U11
U13
U12
L11
U22
U23
L22
L21
U33
L33
L32
L31
33Modelling Linear Algebra Routines
- The execution time is
- If the blocks are of size 1, the operations are
all with individual elements, but if the blocks
size is b the cost is - With k3 and k2 the cost of operations performed
with BLAS 3 or 2
34Modelling Linear Algebra Routines
- But the cost of different operations of the same
level is different, and the theoretical cost
could be better modelled as - Thus, the number of SYSTEM PARAMETERS increases
(one for each basic routine), and ...
35Modelling Linear Algebra Routines
- The value of each System Parameter can depend on
the problem size (n) and on the value of the
Algorithmic Parameters (b) - The formula has the form
- And what we want is to obtain the values of AP
with which the lowest execution time is obtained
36Modelling Linear Algebra Routines
- The values of the System Parameters could be
obtained - With installation routines associated to each
linear algebra routine - From information stored when the library was
installed in the system, thus generating a
hierarchy of libraries with auto-optimization - At execution time by testing the system
conditions prior to the call to the routine
37Modelling Linear Algebra Routines
- These values can be obtained as simple values
(traditional method) or as function of the
Algorithmic Parameters. - In this case a multidimensional table of values
as a function of the problem size and the
Algorithmic Parameters is stored, - And when a problem of a particular size is being
solved the execution time is estimated with the
values of the stored size closest to the real
size - And the problem is solved with the values of the
Algorithmic Parameters which predict the lowest
execution time
38Modelling Linear Algebra Routines
- Parallel block LU factorisation
- matrix
- distribution of computations in the first
step - processors
39Modelling Linear Algebra Routines
- Distribution of computations on successive steps
-
-
-
-
- second step third step
-
40Modelling Linear Algebra Routines
- The cost of parallel block LU factorisation
- Tuning Algorithmic Parameters
- block size b
- 2D-mesh of p proccesors p r ?c dmax(r,c)
- System Parameters
- cost of arithmetic operations k2,getf2
k3,trsmm k3,gemm - communication parameters ts tw
41Modelling Linear Algebra Routines
- The cost of parallel block QR factorisation
- Tuning Algorithmic Parameters
- block size b
- 2D-mesh of p proccesors p r ?c
- System Parameters
- cost of arithmetic operations k2,geqr2
k2,larft k3,gemm k3,trmm - communication parameters ts tw
42Modelling Linear Algebra Routines
- The same basic operations appear repeatedly in
different higher level routines - the information generated for one routine (lets
say LU) could be stored and used for other
routines (e.g. QR) - and a common format is necessary to store the
information
43Modelling Linear Algebra Routines
44Modelling Linear Algebra Routines
Parallel QR factorisation mean refers to the
mean of the execution times with representative
values of the Algorithmic Parameters (execution
time which could be obtained by a non-expert
user) optimum is the lowest time of all the
executions performed with representative values
of the Algorithmic Parameters model is the
execution time with the values selected with the
model
IBM-SP2. 8 processors
45Modelling Linear Algebra Routines
Parameter selection for the QR algorithm
-
p4
p8
b
r
c
b
r
c
1024
16
1
4
16
1
8
2048
16
1
4
16
1
8
Network of Pentium III with Fast Ethernet
3072
32
1
4
32
1
8
4096
32
1
4
32
1
8
46Outline
- A little history
- Modelling Linear Algebra Routines
- Installation routines
- Autotuning routines
- Modifications to libraries hierarchy
- Polylibraries
- Algorithmic schemes
- Heterogeneous systems
- Peer to peer computing
47Installation Routines
- In the formulas (parallel block LU factorisation)
- The values of the System Parameters (k2,getf2 ,
k3,trsmm , k3,gemm , ts , tw) must be estimated
as functions of the problem size (n) and the
Algorithmic Parameters (b, r, c)
48Installation Routines
- By running at installation time Installation
Routines associated to the linear algebra routine - And storing the information generated to be used
at running time - ?
- Each linear algebra routine must be designed
together with the corresponding installation
routines, and the installation process must be
detailed
49Installation Routines
- is estimated by performing matrix-matrix
multiplications and updatings of size
(n/r ?b) ? (b ?n/c) - Because during the execution the size of the
matrix to work with decreases, different values
can be estimated for different problem sizes, and
the formula can be modified to include the
posibility of these estimations with different
values, for example, splitting the formula into
four formulas with different problem sizes
50Installation Routines
- two multiple triangular systems are solved, one
upper triangular of size b ?n/c , and another
lower triangular of size n/r ?b - Thus, two parameters are estimated, one of them
depending on n, b and c, and the other depending
on n, b and r - As for the previous parameter, values can be
obtained for different problem sizes
51Installation Routines
- corresponds to a level 2 LU sequential
factorisation of size b ?b - At installation time each of the basic routines
is executed varying the value of the parameters
they depend on, and with representative values
(selected by the routine designer or the system
manager), - And the information generated is stored in a file
to be used at running time or in the code of the
linear algebra routine before its installation
52Installation Routines
- and appear in communications of three types,
- In one of them a block of size b ?b is broadcast
in a row, and this parameter depends on b and c - In another a block of size b ?b is broadcast in
a column, and the parameter depends on b and r - And in the other, blocks of sizes b ?n/c and n/r
?b are broadcast in each one of the columns and
rows of processors. These parameters depend on n,
b, r and c
53Installation Routines
- In practice each System Parameter depends on a
more reduced number of Algorithmic Parameters,
but this is known only after the installation
process is completed. - The routine designer also designs the
installation process, and can take into
consideration the experience he has to guide the
installation. - The basic installation process can be designed
allowing the intervention of the system manager.
54Installation Routines
- Some results in different systems (physical and
logical platform) - Values of k3_DTRMM ( k3_DGEMM) on the different
platforms (in microseconds)
55Installation Routines
Values of k2_DGEQR2 ( k2_DLARFT) on the
different platforms (in microseconds)
56Installation Routines
- Typically the values of the communication
parameters are well estimated with a ping-pong