Advances in the Optimization of Parallel Routines I

About This Presentation

Title:

Advances in the Optimization of Parallel Routines I

Description:

Advances in the Optimization of Parallel Routines (I) Domingo Gim nez ... Optimizaci n autom tica de rutinas paralelas de lgebra lineal. 2000. 10/8/09 ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 57

Provided by: javier98

Category:

more less

Transcript and Presenter's Notes

Title: Advances in the Optimization of Parallel Routines I

1
Advances in the Optimization of Parallel Routines
(I)

Domingo Giménez
Departamento de Informática y Sistemas
Universidad de Murcia, Spain
dis.um.es/domingo

2
Outline

A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Peer to peer computing

3
Collaborations and autoreferences

Modelling Linear Algebra Routines
J. Cuenca J. González
Modelling the Behaviour of Linear Algebra
Algorithms with Message-passing. 2001
Towards the Design of an Automatically Tuned
Linear Algebra Library. 2002
J. Cuenca L. P. García J. González A.
Vidal
Empirical Modelling of Parallel Linear Algebra
Routines. 2003

4
Colaborations and autoreferences

Installation routines
G. Carrillo
Installation routines for linear algebra
libraries on LANs. 2000
G. Carrillo J. Cuenca J. González
Optimización automática de rutinas paralelas de
álgebra lineal. 2000

5
Colaborations and autoreferences

Autotuning routines
J. Cuenca J. González
Automatic parameterization of parallel linear
algebra routines. 2001
J. Cuenca
Some considerations about the Automatic
Optimization of Parallel Linear Algebra Routines.
2002

6
Colaborations and autoreferences

Modifications to the libraries hierarchy
J. Cuenca J. González
Architecture of an Automatic Tuned Linear Algebra
Library. 2002 - 2004

7
Colaborations and autoreferences

Polylibraries
P. Alberti P. Alonso J. Cuenca A. Vidal
Designing Polylibraries to Speed Up Parallel
Computations. 2003

8
Colaborations and autoreferences

Algorithmic schemes
J. P. Martínez
Automatic Optimization in Parallel Dynamic
Programming Schemes. 2004

9
Colaborations and autoreferences

Heterogeneous systems
J. Cuenca J. Dongarra J. González K.
Roche
Automatic Optimization of Parallel Linear Algebra
Routines in Systems with Variable Load. 2003
J. Cuenca J. P. Martínez
Heuristics for Work Distribution of a Homogeneous
Parallel Dynamic Programming Scheme on
Heterogeneous Systems. 2004

10
Outline

A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Peer to peer computing

11
A little history

Parallel optimization in the past
Hand-optimization for each platform
Time consuming
Incompatible with hardware evolution
Incompatible with changes in the system
(architecture and basic libraries)
Unsuitable for systems with variable workloads
Misuse by non expert users

12
A little history

Initial solutions to this situation
Problem-specific solutions
Polyalgorithms
Installation tests

13
A little history

Problem specific solutions
Brewer (1994) Sorting Algorithms, Differential
Equations
Frigo (1997) FFTW The Fastest Fourier Transform
in the West
LAWRA (1997) Linear Algebra With Recursive
Algorithms

14
A little history

Polyalgorithms
Brewer
FFTW
PHiPAC (1997) Linear Algebra

15
A little history

Installation tests
ATLAS (2001) Dense Linear Algebra, sequential
Carrillo Giménez (2000) Gauss elimination,
heterogeneous algorithm
I-LIB (2000) some parallel linear algebra
routines

16
A little history

Parallel optimization today
Optimization based on computational kernels
Systematic development of routines
Auto-optimization of routines
Middleware for auto-optimization

17
A little history

Optimization based on computational kernels
Efficient kernels (BLAS) and algorithms based on
these kernels
Auto-optimization of the basic kernels (ATLAS)

18
A little history

Systematic development of routines
FLAME project
R. van de Geijn E. Quintana
Dense Linear Algebra
Based on Object Oriented Design
LAWRA
Dense Linear Algebra
For Shared Memory Systems

19
A little history

Auto-optimization of routines
At installation time
ATLAS, Dongarra Whaley
I-LIB, Kanada Katagiri Kuroda
SOLAR, Cuenca Giménez González
LFC, Dongarra Roche
At execution time
Solve a reduced problem in each processor
(Kalinov Lastovetsky)
Use a system evaluation tool (NWS)

20
A little history

Middleware for auto-optimization
LFC
Middleware for Dense Linear Algebra Software in
Clusters.
Hierarchy of autotuning libraries
Include in the libraries installation routines to
be used in the development of higher level
libraries
FIBER
Proposal of general middleware
Evolution of I-LIB
mpC
For heterogeneous systems

21
A little history

Parallel optimization in the future?
Skeletons and languages
Heterogeneous and variable-load systems
Distributed systems
P2P computing

22
A little history

Skeletons and languages
Develop skeletons for parallel algorithmic
schemes
together with execution time models
and provide the users with these libraries
(MALLBA, Málaga-La Laguna-Barcelona) or languages
(P3L, Pisa)

23
A little history

Heterogeneous and variable-load systems
Heterogeneous algorithms unbalanced distribution
of data (static or dynamic)
Homogeneous algorithms more processes than
processors and assignation of processes to
processors (static or dynamic)
Variable-load systems as dynamic heterogeneous

24
A little history

Distributed systems
Intrinsically heterogeneous and variable-load
Very high cost of communications
Necessary special middleware (Globus, NWS)
There can be servers to attend queries of
clients

25
A little history

P2P computing
Users can go in and out dynamically
All the users are the same type (initially)
Is distributed, heterogeneous and variable-load
But special middleware is necessary

26
Outline

A little story
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Peer to peer computing

27
Modelling Linear Algebra Routines

Necessary to predict accurately the execution
time and select
The number of processes
The number of processors
Which processors
The number of rows and columns of processes (the
topology)
The processes to processors assignation
The computational block size (in linear algebra
algorithms)
The communication block size
The algorithm (polyalgorithms)
The routine or library (polylibraries)

28
Modelling Linear Algebra Routines

Cost of a parallel program
arithmetic time
communication time
overhead, for synchronization, imbalance,
processes creation, ...
overlapping of communication and computation

29
Modelling Linear Algebra Routines

Estimation of the time
Considering computation and communication divided
in a number of steps
And for each part of the formula that of the
process which gives the highest value.

30
Modelling Linear Algebra Routines

The time depends on the problem (n) and the
system (p) size
But also on some ALGORITHMIC PARAMETERS like the
block size (b) and the number of rows (r) and
columns (c) of processors in algorithms for a
mesh of processors

31
Modelling Linear Algebra Routines

And some SYSTEM PARAMETERS which reflect the
computation and communication characteristics of
the system.
Typically the cost of an arithmetic operation
(tc) and the start-up (ts) and word-sending time
(tw)

32
Modelling Linear Algebra Routines

LU factorisation (Golub - Van Loan)
Step 1 (factorisation LU no blocks)
Step 2 (multiple lower triangular systems)
Step 3 (multiple upper triangular systems)
Step 4 (update south-east blocks)

U11
U13
U12
L11
U22
U23
L22
L21
U33
L33
L32
L31
33
Modelling Linear Algebra Routines

The execution time is
If the blocks are of size 1, the operations are
all with individual elements, but if the blocks
size is b the cost is
With k3 and k2 the cost of operations performed
with BLAS 3 or 2

34
Modelling Linear Algebra Routines

But the cost of different operations of the same
level is different, and the theoretical cost
could be better modelled as
Thus, the number of SYSTEM PARAMETERS increases
(one for each basic routine), and ...

35
Modelling Linear Algebra Routines

The value of each System Parameter can depend on
the problem size (n) and on the value of the
Algorithmic Parameters (b)
The formula has the form
And what we want is to obtain the values of AP
with which the lowest execution time is obtained

36
Modelling Linear Algebra Routines

The values of the System Parameters could be
obtained
With installation routines associated to each
linear algebra routine
From information stored when the library was
installed in the system, thus generating a
hierarchy of libraries with auto-optimization
At execution time by testing the system
conditions prior to the call to the routine

37
Modelling Linear Algebra Routines

These values can be obtained as simple values
(traditional method) or as function of the
Algorithmic Parameters.
In this case a multidimensional table of values
as a function of the problem size and the
Algorithmic Parameters is stored,
And when a problem of a particular size is being
solved the execution time is estimated with the
values of the stored size closest to the real
size
And the problem is solved with the values of the
Algorithmic Parameters which predict the lowest
execution time

38
Modelling Linear Algebra Routines

Parallel block LU factorisation
matrix
distribution of computations in the first
step
processors

39
Modelling Linear Algebra Routines

Distribution of computations on successive steps
second step third step

40
Modelling Linear Algebra Routines

The cost of parallel block LU factorisation
Tuning Algorithmic Parameters
block size b
2D-mesh of p proccesors p r ?c dmax(r,c)
System Parameters
cost of arithmetic operations k2,getf2
k3,trsmm k3,gemm
communication parameters ts tw

41
Modelling Linear Algebra Routines

The cost of parallel block QR factorisation
Tuning Algorithmic Parameters
block size b
2D-mesh of p proccesors p r ?c
System Parameters
cost of arithmetic operations k2,geqr2
k2,larft k3,gemm k3,trmm
communication parameters ts tw

42
Modelling Linear Algebra Routines

The same basic operations appear repeatedly in
different higher level routines
the information generated for one routine (lets
say LU) could be stored and used for other
routines (e.g. QR)
and a common format is necessary to store the
information

43
Modelling Linear Algebra Routines
44
Modelling Linear Algebra Routines
Parallel QR factorisation mean refers to the
mean of the execution times with representative
values of the Algorithmic Parameters (execution
time which could be obtained by a non-expert
user) optimum is the lowest time of all the
executions performed with representative values
of the Algorithmic Parameters model is the
execution time with the values selected with the
model
IBM-SP2. 8 processors
45
Modelling Linear Algebra Routines
Parameter selection for the QR algorithm

-

p4

p8

b

r

c

b

r

c

1024

16

1

4

16

1

8

2048

16

1

4

16

1

8

Network of Pentium III with Fast Ethernet
3072

32

1

4

32

1

8

4096

32

1

4

32

1

8

46
Outline

A little history
Modelling Linear Algebra Routines
Installation routines
Autotuning routines
Modifications to libraries hierarchy
Polylibraries
Algorithmic schemes
Heterogeneous systems
Peer to peer computing

47
Installation Routines

In the formulas (parallel block LU factorisation)
The values of the System Parameters (k2,getf2 ,
k3,trsmm , k3,gemm , ts , tw) must be estimated
as functions of the problem size (n) and the
Algorithmic Parameters (b, r, c)

48
Installation Routines

By running at installation time Installation
Routines associated to the linear algebra routine
And storing the information generated to be used
at running time
?
Each linear algebra routine must be designed
together with the corresponding installation
routines, and the installation process must be
detailed

49
Installation Routines

is estimated by performing matrix-matrix
multiplications and updatings of size
(n/r ?b) ? (b ?n/c)
Because during the execution the size of the
matrix to work with decreases, different values
can be estimated for different problem sizes, and
the formula can be modified to include the
posibility of these estimations with different
values, for example, splitting the formula into
four formulas with different problem sizes

50
Installation Routines

two multiple triangular systems are solved, one
upper triangular of size b ?n/c , and another
lower triangular of size n/r ?b
Thus, two parameters are estimated, one of them
depending on n, b and c, and the other depending
on n, b and r
As for the previous parameter, values can be
obtained for different problem sizes

51
Installation Routines

corresponds to a level 2 LU sequential
factorisation of size b ?b
At installation time each of the basic routines
is executed varying the value of the parameters
they depend on, and with representative values
(selected by the routine designer or the system
manager),
And the information generated is stored in a file
to be used at running time or in the code of the
linear algebra routine before its installation

52
Installation Routines

and appear in communications of three types,
In one of them a block of size b ?b is broadcast
in a row, and this parameter depends on b and c
In another a block of size b ?b is broadcast in
a column, and the parameter depends on b and r
And in the other, blocks of sizes b ?n/c and n/r
?b are broadcast in each one of the columns and
rows of processors. These parameters depend on n,
b, r and c

53
Installation Routines

In practice each System Parameter depends on a
more reduced number of Algorithmic Parameters,
but this is known only after the installation
process is completed.
The routine designer also designs the
installation process, and can take into
consideration the experience he has to guide the
installation.
The basic installation process can be designed
allowing the intervention of the system manager.

54
Installation Routines