Advances in the Optimization of Parallel Routines I - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Advances in the Optimization of Parallel Routines I

Description:

Advances in the Optimization of Parallel Routines (I) Domingo Gim nez ... Optimizaci n autom tica de rutinas paralelas de lgebra lineal. 2000. 10/8/09 ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 57
Provided by: javier98
Category:

less

Transcript and Presenter's Notes

Title: Advances in the Optimization of Parallel Routines I


1
Advances in the Optimization of Parallel Routines
(I)
  • Domingo Giménez
  • Departamento de Informática y Sistemas
  • Universidad de Murcia, Spain
  • dis.um.es/domingo

2
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Peer to peer computing

3
Collaborations and autoreferences
  • Modelling Linear Algebra Routines
  • J. Cuenca J. González
  • Modelling the Behaviour of Linear Algebra
    Algorithms with Message-passing. 2001
  • Towards the Design of an Automatically Tuned
    Linear Algebra Library. 2002
  • J. Cuenca L. P. García J. González A.
    Vidal
  • Empirical Modelling of Parallel Linear Algebra
    Routines. 2003

4
Colaborations and autoreferences
  • Installation routines
  • G. Carrillo
  • Installation routines for linear algebra
    libraries on LANs. 2000
  • G. Carrillo J. Cuenca J. González
  • Optimización automática de rutinas paralelas de
    álgebra lineal. 2000

5
Colaborations and autoreferences
  • Autotuning routines
  • J. Cuenca J. González
  • Automatic parameterization of parallel linear
    algebra routines. 2001
  • J. Cuenca
  • Some considerations about the Automatic
    Optimization of Parallel Linear Algebra Routines.
    2002

6
Colaborations and autoreferences
  • Modifications to the libraries hierarchy
  • J. Cuenca J. González
  • Architecture of an Automatic Tuned Linear Algebra
    Library. 2002 - 2004

7
Colaborations and autoreferences
  • Polylibraries
  • P. Alberti P. Alonso J. Cuenca A. Vidal
  • Designing Polylibraries to Speed Up Parallel
    Computations. 2003

8
Colaborations and autoreferences
  • Algorithmic schemes
  • J. P. Martínez
  • Automatic Optimization in Parallel Dynamic
    Programming Schemes. 2004

9
Colaborations and autoreferences
  • Heterogeneous systems
  • J. Cuenca J. Dongarra J. González K.
    Roche
  • Automatic Optimization of Parallel Linear Algebra
    Routines in Systems with Variable Load. 2003
  • J. Cuenca J. P. Martínez
  • Heuristics for Work Distribution of a Homogeneous
    Parallel Dynamic Programming Scheme on
    Heterogeneous Systems. 2004

10
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Peer to peer computing

11
A little history
  • Parallel optimization in the past
  • Hand-optimization for each platform
  • Time consuming
  • Incompatible with hardware evolution
  • Incompatible with changes in the system
    (architecture and basic libraries)
  • Unsuitable for systems with variable workloads
  • Misuse by non expert users

12
A little history
  • Initial solutions to this situation
  • Problem-specific solutions
  • Polyalgorithms
  • Installation tests

13
A little history
  • Problem specific solutions
  • Brewer (1994) Sorting Algorithms, Differential
    Equations
  • Frigo (1997) FFTW The Fastest Fourier Transform
    in the West
  • LAWRA (1997) Linear Algebra With Recursive
    Algorithms

14
A little history
  • Polyalgorithms
  • Brewer
  • FFTW
  • PHiPAC (1997) Linear Algebra

15
A little history
  • Installation tests
  • ATLAS (2001) Dense Linear Algebra, sequential
  • Carrillo Giménez (2000) Gauss elimination,
    heterogeneous algorithm
  • I-LIB (2000) some parallel linear algebra
    routines

16
A little history
  • Parallel optimization today
  • Optimization based on computational kernels
  • Systematic development of routines
  • Auto-optimization of routines
  • Middleware for auto-optimization

17
A little history
  • Optimization based on computational kernels
  • Efficient kernels (BLAS) and algorithms based on
    these kernels
  • Auto-optimization of the basic kernels (ATLAS)

18
A little history
  • Systematic development of routines
  • FLAME project
  • R. van de Geijn E. Quintana
  • Dense Linear Algebra
  • Based on Object Oriented Design
  • LAWRA
  • Dense Linear Algebra
  • For Shared Memory Systems

19
A little history
  • Auto-optimization of routines
  • At installation time
  • ATLAS, Dongarra Whaley
  • I-LIB, Kanada Katagiri Kuroda
  • SOLAR, Cuenca Giménez González
  • LFC, Dongarra Roche
  • At execution time
  • Solve a reduced problem in each processor
    (Kalinov Lastovetsky)
  • Use a system evaluation tool (NWS)

20
A little history
  • Middleware for auto-optimization
  • LFC
  • Middleware for Dense Linear Algebra Software in
    Clusters.
  • Hierarchy of autotuning libraries
  • Include in the libraries installation routines to
    be used in the development of higher level
    libraries
  • FIBER
  • Proposal of general middleware
  • Evolution of I-LIB
  • mpC
  • For heterogeneous systems

21
A little history
  • Parallel optimization in the future?
  • Skeletons and languages
  • Heterogeneous and variable-load systems
  • Distributed systems
  • P2P computing

22
A little history
  • Skeletons and languages
  • Develop skeletons for parallel algorithmic
    schemes
  • together with execution time models
  • and provide the users with these libraries
    (MALLBA, Málaga-La Laguna-Barcelona) or languages
    (P3L, Pisa)

23
A little history
  • Heterogeneous and variable-load systems
  • Heterogeneous algorithms unbalanced distribution
    of data (static or dynamic)
  • Homogeneous algorithms more processes than
    processors and assignation of processes to
    processors (static or dynamic)
  • Variable-load systems as dynamic heterogeneous

24
A little history
  • Distributed systems
  • Intrinsically heterogeneous and variable-load
  • Very high cost of communications
  • Necessary special middleware (Globus, NWS)
  • There can be servers to attend queries of
    clients

25
A little history
  • P2P computing
  • Users can go in and out dynamically
  • All the users are the same type (initially)
  • Is distributed, heterogeneous and variable-load
  • But special middleware is necessary

26
Outline
  • A little story
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Peer to peer computing

27
Modelling Linear Algebra Routines
  • Necessary to predict accurately the execution
    time and select
  • The number of processes
  • The number of processors
  • Which processors
  • The number of rows and columns of processes (the
    topology)
  • The processes to processors assignation
  • The computational block size (in linear algebra
    algorithms)
  • The communication block size
  • The algorithm (polyalgorithms)
  • The routine or library (polylibraries)

28
Modelling Linear Algebra Routines
  • Cost of a parallel program
  • arithmetic time
  • communication time
  • overhead, for synchronization, imbalance,
    processes creation, ...
  • overlapping of communication and computation

29
Modelling Linear Algebra Routines
  • Estimation of the time
  • Considering computation and communication divided
    in a number of steps
  • And for each part of the formula that of the
    process which gives the highest value.

30
Modelling Linear Algebra Routines
  • The time depends on the problem (n) and the
    system (p) size
  • But also on some ALGORITHMIC PARAMETERS like the
    block size (b) and the number of rows (r) and
    columns (c) of processors in algorithms for a
    mesh of processors

31
Modelling Linear Algebra Routines
  • And some SYSTEM PARAMETERS which reflect the
    computation and communication characteristics of
    the system.
  • Typically the cost of an arithmetic operation
    (tc) and the start-up (ts) and word-sending time
    (tw)

32
Modelling Linear Algebra Routines
  • LU factorisation (Golub - Van Loan)
  • Step 1 (factorisation LU no blocks)
  • Step 2 (multiple lower triangular systems)
  • Step 3 (multiple upper triangular systems)
  • Step 4 (update south-east blocks)

U11
U13
U12
L11
U22
U23
L22
L21
U33
L33
L32
L31
33
Modelling Linear Algebra Routines
  • The execution time is
  • If the blocks are of size 1, the operations are
    all with individual elements, but if the blocks
    size is b the cost is
  • With k3 and k2 the cost of operations performed
    with BLAS 3 or 2

34
Modelling Linear Algebra Routines
  • But the cost of different operations of the same
    level is different, and the theoretical cost
    could be better modelled as
  • Thus, the number of SYSTEM PARAMETERS increases
    (one for each basic routine), and ...

35
Modelling Linear Algebra Routines
  • The value of each System Parameter can depend on
    the problem size (n) and on the value of the
    Algorithmic Parameters (b)
  • The formula has the form
  • And what we want is to obtain the values of AP
    with which the lowest execution time is obtained

36
Modelling Linear Algebra Routines
  • The values of the System Parameters could be
    obtained
  • With installation routines associated to each
    linear algebra routine
  • From information stored when the library was
    installed in the system, thus generating a
    hierarchy of libraries with auto-optimization
  • At execution time by testing the system
    conditions prior to the call to the routine

37
Modelling Linear Algebra Routines
  • These values can be obtained as simple values
    (traditional method) or as function of the
    Algorithmic Parameters.
  • In this case a multidimensional table of values
    as a function of the problem size and the
    Algorithmic Parameters is stored,
  • And when a problem of a particular size is being
    solved the execution time is estimated with the
    values of the stored size closest to the real
    size
  • And the problem is solved with the values of the
    Algorithmic Parameters which predict the lowest
    execution time

38
Modelling Linear Algebra Routines
  • Parallel block LU factorisation
  • matrix
  • distribution of computations in the first
    step
  • processors

39
Modelling Linear Algebra Routines
  • Distribution of computations on successive steps
  • second step third step

40
Modelling Linear Algebra Routines
  • The cost of parallel block LU factorisation
  • Tuning Algorithmic Parameters
  • block size b
  • 2D-mesh of p proccesors p r ?c dmax(r,c)
  • System Parameters
  • cost of arithmetic operations k2,getf2
    k3,trsmm k3,gemm
  • communication parameters ts tw

41
Modelling Linear Algebra Routines
  • The cost of parallel block QR factorisation
  • Tuning Algorithmic Parameters
  • block size b
  • 2D-mesh of p proccesors p r ?c
  • System Parameters
  • cost of arithmetic operations k2,geqr2
    k2,larft k3,gemm k3,trmm
  • communication parameters ts tw

42
Modelling Linear Algebra Routines
  • The same basic operations appear repeatedly in
    different higher level routines
  • the information generated for one routine (lets
    say LU) could be stored and used for other
    routines (e.g. QR)
  • and a common format is necessary to store the
    information

43
Modelling Linear Algebra Routines
44
Modelling Linear Algebra Routines
Parallel QR factorisation mean refers to the
mean of the execution times with representative
values of the Algorithmic Parameters (execution
time which could be obtained by a non-expert
user) optimum is the lowest time of all the
executions performed with representative values
of the Algorithmic Parameters model is the
execution time with the values selected with the
model
IBM-SP2. 8 processors
45
Modelling Linear Algebra Routines
Parameter selection for the QR algorithm

-


p4

p8


b

r

c

b

r

c

1024

16

1

4

16

1

8

2048

16

1

4

16

1

8

Network of Pentium III with Fast Ethernet
3072

32

1

4

32

1

8

4096

32

1

4

32

1

8


46
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Peer to peer computing

47
Installation Routines
  • In the formulas (parallel block LU factorisation)
  • The values of the System Parameters (k2,getf2 ,
    k3,trsmm , k3,gemm , ts , tw) must be estimated
    as functions of the problem size (n) and the
    Algorithmic Parameters (b, r, c)

48
Installation Routines
  • By running at installation time Installation
    Routines associated to the linear algebra routine
  • And storing the information generated to be used
    at running time
  • ?
  • Each linear algebra routine must be designed
    together with the corresponding installation
    routines, and the installation process must be
    detailed

49
Installation Routines
  • is estimated by performing matrix-matrix
    multiplications and updatings of size
    (n/r ?b) ? (b ?n/c)
  • Because during the execution the size of the
    matrix to work with decreases, different values
    can be estimated for different problem sizes, and
    the formula can be modified to include the
    posibility of these estimations with different
    values, for example, splitting the formula into
    four formulas with different problem sizes

50
Installation Routines
  • two multiple triangular systems are solved, one
    upper triangular of size b ?n/c , and another
    lower triangular of size n/r ?b
  • Thus, two parameters are estimated, one of them
    depending on n, b and c, and the other depending
    on n, b and r
  • As for the previous parameter, values can be
    obtained for different problem sizes

51
Installation Routines
  • corresponds to a level 2 LU sequential
    factorisation of size b ?b
  • At installation time each of the basic routines
    is executed varying the value of the parameters
    they depend on, and with representative values
    (selected by the routine designer or the system
    manager),
  • And the information generated is stored in a file
    to be used at running time or in the code of the
    linear algebra routine before its installation

52
Installation Routines
  • and appear in communications of three types,
  • In one of them a block of size b ?b is broadcast
    in a row, and this parameter depends on b and c
  • In another a block of size b ?b is broadcast in
    a column, and the parameter depends on b and r
  • And in the other, blocks of sizes b ?n/c and n/r
    ?b are broadcast in each one of the columns and
    rows of processors. These parameters depend on n,
    b, r and c

53
Installation Routines
  • In practice each System Parameter depends on a
    more reduced number of Algorithmic Parameters,
    but this is known only after the installation
    process is completed.
  • The routine designer also designs the
    installation process, and can take into
    consideration the experience he has to guide the
    installation.
  • The basic installation process can be designed
    allowing the intervention of the system manager.

54
Installation Routines
  • Some results in different systems (physical and
    logical platform)
  • Values of k3_DTRMM ( k3_DGEMM) on the different
    platforms (in microseconds)

55
Installation Routines
Values of k2_DGEQR2 ( k2_DLARFT) on the
different platforms (in microseconds)
56
Installation Routines
  • Typically the values of the communication
    parameters are well estimated with a ping-pong
Write a Comment
User Comments (0)
About PowerShow.com