Advances in the Optimization of Parallel Routines II - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Advances in the Optimization of Parallel Routines II

Description:

Using BLAS: k1 k2 and k3. 9/27/09. Universidad Polit cnica de Valencia. 12. Autotuning routines ... BLAS. PBLAS. BLACS. Communications. Self-Optimisation ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 91
Provided by: javier98
Category:

less

Transcript and Presenter's Notes

Title: Advances in the Optimization of Parallel Routines II


1
Advances in the Optimization of Parallel Routines
(II)
  • Domingo Giménez
  • Departamento de Informática y Sistemas
  • Universidad de Murcia, Spain
  • dis.um.es/domingo

2
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Hybrid programming
  • Peer to peer computing

3
Colaborations and autoreferences
  • Autotuning routines
  • J. Cuenca J. González
  • Automatic parameterization of parallel linear
    algebra routines. 2001
  • J. Cuenca
  • Some considerations about the Automatic
    Optimization of Parallel Linear Algebra Routines.
    2002

4
Autotuning routines
  • Our approach
  • Routines Parameterised
  • System parameters, Algorithmic parameters
  • System parameters obtained at installation time
  • Analytical model of the routine and simple
    installation routines to obtain the system
    parameters
  • A reduced number of executions at installation
    time
  • Algorithmic parameters
  • From the analytical model with the system
    parameters obtained in the installation process

5
Autotuning routines
  • the scheme

6
Autotuning routines
  • Modelling
  • the LAR

7
Autotuning routines
  • LAR-MOD Analytical Model of LAR
  • The behaviour of the algorithm on the platform is
    defined
  • Texec f (SPs, n, APs)
  • SPs f(n, APs) System Parameters
  • APs Algorithmic Parameters
  • n Problem Size

8
Autotuning routines
  • LAR-MOD Analytical Model of LAR
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries

LARs Performance
9
Autotuning routines
  • LAR-MODAnalytical Model of LAR
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries
  • Two Kinds of SPs
  • Communication System Parameters (CSPs)
  • Arithmetic System Parameters (ASPs)

LARs Performance
10
Autotuning routines
  • LAR-MODAnalytical Model of LAR
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries
  • Two Kinds of SPs
  • Communication System Parameters (CSPs)
  • ts start-up time
  • tw word-sending time

LARs Performance
11
Autotuning routines
  • LAR-MODAnalytical Model of LAR
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries
  • Two Kinds of SPs
  • Communication System Parameters (CSPs)

LARs Performance
  • Arithmetic System Parameters (ASPs)
  • tc arithmetic cost. Using BLAS k1 k2 and k3

12
Autotuning routines
  • LAR-MODAnalytical Model of LAR
  • System Parameters (SPs)
  • Hardware Platform
  • Physical Characteristics
  • Current Conditions
  • Basic libraries

LARs Performance
  • How to estimate each SP?
  • 1º.- Obtain the kernel of performance cost of
    LAR
  • 2º.- Make an Estimation Routine from this
    kernel

13
Autotuning routines
  • Design

14
Autotuning routines
  • Design
  • Making the
  • LAR-ERs

15
Autotuning routines
  • LAR-ERs Estimation Routines
  • Arithmetic System Parameters (ASPs)
  • Computation Kernel of the LAR ? Estimation
    Routine
  • Similar storage scheme
  • Similar quantity of data
  • Communication System Parameters (CSPs)
  • Communication Kernel of the LAR ? Estimation
    Routine
  • Similar kind of communication
  • Similar quantity of data

16
Autotuning routines
  • Design
  • Process
  • has finished

17
Autotuning routines
  • Installation
  • Runing
  • the LAR-ERs

18
Autotuning routines
  • Installation
  • obtaining
  • the OAP

19
Autotuning routines
  • Installation obtaining the OAP
  • Algorithmic Parameters (APs)
  • Known the SPs values,
  • the Optimum Values for the APs are calculated
    (OAP)
  • b block size
  • p number of processors
  • r ? c logical topology
  • grid configuration (logical 2D mesh)

20
Autotuning routines
  • Installation
  • putting it
  • all together

SYSTEM MANAGER
21
Autotuning routines
  • Experiments
  • LAR block LU factorization.
  • Platforms IBM SP2,
  • SGI Origin 2000,
  • NoW
  • Basic Libraries reference BLAS,
  • machine BLAS, ATLAS

22
Autotuning routines
  • LU on IBM SP2
  • Quotient between the
  • execution time with the
  • parameters selected by
  • the model and the lowest
  • experimentl execution
  • time (varying the
  • value of the parameters)

23
Autotuning routines
  • LU on Origin 2000
  • Quotient between the
  • execution time with the
  • parameters selected by
  • the model and the lowest
  • experimentl execution
  • time (varying the
  • value of the parameters)

24
Autotuning routines
  • LU on NoW
  • Quotient between the
  • execution time with the
  • parameters selected by
  • the model and the lowest
  • experimentl execution
  • time (varying the
  • value of the parameters)

25
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Hybrid programming
  • Peer to peer computing

26
Colaborations and autoreferences
  • Modifications to libraries hierarchy
  • J. Cuenca J. González
  • Architecture of an Automatic Tuned Linear Algebra
    Library. 2002 - 2004

27
Modifications to libraries hierarchy
  • In the optimization of routines individual basic
    operations appear repeatedly
  • LU
  • QR

28
Modifications to libraries hierarchy
  • The information generated to instal a routine
    could be used for another different routine
    without additional experiments
  • ts and tw are obtained when the communication
    library (MPI, PVM, ) is installed
  • K3,gemm is obtained when the basic computational
    library (BLAS, ATLAS, ) is installed

29
Modifications to libraries hierarchy
  • To determine
  • the type of experiments necessary for the
    different routines in the library
  • ts and tw obtained with ping-pong, broadcast,
    ?
  • K3,gemm obtained for small block sizes, ?
  • the format in which the data will be stored, to
    facilitate the use of them when installing other
    routines

30
Modifications to libraries hierarchy
  • The method could be valid not only for one
    library (that I am developing) but also for
    others libraries I or somebody else will develop
    in the future
  • the type of experiments
  • the format in which the data will be stored
  • must be decided by the Parallel Linear Algebra
    Community
  • and the typical hierarchy of libraries would
    change

31
Modifications to libraries hierarchy
  • typical hierarchy
  • of Parallel Linear
  • Algebra libraries

ScaLAPACK
PBLAS
LAPACK
BLACS
BLAS
Communications
32
Modifications to libraries hierarchy
  • To include
  • installation information
  • in the lowest levels
  • of the hierarchy

ScaLAPACK
PBLAS
LAPACK
BLACS
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
33
Modifications to libraries hierarchy
  • When installing libraries
  • in a higher level this
  • information can be used,
  • and new information
  • is generated

ScaLAPACK
PBLAS
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
34
Modifications to libraries hierarchy
  • And so in higher levels

ScaLAPACK
Self-Optimisation Information
PBLAS
Self-Optimisation Information
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
35
Modifications to libraries hierarchy
  • And new libraries
  • with autotunig capacity
  • could be developed

Inverse Eigenvalue Problem
Least Square Problem
PDE Solver
Self-Optimisation Information
Self-Optimisation Information
Self-Optimisation Information
ScaLAPACK
Self-Optimisation Information
PBLAS
Self-Optimisation Information
LAPACK
BLACS
Self-Optimisation Information
Self-Optimisation Information
BLAS
Communications
Self-Optimisation Information
Self-Optimisation Information
36
Modifications to libraries hierarchy
  • Movement
  • of information
  • between routines
  • in the different
  • levels of the
  • hierarchy

37
Modifications to libraries hierarchy
  • Movement
  • of information
  • between routines
  • in the different
  • levels of the
  • hierarchy

38
Modifications to libraries hierarchy
  • Movement
  • of information
  • between routines
  • in the different
  • levels of the
  • hierarchy

39
Modifications to libraries hierarchy
  • Movement
  • of information
  • between routines
  • in the different
  • levels of the
  • hierarchy

40
Modifications to libraries hierarchy
SOLAR_manager
LAR(n, AP) ...
Texec f (SP,AP, n) SP f(AP,n)
AP0
  • Architecture of
  • a Self Optimized
  • Linear Algebra
  • Routine manager

Optimum_AP
Model
nc
Current_problem_size
n1 ... nw AP1 ... APz
net1-1 ...net1-p ... netP-1 ..netp-p
Current_network_availability
CPU1 ... CPUp
Current_CPUs_availability
Installation_information
Current_system_information
SP1_manager
SP1_manager
SP1_manager
SPt_manager
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SP11,1
.... SP11,z nw SP1w,1 .... SP1w,z
AP1 .......... APz n1 SPt1,1
.... SPt1,z nw SPtw,1 .... SPtw,z
. . .
Installation_SP1_values
Installation_SP1_values
Installation_SP1_values
Installation_SP1_values
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SP1c,1 ....
SP1c,z
AP1 .......... APz nc SPtc,1 ....
SPtc,z
Current_SP1_values
Current_SP1_values
Current_SP1_values
Current_SP1_values
SP1_information
SP1_information
SP1_information
SPt_information
41
Modifications to libraries hierarchy
  • Lyfe cycle
  • of a SOLAR

42
DESIGN PROCESS
D E S I G N
LAR
LAR Linear Algebra Routine Made by the LAR
Designer
Example of LAR Parallel Block LU factorisation
43
Modelling the LAR
D E S I G N
LAR
Modelling the LAR
MODEL
44
Modelling the LAR
D E S I G N
LAR
Made by the LAR-Designer Only once per LAR
Modelling the LAR
MODEL
SP System Parameters AP Algorithmic
Parameters n Problem size
MODEL Texec f (SP, AP, n)
45
Modelling the LAR
D E S I G N
LAR
SP k3, k2, ts, tw AP p r x c, b n
Problem size
Modelling the LAR
MODEL
MODEL LAR Parallel Block LU factorisation
46
Implementation of SP-Estimators
D E S I G N
LAR
Modelling the LAR
MODEL
Implementation of SP-Estimators
SP-Estimators
47
Implementation of SP-Estimators
D E S I G N
LAR
Modelling the LAR
  • Estimators of Arithmetic-SP
  • Computation Kernel of the LAR
  • Similar storage scheme
  • Similar quantity of data
  • Estimators of Communication-SP Communication
    Kernel of the LAR
  • Similar kind of communication
  • Similar quantity of data

MODEL
Implementation of SP-Estimators
SP-Estimators
48
INSTALLATION PROCESS
D E S I G N
LAR
Modelling the LAR
MODEL
Implementation of SP-Estimators
SP-Estimators
I N S T A L L A T I O N
Installation Process Only once per Platform Done
by the System Manager
49
Estimation of Static-SP
50
Estimation of Static-SP
Basic Libraries Basic Communication Library
MPI PVM Basic Linear Algebra Library
reference-BLAS machine-specific-BLAS ATLAS
Installation File SP values are obtained using
the information (n and AP values) of this file.
51
Estimation of Static-SP
D E S I G N
PlatformCluster of Pentium III Fast
Ethernet Basic Libraries ATLAS and MPI
LAR
Modelling the LAR
Estimation of the Static-SP k3-static (in
?sec) Block size 16 32 64 128 k3-static 0.003
8 0.0033 0.0030 0.0027
MODEL
Implementation of SP-Estimators
SP-Estimators
I N S T A L L A T I O N
Estimation of the Static-SP tw-static (in
?sec) Message size (Kbytes) 32 256 1024 2048 tw-
static 0.700 0.690 0.680 0.675
Basic Libraries
Installation-File
Estimation of Static-SP
Static-SP-File
52
RUN-TIME PROCESS
53
RUN-TIME PROCESS
54
RUN-TIME PROCESS
55
Outline
  • A little history
  • Modelling Linear Algebra Routines
  • Installation routines
  • Autotuning routines
  • Modifications to libraries hierarchy
  • Polylibraries
  • Algorithmic schemes
  • Heterogeneous systems
  • Hybrid programming
  • Peer to peer computing

56
Colaborations and autoreferences
  • Polylibraries
  • P. Alberti P. Alonso J. Cuenca A. Vidal
  • Designing Polylibraries to Speed Up Parallel
    Computations. 2003

57
Polylibraries
  • Different basic libraries can be available
  • Reference BLAS, machine specific BLAS, ATLAS,
  • MPICH, machine specific MPI, PVM,
  • Reference LAPACK, machine specific LAPACK,
  • ScaLAPACK, PLAPACK,
  • To use a number of different basic libraries to
    develop a polylibrary

58
Polylibraries
  • Typical parallel linear algebra libraries
    hierarchy

59
Polylibraries
  • A possible parallel linear algebra polylibraries
    hierarchy

60
Polylibraries
  • A possible parallel linear algebra polylibraries
    hierarchy

61
Polylibraries
  • A possible parallel linear algebra polylibraries
    hierarchy

62
Polylibraries
BLACS
63
Polylibraries
  • The advantage of Polylibraries
  • A library optimised for the system might not be
    available
  • The characteristics of the system can change
  • Which library is the best may vary according to
    the routines and the systems
  • Even for different problem sizes or different
    data access schemes the preferred library can
    change
  • In parallel system with the file system shared by
    processors of different types

64
Architecture of a Polylibrary
Library_1
65
Architecture of a Polylibrary
66
Architecture of a Polylibrary
67
Architecture of a Polylibrary
68
Architecture of a Polylibrary
69
Architecture of a Polylibrary
70
Architecture of a Polylibrary
71
Architecture of a Polylibrary
72
Architecture of a Polylibrary
73
Architecture of a Polylibrary
74
Polylibraries
  • Combining Polylibraries with other Optimisation
    Techniques
  • Polyalgorithms
  • Algorithmic Parameters
  • Block size
  • Number of processors
  • Logical topology of processors

75
Experimental Results
  • Routines of different levels in the hierarchy
  • Lowest level
  • GEMM matrix-matrix multiplication
  • Medium level
  • LU and QR factorisations
  • Highest level
  • a Lift-and-Project algorithm to solve the inverse
    additive eigenvalue problem
  • an algorithm to solve the Toeplitz least square
    problem

76
Experimental Results
  • The platforms
  • SGI Origin 2000
  • IBM-SP2
  • Different networks of processors
  • SUN Workstations Ethernet
  • PCs Fast-Ethernet
  • PCs Myrinet

77
Experimental Results GEMM
  • Routine GEMM (matrix-matrix multiplication)
  • Platform five SUN Ultra 1 / one SUN Ultra 5
  • Libraries
  • refBLAS macBLAS
  • ATLAS1 ATLAS2 ATLAS5
  • Algorithms and Parameters
  • Strassen ? base size
  • By blocks ? block size
  • Direct method

78
Experimental Results GEMM
  • MATRIX-MATRIX MULTIPLICATION INTERFACE
  • if processor is SUN Ultra 5
  • if problem-sizelt600
  • solve using ATLAS5 and Strassen method with base
    size half of problem size
  • else if problem-sizelt1000
  • solve using ATLAS5 and block method with block
    size 400
  • else
  • solve using ATLAS5 and Strassen method with base
    size half of problem size
  • endif
  • else if processor is SUN Ultra 1
  • if problem-sizelt600
  • solve using ATLAS5 and direct method
  • else if problem-sizelt1000
  • solve using ATLAS5 and Strassen method with base
    size half of problem size
  • else
  • solve using ATLAS5 and direct method
  • endif
  • endif

79
Experimental Results GEMM
80
Experimental Results LU
  • Routine LU factorisation
  • Platform 4 PentiumIII Myrinet
  • Libraries
  • ATLAS
  • BLAS for Pentium II
  • BLAS for Pentium III

81
Experimental Results LU
  • The cost of parallel block LU factorisation
  • Tuning Algorithmic Parameters
  • block size b
  • 2D-mesh of p proccesors p r ?c dmax(r,c)
  • System Parameters
  • cost of arithmetic operations k2,getf2
    k3,trsmm k3,gemm
  • communication parameters ts tw

82
Experimental Results LU
83
Experimental Results QR
  • Routine QR factorisation
  • Platform 8 PentiumIII Fast-Ethernet
  • Libraries
  • ATLAS
  • BLAS for Pentium II
  • BLAS for Pentium III

84
Experimental Results QR
  • The cost of parallel block QR factorisation
  • Tuning Algorithmic Parameters
  • block size b
  • 2D-mesh of p proccesors p r ?c
  • System Parameters
  • cost of arithmetic operations k2,geqr2
    k2,larft k3,gemm k3,trmm
  • communication parameters ts tw

85
Experimental Results QR
86
Experimental Results LP
  • Routine Lift-and-Project method for the Inverse
    Additive Eigenvalue Prob
  • Platform dual Pentium III
  • Libraries combinations

87
Experimental Results LP
  • The theoretical model of the sequential algorithm
    cost
  • System Parameters
  • ksyev ? LAPACK
  • k3, gemm k3, diaggemm ? BLAS-3
  • k1,dot k1,scal k1,axpy ? BLAS-1

88
Experimental Results LP
89
Experimental Results LP
90
Polylibraries
  • The method can be applied to sequential and
    parallel algorithms
  • It can be combined with other methods of
    computation speed up.
  • The LIF contains the cost of an operation for
    each one of the routines. These costs may be
    different for different data sizes or access
    schemes.
  • Could be applied to help in the development of
    efficient parallel libraries in other fields.
Write a Comment
User Comments (0)
About PowerShow.com