Title: Automatic Optimization in Parallel Dynamic Programming Schemes
1Automatic Optimization in Parallel Dynamic
Programming Schemes
- Domingo Giménez
- Departamento de Informática y Sistemas
- Universidad de Murcia, Spain
- domingo_at_dif.um.es
- dis.um.es/domingo
Juan-Pedro MartÃnez Departamento de EstadÃstica
y Matemática Aplicada Universidad Miguel
Hernández de Elche, Spain jp.martinez_at_uhm.es
2Our Goal
- General Goal to obtain parallel routines with
autotuning capacity - Previous works Linear Algebra Routines
- This communication Parallel Dynamic Programming
Schemes - In the future apply the techniques to hybrid,
heterogeneous and distributed systems
3Outline
- Modelling Parallel Routines for Autotuning
- Parallel Dynamic Programming Schemes
- Autotuning in Parallel Dynamic Programming
Schemes - Experimental Results
4Modelling Parallel Routines for Autotuning
- Necessary to predict accurately the execution
time and select - The number of processes
- The number of processors
- Which processors
- The number of rows and columns of processes (the
topology) - The processes to processors assignation
- The computational block size (in linear algebra
algorithms) - The communication block size
- The algorithm (polyalgorithms)
- The routine or library (polylibraries)
-
-
5Modelling Parallel Routines for Autotuning
- Cost of a parallel program
- arithmetic time
- communication time
- overhead, for synchronization, imbalance,
processes creation, ... - overlapping of communication and computation
6Modelling Parallel Routines for Autotuning
- Estimation of the time
- Considering computation and communication divided
in a number of steps - And for each part of the formula that of the
process which gives the highest value. -
7Modelling Parallel Routines for Autotuning
- The time depends on the problem (n) and the
system (p) size - But also on some ALGORITHMIC PARAMETERS like the
block size (b) and the number of processors (q)
used from the total available -
8Modelling Parallel Routines for Autotuning
- And some SYSTEM PARAMETERS which reflect the
computation and communication characteristics of
the system. - Typically the cost of an arithmetic operation
(tc) and the start-up (ts) and word-sending time
(tw)
9Modelling Parallel Routines for Autotuning
- The values of the System Parameters could be
obtained - With installation routines associated to the
routine we are installing - From information stored when the library was
installed in the system - At execution time by testing the system
conditions prior to the call to the routine
10Modelling Parallel Routines for Autotuning
- These values can be obtained as simple values
(traditional method) or as function of the
Algorithmic Parameters. - In this case a multidimensional table of values
as a function of the problem size and the
Algorithmic Parameters is stored, - And when a problem of a particular size is being
solved the execution time is estimated with the
values of the stored size closest to the real
size - And the problem is solved with the values of the
Algorithmic Parameters which predict the lowest
execution time
11Parallel Dynamic Programming Schemes
- There are different Parallel Dynamic Programming
Schemes. - The simple scheme of the coins problem is used
- A quantity C and n coins of values
v(v1,v2,,vn), and a quantity q(q1,q2,,qn) of
each type. Minimize the quantity of coins to be
used to give C. - But the granularity of the computation has been
varied to study the scheme, not the problem.
12Parallel Dynamic Programming Schemes
- Sequential scheme
- for i1 to number_of_decisions
- for j1 to problem_size
- obtain the optimum solution with i
decisions and problem size j - endfor Complete the table with the formula
- endfor
13Parallel Dynamic Programming Schemes
- Parallel scheme
- for i1 to number_of_decisions
- In Parallel
- for j1 to problem_size
- obtain the optimum
- solution with
- i decisions
- and problem size j
- endfor
- endInParallel
- endfor
14Parallel Dynamic Programming Schemes
- Message-passing scheme
- In each processor Pj
- for i1 to number_of_decisions
- communication step
- obtain the optimum
- solution with
- i decisions
- and the problem
- sizes Pj has
- assigned
- endfor
- endInEachProcessor
N
PO P1
P2 .................... PK-1
PK
15Autotuning in Parallel Dynamic Programming Schemes
- Theoretical model
- Sequential cost
- Computational parallel cost (qi large)
- Communication cost
- The only AP is p
- The SPs are tc , ts and tw
one step
16Autotuning in Parallel Dynamic Programming Schemes
- How to estimate arithmetic SPs
- Solving a small problem
- How to estimate communication SPs
- Using a ping-pong (CP1)
- Solving a small problem varying the number of
processors (CP2) - Solving problems of selected sizes in systems of
selected sizes (CP3)
17Experimental Results
- Systems
- SUNEt five SUN Ultra 1 and one SUN Ultra 5 (2.5
times faster) Ethernet - PenET seven Pentium III FastEthernet
- Varying
- The problem size C 10000, 50000, 100000, 500000
- Large value of qi
- The granularity of the computation (the cost of a
computational step)
18Experimental Results
- CP1
- ping-pong (point-to-point communication).
- Does not reflect the characteristics of the
system - CP2
- Executions with the smallest problem (C 10000)
and varying the number of processors - Reflects the characteristics of the system, but
the time also changes with C - Larger installation time (6 and 9 seconds)
- CP3
- Executions with selected problem (C 10000,
100000) and system (p 2, 4, 6) sizes, and linear
interpolation for other sizes - Larger installation time (76 and 35 seconds)
19Experimental Results
Parameter selection
SUNEt
PenFE
20Experimental Results
- Quotient between the execution time with the
parameter selected by each one of the selection
methods and the lowest execution time, in SUNEt
21Experimental Results
- Quotient between the execution time with the
parameter selected by each one of the selection
methods and the lowest execution time, in PenFE
22Experimental Results
- Three types of users are considered
- GU (greedy user)
- Uses all the available processors.
- CU (conservative user)
- Uses half of the available processors
- EU (expert user)
- Uses a different number of processors depending
on the granularity - 1 for low granularity
- Half of the available processors for middle
granularity - All the processors for high granularity
23Experimental Results
- Quotient between the execution time with the
parameter selected by each type of user and the
lowest execution time, in SUNEt
24Experimental Results
- Quotient between the execution time with the
parameter selected by each type of user and the
lowest execution time, in PenFE
25Conclusions and future work
- The inclusion of Autotuning capacities in a
Parallel Dynamic Programming Scheme has been
considered. - Different forms of modelling the scheme and how
parameters are selected have been studied. - Experimentally the selection proves to be
satisfactory, and useful in providing the users
with routines capable of reduced time executions - In the future we plan to apply this technique
- to other algorithmic schemes,
- in hybrid, heterogeneous and distributed systems.