Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning

1
Automatic Selection of Loop Scheduling
Algorithms Using Reinforcement Learning

Mahbubur Rashid1,2, Ioana Banicescu1,2,
Ricolindo L. Cariño3
1Department of Computer Science and Engineering
2Center for Computational Sciences HPC2
3Center for Advanced Vehicular Systems HPC2
Mississippi State University
Partial support from the NSF Grants 9984465,
0085969, 0081303, 0132618, and 0082979.

2
Outline

Load balancing research _at_ MSU
Research motivation
The need to select the appropriate dynamic loop
scheduling algorithm in a dynamic environment
Background work
Dynamic loop scheduling
Reinforcement Learning techniques
An integrated approach using dynamic loop
scheduling and reinforcement learning for
performance improvements
Experimental setup results
Conclusions future directions

3
Load Balancing Research _at_ MSU
4
Scheduling and Load Balancing _at_ MSU

Objective Performance optimization for problems
in computational science via dynamic scheduling
and load balancing algorithm development
Activities
Derive novel loop scheduling techniques (based on
probabilistic analyses)
Adaptive weighted factoring (2000, 01, 02, 03)
Adaptive factoring (2000)
Develop load balancing tools and libraries
For applications using Threads MPI DMCS/MOL,
LB_Library (2004, 08)
Addnl functionality of systems Hector Loci
(2006)
Improve the performance of applications
N-body simulations CFD simulations Quantum
physics
Astrophysics Computational mathematics,
statistics

5
Research Motivation
6
Motivation The need to select the appropriate
dynamic loop scheduling algorithm for
time-stepping applications running in a dynamic
environment

Sequential form
Initializations
do t 1, nsteps
do i 1, N
(loop body)
end do
end do
Finalizations

Parallel form
Initializations
do t 1, nsteps
call LoopSchedule (
1, N, loop_body_routine,
myRank, foreman, method,
)
end do
Finalizations

Property The loop iterate execution times (1)
are non-uniform, and (2) evolve with
t. Problem How to select the scheduling
method? Proposed solution Machine Learning!
7
Background Work
8
Dynamic loop scheduling algorithms

Static chunking
Dynamic Non-adaptive
Fixed size chunking (1985)
Guided self scheduling (1987)
Factoring (1992)
Weighted factoring (1996)
Dynamic Adaptive
Adaptive weighted factoring (2001-2003)
Adaptive factoring (2000, 2002)
Significance of dynamic scheduling techniques
Address all sources of load imbalance
(algorithmic and systemic)
Based on probabilistic analyses

9
Machine Learning (ML)

Supervised Learning (SL)
Teacher
Learner
Input-output pairs
Training (offline learning)
Reinforcement Learning (RL)
Agent
Environment
Action, state, reward
Learning concurrent with problem solving
Survey http//www-2.cs.cmu.edu/afs/cs/project/jai
r/pub/ volume4/kaelbling96a-html/rl-survey.html

10
Reinforcement Learning system
I set of inputs (i) R set of rewards (r) B
policy a action T transition s - state
11
Reinforcement Learning (RL)

Model-based approach
Model M, utility function UM from M
Examples Dyna, Prioritized Sweeping, Queue-Dyna,
Real-Time Dynamic Programming
Model-free approach
Action-value function Q
Example Temporal Difference (Monte Carlo
Dynamic Programming)
SARSA algorithm
QLEARN algorithm

12
RL system for automatic selection of loop
scheduling methods
I set of inputs (set of methods, current time
step, set of loop ids) R set of rewards (loop
execution time) B policy (SARSA, QLEARN) a
action (use particular scheduling method) s
state (application is using method)
13
Research Approach
14
Embedding a RL system in time-stepping
applications with loops

Serial form
Initializations
do t 1,nsteps
do i 1, N
(loop body)
end do
end do
Finalizations

Parallel form
Initializations
do t 1, nsteps
call LoopSchedule(
1, N, loop_body_rtn,
myRank, foreman,
method, )
end do
Finalizations

With RL system Initializations call RL_Init() do
t 1, nsteps time_start time() call
RL_Action (method) call LoopSchedule ( 1, N,
loop_body_rtn, myRank, foreman, method,
) reward time()-time_start call RL_Reward
(t, method, reward) end do Finalizations
15
Test application Simulation of wave packet
dynamics using the Quantum Trajectory Method (QTM)

Bohm, D. 1952. A Suggested Interpretation of the
Quantum Theory in Terms of Hidden Variable, Phys
Rev 85, No. 2, 166-193.
Lopreore, C.L., R.W. Wyatt. 1999. Quantum
Wavepacket Dynamics with Trajectories, Phys Rev
Letters 82, No. 26, 5190-5193.
Brook, R.G, P.E. Oppenheimer, C.A. Weatherford,
I. Banicescu, J. Zhu. 2001. Solving the
Hydrodynamic Formulation of Quantum Mechanics A
Parallel MLS Method, Int. J. of Quantum
Chemistry 85, Nos. 4-5, 263-271.
Carino, R.L., I. Banicescu, R.K. Vadapalli, C.A.
Weatherford, J. Zhu. 2004. Message-Passing
Parallel Adaptive Quantum Trajectory Method,
High performance Scientific and Engineering
Computing Hardware/Software Support, L. T. Yang
and Y. Pan (Editors). Kluwer Academic Publishers,
127-139.

16
Application (QTM) summary

The time dependent Schrödinger's equation (TDSE)
i h /t ? H ?, H ? -(h/2m)Ñ2 V
quantum-mechanical dynamics of a particle of mass
m moving in a potential V
?(r,t ) is the complex wave function
The quantum trajectory method (QTM)
?(r,t ) R(r,t ) exp(i S(r,t )/h) (polar form
real-valued amplitude R(r,t ), phase S(r,t )
functions)
Plug ?(r,t ) into the TDSE, separate real and
imaginary parts
-(/t)?(r,t ) Ñ . ?(r,t)(1/m)ÑS(r,t )
-(/t)S(r,t ) (1/2m)ÑS(r,t )2 V(r,t )
Q(? r,t )
Probability density ?(r,t ) R2(r,t )
Velocity v(r,t ) (1/m)ÑS(r,t )
Flux j(r,t ) ?(r,t ) v(r,t )
Quantum potential Q(? r,t ) -(1/2m)(Ñ2log
?1/2 Ñlog ?1/2 2)

17
QTM algorithm

Initialize wave packet x(1N), v(1N), ?(1N)
do t 1, nsteps
do i 1..N
call MWLS (i, x(1N), ?(1N), p, b,)
compute Q(i)
do i 1..N
call MWLS (i, x(1N), Q(1N), p, b,)
compute fq(i)
do i 1..N
call MWLS (i, x(1N), v(1N), p, b,)
compute dv(i)
do i 1..N
Compute V(i), fc(i)
do i 1..N
Update ?(i), x(i), v(i)
Output wave packet

18
Free particle evolution of density
19
Harmonic oscillator evolution of density
20
Embedding a RL system in time-stepping
applications with loop scheduling

Serial form
Initializations
do t 1,nsteps
do i 1, N
(loop body)
end do
end do
Finalizations

Parallel form
Initializations
do t 1, nsteps
call LoopSchedule(
1, N, loop_body_rtn,
myRank, foreman,
method, )
end do
Finalizations

With RL system Initializations call RL_Init() do
t 1, nsteps time_start time() call
RL_Action (method) call LoopSchedule ( 1, N,
loop_body_rtn, myRank, foreman, method,
) reward time()-time_start call RL_Reward
(t, method, reward) end do Finalizations
21
QTM Application with embedded RL agents
22
Experimental Setups Results
23
Computational platform

HPC2 _at_ MSU hosts
13th most advanced HPC computational resource in
the world
EMPIRE cluster
1038 Pentium III (1.0 or 1.266 GHz)
Linux RedHat PBS
127th of Top 500 in June 2002
QTM in Fortran90, MPICH
RL agent in C

24
Experimental setup 1

Simulations
Free particle harmonic oscillator
501, 1001, 1501 pseudo-particles
10,000 time steps
No. of processors 2, 4, 8, 12, 16, 20, 24
Dynamic Loop scheduling methods
Equal size chunks (STATIC, SELF, FSC)
Decreasing size chunks (GSS, FAC)
Adaptive size chunks (AWF, AF)
Experimental methods (MODF, EXPT)
RL agent (techniques SARSA, QLEARN)

25
Experimental setup 1 (cont.)

Hypothesis
The simulation performs better using dynamic
scheduling methods with RL than dynamic
scheduling methods without RL
Design
Two-factor factorial experiment (factors
methods, no. of processors)
Five (5) replicates
Average parallel execution time TP
Comparison via t statistic at 0.05 significance
level, using Least Squares Means

26
Mean TP of free particle simulation , 104 time
steps, 501 pseudo particles
Means with the same annotation are not different
at 0.05 significance level via t statistics using
LSMEANS
27
Experimental setup 2

Simulations
Free particle harmonic oscillator
1001 pseudo-particles
500 time steps
No. of processors 4, 8, 12
Dynamic Loop scheduling methods
Equal size chunks (STATIC, SELF, FSC)
Decreasing size chunks (GSS, FAC)
Adaptive size chunks (AWF, AF)
Experimental methods (MODF, EXPT)
RL agent (techniques SARSA, QLEARN)

28
Experimental setup 2 (cont.)

Hypothesis
The simulation performance is not sensitive to
the learning parameters or the type of learning
technique used in the RL agent
Each learning technique selects the dynamic loop
scheduling methods in a unique pattern
Design
Two-factor factorial experiment (factors
methods, no. of processors)
Five (5) replicates
Average parallel execution time TP
Comparison via t statistic at 0.05 significance
level, using Least Squares Means

29
Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 4procs.)
30
Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 8procs.)
31
Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 12procs.)
32
Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 4 procs.)
33
Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 8 procs.)
34
Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 12 procs.)
35
Dynamic loop scheduling method selection patterns
( selection counts for QLEARN and SARSA)
36
Execution time Tp (sec) statistics with the RL
techniques (4, 8, and 12 procs.)
RL 0 is QLEARN RL 1 is SARSA
37
Conclusions

Performance of time-stepping applications with
parallel loops benefit from the proper use of
dynamic loop scheduling methods selected by RL
techniques
Dynamic loop scheduling methods using the RL
agent consistently outperform dynamic loop
scheduling methods without RL agent in wave
packet simulations
The performance of the simulation is not
sensitive to the learning parameters of the RL
techniques used

38
Conclusions (cont.)

The number and the pattern of the dynamic loop
scheduling method selection vary from one RL
technique to another
The execution time surface charts show relatively
smoother surfaces for the cases using SARSA in
the RL agent, indicating that this RL technique
is more robust.
Future work
Use of other more novel RL techniques in the RL
agent
Extending the use of this approach for
performance optimization of other time-stepping
applications

39
Appendix
40
Fixed size chunking (FSC)

Kruskal and Weiss (1985)
iterations times are i.i.r.v. with mean µ and
standard deviation s
constant overhead h
homogeneous procs start simultaneously
Expected finish time
E(T) µ(N/P) (hN)/(kP) sv(2k) log P
Optimal chunk size
kopt (Nhv2) / (sPv(log P) (2/3)

Index
41
Guided self-scheduling (GSS)

Polychronopoulos and Kuck (1987)
equal iteration times
homogeneous processors (need not start
simultaneously)
chunk remaining/P
decreasing chunk sizes

Index
42
Factoring (FAC)

Flynn Hummel (1990)
batchremaining/xb chunkbatch/P
xb "is determined by estimating the maximum
portion of the remaining iterations that have a
high probability of finishing before the optimal
time (N/P) µ (ignoring the scheduling overhead)"
xb 2 works well (FAC2)

Index
43
Weighted factoring (WF)

Hummel, Schmidt, Uma and Wien (1996)
processors may be heterogeneous
wr the relative speed of processor r
chunkr (FAC2 chunk) wr
sample application radar signal processing

Index
44
Adaptive weighted factoring (AWF)

Banicescu, Soni, Ghafoor, and Velusamy (2000)
for time stepping applications
pr ? (chunk times) / ? (chunk size)
pave (? pi )/ P
?r pr / pave
?tot ? ?i
wr (?rP) / ?tot

Index
45
Adaptive factoring (AF)

Banicescu and Liu (2000)
generalized factoring method based on "a
probabilistic and statistical model that computes
the chunksize such that all processors' expected
finishing time is less than the optimal time of
remaining iterates without further factoring"
µr, sr are estimated during runtime
chunkr (D2TR -v(D(D4TR)) / (2µr), where
Rrem, D?(si /µi), T?(1/µi)

Index
46
AWF variants

Adapt wr after a chunk
AWF-B
use batch as in FAC2
chunkr wrbatch/P
AWF-C
chunkr wrremaining/(2P)
Small chunks are used to collect initial timings

Index
47
Parallel overheads

Loop scheduling
FAC, AWF O (P log N )
AF slightly higher than FAC
Data movement
MPI_Bcast() worst case O (PN)

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning PowerPoint PPT Presentation