Title: Automatic Selection of Loop Scheduling Algorithms Using Reinforcement Learning
1Automatic Selection of Loop Scheduling
Algorithms Using Reinforcement Learning
- Mahbubur Rashid1,2, Ioana Banicescu1,2,
Ricolindo L. Cariño3 - 1Department of Computer Science and Engineering
- 2Center for Computational Sciences HPC2
- 3Center for Advanced Vehicular Systems HPC2
- Mississippi State University
- Partial support from the NSF Grants 9984465,
0085969, 0081303, 0132618, and 0082979.
2Outline
- Load balancing research _at_ MSU
- Research motivation
- The need to select the appropriate dynamic loop
scheduling algorithm in a dynamic environment - Background work
- Dynamic loop scheduling
- Reinforcement Learning techniques
- An integrated approach using dynamic loop
scheduling and reinforcement learning for
performance improvements - Experimental setup results
- Conclusions future directions
3Load Balancing Research _at_ MSU
4Scheduling and Load Balancing _at_ MSU
- Objective Performance optimization for problems
in computational science via dynamic scheduling
and load balancing algorithm development - Activities
- Derive novel loop scheduling techniques (based on
probabilistic analyses) - Adaptive weighted factoring (2000, 01, 02, 03)
- Adaptive factoring (2000)
- Develop load balancing tools and libraries
- For applications using Threads MPI DMCS/MOL,
LB_Library (2004, 08) - Addnl functionality of systems Hector Loci
(2006) - Improve the performance of applications
- N-body simulations CFD simulations Quantum
physics - Astrophysics Computational mathematics,
statistics
5Research Motivation
6Motivation The need to select the appropriate
dynamic loop scheduling algorithm for
time-stepping applications running in a dynamic
environment
- Sequential form
- Initializations
- do t 1, nsteps
-
- do i 1, N
- (loop body)
- end do
-
- end do
- Finalizations
- Parallel form
- Initializations
- do t 1, nsteps
-
- call LoopSchedule (
- 1, N, loop_body_routine,
- myRank, foreman, method,
- )
-
- end do
- Finalizations
Property The loop iterate execution times (1)
are non-uniform, and (2) evolve with
t. Problem How to select the scheduling
method? Proposed solution Machine Learning!
7Background Work
8Dynamic loop scheduling algorithms
- Static chunking
- Dynamic Non-adaptive
- Fixed size chunking (1985)
- Guided self scheduling (1987)
- Factoring (1992)
- Weighted factoring (1996)
- Dynamic Adaptive
- Adaptive weighted factoring (2001-2003)
- Adaptive factoring (2000, 2002)
- Significance of dynamic scheduling techniques
- Address all sources of load imbalance
(algorithmic and systemic) - Based on probabilistic analyses
9Machine Learning (ML)
- Supervised Learning (SL)
- Teacher
- Learner
- Input-output pairs
- Training (offline learning)
- Reinforcement Learning (RL)
- Agent
- Environment
- Action, state, reward
- Learning concurrent with problem solving
- Survey http//www-2.cs.cmu.edu/afs/cs/project/jai
r/pub/ volume4/kaelbling96a-html/rl-survey.html
10Reinforcement Learning system
I set of inputs (i) R set of rewards (r) B
policy a action T transition s - state
11Reinforcement Learning (RL)
- Model-based approach
- Model M, utility function UM from M
- Examples Dyna, Prioritized Sweeping, Queue-Dyna,
Real-Time Dynamic Programming - Model-free approach
- Action-value function Q
- Example Temporal Difference (Monte Carlo
Dynamic Programming) - SARSA algorithm
- QLEARN algorithm
12RL system for automatic selection of loop
scheduling methods
I set of inputs (set of methods, current time
step, set of loop ids) R set of rewards (loop
execution time) B policy (SARSA, QLEARN) a
action (use particular scheduling method) s
state (application is using method)
13Research Approach
14Embedding a RL system in time-stepping
applications with loops
- Serial form
- Initializations
- do t 1,nsteps
-
- do i 1, N
- (loop body)
- end do
-
- end do
- Finalizations
- Parallel form
- Initializations
- do t 1, nsteps
-
- call LoopSchedule(
- 1, N, loop_body_rtn,
- myRank, foreman,
- method, )
-
- end do
- Finalizations
With RL system Initializations call RL_Init() do
t 1, nsteps time_start time() call
RL_Action (method) call LoopSchedule ( 1, N,
loop_body_rtn, myRank, foreman, method,
) reward time()-time_start call RL_Reward
(t, method, reward) end do Finalizations
15Test application Simulation of wave packet
dynamics using the Quantum Trajectory Method (QTM)
- Bohm, D. 1952. A Suggested Interpretation of the
Quantum Theory in Terms of Hidden Variable, Phys
Rev 85, No. 2, 166-193. - Lopreore, C.L., R.W. Wyatt. 1999. Quantum
Wavepacket Dynamics with Trajectories, Phys Rev
Letters 82, No. 26, 5190-5193. - Brook, R.G, P.E. Oppenheimer, C.A. Weatherford,
I. Banicescu, J. Zhu. 2001. Solving the
Hydrodynamic Formulation of Quantum Mechanics A
Parallel MLS Method, Int. J. of Quantum
Chemistry 85, Nos. 4-5, 263-271. - Carino, R.L., I. Banicescu, R.K. Vadapalli, C.A.
Weatherford, J. Zhu. 2004. Message-Passing
Parallel Adaptive Quantum Trajectory Method,
High performance Scientific and Engineering
Computing Hardware/Software Support, L. T. Yang
and Y. Pan (Editors). Kluwer Academic Publishers,
127-139.
16Application (QTM) summary
- The time dependent Schrödinger's equation (TDSE)
- i h /t ? H ?, H ? -(h/2m)Ñ2 V
- quantum-mechanical dynamics of a particle of mass
m moving in a potential V - ?(r,t ) is the complex wave function
- The quantum trajectory method (QTM)
- ?(r,t ) R(r,t ) exp(i S(r,t )/h) (polar form
real-valued amplitude R(r,t ), phase S(r,t )
functions) - Plug ?(r,t ) into the TDSE, separate real and
imaginary parts - -(/t)?(r,t ) Ñ . ?(r,t)(1/m)ÑS(r,t )
- -(/t)S(r,t ) (1/2m)ÑS(r,t )2 V(r,t )
Q(? r,t ) - Probability density ?(r,t ) R2(r,t )
- Velocity v(r,t ) (1/m)ÑS(r,t )
- Flux j(r,t ) ?(r,t ) v(r,t )
- Quantum potential Q(? r,t ) -(1/2m)(Ñ2log
?1/2 Ñlog ?1/2 2)
17QTM algorithm
- Initialize wave packet x(1N), v(1N), ?(1N)
- do t 1, nsteps
- do i 1..N
- call MWLS (i, x(1N), ?(1N), p, b,)
compute Q(i) - do i 1..N
- call MWLS (i, x(1N), Q(1N), p, b,)
compute fq(i) - do i 1..N
- call MWLS (i, x(1N), v(1N), p, b,)
compute dv(i) - do i 1..N
- Compute V(i), fc(i)
- do i 1..N
- Update ?(i), x(i), v(i)
- Output wave packet
18Free particle evolution of density
19Harmonic oscillator evolution of density
20Embedding a RL system in time-stepping
applications with loop scheduling
- Serial form
- Initializations
- do t 1,nsteps
-
- do i 1, N
- (loop body)
- end do
-
- end do
- Finalizations
- Parallel form
- Initializations
- do t 1, nsteps
-
- call LoopSchedule(
- 1, N, loop_body_rtn,
- myRank, foreman,
- method, )
-
- end do
- Finalizations
With RL system Initializations call RL_Init() do
t 1, nsteps time_start time() call
RL_Action (method) call LoopSchedule ( 1, N,
loop_body_rtn, myRank, foreman, method,
) reward time()-time_start call RL_Reward
(t, method, reward) end do Finalizations
21QTM Application with embedded RL agents
22Experimental Setups Results
23Computational platform
- HPC2 _at_ MSU hosts
- 13th most advanced HPC computational resource in
the world - EMPIRE cluster
- 1038 Pentium III (1.0 or 1.266 GHz)
- Linux RedHat PBS
- 127th of Top 500 in June 2002
- QTM in Fortran90, MPICH
- RL agent in C
24Experimental setup 1
- Simulations
- Free particle harmonic oscillator
- 501, 1001, 1501 pseudo-particles
- 10,000 time steps
- No. of processors 2, 4, 8, 12, 16, 20, 24
- Dynamic Loop scheduling methods
- Equal size chunks (STATIC, SELF, FSC)
- Decreasing size chunks (GSS, FAC)
- Adaptive size chunks (AWF, AF)
- Experimental methods (MODF, EXPT)
- RL agent (techniques SARSA, QLEARN)
25Experimental setup 1 (cont.)
- Hypothesis
- The simulation performs better using dynamic
scheduling methods with RL than dynamic
scheduling methods without RL - Design
- Two-factor factorial experiment (factors
methods, no. of processors) - Five (5) replicates
- Average parallel execution time TP
- Comparison via t statistic at 0.05 significance
level, using Least Squares Means
26Mean TP of free particle simulation , 104 time
steps, 501 pseudo particles
Means with the same annotation are not different
at 0.05 significance level via t statistics using
LSMEANS
27Experimental setup 2
- Simulations
- Free particle harmonic oscillator
- 1001 pseudo-particles
- 500 time steps
- No. of processors 4, 8, 12
- Dynamic Loop scheduling methods
- Equal size chunks (STATIC, SELF, FSC)
- Decreasing size chunks (GSS, FAC)
- Adaptive size chunks (AWF, AF)
- Experimental methods (MODF, EXPT)
- RL agent (techniques SARSA, QLEARN)
28Experimental setup 2 (cont.)
- Hypothesis
- The simulation performance is not sensitive to
the learning parameters or the type of learning
technique used in the RL agent - Each learning technique selects the dynamic loop
scheduling methods in a unique pattern - Design
- Two-factor factorial experiment (factors
methods, no. of processors) - Five (5) replicates
- Average parallel execution time TP
- Comparison via t statistic at 0.05 significance
level, using Least Squares Means
29Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 4procs.)
30Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 8procs.)
31Execution time Tp (sec) for all combinations of
learning parameters (QLEARN, SARSA, 12procs.)
32Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 4 procs.)
33Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 8 procs.)
34Execution time Tp (sec) surface charts for all
combinations of learning parameters (QLEARN,
SARSA, 12 procs.)
35Dynamic loop scheduling method selection patterns
( selection counts for QLEARN and SARSA)
36Execution time Tp (sec) statistics with the RL
techniques (4, 8, and 12 procs.)
RL 0 is QLEARN RL 1 is SARSA
37Conclusions
- Performance of time-stepping applications with
parallel loops benefit from the proper use of
dynamic loop scheduling methods selected by RL
techniques - Dynamic loop scheduling methods using the RL
agent consistently outperform dynamic loop
scheduling methods without RL agent in wave
packet simulations - The performance of the simulation is not
sensitive to the learning parameters of the RL
techniques used
38Conclusions (cont.)
- The number and the pattern of the dynamic loop
scheduling method selection vary from one RL
technique to another - The execution time surface charts show relatively
smoother surfaces for the cases using SARSA in
the RL agent, indicating that this RL technique
is more robust. - Future work
- Use of other more novel RL techniques in the RL
agent - Extending the use of this approach for
performance optimization of other time-stepping
applications
39Appendix
40Fixed size chunking (FSC)
- Kruskal and Weiss (1985)
- iterations times are i.i.r.v. with mean µ and
standard deviation s - constant overhead h
- homogeneous procs start simultaneously
- Expected finish time
- E(T) µ(N/P) (hN)/(kP) sv(2k) log P
- Optimal chunk size
- kopt (Nhv2) / (sPv(log P) (2/3)
Index
41Guided self-scheduling (GSS)
- Polychronopoulos and Kuck (1987)
- equal iteration times
- homogeneous processors (need not start
simultaneously) - chunk remaining/P
- decreasing chunk sizes
Index
42Factoring (FAC)
- Flynn Hummel (1990)
- batchremaining/xb chunkbatch/P
- xb "is determined by estimating the maximum
portion of the remaining iterations that have a
high probability of finishing before the optimal
time (N/P) µ (ignoring the scheduling overhead)" - xb 2 works well (FAC2)
Index
43Weighted factoring (WF)
- Hummel, Schmidt, Uma and Wien (1996)
- processors may be heterogeneous
- wr the relative speed of processor r
- chunkr (FAC2 chunk) wr
- sample application radar signal processing
Index
44Adaptive weighted factoring (AWF)
- Banicescu, Soni, Ghafoor, and Velusamy (2000)
- for time stepping applications
- pr ? (chunk times) / ? (chunk size)
- pave (? pi )/ P
- ?r pr / pave
- ?tot ? ?i
- wr (?rP) / ?tot
Index
45Adaptive factoring (AF)
- Banicescu and Liu (2000)
- generalized factoring method based on "a
probabilistic and statistical model that computes
the chunksize such that all processors' expected
finishing time is less than the optimal time of
remaining iterates without further factoring" - µr, sr are estimated during runtime
- chunkr (D2TR -v(D(D4TR)) / (2µr), where
Rrem, D?(si /µi), T?(1/µi)
Index
46AWF variants
- Adapt wr after a chunk
- AWF-B
- use batch as in FAC2
- chunkr wrbatch/P
- AWF-C
- chunkr wrremaining/(2P)
- Small chunks are used to collect initial timings
Index
47Parallel overheads
- Loop scheduling
- FAC, AWF O (P log N )
- AF slightly higher than FAC
- Data movement
- MPI_Bcast() worst case O (PN)