Title: Performance Control on the Grid
1Performance Control on the Grid
- Mikel Luján
- Apart Workshop, March 2003
2Performance Analysis and Distributed Computing
. . .
- We want specific computational performance and
resource levels - We want these in an environment that is
- Distributed (physically)
- Heterogeneous (inherently)
- Dynamic (by necessity)
- We want them with a minimum effort
3. . . Implies Performanceand Resource Control
- There is no viable alternative apart from
automating the process of achieving them - monitoring and analysis (sensing)
- alteration (actuation)
- This is a classical control system scenario, and
we need to use established control system
techniques to deal with it
4Control Systems Overview
- Control systems involve appropriate change of
actuated variables informed by (timely) feedback
of monitoring information
5Conclusions
- Traditional performance control amounts to
open loop control, mediated by human
performance engineers - Distributed environments introduce enough extra
complexity that closed loop control becomes
essential - Successful closed loop control demands accurate
and rapid feedback data, thus affecting
achievable control limits
6Outline
- Refresh John Gurds talk last summer
- case for Performance Control
- FALSE (Feedback Guided Dynamic Loop Scheduling)
- An example of Performance Control on Parallel
Machines - Update of the RealityGrid project
- An example of Performance Control on the Grid
7- Part A
- FALSE - Feedback Guided Dynamic Loop Scheduling
- Len Freeman, Rupert Ford,
- David Hancock and Mark Bull
- Outline
- Problem definition
- Background work
- FALSE
8Problem
- Given a set of N independent tasks (a PARALLEL DO
loop), how should we assign tasks to processors
to minimise execution time? - DO PARALLEL I 1,N
- CALL LOOP_BODY(I)
- END DO
- Minimise load imbalance overheads.
- But not at the expense of other overheads (remote
access, scheduling, etc.).
9Dynamic Scheduling
- Assume that the durations of the tasks are not
known (and cannot be predicted). - Treat the set of tasks as a stack.
- Each processor takes a task from the stack.
- When a processor has completed a task, it takes
the next task from the stack. -
- This is known as simple self-scheduling (or
dynamic scheduling).
10Dynamic Scheduling (cont.)
- Problem
- The sequentialised access to the stack can be a
significant bottleneck ---gt group the tasks
together in chunks (chunk self scheduling,
Kruskal and Weiss, 1985).
11Guided Self-Scheduling
- Chunksize is changed dynamically as the number of
tasks on the stack decreases. The chunksize is
set to R/p where R is the number of tasks
remaining and p is the number of processors. - A number of variants (factoring, tapering,
trapezoid self-scheduling) have been proposed
use different formulae for the chunksizes.
12Guided Self-Scheduling (cont.)
- Problem
- Contention for the stack as the chunksize
decreases. - If the set of tasks is repeated then it is likely
that a task will be executed on a different
processor from the one on which it was previously
executed lack of temporal locality.
13Affinity Scheduling
- Each processor has its own stack of tasks.
- When a stack is empty, processor steals tasks
from most heavily loaded processor. - Access to tasks must be protected by locks.
- And other scheduling algorithms
14Feedback-Guided Dynamic Loop Scheduling
- DO SEQUENTIAL J 1, NSTEPS
- DO PARALLEL I 1, NPOINTS
- CALL LOOP_BODY(I)
- END DO
- END DO
- Basic assumption the workload of the inner
(parallel) loop changes only slowly as the outer
(sequential) loop is executed.
15Feedback-Guided Dynamic Loop Scheduling (cont.)
- Assume the following lower and upper loop
iteration bounds on outer iteration step t - ljt, hjt, j 1, 2,, p.
- Further assume the corresponding measured
execution times Tjt, j 1, 2,, p. - A piecewise constant approximation to the
workload is given by
16Graphical Representation
17FGDLS Algorithm
- Define new iteration bound limits for iteration
t1, ljt1, hjt1, j 1, 2, , p, so that this
piecewise constant function is approximately
equi-partitioned amongst the p processors
18Graphical Representation of new partition at step
t1
19Properties of FGDLS Algorithm
- Cheap to compute new partition.
- If the workload is static, a near-optimal
solution is found very quickly. - If the workload is dynamic, but slowly changing,
then FGDLS defines a good partition.
20Performance Experiments (briefly)
- Synthetic workload Gaussian distribution whose
midpoint oscillates sinusoidally. - Period of oscillations is controllable.
- Results obtained on 16-processor SGI O2000.
21(No Transcript)
22(No Transcript)
23Nested Loops
- DO SEQUENTIAL J 1, NSTEPS
- DO PARALLEL K1 1, NPOINTS1
- .
- .
- DO PARALLEL KM 1, NPOINTSM
- CALL LOOP_BODY(K1, ..,KM)
- END DO
- .
- .
- END DO
- END DO
- And extended to the Grid
- (see Heyman, Senar, Luque and Livny in Europar
2001)
24The Problem in Terms of Performance Control
- Feedback function
- execution time per processor real t1p
- /- (function) maxi,j(ti-tj)
- Actuator FGDLS algorithm
- Output Actuator
- int lowerBound1p
- int upperBound1p
25- Part B
- Update of the RealityGrid project
- Ken Mayes, Mikel Luján, Graham Riley, Rupert
Ford, John Gurd and Len Freeman
26RealityGrid - Aims
- A UK e-Science testbed project.
- Predict realistic behaviour of matter
- Large scale simulation, computational steering
and high performance visualisation. - Techniques Lattice Boltzman (LB), Molecular
Dynamics (MD), Monte Carlo (MC). - Discovery of new materials through the
integration of prediction and experiments (LUSI
facility).
27Application Codes
- LB3D (LB) Lattice Boltzman simulation of oil,
water and surfactant (detergent) in porous media. - Plus reduced version, without surfactants
- LAMMPS (MD) atomistic/molecular simulation
- Oxford MC code TINKER
28Academic partners
Queen Mary, University of London Imperial
College University of Manchester University of
Edinburgh University of Oxford University of
Loughborough
29Industrial Partners
- Schlumberger
- Edward Jenner Institute for Vaccine Research
- Silicon Graphics Inc
- Computation for Science Consortium
- Advanced Visual Systems
-
- Fujitsu
30Overview
- Set the context
- LB3D Use Case
- Design of our Performance Control System
- Brief look at prototype implementation
- What are we doing now?
- Summary
31Context
- Grid applications will be distributed and, in
some sense, component-based. - To deliver the power grid model, adaptation is
key! - Elevate users above Grid resource and performance
details. - Our work is considering adaptation and its impact
on performance - Adaptation in coupled modelling deployment,
- Flexibility in deployment of compositions of
coupled models. - Initial model configuration and deployment on
resources. - Adaptation due to malleable components.
- At run time, re-allocation of resources in
response to changes.
32LB3D Use Case
Display simulation time equal
Tracking a parameter search
Params 1
LB3D
Output rate 1 Resolution 1
T100
User Changes dynamically
Output rate 2 Resolution 2
T100
LB3D
Params 2
33LB3D Use Case
Tracking a parameter study
Params 1
LB3D
Output rate 1 Resolution 1
Params 2
LB3D
Output rate 2 Resolution 2
User Display rates equal
34Malleable LB3D - mechanisms
- Lb3d will respond to requests to change resources
- Use more (or less) pes ( mem) on current system
- Move from one system to another
- Via machine independent (parallel)
checkpoint/restart - Lb3d will output science data (for remote
visualisation) at higher or lower rates - Lb3d will (one day) respond to requests to
continue running at higher (or lower) lattice
resolution - Each of the above affects performance (e.g.
timesteps per second rate) - Each has an associated cost
35Use Case - detail
- User might be tracking many parameter set
developments (one per lb3d instance) - Some will be uninteresting (for a while)
- Lower output rate / resolution / terminate
- Some will become interesting
- Increase output rate / resolution
- One aim Re-distribute resources amongst all lb3d
instances to maintain highest possible timestep
rate
36A General Grid Application
Generate Data
Component 1
Component 2
Component 3
Computational Grid Resources
Applications and components exhibit phased
behaviour
37Life is Hierarchical
- Can we use hierarchy to divide and conquer
complex system problems?
38Performance Steerers - Design
Initial deployment Run-time adaption
Component Framework
Computational Grid Resources
39Full System
External Resource Scheduler
Application Performance Repository
Component Performance Repository
Loader
Resource Usage Monitor
40Performance Prediction
- Role of APS
- To distribute available resources amongst
components such that the predicted performance of
components gives a minimum predicted execution
time. - Role of CPS
- To utilise resources and component performance
effectors (actuators) so that the predicted
execution time of the steered component is
minimised.
41Life is Repetitive
- Many programs iterate more-or-less the same thing
over-and-over again - We can take advantage of this
- e.g. for load balance in weather prediction
- and, possibly, for performance control
42Application Progress Points
- Assume that the execution of the application
proceeds through phases. Phase boundaries are
marked by Progress Points. - NB. Can take decisions about performance and take
actions at the progress points - Must be safe points
43Component Progress Points
APS
Application progress points
CPS
Component progress points
Component
Time
- Information about progress points will be
contained in some repository.
44Implementation
APS as daemon, CPS as library
Comms I/f
CPS Progress
Sockets (rpc)
Comms I/f
Component Interface
Procedure calls
Component control
45Implementation
Machine 1
APS
socket
Machine 3
Machine 2
CPS
LB3D
shmem
shmem
MPIRUN
Component Loader
Component Loader
socket
DUROC RSL GLOBUS GRIDFTP
46Start-up Process
- GlobusRun, RSL script for Component loaders (one
per machine in Grid) plus APS daemon. - Loaders connect to each other.
- LB3D started by Loader (via MPIRUN), calls CPS (a
library) at start-up. - CPS connects to APS.
- Lb3d calls CPS at each subsequent progress point
and CPS communicates with APS. - Continue until LB3D completed (e.g. no. of
timesteps complete).
47What are we doing now?
- Finishing prototype implementation (basic
mechanisms) - An example
- Every N tsteps, move LB3D between machines 2 and
3 determined by APS. - At tstep mod N progress point in LB3D, APS tells
CPS, which tells the component, to checkpoint,
CPS writes certain status information to the
shmem area and then lb3d (and CPS) dies. - Loader on machine it ran on communicates to
loader on machine it is to run on. The restart
file is Gridftpd along with restart info. (e.g.
tstep to shmem area of new loader). - New LB3D is started and CPS manages the restart.
- Continue until no more tsteps.
48What are we doing now?
- Starting to collect performance results
- np 4, data 64x64x64
- checkpoint file (XDR) size 32.8 MB
- average resident tstep time for cronus 3.384
(s.) - average migration tstep time to cronus 43.718
(s.) - average resident tstep time for centaur3 6.675
(s.) - average migration tstep time to centaur3 55.280
(s.) - np 4, data 8x8x8
- checkpoint file (XDR) size 64 KB
- average resident tstep time for cronus 0.017
(s.) - average migration tstep time to cronus 0.504
(s.) - average resident tstep time for centaur3 0.061
(s.) - average migration tstep time to centaur3 3.038
(s.) - cronus SGI Origin 3400 and centaur3 Sun TCF
49What are we doing now?
- Developing implementation of performance
repository - Berkeley Database (linked as library)
- Survey of prediction and machine learning
algorithms - runtime vs. off line analysis
- predictions accuracy
- amount of history data required
- Learning control theory and understand how to
apply it to Performance Control.
50Summary
- Aim
- Develop an architecture that enables us to
investigate different mechanisms for Performance
Control of malleable component based applications
on the Grid - Main characteristics
- dynamic and adaptation
- Design/implementation tensions
- general vs. specific purpose
- APS lt-gt CPS communication
- APS lt-gt CPS ratio
- performance prediction algorithms - accuracy vs.
execution time - APS/CPS overhead vs. Application execution time
- Work in development (first year in one sentence)
- A Grid Framework for Malleable Component-based
Application Migration
51Related Projects at Manchester
- Met Office FLUME, design of next generation
software - Coupled models
- Tyndall Centre Climate Impact, Integrated
assessment modelling - Coupling climate and economic models
- Computational Markets
- 1 RA and 1 PhD position available
- UK e-Science funded (Imperial College led)
- For more information heck
- http//www.cs.man.ac.uk/cnc
- http//www.realitygrid.org