Title: Efficient Hierarchical SelfScheduling for MPI Applications Executing in Computational Grids
1Efficient Hierarchical Self-Scheduling for MPI
Applications Executing in Computational
Grids Aline Nascimento, Alexandre Sena,
Cristina Boeres and Vinod Rebello Instituto de
Computação Universidade Federal Fluminense,
Niterói (RJ), Brazil http//easygrid.ic.uff.br e
-mail depaula,asena,boeres,vinod_at_ic.uff.br
2Talk Outline
- Introduction
- Concepts and Related Work
- The EasyGrid Application Management System
- Hybrid Scheduling
- Experimental Analysis
- Conclusions and Future Work
3Introduction
- Grid computing has become increasingly widespread
around the world - Growth in popularity means a larger number of
applications will compete for limited resources - Efficient utilisation of the grid infrastructure
will be essential to achieve good performance - Grid infrastructure is typically
- Composed of diverse heterogeneous computational
resources interconnected by networks of varying
capacities - Shared, executing both grid and local user
applications - Dynamic, resources may enter and leave without
prior notice
Hard to develop efficient grid management systems
4Introduction
- Much research is being invested in the
development of specialised middleware responsible
for - Resource discovery and controlling access
- The efficient and successful execution of
applications on the available resources - Three implementation philosophies can be defined
- Resource management systems (RMS)
- User management systems (UMS)
- Application management systems (AMS)
5Concepts
- Resource Management System
- Adopts a system-centric viewpoint
- Centralised in order to manage a number of
applications simultaneously - Aims to maximise system utilisation considering
just the resource requirements of the
applications not their internal characteristics - Scheduling middleware installed on single central
server, monitoring middleware on grid nodes - User Management System
- Adopts an application-centric viewpoint
- One per application, collectively decentralised
- Utilises the grid in accordance with resource
availability and applications topology (e.g. Bag
of Tasks) - Installed on every machine from which an
application can be launched
6Concepts
- Application Management System
- Also adopts an application-centric viewpoint
- One per application
- Utilises the grid in accordance with both the
available resources and the applications
characteristics - Is embedded into the application, offering better
portability - Works in conjunction with simplified RMS
- Transforms applications into system-aware
versions - Each application can be made aware of their
computational requirements and adjust itself
according to the grid environment
7Concepts
- The EasyGrid AMS is
- Hierarchically distributed within an application
- Decentralised amongst various applications
- Application specific
- Designed for MPI applications
- Automatically embedded into the application
- EasyGrid simplifies the process of grid enabling
existing MPI applications - The grid resources just need to offer core
middleware services and a MPI communication
library
8Objectives
- This paper focuses specifically on the problem of
scheduling processes within the EasyGrid AMS - The AMS scheduling policies are application
specific - This paper highlights the scheduling features
through the execution of bag of tasks
applications - The main goals are
- To show the viability of the proposed scheduling
strategy in the context of an Application
Management System - To quantify the quality of the results obtainable
9The EasyGrid AMS Middleware
- The EasyGrid framework is an AMS middleware for
MPI implementations with dynamic process creation
10The EasyGrid AMS Architecture
- A three level hierarchical management system
Site 4
SM
SM
GM
Site 1
HM
HM
HM
HM
HM
Computational Grid
Site 3
SM
Site 2
SM
HM
HM
HM
HM
11The EasyGrid Portal
- The EasyGrid Scheduling Portal is responsible for
- Choosing the appropriate scheduling and fault
tolerance policies for the application - Determining an initial process allocation
- Compiling the system aware application
- Managing the users grid proxy
- Creating the MPI environment (including
transferring files) - Providing fault tolerance for the GM process
Acts like a simplified resource management system
12The EasyGrid AMS
- GM creates one SM per site
- Each SM spawns a HM on each remaining resource at
their respective site - The application processes will be created
dynamically according to the local policy of the
HM - Processes are created with unique communicators
- Gives rise to the hierarchical AMS structure
- Communication can only take place between parent
and child processes - HMs, SMs and GM route messages between
application processes
13Process Scheduling
- Scheduling is made complex due to the dynamic and
heterogeneous characteristics of grids - Predicting the processing power and communication
bandwidth available to an application is
difficult - Static schedulers
- Estimates assumed a priori may be quite different
at runtime - More sophisticated heuristics can be employed at
compile time - Dynamic schedulers
- Access to accurate runtime system information
- Decisions need to be made quickly to minimise
runtime intrusion
- Hybrid Schedulers
- Combine the advantages by integrating both static
and dynamic schedulers
14Static Scheduling
- The scheduling heuristics require models that
capture the relevant characteristics of both the - application (typically represented by a DAG) and
- the target system
- To define the parameters of the architectural
model - The Portal executes a MPI modelling program with
the users credentials to determine the current
realistic performance available to this users
application - At start-up, application processes will be
allocated to the resources in accordance with the
static schedule
15Dynamic Scheduling
- Modifying the initial allocation at run time is
essential, given the difficulty of - Extracting precise information with regard to the
applications characteristics - Predicting the performance available from shared
grid resources - The current version only reschedules processes
which have yet to be created - The dynamic schedulers are associated with each
of the management processes distributed in the 3
levels of the AMS hierarchy
16AMS Hierarchical Schedulers
Associated with the GM
Associated with a SM
Associated with a HM
17Dynamic Scheduling
- The appropriate scheduling policies depend on
- The class of the application and the users
objectives - Different policies may be used in different
layers of the hierarchy and even within the same
layer - The dynamic schedulers collectively
- Estimate the remaining execution time on each
resource - Verify if the allocation needs to be adjusted
- If necessary, activate the rescheduling mechanism
- A rescheduling mechanism characterises a
scheduling event
18Host Dynamic Scheduler
- HDS determines both the order and the instant
that an application process should be created on
the host - Possible scheduling polices to determine process
sequence include - The order specified by the static scheduler
- Data flow selects any ready task
- A second policy is necessary to indicate when the
selected process may execute - The optimal number of application process that
should execute concurrently depends on their I/O
characteristics - Often influenced by local usage restrictions
19Host Dynamic Scheduler - HDS
- When an application process terminates on a
resource, the monitor makes available to the HDS
the processs - wall clock time
- CPU execution time
together with the heterogeneity factor
- Both cp and ert are added to the monitoring
message and sent to the Site Manager
SDS
HDS
HDS
HDS
20Site Dynamic Scheduler - SDS
- On receiving the and the from each
resource in the site, the SDS calculates - The ideal makespan for the remaining processes in
the site - The site imbalance index
- If the site imbalance index is above a predefined
threshold, a scheduling event at the SDS is
triggered
- SDS requests a percentage of the remaining
workload from the most overloaded host
- HDS send tasks to be rescheduled to the SDS
SDS
- SDS distributes the tasks amongst the under
loaded hosts
- HDS receives the request and decides which tasks
to release
HDS
HDS
HDS
21Site Dynamic Scheduler
- When not executing a scheduling event, SDS
calculates - The average estimated remaining time of the site
- The sum of computational power of site resources
GDS
SDS
SDS
SDS
22Global Dynamic Scheduler - GDS
- On receiving the and the from each site, GDS
calculates - The ideal makespan for the whole application
- The system imbalance index
GDS
- GDS distributes the tasks between the under
loaded sites - Each under loaded site reschedules the received
tasks amongst their hosts
- GDS requests to the most overloaded site a
percentage of its remaining workload - SDS receives the request and forwards it to its
HDSs
- SDS waits for each HDS to answer and sends the
list of tasks to the GDS
SDS
SDS
HDS
HDS
HDS
HDS
HDS
23Experimental Analysis
- These scheduling policies were designed for BoT
applications such as parameter sweep (PSwp) - PSwp can be represented by simple fork-join DAGs
- The scheduling policies for this class of DAG can
be seen as load balancing strategies - Semi-controlled three site grid environment
- Pentium IV 2.6 GHz processors with 512 Mb RAM
- Running Linux Fedora Core 2, Globus Toolkit 2.4
and LAM/MPI 7.0.6
24Experimental Analysis
25Experimental Analysis
- Same HDS policies for PSwp1 and PSwp2
- The overhead due to the AMS is very low, not
exceeding 2.5
- The cost for rescheduling is also small, less
than 1.2
- The standard deviation is less than the
granularity of a single task
26Experimental Analysis
- Different HDS policies for PSwp1 and PSwp2
- PSwp1 processes execute only when resources are
idle
- The interference produced by PSwp1 is less than
0.8
- The cost for rescheduling not exceeds 1.4
27Experimental Analysis
- EasyGrid AMS with different scheduling policies
- Round robin used by MPI without dynamic
scheduling - Near optimal static scheduling without dynamic
scheduling - Round robin used by MPI dynamic scheduling
- Near optimal static scheduling dynamic
scheduling
28Conclusion
- In attempt to improve application performance a
hierarchical hybrid scheduling is employed - The low cost of the hierarchical scheduling
methodology leads to an efficient execution of
BoT MPI applications - Different scheduling policies may be used in
different levels of the scheduling hierarchy - One AMS per application permits that application
specific scheduling policies be used - System awareness permits various applications to
collaborate their scheduling efforts in a
decentralised manner to obtain good system
utilisation
29Efficient Hierarchical Self-Scheduling for MPI
Applications Executing in Computational
Grids Thanks e-mail depaula,asena,boeres,vino
d_at_ic.uff.br
30Calculation of the Optimal Value
31Calculation of the Optimal Value
- Together (Pswp1 uses only idle resources)
- When Pswp2 finishes, Pswp1
- Has already executed 56 tasks in each idle
resource (7 machines) - (56 tasks 7 machines) ? 392 tasks executed
- And, remains (1000 392) ? 608 tasks, not yet
executed in the shared resources - Executes 608 tasks in 18 machines (Sites 1, 2)
- tasks/machines 608/18 ? 34
- ? Optimal value (56 34) Duration of the
Task
32Calculation of the Expected Value
- The expected value when the applications are
executing together is based on the actual value
when the application executed alone - Example (same HDS policies)
- actual value 57.20 (executing alone in 18
machines) - expected value executing in (7 machines 11
shared machines) - 57.20 ? 18 machines
- ???? ? 12.5 machines (7 5.5)
- Expected value (57.20 18)/12.5 82.36
-
33Calculation of the Expected Value
- The expected value when the applications are
executing together is based on the actual value
when the application executed alone - Example (different HDS policies)
- actual value 57.67 (executing alone in 18
machines) - at this moment is considered that Pswp2
finished its execution - So, Pswp1 has already executed 56 tasks in each
idle resource (7 machines) ? 392 tasks executed - And remains (1000 392) ? 608 tasks, not yet
executed in the shared resources - Pswp1 executes 608 tasks 18 machines (Site 1 and
2) - tasks/machines 608/18 ? 34
- Expected value (57.67 34) 91.67
34(No Transcript)
35(No Transcript)
36Related Works
- Accordingly to Buyya et al., the scheduling
subsystem may be - Centralised
- A single scheduler is responsible for grid-wide
decisions - Not scalable
- Hierarchical
- Scheduling is divided into several levels,
permitting different scheduling policies at each
level - Failure of the top level scheduler, results in
the entire system failure - Decentralised
- Various decentralised schedulers communicate with
each other, offering portability and fault
tolerance - Good individual scheduling may not lead to a good
overall performance
37Related Work
- OurGrid
- Employs two different schedulers
- A UMS job manager, with the responsibility for
allocating application jobs on grid resources - A RMS site manager in charge of enforce
particular site-specific policies - No information about the application is used
- GridFlow
- Is a RMS which focuses on service-level
scheduling and workflow - Management takes place at 3 levels global, local
and resource
38Related Work
- GrADS
- Is a RMS that utilises Autopilot to monitor the
adherence to a performance contract between the
application demands and resource capabilities - If the contract is violated, rescheduler takes
corrective action by either suspending
execution, computing a new schedule, migrating
process and then restarting or by process
swapping - AppleS
- Is an AMS where individual scheduling agents that
are embedded into the application perform
adaptive scheduling - Given the users goals, these centralised agents
use applications characteristics and system
information to select viable resources