Title: Kaoutar El Maghraoui, elmagkcs'rpi'edu
1An Architecture for Reconfiguring MPI
Applications in Dynamic and Heterogeneous
Environments
12th SIAM Conference on Parallel Processing for
Scientific Computing
- Kaoutar El Maghraoui, elmagk_at_cs.rpi.edu
- Department of Computer Science
- Rensselaer Polytechnic Institute
- http//wcl.cs.rpi.edu/ios/
- In Collaboration with
- Dr. Carlos Varela (Thesis Advisor)
- Dr. Boleslaw Szymanski
- Travis Desell
- February 15, 2006
2Todays Grid Environments
- Infrastructure
- Complex, large-scale, high fault rates, and
dynamic - Applications
- Complex deployment
- Challenges
- High-level application development interface
- Designing and constructing applications for
adaptability - Late mapping of applications to Grid resources
- Monitoring and control of performance
3MPI Challenges on Dynamic Grids
- Tailored for tightly coupled systems
- Dynamic reconfiguration
- Process mobility
- Scale up to accommodate new resources
- Shrink to accommodate leaving or slow resources
- Transparent performance monitoring and
application adaptability - Currently handled by the programmer
- Goal
- Extending MPI with dynamic reconfiguration and
adaptability to dynamic computational grids
4Approach
- Separation of concerns between the application
and the middleware - Middleware-level
- When and how to reconfigure applications?
- Applications-level
- Problem solving
- Support for migration and/or malleability
- Gap bridging software
- High level APIs
- Library support to integrate applications and
middleware
5IOS Overview
- The Internet Operating System (IOS) is a
decentralized middleware framework that provides - Opportunistic load balancing capabilities
- Resource-level profiling
- Application-level profiling
- Goal
- Automatic reconfiguration of applications in
dynamic environments (e.g., Computational Grids) - Scalability to worldwide execution environments
- Modular architecture enabling evaluation of
different load balancing and resource profiling
strategies - Generic Interfaces to interoperate with various
programming models
6IOS Architecture
- Distributed middleware agents
- Encapsulate modules for resource profiling and
reconfiguration policies. - Capable of interconnecting in various virtual
topologies (hierarchical or P2P) - Interface with high level applications
- Interfacing with IOS agents
- Applications implement specific APIs to interface
with IOS agents - Applications need to support component
migration/malleability
7IOS Architecture
IOS-enabled Node
Reconfiguration request (migrate/split/merge/repli
cate)
Application Component
Message passing
Application profiling
IOS API
Decision Module
Profiling Module
Protocol Module
Steal requests
Communication profiles
Reconfigure?
List of profiles
Evaluates the gain of a potential
reconfiguration
Sends steal requests/ Receives steal requests
Available processing
Decision
Interfaces to resources profilers
Inter-delay info
Network monitor
Memory monitor
CPU monitor
Initiate a steal request
IOS Agent
8IOS Load Balancing Strategies
- Modularity for customizable load balancing and
profiling strategies, e.g. - Random work-stealing (RS)
- Based on Cilks work stealing approach
- Lightly-loaded nodes send work steal packets to
heavily loaded nodes - Application topology-sensitive work-stealing
(ATS) - Extension to RS
- Collocate processes communicating frequently
- Network topology-sensitive work-stealing (NTS)
- Extension to ATS
- Considers network topology
- Minimizes WAN latencies
9Reconfiguring MPI Applications with IOS
- Extending MPI
- Semi-transparent checkpointing
- Process migration support
- Integration with IOS
- Currently for iterative applications
10The MPI/IOS Runtime Architecture
- Instrumented MPI applications
- Process Checkpointing and Migration (PCM) library
- Wrappers for some MPI native calls
- The MPI library
- The IOS runtime components
11MPI/IOS Interactions
12MPI Process Migration
- Implemented at the user-level
- Relies on MPI communicator rearrangements and
MPI-2 spawning feature - Instrumentation of programs with PCM calls
- Benefit portability
- Limitation semi-transparency
13Migration Example
Migrate
MPI_SPAWN
3
0
1
4
Transfer of state
2
5
0
Newly created communicator
MPI_COMM_WORLD
14Migration Example
MPI_Intercomm_merge merges the two communicators
3
6
1
4
2
5
0
MPI_COMM_WORLD
15Migration Example
MPI_Comm_create creates a new communicators
3
3
1
4
2
5
0
MPI_COMM_WORLD
16Profiling MPI Applications
- The profiling library is based on the MPI
profiling interface - Transparent interception of all MPI calls
- Goal Profile MPI applications' communication
patterns
17How to Instrument MPI Programs with
PCM?(Initialization Phase)
- include mpi.h
- include "pcm.h
- MPI_Comm PCM_COMM_WORLD
- int main(int argc, char argv)
- MPI_Init( argc, argv )
- PCM_COMM_WORLD MPI_COMM_WORLD
- PCM_Init(PCM_COMM_WORLD)
- MPI_Comm_rank( PCM_COMM_WORLD, rank )
- MPI_Comm_size( PCM_COMM_WORLD, n )
-
- spawnrank PCM_Process_Status()
- if(spawnrank gt 0)
- //load any checkpointed data
- PCM_Load()
-
18How to Instrument MPI Programs with
PCM?(Iterations Phase)
- for(several iterations)
- pcm_status PCM_Status(PCM_COMM_WORLD)
- if(pcm_status PCM_MIGRATE)
- //checkpoint data
- PCM_Store()
- PCM_COMM_WORLD PCM_Reconfigure()
-
- else if(pcm_status PCM_RECONFIGURE)
-
- PCM_COMM_WORLD PCM_Reconfigure()
- MPI_Comm_rank(PCM_COMM_WORLD, rank)
-
- // Data Computation.
- //Exchange of computed data with
neighboring processes. - // MPI_Send() MPI_Recv()
-
-
- PCM_Finalize(PCM_COMM_WORLD)
- MPI_Finalize()
19A Reconfiguration Scenario
Processor 1
MPI Process rank 1
IOS Agent
20Case Study Heat Diffusion Problem
- A problem that models heat transfer in a solid
- A two-dimensional mesh is used to represent the
problem data space - An Iterative Application
- Highly synchronized
21Adaptation Experiments
22Adaptation Experiments (2)
Adaptation through removing a slow processor
23Adaptation Experiments (3)
Adaptation through migration to a better cluster
24Empirical Results Overhead of the PCM library
25Reconfiguration Overhead
26Breakdown of Reconfiguration Cost
27Ongoing/Future Work
- Splitting and Merging MPI Application Processes
- New reconfiguration policies on dynamic
environments - More realistic load characteristics and network
latencies. - Interoperability with MPICH-G2
- Improving the PCM API
- Non-iterative applications
28Related Work
- MPICH-G2
- Grid-enabled implementation of MPI
- http//www3.niu.edu/mpi/
- Adaptive MPI (AMPI)
- MPI implementation with light threads for process
migration Huang03 - MPI Process Swapping
- Initial over-allocation of processors and
selection of the best executing nodes Sievert04 - Extensions to MPI with checkpointing and restart
- SRS library Vadhiyar03 application stop and
restart - CoCheck Stellner96 and StarFishAgbaria99
Fault tolerance - MPICH-VBouteiller05 Fault tolerance
29Questions?
30Backup Slides
31Using the IOS middleware
- Start IOS Peer Servers a mechanism for peer
discovery - Start a network of IOS theaters
- Write your SALSA programs and extend all actors
to autonomous actors - Bind autonomous actors to theaters
- IOS automatically reconfigures the location of
actors in the network for improved performance of
the application. - IOS supports the dynamic addition and removal of
theaters
32Parallel Issues
- When running across multiple resources, the
bandwidth and latencies of communication between
processes on different resources is much greater
than between processes on a single resource - Need to think about communication patterns is
it possible to reduce the amount of communication
by, for example, buffering data for longer and
sending larger batches of data.
33Today Globus
- Developed by Ian Foster and Carl Kesselman
- Grew from the I-Way (SC-95)
- Basic Services for distributed computing
- Resource discovery and information services
- User authentication and access control
- Job initiation
- Communication services (Nexus and MPI)
- Applications are programmed by hand
- Many applications
- User responsible for resource mapping and all
communication - Existing users acknowledge how hard this is
34Today Condor
- Support for matching application requirements to
resources - User and resource provider write ClassAD
specifications - System matches ClassADs for applications with
ClassADs for resources - Selects the best match based on a
user-specified priority - Can extend to Grid via Globus (Condor-G)
- What is missing?
- User must handle application mapping tasks
- No dynamic resource selection
- No checkpoint/migration (resource re-selection)
- Performance matching is straightforward
- Priorities coded into ClassADs
35Resource Sensitive Model
- Decision components use a resource sensitive
model to decide based on the profiled
applications how to balance the resources
consumption - Reconfiguration decisions
- Where to migrate
- When to migrate
- How many entities to migrate
36(No Transcript)
37IOS API
- The following methods notify the profiling agent
of actors entering and exiting the theater due to
migration and binding - public void addProfile(UAN uan)
- public void removeProfile(UAN uan)
- Public void migrateProfile(UAN uan, UAL target)
- The profiling agent updates its actor profiles
based on message sending with these methods - public void msgSend(UAN uan, Msg_INFO msgInfo)
- The profiling agent updates its actor profiles
based on message reception with this method - public void msgReceive(UAN uan, targetUAL,
Msg_INFO msgInfo) - The following methods notify the profiling agent
of the start of a message being processed and the
end of a message being processed, with a UAN or
UAL to identify the sending actor - public void beginProcessing(UAN uan, Msg_INFO
msgInfo) - public void endProcessing(UAN uan, Msg_INFO
msgInfo)
38Virtual Topologies of IOS Agents
- Agents organize themselves in various
network-sensitive virtual topologies to sense the
underlying physical environments - Peer-to-peer topology agents form a p2p network
to exchange profiled information. - Cluster-to-cluster topology agents organize
themselves in groups of clusters. Cluster
managers form a p2p network.
39C2C vs. P2P topologies
40Parallel Decomposition of the Heat Problem
41MPI Process Migration
- Upon a migration notification from the IOS
middleware - The migrating process saves its current state
through the PCM checkpointing support - The rest of the processes get notified about the
event of a migration. Any communication is
suspended until migration is done. - The migrating process spawns a new process in the
target location and sends its local checkpointed
data (MPI-2) - The newly created process restores its state
- Rearrangement of any shared communicators is
performed collectively by all processes. - Computation is then resumed
42How to Instrument an MPI Program?
- The PCM API
- Process Checkpointing and Migration API
- Register variables with a check-point handler
- Store data locally or remotely in a PCM Daemon.
- Restores previously check-pointed data
- Periodic probing of the status of an MPI
application or MPI process. - The PCM Daemon
- Loaded on every participating node.
- Communicates with IOS agents and the MPI
profiling library - Handles process migration
43- include ltmpi.hgt
- int main(int argc, char argv)
- MPI_Init( argc, argv )
- MPI_Comm_rank( MPI_COMM_WORLD, rank )
- MPI_Comm_size( MPI_COMM_WORLD,
totalProcessors ) - current_iteration 0
-
- //Initialize and Distribute data among
processors -
- for(several loops)
- // Data Computation.
- //Exchange of computed data with
neighboring processes. - // MPI_Send() MPI_Recv()
-
- // Data Collection
- MPI_Barrier( MPI_COMM_WORLD )
- MPI_Finalize()
- return 0