Adaptive MPI - PowerPoint PPT Presentation

About This Presentation

Title:

Adaptive MPI

Description:

Title: Adaptive Load Balancing for MPI Programs Author: Milind A. Bhandarkar Last modified by: Milind Bhandarkar Created Date: 5/21/2001 7:23:36 PM – PowerPoint PPT presentation

Number of Views:54

Avg rating:3.0/5.0

Slides: 25

Provided by: MilindABh7

Learn more at: http://charm.cs.illinois.edu

Category:

more less

Transcript and Presenter's Notes

Title: Adaptive MPI

1
Adaptive MPI

Milind A. Bhandarkar
(milind_at_cs.uiuc.edu)

2
Motivation

Many CSE applications exhibit dynamic behavior
Adaptive mesh refinement (AMR)
Pressure-driven crack propagation in solid
propellants
Also, non-dedicated supercomputing platforms such
as clusters affect processor availability
These factors cause severe load imbalance
Can the parallel language / runtime system help ?

3
Load Imbalance in Crack-propagation Application
(More and more cohesive elements activated
between iterations 35 to 40.)
4
Load Imbalance in an AMR Application
(Mesh is refined at 25th time step)
5
Multi-partition Decomposition

Basic idea
Decompose the problem into a number of
partitions,
Independent of the number of processors
Partitions gt processors
The system maps partitions to processors
The system maps and re-maps objects as needed
Re-mapping strategies help adapt to dynamic
variations
To make this work, need
Load balancing framework runtime support
But, isnt there a high overhead of
multi-partitioning ?

6
Overhead of Multi-partition Decomposition
(Crack Propagation code, with 70k elements)
7
Charm

Supports data driven objects.
Singleton objects, object arrays, groups, ..
Many objects per processor, with method execution
scheduled with availability of data.
Supports object migration, with automatic
forwarding.
Excellent results with highly irregular dynamic
applications.
Molecular dynamics application NAMD, speedup of
1250 on 2000 processors of ASCI-red.
Brunner, Phillips, Kale. Scalable molecular
dynamics, Gordon Bell finalist, SC2000.

8
Charm System Mapped Objects
9
Load Balancing Framework
10
However

Many CSE applications are written in Fortran, MPI
Conversion to a parallel object-based language
such as Charm is cumbersome
Message-driven style requires split-phase
transactions
Often results in a complete rewrite
How to convert existing MPI applications without
extensive rewrite ?

11
Solution

Each partition implemented as a user-level thread
associated with a message-driven object
Communication library for these threads same in
syntax and semantics as MPI
But what about the overheads associated with
threads ?

12
AMPI Threads Vs MD Objects (1D Decomposition)
13
AMPI Threads Vs MD Objects (3D Decomposition)
14
Thread Migration

Thread stacks may contain references to local
variables
May not be valid upon migration to a different
address space
Solution thread stacks should span the same
virtual address space on any processor where they
may migrate (Isomalloc)
Split the virtual space into per-processor
allocation pool
Scalability issues
Not important on 64-bit processors
Constrained load balancing (limit the threads
migratability to fewer processors)

15
AMPI Issues Thread-safety

Multiple threads mapped to each processor
Process data to be localized
Make them instance variables of a class
All subroutines become instance methods of this
class
AMPIzer A source-to-source translator
Based on Polaris front-end
Recognize all global variables
Put them in a thread-private area

16
AMPI Issues Data Migration

Thread-private data needs to be migrated with the
thread
Developer has to write subroutines for packing
and unpacking data
Writing separate subroutines is error-prone
Puppers (puppack-unpack)
A subroutine to show the data to the runtime
system
Fortran90 generic procedures make writing the
pupper easy

17
AMPI Other Features

Automatic checkpoint and restart
On different number of processors
Number of chunks remain the same, but can be
mapped to different number of processors
No additional work is needed
Same pupper used for migration is also used for
checkpointing and restart

18
Adaptive Multi-MPI

Integration of multiple MPI-based modules
Example integrated rocket simulation
ROCFLO, ROCSOLID, ROCBURN, ROCFACE
Each module gets its own MPI_COMM_WORLD
All COMM_worlds form MPI_COMM_UNIVERSE
Point to point communication between different
MPI_COMM_worlds using the same AMPI functions
Communication across modules is also considered
while balancing load

19
Experimental Results
20
AMR Application With Load Balancing
(Load balancer is activated at time steps 20, 40,
60, and 80.)
21
AMPI Load Balancing on Heterogeneous Clusters
(Experiment carried out on a cluster of Linux
workstations.)
22
AMPI Vs MPI
(This is a scaled problem.)
23
AMPI Overhead
24
AMPI Status