Title: DynamiteG
1Dynamite-G
2Topics
- Load balancing by task migration in
message-passing applications - Checkpoint and restart mechanisms
- Migrating tasks
- Performance issues
- Current status
3Why cluster and grid computing
- Clusters and grids increasingly interesting
- more workstations
- higher performance per workstation
- faster interconnecting networks
- price/performance competitive with MPP
- enormous unused capacity
- cyclic availability
4Issues
- Clusters are inherently inhomogeneous
- intrinsic differences in performance, memory,
bandwidth - dynamically changing background load
- ownership of nodes
- Grids add
- differences in administration
- disjoint file systems
- security etc.
5Goals of Dynamite
- Utilise unused cycles
- Support parallel applications (PVM-MPI)
- Respect ownership
- Dynamic load redistribution at task level
- Code level transparency
- User level implementation
6Task allocation domains
Static task load
Dynamic task load
Static task allocation
Predictable reallocation
Dynamical reallocation
Static resource load
Dynamic resource load
7Why migrate
- Performance of parallel program usuallydictated
by slowest task - Task resource requirements and available
resources both vary dynamically - Therefore, optimal task allocation changes
- Gain must exceed cost of migration
- Resources used by long-running programs may be
reclaimed by owner
8Checkpointing/restoring infrastructure
- User level
- Implemented in LINUX ELF dynamic loader V1.9.9
- can run arbitrary code before the application
starts running - wrapping function calls
- straightforward support for shared libraries
- only need to re-link with different loader
(special option when linking)
9Checkpointing
- Checkpointing
- signal received
- register/signal mask state saved using sigsetjmp
- process address space (text, data, heap, stack,
shared libraries) dumped to a checkpoint file - Checkpoint file is a standalone ELF executable
10Restoring
- OS kernel loads text and data segments, invokes
dynamic loader - Dynamic loader
- recognises checkpoint file (special sections)
- restores heap, shared libraries and stack, jumps
to signal handler (siglongjmp) - Process returns from signal handler to
application code
11Handling kernel context
- Kernel context not automatically preserved
- open files, pipes, sockets, shared memory
- Open files important, call wrapping used (open,
close, creat, ...) - Shared file system a prerequisite
- Method allows shut-down of source node
12Open files
- Relevant file operations are monitored
- primarily open, close, creat
- Obtain file-position before migration, close file
- Reopen and reposition file after migration
- no mirror or proxy needed on old host
- fcntl and ioctl calls are not monitored
- not much used and very complex
- incomplete functionality
13Location independent addressing
- Standard PVM node identifier encoded in task
identifier - (e.g. t80001 task 1 running on node 8). Used
when routing messages between tasks. - Dynamite approach
- task identifier stays the same after migration
- routing tables maintained in all PVM daemons
14Dynamite Initial State
Two PVM tasks communicating through a network of
daemons Migrate task 2 to node B
15Prepare for Migration
Create new context for task 2 Tell PVM daemon B
to expect messages for task 2 Update routing
tables in daemons (first B, then A, later C)
16Checkpointing
Node A
Node B
Newcontext
PVMD A
PVMD B
PVMtask 1
Node C
PVMD C
Program PVM Ckpt
Send checkpoint signal to task 2 Flush
connections Checkpoint task to disk
17Restart Execution
Node A
Node B
NewPVM task 2
PVMD A
PVMD B
PVMtask 1
Node C
PVMD C
Restart checkpointed task 2 on node B Resume
communications Re-open re-position files
18Connection flushing
SourcePVMD
Migrating task
Remote task
RemotePVMD
SIGURG
TC_MOVED
SIGUSR1
Time
TC_EOC
TC_EOC
TC_EOC
close()
EOF
TM_MIG
close()
Checkpoint
close()
19Connection flushing
- all tasks are notified with SIGURG and TC_MOVED
message - migrating task M sends TC_EOC messages via all
direct connections - tasks reply to TC_EOC messages to M
- direct connections are closed
- source PVM daemon sends TC_EOC message to M
- migrating task M replies daemon with TM_MIG
- task-daemon connection is closed
20Special considerations
- Critical sections
- signal blocking and unblocking
- Blocking calls
- modifications to low-level mxfer function
- Out-of-order fragments and messages
- message forwarding and sequencing
- Messages partially sent on migration
- if via direct connections, re-send entirely
21Performance
- Migration speed largely dependent on the speed of
shared file system - and that depends mostly on the network
- NFS over 100 Mbps Ethernet
- 0.4 s lt Tmig lt 15 s for 2 MB lt sizeimg lt
64 MB - Communication speed reduced due to added overhead
- 25 for 1 byte direct messages
- 2 for 100 KB indirect messages
22Migration (Linux)
23Ping-pong experiment(Linux)
24Migration decider
Configuration file
PVMD
Migration decider
Master monitor
25Decider
- Cost of configuration derived from weighted sum
of - average CPU load
- average memory load
- migrations
- Use of maximum instead of average optional
- accounts for interdependence of tasks
- Branch and bound search
- Upper bound on search time
26Three environments
- The progress of a test program
- Undisturbed (PVM)
- Disturbed and migrated (DPVMmigration)
- Disturbed but not migrated (DPVMload)
27NAS CG Benchmark
283 tasks in a FE code
29Status
- Checkpointer operational under
- Solaris 2.5.1 and higher (UltraSparc, 32 bit)
- Linux/i386 2.0 and 2.2 (libc5 and glibc 2.0)
- PVM 3.3.x applications supported, tested on
- Pam-Crash (ESI) - car crash simulations
- CEM3D (ESI) - electro-magnetics code
- Grail (UvA) - large, simple FEM code
- NAS parallel benchmarks
30Dynamite The Grid
- Critical analysis of usefulness nowadays
- Popular computing platform Beowulf clusters
- Typical cluster management strategy space
sharing - Checkpointing multiple tasks or even the whole
parallel application quite useful for fault
tolerance or cross-cluster migration - File access presents complex problems
- Dynamic resource requests to the grid
31Road to Dynamite-G
- Study and solve issues for cross-cluster
migration - No shared file system
- Authentication
- Basic infrastructure stays the same, we only use
some of the Grid services (remote file access,
monitoring, scheduling) - Full integration with Globus (job submission, job
management, security) - Globus is a moving target
32Cross-cluster checkpointing
Node A
Node B
Helper task
PVMD A
PVMD B
PVMtask 1
Node C
PVMD C
Program PVM Ckpt
Send checkpoint signal to task 2 Flush
connections, close files Checkpoint task to disk
via helper task
33Socket and File-based migration in a single
cluster
34Nodes in two different clusters
35Performance of socket migration
- Target file format retained
- Usually transfer to local disk (/tmp) most
efficient - For migration to local disk no network link is
crossed more than once - Performance depends on network speed, local disk
speed and memory (cache) of target machine - Performance compares well to original mechanism
(checkpoint to file on file server) - Consider mechanism as standard, also for
in-cluster migration
36Issues for file access
- Moving open files with tasks appears least
complicated solution, but - Tasks may open and close files required files
unknowable at time of migration - Task may share a file
- Files need to be returned after task completion
- Connect to proxy file server on source cluster
- Security issues
- Performance
37Some other open issues
- Checkpointing and restarting entire programs
- Saving communication context
- Checkpoint-and-stay-alive
- Cross-cluster migration (target cluster known)
- Monitoring and scheduling
- Migration cost vs. performance gain
- Migrating tasks vs. migrating entire programs
- Grid
- When to start looking for a new cluster
- How best to use available mechanisms
38Full integration with Globus
- Upgrade our checkpointer
- Existing Grid-enabled implementation of MPI,
MPICH-G2, does not use "ch_p4" communication
device, it uses its own "globus2" - Start from scratch?
- Support most of the fancy features in MPICH-G2
such as heterogeneity ?
39Conclusions
- Migration of tasks allows
- optimal task allocation in dynamic environment
- freeing of nodes
- Dynamite addresses the problem of migrating tasks
in parallel programs - dynamically linked programs with open files
- direct and indirect PVM connections
- MPI expected in near future
- scheduler needs further work
- Slight performance penalties in communication and
migration - The road to Dynamite-G is long, but appears
worthwhile
40Collaborations
- UvA
- Kamil Iskra
- Dick van Albada
- ESI
- Jan Clinckemaillie, Henri Luzet
- Genias
- Ad Emmen
- Univ. Indonesia, Jakarta
- Chan Bassaruddin, Judhi Santoso
- IT Bandung
- Bobby Nazief, Oerip Santoso
- Univ. Mining Metallurgy, Krakow
- Marian Bubak, Darius Zbik
- Univ. Wisconsin
- Miron Livny
- State Univ. Mississipi
- Ioana Banicescu