Title: MPI COLLECTIVE COMMUNICATIONS FOR GRIDS broadcast
1MPI COLLECTIVE COMMUNICATIONS FOR
GRIDS(broadcast)
- Progress Report
- Rakhi Gupta
- MSc (Engg.) SERC
2Motivation
- Collective operations
- Topology aware based on hierarchy (MPICH-G)
- MagPIe one WAN link policy
- Indirect link might be faster than the direct
link
3Motivation (continued)
- Static techniques - Communication tree does not
dynamically adapt with changes in network
characteristics. - Objective To develop a system that
- Converges to the optimal tree.
- Adaptive
4Key observation
- Broadcast operation with the same of set of nodes
and the same root may occur many times, within
the same application or even across applications.
Per fomance
Per fomance
Topology aware Broadcast
Optimized broadcast
time
time
First run of application
Ith run of application
5Strategy
Single Step transformation trans (2,4)
6
6
4
4
3
3
Select 2 random numbers (2,4)
2
2
5
5
Single step modified tree
Initial tree
6Algorithm
- Start with a flat tree t0.
- bad_link_list f.
- Tree_list t0.
- Till performance is not satisfactory,
- tj trans (a,b) ti , (a,b) ? bad_link_list i
gt 0 i lt size(Tree_list) - If time(tj) gt time(ti) bad_link_list
bad_link_list (a,b). - If time(tj) lt MIN_TIME(Tree_list) threshold
- Remove MAX_TIME(Tree_list) from Tree_list.
- Add tj to Tree_list.
7Architecture
Service synchronization of experiments and MPI
Application. Experiment finds the bcast
tree. MPI Application uses the tree made by
experiment
IPC through shared memory
Bcast service
- One experiment can execute at one instant
- Many applications may execute simultaneously.
8Scenario 1
2. j Find_exec_exp()
3. e
Bcast service
4. s
1. p
5. g
Stopping an experiment at j to run broadcast
with root i
9Scenario 2
2. k Find_recent() 3. Update(i)
4. b
Bcast service
1. f
When broadcast at i finishes, experiment k starts
10Broadcast Implementation
- If node is Root node
- Note time t1
- Send to direct successors data sub tree for
that successor root info - Wait for acknowledgement from all the leaf nodes.
- Note time t2 (all ack obtained)
- Bcast time is t2-t1
- Else
- Wait for data from parent
- If node is not leaf node
- Send to direct successors data sub tree for
that successor root info - Else
- Send Acknowledgement to root (null message
packets) - For getting proper results experimental
broadcasts are repeated many times.
11References
- Thilo Kielmann, Rutger F.H. Hofman, Henri E. Bal,
Aske Plaat, and Raoul A.F. Bhoedjang MagPIe
MPI's Collective Communication Operations for
Clustered Wide Area Systems - Nicholas T Karonis, Ian Foster, et al Exploiting
hierarchy in parallel computer networks to
optimize collective operation performance.