Title: Gridenabled MPI Library Tutorial
1Grid-enabled MPI Library(Tutorial)
- Kyung-Lang Park
- Ph.D. Candidate of the Department of Computer
Science - Yonsei University
- (2005. 4. 13)
2Contents
- What is the MPI?
- What is the Grid?
- MPICH-G2
- Installation
- Compilation with MPICH-G2
- Launching MPI application (mpirun)
- Initializing Process (MPI_Init)
- Communication
- Simple Send and Receivce
- Collective Communication
- MPICH-GX
3What is the MPI?
- MPI is a library specification (not library
implementation) for message passing (API)
(http//http//www-unix.mcs.anl.gov/mpi/) - The goal of MPI
- Development of widely used standard for writing
message passing programs
Distinguish MPI Specification from MPI
Implementation!!!
4MPI Implementations
- MPICH A Portable Implementation of MPI
- Current Version is 1.2.6
- Base of several MPI Implementations
- winMPICH
- MPICH-VM
- MPICH-G2
- MPICH2
- OpenMPI (full MPI-2 standard)
- PACX-MPI (Parallel Computer Extension)
- LAM/MPI
- FT-MPI Harness Fault-Tolerant MPI
- LA-MPI Network Fault-Tolerant MPI (Los Alamos)
- MPI/Pro
- IBMs MPI
- SGIs MPI
5What is the Grid?
- Definition
- Grid is a system that coordinates distributed
resource using standard, open, general-purpose
protocols and interfaces to deliver nontrivial
qualities of service - Difference features
- Multiple Administrative Domain
- Heterogeneous resources
- Including wide-area networks
- Exposed in many kinds of faults
- Including non-dedicated resources?
6MPICH-G2
- A Grid-enabled MPI Implementation
- To allow a user to run MPI programs across
multiple computers at different sites using the
same command that would be used on a parallel
computer - Library extends the MPICH to use services
provided by the Globus Toolkit - Not include other issues
7Installation
- MPICH-G2 is included in current version of MPICH
- MPICH include various devices (p4, g2, )
- Configure decide device type
- Make and Make install make the MPI library and
tools - G2 Installation should be performed after
installing Globus Toolkit
src
bin
Configure
Make, Make install
mpid
include
p4
example
mpe
lib
Makefile
g2
doc
doc
mpichconf.h
bin
Romio
www
Mpid.h
include
Making MPI Library
MPI Source distribution
Building header
/usr/local/mpich_g2
8Compilation
- Compilation is not so different from others
- mpicc options source file
- mpicc o cpi cpi.c
- mpicc is the gcc (or cc) with linking installed
MPI library - gcc o cpi cpi.c lmpichg2
9Launching MPI applications
- Launching is complex
- Because we need to use multi administrative
domain - Need understanding of Globus Toolkit
- Launching application with mpirun
- mpirun generate an GLobus RSL script by using
the parameters - Generated RSL are passed to globusrun
- globusrun send sub-requests to servers
- Gatekeeper on the Servers receive the sub-request
from the globusrun - Mutual-authentication is performed.
- Gatekeeper pass the request to globus-job-manager
- globus-job-manager fork the process
- Directly
- PBS supports
10Operation Flow of Launching MPI Applications
Remote machines
User machine
gatekeeper
Globus library
request
Dynamic link library
Globusrun (Duroc)
Globus-job-manager
COMMON
NEXUS
DUROC
DUCT
Fork MPI_Process
RSL
IO
GSI
MYJOB
request
MPI Process 0
Globus 2.X
barrier
MPICH
MPI Library
gatekeeper
Generated RSLglobusrsl mpirun 1 w f rslname
Globus-job-manager
Link library
Library link
a.out
mpicc
mpirun
MPI Process 1
Source file
barrier
Command and Machine file mpirun np 2
machinefile m a.out
11MPI Application
12MPI Initialization
- Goal Initialize Global Variable
- Core Variable
- Struct MPIR_COMMUNICATOR MPIR_COMM_WORLD
- PtrToIdx PtrArrayMAX_PTRS
- Struct channel_t CommworldChannels
- Struct commworldchannels CommWorldChannelsTable
- Globus_byte_t MyGlobusGramJobContact
- Globus_byte_t GramJobcontactsVector
- Int MPID_MyWorldSize
- Int MPID_MyWorldRank
- Globus_handle_t Handle
13Flow Chart
MPI_Init()
MPIR_Init()
Globus_module_activate
MPID_Init()
Get base variable
Globus_init()
Get_topology()
Create_my_miproto()
MPIR_Topology_Init()
Distribute_byte_array()
MPIR_Init_dtes()
Build_channel()
MPIR_Errhandler_create()
Select_protocols()
MPIR_GROUP_EMPTY
Make uniquq name
MPIR_COMM_WORLD
CommWorldChannelsTable
MPIR_COMM_SELF
MyGlobusJobContactVector
MPIR_Return()
End of MPI_Init()
14Three Part of Initialization
- Device Initialization (globus_init)
- globus module activate
- Make basic information (rank, nprocs)
- set my protocol information
- create channel table
- Etc.
- Initialize Datatype and Errhandlers
- Create default communicator
- MPIR_COMM_WORLD
15Globus_module_activate
- Bring up runtime libraries
- Globus_module_activate(Module Name)
- GLOBUS_DUROC_RUNTIME_MODULE
- GLOBUS_COMMON_MODULE
- GLOBUS_IO_MODULE
- GLOBUS_NEXUS_MODULE
16Make Basic Information (1)
- Get Base Information
- Tcp buffer size
- Globus_module_getenv(MPICH_GLOBUS2_TCP_BUFFER_SIZ
E) - Save into MpichGloubus2TcpBufsz
- Rank_in_my_subjob
- Globus_duroc_runtime_intra_subjob_rank()
- My_subjob_size
- Globus_duroc_runtime_intra_subjob_size()
17Make Basic Information (2)
- Call get_topology()
- Initialize Topology Variable
- Subjob_addresses
- Nprocs of total processes
- Nsubjobs of subjob
- My_grank group rank
- Using GLOBUS_DUROC API
- Divided into master part and slave part
18Make Basic Information (3)
- get_topology() - subjob slave
- Receive Message from subjob master
- Intra_subjob_receive
- Globus_duroc_runtime_intra_subjob_receive
- Get nprocs and my_grank from Message
19Make Basic Information (4)
- get_topology() - subjob master
- Getting subjob layout
- Globus_duroc_runtime_intra_subjob_structure()
- My_subjob_addr, Nsubjobs, subjob_addresses
- Finding index of master subjob 0
- He is the one with the lowest address
- Sj0_master_idx
- Calculate my subjobmaster_rank
- Getting GLOBUS_DUROC_SUBJOB_INDEX
- rsl_subjob_rank
20Make Basic Information (5)
- get_topology() - subjob master (not root)
- Make Message
- duroc_subjobmaster_rank,rsl_subjob_rank,my_subjob
_size - Send message to subjob master 0
- Globus_duroc_runtime_inter_subjob_send
- Receive Message
- Nprocs, my_grank
- Make Message and send to slaves
21Make Basic Information (6)
- get_topology() - subjob master (root)
- Sorting subjob_addresses
- Receive Message from subjob masters
- Globus_duroc_runtime_inter_subjob_receive()
- Make rsl_ranks, job_sizes
- Calculating nprocs and everyones g_rank based on
rsl_ranks and job_sizes - Nprocs sum of job_sizes
- G_rank sum of job_sizes that are less than mine
- Sending Message to subjob masters
- Nprocs,g_ranks
22Set my protocol
- Getting network address and mask
- net_addr,net_mask
- Getting interface address
- Make passive socket and listen using globus_io
- Assign port , Handle
- globus_io_tcp_listen()
- Getting more variables
- Make byte stream(my_miproto)
- S_tcptype,hostname,s_port,lan_id,localhost_id
23Create Channel Table (1)
- Distribute my_miproto to all processes
- distribute_byte_array(...,array, vector,)
- Flow
- Master gather messages from slaves
- Intra_subjob_gather
- Each master exchages the message
- Globus_duroc_runtim_inter_subjob_send(recv)
- Master bcast messages to all slaves
- Intra_subjob_bcast
- Make my_miproto_vector(array)
24My_miproto_vector
Int my_miproto
0
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
1
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
nprocs
25Create Channel Table (2)
- Call Build_channel()
- Allocate channels(array)
- Set Values of each channel
- Channeli.proto_list mp
- Mp-gtinfo tp
- Set tps Variable using mi_proto_vector
26Create Channel Table (3)
- Select_protocols()
- For All destinations(i)
- Set channeli.selected_proto protocol using both
source and dest
27Commworldchannels
Struct channel_t CommworldChannels
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
0
Type(tcp,mpi,unknown)
Void info
next
Type(tcp,mpi,unknown)
Void info
next
1
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
Type(tcp,mpi,unknown)
Void info
next
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
nprocs
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
28Make Unique Name
- My_grank 0
- Make unique name(my_commworld_id)
- Hostname getpid()
- My_grank ! 0
- My_commworld_id NULL
- Distribute_byte_array
- It can be combined intomy_commworld_id_vector
29CommWorldChannelsTable
- Set CommWorldChannelTable0
- .nprocs MPID_MyWorldSize
- .name my_commworld_id_vector0
- .channels Commworldchannels
30GramJobContactsVector
- Get MyGlobusGramJobContact
- GLOBUS_GRAM_JOB_CONTACT
- Distribute_byte_array()
- Combined into GramJobcontactsVector
- End of globus_init
- End of MPID_Init
31End of globus_init()
32The Rest of Initialization
- MPIR_Init_Queue
- MPIR_Init_dtes()
- MPIR_Errhandles_create()
- Make defualt Communicator Group
- MPIR_Comm_world
33Init global variables (1)
- MPIR_Init_queue()
- Allocate MPIR_Topo_els
- MPIR_Init_dtes()
- Setup MPI_Datatypes
- Set a map from MPI_Datatype into basic types
- MPI_FLOAT float
34Init global variables (2)
- MPIR_Errhandler_create()
- 3 cases of error
- MPIR_Errors_are_fatal
- MPIR_Errors_return
- MPIR_Errors_warn
- Make errhandler object(new)
- Register function to object
- Register Pointer into PtrArray
- MPIR_RegPointerIdx(errhandler(const),new)
35Create default Communicator
- MPIR_COMM_WORLD
- Allocate MPIR_COMM_WORLD
- Register pointer into PtrArray with
MPI_COMM_WORLD(const) - Set all variables in MPIR_COMM_WORLD
- Comm_type, ADIctx,group,
- Predefined attributes for MPI_COMM_WORLD
- Create topology information (globus2 only)
36MPI Communication (cont.)
- Preparing Communication - MPI_Init()
- Get basic information
- Gather information of each process
- Create ChannelTable
- Make passive socket
- Register listen_callback() function
- Sending Message MPI_Send()
- Get protocol information from C.T.
- Open socket using globus_io
- Write data to socket using globus_io
37MPI Communication
- Receiving Message listen_callback
- Accept socket connection
- Reading data from socket
- copy data into recv-queue
- Receiving Message MPI_Recv(..buf..)
- Search recv-queue
- Copy data from recv-queue to buf
38MPI Communication
12. Read data from posted Queue
Process A (Rank 3)
Process B (Rank 5)
1. Creating passive socket
5. Connection
6. Accept socket
7. Send data
8. Call listen_callback9. Copy data to unexpected
4. Making socket for writing
2. Getting information of Process B
COMMWORLDCHANNEL
MPID_recvs
Rank 5 selected protocol tcp , link -gt
posted
unexpected
10. Move data to posted11. Delete original buf
3. Getting protocol information
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
39Collective Communications
- G2 uses multi-leveled collective communications
- Comparison
- Old MPI provides only 1 level topology
- Assume all nodes are in same latency
- MagPIE does 2
- WAN, LAN
- G2 Provides 4
- WAN, LAN, intraTCP, VendorMPI
- Topology Discovery
- Assume all processes can communicate in WAN
- User specify the LAN_ID
- Subjob means processes use intraTCP
- Jobtypempi means process can use Vmpi
40Multiple Level Collective Communication
- Exploiting Hierarchy
- WAN_TCP lt LAN_TCP lt intra TCP lt vendor MPI.
m1.utech.edu
m1.utech.edu
vendor MPI
Intra TCP
LAN_TCP
p0
p10
1. WAN_TCP Level P0, P20
2. LAN_TCP Level P0, P10
WAN_TCP
3. Intra TCP Level P10,., P19
vendor MPI
p20
4. Vendor MPI Level P0, P9, P20, P29
c1.nlab.gov
41Multiple Level Collective Communication
- ( (resourceManagerContact"m1.utech.edu")
(count10) (jobtypempi) (label"subjob 0")
(environment(GLOBUS_DUROC_SUBJOB_INDEX 0))
(GLOBUS_LAN_ID foo) (directory/homes/users/smi
th) (executable/homes/users/smith/myapp) ) (
(resourceManagerContact"m2.utech.edu")
(count10) (label"subjob 1")
(environment(GLOBUS_DUROC_SUBJOB_INDEX
1)) (GLOBUS_LAN_ID foo) (directory/homes/users/
smith) (executable/homes/users/smith/myapp) ) (
(resourceManagerContact"c1.nlab.gov")
(count10) (jobtypempi) (label"subjob 2")
(environment(GLOBUS_DUROC_SUBJOB_INDEX 2))
(directory/users/smith) (executable/users/smi
th/myapp) )
42Flow of Multilevel Collective operations
- Make Topology (MPI_Init)
- getting variable
- COMM-gtglobus_lan_id
- COMM-gtlocalhost_id
- cluster_table()
- comm-gtTopology_Depths
- comm-gtTopology_ColorTable
- comm-gtTopology_ClusterSizes
- comm-gtTopology_ClusterIds
- comm-gtTopology_ClusterSets
- Update Topology (Before Collective operations)
- update_cluster_table()
- Make multiple_set_t (Before Collective
operations) - involve_sets()
43Getting Variables
- globus_lan_id
- MPI_Init
- create_my_miproto()
- globus_libc_getenv(GLOBUS_LAN_ID)
- localhost_id
- MPI_Init
- create_my_miproto()
- globus_libc_getenv(GLOBUS_DUROC_SUBJOB_INDEX)
- atoi(duroc_subjob)
44Cluster_table()
- Topology_Depth
- get_channel(my_rank)
- Topology_depth(my_rank) of protocols
- Topology_ColorTable
- level 0 all processes have same color
- level 1 globus_lan_id
- level 2 localhost_id
- level 3 jobtype
45Examples
WAN
LAN
cluster101
cluster201
mercury
Cluster_colorTable
-gt procs
0
0
0
0
0
0
level 0
0
0
0
0
1
1
level 1
0
0
1
1
2
2
level 2
46Cluster_table() (cont.)
- Topology_ClusterIds
- Classifying the process in the same color
Topoogy_ClusterIds
-gt procs
0
0
0
0
1
1
level 0
0
0
1
1
0
0
level 1
0
1
0
1
0
1
level 2
47Cluster_table() (cont.)
- Topology_ClusterSizes
- of process having same color at this level
- Cluster_Sets
- Memory for the set of master processes
48Update_cluster_id_table()
- Update topology variables for the nonzero root
- If cid is not zero, Rotate the Cids
- Update the hidden communicator
Root
Topoogy_ClusterIds
-gt procs
0
0
0
0
1
1
level 0
0
0
1
1
0
0
level 1
0
1
0
1
0
1
level 2
49Update_cluster_id_table()
Root
-gt procs
Topoogy_ClusterIds
0
0
0
0
1
1
level 0
1
1
0
0
0
0
level 1
0
1
1
0
0
1
level 2
50Involve_set()
- Make useful data structure for collective
communication - Topology variables are converted into the struct
multiple_set_t
51MPI_Bcast (cont.)
MPI_Bcast(buf,comm)
comm_ptr MPIR_To_Pointer(comm)
comm_ptr-gtcollops-gtBcast(buf)
type MPI_INTRA
Intra_Bcast(buf)
Inter_Bcast(buf)
Not Supported yet
ifdef MPID_Bcast()
MPID_FN_Bcast(buf)
Intra_Bcast(buf)
Topology-aware bcast
binomial bcast
52MPI_Bcast (cont.)
MPID_FN_Bcast(buf)
involve(comm,set_info)
allocate request
all sets in set_info
level 0
flat_tree_bcast(buf)
binomial_bcast(buf)
Im root in this set
MPI_Recv(buf) from parent
MPI_Isend(buf)
MPI_Recv(buf)
MPI_Send(buf) to parent
53struct multiple_set_t
set_info
Rank 0
num
set
size
level
root_index
my_rank_index
set
0,20
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
09
set_info
Rank 10
num
set
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
1019
54MPICH-GX
- Modified version of MPICH-G2
- Private IP Support (K.R. Park)
- File-based Initialization (K.W. Koh)
- Mpid/globus2/init_g.c
- Efficient Collective Operations (K.L. Park)
- Mpid/globus2/topology_intra_fns.c
- Message Compression (H.J. Lee)
- You should understand how we modified the
mpich-g2
55(No Transcript)
56Next Seminar
- Homework
- Install Globus and MPICH-G2
- Compile and Launching cpi
- Printing Source code of mpich-g2
- Understanding operation flow
- Initialization
- Communication
- Collective Communication
- Modifying the code
- Insert codes printing your name when initializing
MPI process - Insert codes printing message transfer number
- Remake the library
- Compile and Launching cpi
- Understand operation flow
- File-based Initialization
- Proxy-based Communication
- Modifying the code
- File format of Initialization
- Stablizing the communication code
57Backup - ADI of MPICH
- MPICH can support various device types using ADI
(Abstract Device Interface) - ADI is a small set of function definitions
- C functions or Macro definitions
- All MPI functions are implemented by ADI
- Basic 4 functions of ADI
- ???? ??
- API? H/W??? Data ??
- ??? ?? ???? list ??
- ?? ??? ?? ?? ??
58Backup - MPICH Layers
MPI
MPI_Send
MPI_Recv
MPI_Isend
ADI
MPID_SendDatatype
MPID_ISendDatatype
MPID_RecvDatatype
MPID_IRecvDatatype
Device
G_malloc
Globus_mutex_lock
Globus_dc_put
Globus_io_write
G_free
Globus2
Ch_p4
59Backup - ????? ??
- MPI_ (MPI_SEND)
- MPI? ?????? ???? ?? ?? ???? ???? ???? ??.
- MPIR_ (MPIR_GET_DTYPE_PTR)
- MPI_ ??? ?? ??? ???? ?? ??? ??? ?? ??? ?????
- MPID_ (MPID_SendDateType)
- MPI_ ??? ?? ??? ???? ?? ??? ?????? ??? ?? ??? ???
Device? ?? ?? ??? ???.
60Backup - Using ADI
Problem How to allow Draw3d() to bad graphic
device
ADI Approach
- Draw_Rect3D()
-
- .
- DRAW_MACRO_3D()
-
-
DRAW_MACRO_3D() . Draw3D() .
Draw_ MACRO_3D() . for(i0 to 2) Draw()
.
61ADI Hierarchy (1)
globus2
62ADI Hierarchy (2)
63Backup - DUROC subjob
-
- ( (resourceManagerContactgrid.yonsei.ac.kr)
- (count 4)
- (labelsubjob 0)
- (environment(GLOBUS_DUROC_SUBJOB_INDEX 0)
(LD_LIBRARY_PATH /usr/local/globus/l
ib/)) - (directtory/usr/local..)
(executable/usr/local..) - )
- ( (resourceManagerContactcluster.yonsei.ac.kr)
- (count 2)
- (labelsubjob 4
- (environment(GLOBUS_DUROC_SUBJOB_INDEX 1)
- .
- ((resourceManagerContactdcc.sogang.ac.kr)
- (count 2)
- (environement(GLOBUS_DUROC_SUBJOB_INDEX 2)
-
- ((resoureManagerContactvenus.kisti.re.kr)
- (count 4)
- .
64Backup - DUROC subjob
DUROC
CO-ALLOCATION
Subjob_size4
Subjob_size2
Subjob_size2
Subjob_size4
SUBJOB_INDEX0
SUBJOB_INDEX1
SUBJOB_INDEX2
SUBJOB_INDEX3
0
0
0
0
1
1
1
1
2
2
3
3
Venus.kisti.re.kr
Grid.yonsei.ac.kr
cluster.yonsei.ac.kr
dcc.sogang.ac.kr