Gridenabled MPI Library Tutorial - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Gridenabled MPI Library Tutorial

Description:

... the Department of Computer Science. Yonsei University (2005. 4. ... create channel table. Etc. Initialize Datatype and Errhandlers. Create default communicator ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 65
Provided by: supercom
Category:

less

Transcript and Presenter's Notes

Title: Gridenabled MPI Library Tutorial


1
Grid-enabled MPI Library(Tutorial)
  • Kyung-Lang Park
  • Ph.D. Candidate of the Department of Computer
    Science
  • Yonsei University
  • (2005. 4. 13)

2
Contents
  • What is the MPI?
  • What is the Grid?
  • MPICH-G2
  • Installation
  • Compilation with MPICH-G2
  • Launching MPI application (mpirun)
  • Initializing Process (MPI_Init)
  • Communication
  • Simple Send and Receivce
  • Collective Communication
  • MPICH-GX

3
What is the MPI?
  • MPI is a library specification (not library
    implementation) for message passing (API)
    (http//http//www-unix.mcs.anl.gov/mpi/)
  • The goal of MPI
  • Development of widely used standard for writing
    message passing programs

Distinguish MPI Specification from MPI
Implementation!!!
4
MPI Implementations
  • MPICH A Portable Implementation of MPI
  • Current Version is 1.2.6
  • Base of several MPI Implementations
  • winMPICH
  • MPICH-VM
  • MPICH-G2
  • MPICH2
  • OpenMPI (full MPI-2 standard)
  • PACX-MPI (Parallel Computer Extension)
  • LAM/MPI
  • FT-MPI Harness Fault-Tolerant MPI
  • LA-MPI Network Fault-Tolerant MPI (Los Alamos)
  • MPI/Pro
  • IBMs MPI
  • SGIs MPI

5
What is the Grid?
  • Definition
  • Grid is a system that coordinates distributed
    resource using standard, open, general-purpose
    protocols and interfaces to deliver nontrivial
    qualities of service
  • Difference features
  • Multiple Administrative Domain
  • Heterogeneous resources
  • Including wide-area networks
  • Exposed in many kinds of faults
  • Including non-dedicated resources?

6
MPICH-G2
  • A Grid-enabled MPI Implementation
  • To allow a user to run MPI programs across
    multiple computers at different sites using the
    same command that would be used on a parallel
    computer
  • Library extends the MPICH to use services
    provided by the Globus Toolkit
  • Not include other issues

7
Installation
  • MPICH-G2 is included in current version of MPICH
  • MPICH include various devices (p4, g2, )
  • Configure decide device type
  • Make and Make install make the MPI library and
    tools
  • G2 Installation should be performed after
    installing Globus Toolkit

src
bin
Configure
Make, Make install
mpid
include
p4
example
mpe
lib
Makefile
g2
doc
doc
mpichconf.h
bin

Romio
www
Mpid.h
include
Making MPI Library
MPI Source distribution
Building header
/usr/local/mpich_g2
8
Compilation
  • Compilation is not so different from others
  • mpicc options source file
  • mpicc o cpi cpi.c
  • mpicc is the gcc (or cc) with linking installed
    MPI library
  • gcc o cpi cpi.c lmpichg2

9
Launching MPI applications
  • Launching is complex
  • Because we need to use multi administrative
    domain
  • Need understanding of Globus Toolkit
  • Launching application with mpirun
  • mpirun generate an GLobus RSL script by using
    the parameters
  • Generated RSL are passed to globusrun
  • globusrun send sub-requests to servers
  • Gatekeeper on the Servers receive the sub-request
    from the globusrun
  • Mutual-authentication is performed.
  • Gatekeeper pass the request to globus-job-manager
  • globus-job-manager fork the process
  • Directly
  • PBS supports

10
Operation Flow of Launching MPI Applications
Remote machines
User machine
gatekeeper
Globus library
request
Dynamic link library
Globusrun (Duroc)
Globus-job-manager
COMMON
NEXUS
DUROC
DUCT
Fork MPI_Process
RSL
IO
GSI
MYJOB
request
MPI Process 0
Globus 2.X
barrier
MPICH
MPI Library
gatekeeper
Generated RSLglobusrsl mpirun 1 w f rslname
Globus-job-manager
Link library
Library link
a.out
mpicc
mpirun
MPI Process 1
Source file
barrier
Command and Machine file mpirun np 2
machinefile m a.out
11
MPI Application
12
MPI Initialization
  • Goal Initialize Global Variable
  • Core Variable
  • Struct MPIR_COMMUNICATOR MPIR_COMM_WORLD
  • PtrToIdx PtrArrayMAX_PTRS
  • Struct channel_t CommworldChannels
  • Struct commworldchannels CommWorldChannelsTable
  • Globus_byte_t MyGlobusGramJobContact
  • Globus_byte_t GramJobcontactsVector
  • Int MPID_MyWorldSize
  • Int MPID_MyWorldRank
  • Globus_handle_t Handle

13
Flow Chart
MPI_Init()
MPIR_Init()
Globus_module_activate
MPID_Init()
Get base variable
Globus_init()
Get_topology()
Create_my_miproto()
MPIR_Topology_Init()
Distribute_byte_array()
MPIR_Init_dtes()
Build_channel()
MPIR_Errhandler_create()
Select_protocols()
MPIR_GROUP_EMPTY
Make uniquq name
MPIR_COMM_WORLD
CommWorldChannelsTable
MPIR_COMM_SELF
MyGlobusJobContactVector
MPIR_Return()
End of MPI_Init()
14
Three Part of Initialization
  • Device Initialization (globus_init)
  • globus module activate
  • Make basic information (rank, nprocs)
  • set my protocol information
  • create channel table
  • Etc.
  • Initialize Datatype and Errhandlers
  • Create default communicator
  • MPIR_COMM_WORLD

15
Globus_module_activate
  • Bring up runtime libraries
  • Globus_module_activate(Module Name)
  • GLOBUS_DUROC_RUNTIME_MODULE
  • GLOBUS_COMMON_MODULE
  • GLOBUS_IO_MODULE
  • GLOBUS_NEXUS_MODULE

16
Make Basic Information (1)
  • Get Base Information
  • Tcp buffer size
  • Globus_module_getenv(MPICH_GLOBUS2_TCP_BUFFER_SIZ
    E)
  • Save into MpichGloubus2TcpBufsz
  • Rank_in_my_subjob
  • Globus_duroc_runtime_intra_subjob_rank()
  • My_subjob_size
  • Globus_duroc_runtime_intra_subjob_size()

17
Make Basic Information (2)
  • Call get_topology()
  • Initialize Topology Variable
  • Subjob_addresses
  • Nprocs of total processes
  • Nsubjobs of subjob
  • My_grank group rank
  • Using GLOBUS_DUROC API
  • Divided into master part and slave part

18
Make Basic Information (3)
  • get_topology() - subjob slave
  • Receive Message from subjob master
  • Intra_subjob_receive
  • Globus_duroc_runtime_intra_subjob_receive
  • Get nprocs and my_grank from Message

19
Make Basic Information (4)
  • get_topology() - subjob master
  • Getting subjob layout
  • Globus_duroc_runtime_intra_subjob_structure()
  • My_subjob_addr, Nsubjobs, subjob_addresses
  • Finding index of master subjob 0
  • He is the one with the lowest address
  • Sj0_master_idx
  • Calculate my subjobmaster_rank
  • Getting GLOBUS_DUROC_SUBJOB_INDEX
  • rsl_subjob_rank

20
Make Basic Information (5)
  • get_topology() - subjob master (not root)
  • Make Message
  • duroc_subjobmaster_rank,rsl_subjob_rank,my_subjob
    _size
  • Send message to subjob master 0
  • Globus_duroc_runtime_inter_subjob_send
  • Receive Message
  • Nprocs, my_grank
  • Make Message and send to slaves

21
Make Basic Information (6)
  • get_topology() - subjob master (root)
  • Sorting subjob_addresses
  • Receive Message from subjob masters
  • Globus_duroc_runtime_inter_subjob_receive()
  • Make rsl_ranks, job_sizes
  • Calculating nprocs and everyones g_rank based on
    rsl_ranks and job_sizes
  • Nprocs sum of job_sizes
  • G_rank sum of job_sizes that are less than mine
  • Sending Message to subjob masters
  • Nprocs,g_ranks

22
Set my protocol
  • Getting network address and mask
  • net_addr,net_mask
  • Getting interface address
  • Make passive socket and listen using globus_io
  • Assign port , Handle
  • globus_io_tcp_listen()
  • Getting more variables
  • Make byte stream(my_miproto)
  • S_tcptype,hostname,s_port,lan_id,localhost_id

23
Create Channel Table (1)
  • Distribute my_miproto to all processes
  • distribute_byte_array(...,array, vector,)
  • Flow
  • Master gather messages from slaves
  • Intra_subjob_gather
  • Each master exchages the message
  • Globus_duroc_runtim_inter_subjob_send(recv)
  • Master bcast messages to all slaves
  • Intra_subjob_bcast
  • Make my_miproto_vector(array)

24
My_miproto_vector
Int my_miproto
0
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
1
S_nprotos
S_tcptype
hostname
Lan_id_lng
Lan_id
nprocs
25
Create Channel Table (2)
  • Call Build_channel()
  • Allocate channels(array)
  • Set Values of each channel
  • Channeli.proto_list mp
  • Mp-gtinfo tp
  • Set tps Variable using mi_proto_vector

26
Create Channel Table (3)
  • Select_protocols()
  • For All destinations(i)
  • Set channeli.selected_proto protocol using both
    source and dest

27
Commworldchannels
Struct channel_t CommworldChannels
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
0
Type(tcp,mpi,unknown)
Void info
next
Type(tcp,mpi,unknown)
Void info
next
1
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
Type(tcp,mpi,unknown)
Void info
next
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
nprocs
Struct miproto_t Proto_list
Struct miproto_t Selected_proto
28
Make Unique Name
  • My_grank 0
  • Make unique name(my_commworld_id)
  • Hostname getpid()
  • My_grank ! 0
  • My_commworld_id NULL
  • Distribute_byte_array
  • It can be combined intomy_commworld_id_vector

29
CommWorldChannelsTable
  • Set CommWorldChannelTable0
  • .nprocs MPID_MyWorldSize
  • .name my_commworld_id_vector0
  • .channels Commworldchannels

30
GramJobContactsVector
  • Get MyGlobusGramJobContact
  • GLOBUS_GRAM_JOB_CONTACT
  • Distribute_byte_array()
  • Combined into GramJobcontactsVector
  • End of globus_init
  • End of MPID_Init

31
End of globus_init()
32
The Rest of Initialization
  • MPIR_Init_Queue
  • MPIR_Init_dtes()
  • MPIR_Errhandles_create()
  • Make defualt Communicator Group
  • MPIR_Comm_world

33
Init global variables (1)
  • MPIR_Init_queue()
  • Allocate MPIR_Topo_els
  • MPIR_Init_dtes()
  • Setup MPI_Datatypes
  • Set a map from MPI_Datatype into basic types
  • MPI_FLOAT float

34
Init global variables (2)
  • MPIR_Errhandler_create()
  • 3 cases of error
  • MPIR_Errors_are_fatal
  • MPIR_Errors_return
  • MPIR_Errors_warn
  • Make errhandler object(new)
  • Register function to object
  • Register Pointer into PtrArray
  • MPIR_RegPointerIdx(errhandler(const),new)

35
Create default Communicator
  • MPIR_COMM_WORLD
  • Allocate MPIR_COMM_WORLD
  • Register pointer into PtrArray with
    MPI_COMM_WORLD(const)
  • Set all variables in MPIR_COMM_WORLD
  • Comm_type, ADIctx,group,
  • Predefined attributes for MPI_COMM_WORLD
  • Create topology information (globus2 only)

36
MPI Communication (cont.)
  • Preparing Communication - MPI_Init()
  • Get basic information
  • Gather information of each process
  • Create ChannelTable
  • Make passive socket
  • Register listen_callback() function
  • Sending Message MPI_Send()
  • Get protocol information from C.T.
  • Open socket using globus_io
  • Write data to socket using globus_io

37
MPI Communication
  • Receiving Message listen_callback
  • Accept socket connection
  • Reading data from socket
  • copy data into recv-queue
  • Receiving Message MPI_Recv(..buf..)
  • Search recv-queue
  • Copy data from recv-queue to buf

38
MPI Communication
12. Read data from posted Queue
Process A (Rank 3)
Process B (Rank 5)
1. Creating passive socket
5. Connection
6. Accept socket
7. Send data
8. Call listen_callback9. Copy data to unexpected
4. Making socket for writing
2. Getting information of Process B
COMMWORLDCHANNEL
MPID_recvs
Rank 5 selected protocol tcp , link -gt
posted
unexpected
10. Move data to posted11. Delete original buf
3. Getting protocol information
Hostname
port
handlep
whandle
header
To_self
Connection_lock
Connection_cond
attr
Cancell_head
Cancel_tail
Send_head
Send_tail
39
Collective Communications
  • G2 uses multi-leveled collective communications
  • Comparison
  • Old MPI provides only 1 level topology
  • Assume all nodes are in same latency
  • MagPIE does 2
  • WAN, LAN
  • G2 Provides 4
  • WAN, LAN, intraTCP, VendorMPI
  • Topology Discovery
  • Assume all processes can communicate in WAN
  • User specify the LAN_ID
  • Subjob means processes use intraTCP
  • Jobtypempi means process can use Vmpi

40
Multiple Level Collective Communication
  • Exploiting Hierarchy
  • WAN_TCP lt LAN_TCP lt intra TCP lt vendor MPI.

m1.utech.edu
m1.utech.edu
vendor MPI
Intra TCP
LAN_TCP
p0
p10
1. WAN_TCP Level P0, P20
2. LAN_TCP Level P0, P10
WAN_TCP
3. Intra TCP Level P10,., P19
vendor MPI
p20
4. Vendor MPI Level P0, P9, P20, P29
c1.nlab.gov
41
Multiple Level Collective Communication
  • ( (resourceManagerContact"m1.utech.edu")
    (count10) (jobtypempi) (label"subjob 0")
    (environment(GLOBUS_DUROC_SUBJOB_INDEX 0))
    (GLOBUS_LAN_ID foo) (directory/homes/users/smi
    th) (executable/homes/users/smith/myapp) ) (
    (resourceManagerContact"m2.utech.edu")
    (count10) (label"subjob 1")
    (environment(GLOBUS_DUROC_SUBJOB_INDEX
    1)) (GLOBUS_LAN_ID foo) (directory/homes/users/
    smith) (executable/homes/users/smith/myapp) ) (
    (resourceManagerContact"c1.nlab.gov")
    (count10) (jobtypempi) (label"subjob 2")
    (environment(GLOBUS_DUROC_SUBJOB_INDEX 2))
    (directory/users/smith) (executable/users/smi
    th/myapp) )

42
Flow of Multilevel Collective operations
  • Make Topology (MPI_Init)
  • getting variable
  • COMM-gtglobus_lan_id
  • COMM-gtlocalhost_id
  • cluster_table()
  • comm-gtTopology_Depths
  • comm-gtTopology_ColorTable
  • comm-gtTopology_ClusterSizes
  • comm-gtTopology_ClusterIds
  • comm-gtTopology_ClusterSets
  • Update Topology (Before Collective operations)
  • update_cluster_table()
  • Make multiple_set_t (Before Collective
    operations)
  • involve_sets()

43
Getting Variables
  • globus_lan_id
  • MPI_Init
  • create_my_miproto()
  • globus_libc_getenv(GLOBUS_LAN_ID)
  • localhost_id
  • MPI_Init
  • create_my_miproto()
  • globus_libc_getenv(GLOBUS_DUROC_SUBJOB_INDEX)
  • atoi(duroc_subjob)

44
Cluster_table()
  • Topology_Depth
  • get_channel(my_rank)
  • Topology_depth(my_rank) of protocols
  • Topology_ColorTable
  • level 0 all processes have same color
  • level 1 globus_lan_id
  • level 2 localhost_id
  • level 3 jobtype

45
Examples
WAN
LAN
cluster101
cluster201
mercury
Cluster_colorTable
-gt procs
0
0
0
0
0
0
level 0
0
0
0
0
1
1
level 1
0
0
1
1
2
2
level 2
46
Cluster_table() (cont.)
  • Topology_ClusterIds
  • Classifying the process in the same color

Topoogy_ClusterIds
-gt procs
0
0
0
0
1
1
level 0
0
0
1
1
0
0
level 1
0
1
0
1
0
1
level 2
47
Cluster_table() (cont.)
  • Topology_ClusterSizes
  • of process having same color at this level
  • Cluster_Sets
  • Memory for the set of master processes

48
Update_cluster_id_table()
  • Update topology variables for the nonzero root
  • If cid is not zero, Rotate the Cids
  • Update the hidden communicator

Root
Topoogy_ClusterIds
-gt procs
0
0
0
0
1
1
level 0
0
0
1
1
0
0
level 1
0
1
0
1
0
1
level 2
49
Update_cluster_id_table()
Root
-gt procs
Topoogy_ClusterIds
0
0
0
0
1
1
level 0
1
1
0
0
0
0
level 1
0
1
1
0
0
1
level 2
50
Involve_set()
  • Make useful data structure for collective
    communication
  • Topology variables are converted into the struct
    multiple_set_t

51
MPI_Bcast (cont.)
MPI_Bcast(buf,comm)
comm_ptr MPIR_To_Pointer(comm)
comm_ptr-gtcollops-gtBcast(buf)
type MPI_INTRA
Intra_Bcast(buf)
Inter_Bcast(buf)
Not Supported yet
ifdef MPID_Bcast()
MPID_FN_Bcast(buf)
Intra_Bcast(buf)
Topology-aware bcast
binomial bcast
52
MPI_Bcast (cont.)
MPID_FN_Bcast(buf)
involve(comm,set_info)
allocate request
all sets in set_info
level 0
flat_tree_bcast(buf)
binomial_bcast(buf)
Im root in this set
MPI_Recv(buf) from parent
MPI_Isend(buf)
MPI_Recv(buf)
MPI_Send(buf) to parent
53
struct multiple_set_t
set_info
Rank 0
num
set
size
level
root_index
my_rank_index
set
0,20
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
09
set_info
Rank 10
num
set
size
level
root_index
my_rank_index
set
0,10
size
level
root_index
my_rank_index
set
1019
54
MPICH-GX
  • Modified version of MPICH-G2
  • Private IP Support (K.R. Park)
  • File-based Initialization (K.W. Koh)
  • Mpid/globus2/init_g.c
  • Efficient Collective Operations (K.L. Park)
  • Mpid/globus2/topology_intra_fns.c
  • Message Compression (H.J. Lee)
  • You should understand how we modified the
    mpich-g2

55
(No Transcript)
56
Next Seminar
  • Homework
  • Install Globus and MPICH-G2
  • Compile and Launching cpi
  • Printing Source code of mpich-g2
  • Understanding operation flow
  • Initialization
  • Communication
  • Collective Communication
  • Modifying the code
  • Insert codes printing your name when initializing
    MPI process
  • Insert codes printing message transfer number
  • Remake the library
  • Compile and Launching cpi
  • Understand operation flow
  • File-based Initialization
  • Proxy-based Communication
  • Modifying the code
  • File format of Initialization
  • Stablizing the communication code

57
Backup - ADI of MPICH
  • MPICH can support various device types using ADI
    (Abstract Device Interface)
  • ADI is a small set of function definitions
  • C functions or Macro definitions
  • All MPI functions are implemented by ADI
  • Basic 4 functions of ADI
  • ???? ??
  • API? H/W??? Data ??
  • ??? ?? ???? list ??
  • ?? ??? ?? ?? ??

58
Backup - MPICH Layers
MPI
MPI_Send
MPI_Recv
MPI_Isend
ADI
MPID_SendDatatype
MPID_ISendDatatype
MPID_RecvDatatype
MPID_IRecvDatatype
Device
G_malloc
Globus_mutex_lock
Globus_dc_put
Globus_io_write
G_free
Globus2
Ch_p4
59
Backup - ????? ??
  • MPI_ (MPI_SEND)
  • MPI? ?????? ???? ?? ?? ???? ???? ???? ??.
  • MPIR_ (MPIR_GET_DTYPE_PTR)
  • MPI_ ??? ?? ??? ???? ?? ??? ??? ?? ??? ?????
  • MPID_ (MPID_SendDateType)
  • MPI_ ??? ?? ??? ???? ?? ??? ?????? ??? ?? ??? ???
    Device? ?? ?? ??? ???.

60
Backup - Using ADI
Problem How to allow Draw3d() to bad graphic
device
ADI Approach
  • Draw_Rect3D()
  • .
  • DRAW_MACRO_3D()
  • Draw_Rect3D()
  • Draw3D()

DRAW_MACRO_3D() . Draw3D() .
Draw_ MACRO_3D() . for(i0 to 2) Draw()
.
61
ADI Hierarchy (1)
globus2
62
ADI Hierarchy (2)
63
Backup - DUROC subjob
  • ( (resourceManagerContactgrid.yonsei.ac.kr)
  • (count 4)
  • (labelsubjob 0)
  • (environment(GLOBUS_DUROC_SUBJOB_INDEX 0)
    (LD_LIBRARY_PATH /usr/local/globus/l
    ib/))
  • (directtory/usr/local..)
    (executable/usr/local..)
  • )
  • ( (resourceManagerContactcluster.yonsei.ac.kr)
  • (count 2)
  • (labelsubjob 4
  • (environment(GLOBUS_DUROC_SUBJOB_INDEX 1)
  • .
  • ((resourceManagerContactdcc.sogang.ac.kr)
  • (count 2)
  • (environement(GLOBUS_DUROC_SUBJOB_INDEX 2)
  • ((resoureManagerContactvenus.kisti.re.kr)
  • (count 4)
  • .

64
Backup - DUROC subjob
DUROC
CO-ALLOCATION
Subjob_size4
Subjob_size2
Subjob_size2
Subjob_size4
SUBJOB_INDEX0
SUBJOB_INDEX1
SUBJOB_INDEX2
SUBJOB_INDEX3
0
0
0
0
1
1
1
1
2
2
3
3
Venus.kisti.re.kr
Grid.yonsei.ac.kr
cluster.yonsei.ac.kr
dcc.sogang.ac.kr
Write a Comment
User Comments (0)
About PowerShow.com