Title: MPICHVMI
1MPICH-VMI
- VMI Team
- Cluster Software Tools group, NCSA
- Monday, August 18, 2003
2Compiling with MPICH-VMI
- Include MPICH_INSTALL_PATH/bin installation
directory in your path - This brings in the compiler wrapper scripts into
your environment - mpicc and mpiCC for C and C codes
- mpif77 and mpif90 for F77 and F90 codes
- Some underlying compilers such as GNU compiler
suite do not support F90. Use mpif90 show to
determine underlying compiler being used.
3Compiling with MPICH-VMI
- The compiler scripts are wrappers that include
all MPICH-VMI specific libraries and paths - All underlying compiler switches are supported
and passed to the compiler - The MPICH-VMI library by default is compiled with
debug symbols.
4Running with MPICH-VMI
- mpirun script is available for launching jobs
- Supports all standard arguments in addition to
MPICH-VMI specific arguments - mpirun uses ssh, rsh and MPD for launching jobs.
Default is MPD - Provides automatic selection/failover
- If MPD ring not available, falls back to ssh/rsh
5Running with MPICH-VMI
- VMI specific arguments related to three broad
categories - Parameters that can be tuned at runtime
- Parameters for launching GRID jobs
- Parameters to control profiling of job
- mpirun help option to list all VMI tunable
parameters - All VMI specific parameters are optional. GRID
jobs require some parameters to be set.
6Running with MPICH-VMI
- Runtime Parameters
- -specfile Specify the underlying network
transport to use. This can be a shortened network
name (tcp, myrinet or mst for Infiniband) or path
to a VMI transport definition specification file
in XML format. - -force-shell Disables use of MPDs for launching
job. GRID jobs require use of ssh/rsh for job
launching.
7Running with MPICH-VMI
- Runtime Parameters
- -job-sync-timeout Maximum number of seconds
allowed for all processes to start. Default is
300 seconds. - -debugger Use the specified debugger to debug MPI
application with. Supported debuggers are gdb and
totalview. - -mmapthreshold Specifies the memory allocation
size in bytes for which MMAP will be used to
obtain memory. By default all memory less than 4
MB is allocated from the heap.
8Running with MPICH-VMI
- Runtime Parameters
- -eagerlen Specifies the message size in bytes to
switch from short/eager protocol to rendezvous
protocol. Default is 16KB. - -eagerisendcopy Specifies the largest message
size that can be completed immediately for
asynchronous sends (MPI_Isend). - -disable-short-rdma Disables the use of RDMA
protocol for short messages. - -short-rdma-credits Specifies the maximum number
of unacknowledged short RDMA messages. Default is
32.
9Running with MPICH-VMI
- Runtime Parameters
- -rdmachunk Specifies the base RDMA chunk size for
rendezvous protocol. All RDMA transfers for
rendezvous are performed using the base RDMA
chunk size. Default is 256KB. - -rdmapipeline Specifies the maximum number of
RDMA chunks in flight. The overall memory demand
for RDMA is rdmachunk size
rdmapipeline length.
10Running with MPICH-VMI
- Runtime Parameters
- -v Verbose Level 1. Output VMI startup messages
and make MPIRUN verbose. - -vv Verbose Level 2. Additionally output any
warning messages. - -vvv Verbose Level 3. Additionally output any
error messages. - -vvvv Verbose Level 10. Excess Debug. Useful only
for developers of MPICH-VMI and submitting crash
dumps.
11Running with MPICH-VMI
- A MPICH-VMI GRID job consists of one or more
subjobs. - A subjob is launched on each site using
individual mpirun commands. The specfile selected
should be one of the xsite network transports
(xsite-mst-tcp or xsite-myrinet-tcp). - The higher performance SAN (Infiniband or Myinet)
is used for intra site communication. Cross site
communication uses TCP automatically.
12Running with MPICH-VMI
- Individual subjobs can use their preferred
underlying transport as part of the GRID job. - Its possible to have two clusters, one with
Infiniband and other myrinet and span a GRID job
across them. Infiniband will be used for intra
cluster communication within the first cluster
and myrinet for communication within the second.
TCP will be used for communication betweeen the
clusters.
13Running with MPICH-VMI
- All subjobs must specify the same GRID specific
parameters - Grid Specific Parameters
- -grid-procs Specifies the total number of
processes in the job. np parameter to mpirun
still specifies the number of processes in the
subjob - -grid-crm Specifies the host running the grid CRM
to be used for subjob synchronization.
14Running with MPICH-VMI
- Grid Specific Parameters
- -key Alphanumeric string that uniquely identifies
the grid job. This should be the same for all
subjobs! - -allocator-uri The grid CRM allocator to use for
MPI rank assignment. Default allocator is
default which allocates rank in FIFO order.
Use vmi_crm_query to query available allocators
on a CRM.
15Running with MPICH-VMI
- Profiling Specific Parameters
- -disable-profiling Disables collection of profile
data. All MPICH-VMI runs gather profiling data by
default. Profile data is sent to the profile
server at NCSA for use by MPICH-VMI developers to
enable further optimizations. - -profile-server Specifies the host running the
MPICH-VMI profile server. Users can retarget the
profile data to their own profiling servers to be
used with profile guided optimization features of
MPICH-VMI.
16Running with MPICH-VMI
- Executing on a single cluster using Myrinet
- mpirun np 32 specfile myrinet ./cpi
- Executing on a single cluster using Infiniband
- Mpirun np 32 specfile mst ./cpi
- Dont need to recompile the executable. Selection
at job submission via specfile switch to mpirun
17Running with MPICH-VMI
- Executing a cross site (Grid) run
- Launch individual mpirun at each site. Need to
specify the total number of processes (grid
procs) and the number of local processes (np) - Run a 32 processor job with 16 processors at each
site - Site A mpirun np 16 specfile xsite-myrinet-tcp
grid-procs 32 grid-crm ltip of crm hostgt -key
gridtest ./cpi - Site B mpirun np 16 specfile xsite-myrinet-tcp
grid-procs 32 grid-crm ltip of crm hostgt -key
gridtest ./cpi - The same CRM and key must be specified during
each subjob submission
18Debugging with MPICH-VMI
- MPICH-VMI supports debugging of codes using
TotalView and gdb - A primitive wrapper environment for debugging
parallel programs with gdb is available. - Caveat MPD must be used for job launching
- All stdin, stdout stderr is redirected to gdb
processes on the cluster. Standard gdb commands
are available in this mode.
19Debugging with TotalView
- To launch your MPI program with Totalview
- mpirun tv np lt procsgt ltprog namegt
- Attaching to a running program
- Start TotalView with no arguments and press N
in the root window and select the MPI job to
which you would want to attach.
20Profile Guided Optimization
- Motivation
- Grid computing poses immense challenges to scale
an application running across geographically
distributed clusters with relatively high latency
and low bandwidth. - The need to minimize the data transferred across
the low bandwidth wide area link. - Current grid-aware MPICH implementations expect
topology aware support from user applications
which may not be always feasible.
21VMI Profile Guided Optimization
- MPICH-VMI2 Profile guided optimization creates a
communication pattern based for a specific job
run. This communication pattern is analyzed and
communication ranks are assigned optimally to
minimized the bottleneck traffic for a future run
of the same job. - The profile data is collected by individual nodes
and sent to the head node of each cluster which
forwards it to the vmiprofile server (default
vmiprofile server at NCSA) - Profiling can also be disabled or can be
redirected to a profile server in your own
administrative domain. - Profile data collected is MPI specific
information like the number of short sends and
receives and no personal information is
collected to respect the privacy of the end-user.
22VMI Profile Data
- MPI Communication Rank specific data used to
build point to point communication graphs. - Job specific information to enable optimizations
to be based on a specific job.
23Profile Data (Job Specific)
24Profile Data (Rank Specific)
25Profile Tools
- Command line tools
- vmicollect
- Used for generating communication graphs.
- mincut
- Used to generate a partition for a communication
topology. - Graphic tools
- Pajeck
- Used to display the communication grpahs.
26Profile Analyzer Tools
- VMIcollect
- Queries the vmiprofile database to collect the
profile data for a specific job. - Outputs a communication graph in pajek format
- Arguments
- -p ltprogram namegt
- -d ltstart dategt ltend dategt
- -j ltjobidgt
- -a ltarguments passedgt
- -n lt of processorsgt
- If the query is not unique, the jobid, the
program name and argument list of each matching
entry is displayed.
27Profile Analyzer Tools
- Mincut
- Creates a partition of a communication graph
generated from a given jobid. - Outputs the partition list.
- Arguments
- -j ltjobidgt
- -h lthostname of database servergt
- -u ltusernamegt
- -p ltpasswordgt
28VMI Grid CRM with allocator modules
- The VMI Grid CRM server does rank assignment to
nodes using allocator modules. - The default allocator does assignment using the
standard FIFO mechanism. The other allocators
available are random and mincut. - The format of the allocator file
- ltallocatornamegtltmodulepathgtltargumentsgt
- Ex queue/opt/mpich-vmi2.0/tools/random.sohostv
miprofile.ncsa.uiuc.edu?userroot
29Allocator Modules
- VMI allocator module implements four functions
- int VMI_Grid_AllocatorInit(int argc, char argv)
- Initializes the allocator module and takes the
arguments given in the allocator file. - int VMI_Grid_Allocate(PCRM_JOB job, int argc,
char argv) - Allocates ranks to subjobs.
- int VMI_Grid_AllocatorTerminate()
- Cleans up the allocator module. Currently not
used. - char VMI_Grid_AllocatorName()
- Returns a version name for the allocator.
30Allocator URI
- Allocator URI is specified as an mpirun option
for selecting the topology based on a jobid. - Example -allocator-uri mincuthostvmiprofile?use
rapant?jobid256
31MPICH-VMI Communication Protocols
- MPICH-VMI implements two different protocols for
point-to-point communication - Eager
- For short messages (less than 16K. User can
change this parameter). - Rendezvous
- For long messages (More than 16K. User can change
this parameter).
32MPICH-VMI Communication Protocols
- Eager Protocol
- Message is sent to the receiver immediately. The
receivers messaging layer has to allocate some
space for the message if the corresponding
receive has not been posted, i.e., an MPI_Recv
has not been called. This protocol can give good
performance for large messages too if the number
of unexpected receives is very low, since
unexpected receives require memcpy of incoming
message to a temporary buffer. A lot of large
unexpected messages can add a huge overhead.
33MPICH-VMI Communication Protocols
- Eager Protocol (Short messages)
- Two different implementations
- Using send/recv communication model. Relatively
expensive. Overhead associated with send/recv
communication. - Using RDMA. Faster. Data is deposited directly
into the messaging layers buffer from where it
is copied into applications buffer (one memcpy
if message is expected, two if message is
unexpected).
34MPICH-VMI Communication Protocols
- Rendezvous Protocol (Large messages)
- Message is sent to the receiver only after the
receiver has posted a receive (called MPI_Recv)
for the message. This requires an additional
handshake between the communicating processes. - This protocol is implemented in MPICH-VMI using
RDMA and provides a true zero copy data transfer.
35Tunable Parameters
- Two Broad classes of tunable parameters
- Communication Protocol Tuning
- Tuning Eager send/recv
- Tunable parameters eagerisendcopy,
eagerunexcount, eagerlen - Tuning Eager RDMA
- Tunable parameters disable-short-rmda,
short-rdma-credits - Tuning Rendezvous
- Tunable parameters rdmachunk, rdmapipeline
- Miscellaneous Tuning
- Tunable parameters mmapthreshold
36Tuning Eager send/recv Protocol
- eagerisendcopy
- Specifies the size of the largest message that
can be copied in asynchronous eager send/recv
send to finish the send immediately. - To send any message, the send buffer must be
registered (pinned down). In asynchronous send,
memory registration can be avoided and send can
be immediately completed (application can
immediately reuse the send buffer) if the
message can be appended to the send packet
header, since that region is already registered.
37Tuning Eager send/recv Protocol
- Tradeoffs eagerisendcopy
- Pro(s)
- For messages of size less than eagerisendcopy,
application would not have to wait for the send
to be completed. So, it can carry on with any
computation, thus parallelizing computation with
communication. - Message buffers of size less than eagerisendcopy
will not have to be registered (registration is
expensive if the buffer is not in VMI cache). - Con(s)
- Since messages of size less than eagerisendcopy
will be copied, memcpy can add overhead. - If the message buffer is already registered
(buffer is in VMI cache), copying the message
will add unnecessary overhead.
38Tuning Eager send/recv Protocol
- eagerunexcount
- This specifies the maximum number of unexpected
short message receive buffers registered (pinned
down) at any time. If number of unexpected
receives exceed this value, a temporary buffer is
allocated (malloc) for each new unexpected
message where message is copied (memcpy) and
receive buffer is released.
39Tuning Eager send/recv Protocol
- Tradeoffs eagerunexcount
- Pro(s)
- If your application has a large number of
unexpected receives, you can reduce the amount of
pinned down (registered) memory by reducing the
value of this variable. - Con(s)
- If an application has a large number of
unexpected receive buffers using pinned down
memory, memory resources will be strained, possi.
- Tradeoff between cost of keeping pinned down
memory from being reused and cost malloc memcpy
for using the temporary buffer.
40Tuning Eager send/recv Protocol
- eagerlen
- Messages of size less than or equal to eagerlen
use eager protocol. Messages of size greater than
eagerlen use rendezvous protocol.
41Tuning Eager send/recv Protocol
- Tradeoffs eagerlen
- Pro(s)
- If the ratio of unexpected receives to expected
receives is very low, using eager protocol even
for large messages might be faster. In that case,
increase the eagerlen so that large messages also
use eager protocol. - If the network interconnect has high latency
(like in TCP), using eager protocol will be more
beneficial as compared to Rendezvous since
Rendezvous requires a handshake between the
sender and the receiver. Therefore, with high
latency interconnect, the performance suffers. - Con(s)
- Messages of size less then eagerlen will have
overhead associated with send/recv communication.
Rendezvous, on the other hand, can deposit data
directly into the receivers buffer, avoiding
send/recv overhead. - If the ratio of unexpected receives to expected
receives is very high, messages using eager
protocol will have to be temporarily buffered
which is expensive. - Short messages using eager protocol use a buffer
pool (pinned down memory) where each buffer is
the size of eagerlen. With large eagerlen, more
memory will be pinned down, increasing total
memory utilization of your application. This can
negatively impact performance if your system does
not have sufficient memory resources.
42Tuning Eager RDMA Protocol
- disable-short-rdma
- Disables use for RDMA protocol for short
messages. Only send/recv eager protocol is used
for messages less then eagerlen. - In short RDMA, the sender puts the data into a
series of slots whose addresses the receiver has
published to the sender. Only after these slots
have been filled up, the sender starts using the
send/recv eager protocol.
43Tuning Eager RDMA Protocol
- Tradeoffs disable-short-rdma
- Pro(s)
- Eager RDMA improves both bandwidth and latency,
since the extra overhead (memcpy etc) associated
with eager send/recv protocol is avoided. - Con(s)
- Short RDMA requires the receiver to permanently
pin down regions of memory it publishes to the
sender for putting short messages. This requires
the system to have sufficient memory resources.
If your system does not have sufficient memory,
either issue disable-rdma-short to use only eager
protocol for short messages or reduce the number
of RDMA credits to lessen the amount of memory
pinned down.
44Tuning Eager RDMA Protocol
- short-rdma-credits
- Maximum number of unacknowledged short RDMA
messages. - short-rdma-credits also equals the number of RDMA
slots allocated by the receiver where the sender
can deposit short RDMA messages The size of each
slot is equal to eagerlen. - If all the short RDMA slots are filled up,
MPICH-VMI switches to send/recv eager protocol.
45Tuning Eager RDMA Protocol
- Tradeoffs short-rdma-credits
- Pro(s)
- If your application has a large number of
unexpected receives, it is likely that short RDMA
slots will fill up quickly, increasing the ratio
of eager send/recv sends and receives to eager
RDMA sends and receives. If your system has
sufficient memory resources, increasing
short-rdma-credits will potentially improve
performance by decreasing the ratio of eager
send/recv messages to eager RDMA messages, thus
making communication faster. - Con(s)
- Greater the number of slots, more will be the
pinned down memory for RDMAs, straining the
memory resources. - Large number of slots mean a bigger set to poll
for incoming data. That can be expensive.
46Tuning Rendezvous Protocol
- rdmachunk
- Base chunk size for large RDMA transfers used for
Rendezvous protocol. - MPICH-VMI fragments all messages being sent over
Rendezvous protocol into chunks of size rdmachunk
to avoid pinning down large buffers of memory.
47Tuning Rendezvous Protocol
- Tradeoffs rdmachunk
- Pro(s)
- Fragmenting large RDMA transfers on both sender
and receiver sides reduces amount of pinned down
memory, conserving memory resources. - Con(s)
- If the size of the message being sent is greater
than rdmachunk, the message will be fragmented
and the cost of multiple RDMA puts will be
incurred.
48Tuning Rendezvous Protocol
- rdmapipeline
- Maximum number of RDMA chunks in flight.
- It also equals the number of buffers receiver
publishes to the sender where the sender can put
the data. - Hence, the amount of registered memory for each
Rendezvous send is rdmapipeline rdmachunk on
both the sender and the receivers end.
49Tuning Rendezvous Protocol
- Tradeoffs rdmapipeline
- Pro(s)
- If the message to be sent is large than
rdmachunk, two or more puts will be required to
deposit the data into receivers buffer. For each
put, the receiver will post a publish to the
sender after the last put completes. With
rdmapipeline, sender can do multiple puts without
waiting for receivers publish. This can be very
useful, especially with high latency
interconnects. - Con(s)
- Large rdmapipeline can strain memory resources
since each communicating process will have to
keep pinned down (registered) memory of size
rdmapipeline rdmachunk for each rendezvous
communication.
50Tuning Misc. Parameters
- mmapthreshold
- Specifies the memory allocation size in bytes for
which MMAP will be used to obtain memory. All
memory allocations of size less than
mmapthreshold are done from the heap. - VMI does not keep mmaped memory in its cache of
registered memory. Therefore, there will be a
cache miss for all mmaped memory buffers that
needs to be registered.
51Tuning Misc. Parameters
- Tradeoffs mmapthreshold
- Pro(s)
- If application is allocating large buffers and
using them for messaging multiple times, having
those buffers being allocated from heap would
help since VMI will keep those buffers in its
cache of registered memory. Increasing
mmapthreshold will be useful since buffers of
sizes less then mmapthreshold will be using heap. - TIP To find out if your application is using
mmaped memory, see cache statistics in profiling
data.
52Topology Aware Collectives
- MPI 1.1 specification has 14 collective
operations. Most popular are Bcast, Barrier,
Scatter/Allscatter, Gather/Allgather,
Reduce/Allreduce. - When running jobs on a grid, the latencies
between any two compute nodes might be relatively
high if these nodes reside on different sites on
the grid rather than on the same site. - Hence, it is desirable that the collectives are
implemented such that the high latency link
connecting one grid site to another grid site is
used minimum possible number of times. - Topology aware collectives implemented in
MPICH-VMI are grid aware. They tend minimize
communication between nodes that do not reside on
the same grid site. - Currently MPI_Bcast, MPI_Barrier, MPI_Reduce and
MPI_Allreduce have been implemented.
53Topology Aware Collectives
- MPI_Bcast
- Each communicator has a coordinator node at each
grid site. In the current implementation, the
node with the lowest grank within a site for a
give communicator is designated as the
coordinator for that site. - If the root of the broadcast is the coordinator,
the root does a binomial tree broadcast for nodes
within that site. The root also does a flat tree
broadcast among the coordinator nodes. The
coordinator nodes in turn do a binomial tree
broadcast at their own sites. - If the root of the broadcast is not the
coordinator, it assumes the the role of
coordinator for that site and does a flat tree
broadcast (among coordinators) followed by a
binomial tree broadcast (with its own site).
54Topology Aware Collectives
- MPI_Barrier
- Each site has a coordinator node, chosen the same
way as in MPI_Bcast. - First, there is a intra-site gather of empty
barrier messages, with the coordinator acting as
roots. - Then, the coordinator with the lowest grank
(called the master coordinator) blocks till it
receives empty barrier message from the rest of
the coordinators. - The master coordinator replies to site
coordinators with an empty barrier message,
followed by an intra-site broadcast of barrier
messages, with coordinators acting as broadcast
roots.
55Job Startup at Single Site
NCSA
- mpirun executes
- spawns process on nodes
- Contact CRM for job startup synchronization
- CRM allocates ranks and broadcasts location of
all ranks
- Query vmieyes daemons for active network devices
for each process
- Establish connection mesh between ranks using
the available network devices
VMIEYES
MPIRUN
APPLICATION
56Startup for Grid Jobs
NCSA
- mpirun executes at NCSA
- spawns processes on NCSA nodes
- Contact Subjob CRM for job startup
synchronization at NCSA
- Subjob CRM at NCSA broadcasts ranks to all NCSA
processes
- Query vmieyes daemons for active network devices
for each process
- Establish connection mesh between ranks using
the available network devices. Myrinet used at
NCSA in example, TCP for WAN
- NCSA CRM acts as proxy and registers processes
with GRID CRM
VMIEYES
MPIRUN
GRID CRM
APPLICATION
SDSC
- GRID CRM generates job topology and allocates
ranks. Topology and ranks forwarded by subjob CRM
servers at NCSA and SDSC
Myrinet
Infiniband
TCP -WAN
- Establish connection mesh between ranks using
the available network devices. Infiniband used at
SDSC in example, TCP for WAN
- mpirun executes at SDSC
- spawns processes on SDSC nodes
- Contact Subjob CRM for job startup
synchronization at SDSC
- Subjob CRM at SDSC broadcasts ranks to all SDSC
processes
- Query vmieyes daemons for active network devices
for each process
- SDSC CRM acts as proxy and registers processes
with GRID CRM