MPICHVMI

About This Presentation

Title:

MPICHVMI

Description:

To send any message, the send buffer must be registered (pinned down) ... It also equals the number of buffers receiver publishes to the sender where the ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 57

Provided by: Avne

Category:

more less

Transcript and Presenter's Notes

Title: MPICHVMI

1
MPICH-VMI

VMI Team
Cluster Software Tools group, NCSA
Monday, August 18, 2003

2
Compiling with MPICH-VMI

Include MPICH_INSTALL_PATH/bin installation
directory in your path
This brings in the compiler wrapper scripts into
your environment
mpicc and mpiCC for C and C codes
mpif77 and mpif90 for F77 and F90 codes
Some underlying compilers such as GNU compiler
suite do not support F90. Use mpif90 show to
determine underlying compiler being used.

3
Compiling with MPICH-VMI

The compiler scripts are wrappers that include
all MPICH-VMI specific libraries and paths
All underlying compiler switches are supported
and passed to the compiler
The MPICH-VMI library by default is compiled with
debug symbols.

4
Running with MPICH-VMI

mpirun script is available for launching jobs
Supports all standard arguments in addition to
MPICH-VMI specific arguments
mpirun uses ssh, rsh and MPD for launching jobs.
Default is MPD
Provides automatic selection/failover
If MPD ring not available, falls back to ssh/rsh

5
Running with MPICH-VMI

VMI specific arguments related to three broad
categories
Parameters that can be tuned at runtime
Parameters for launching GRID jobs
Parameters to control profiling of job
mpirun help option to list all VMI tunable
parameters
All VMI specific parameters are optional. GRID
jobs require some parameters to be set.

6
Running with MPICH-VMI

Runtime Parameters
-specfile Specify the underlying network
transport to use. This can be a shortened network
name (tcp, myrinet or mst for Infiniband) or path
to a VMI transport definition specification file
in XML format.
-force-shell Disables use of MPDs for launching
job. GRID jobs require use of ssh/rsh for job
launching.

7
Running with MPICH-VMI

Runtime Parameters
-job-sync-timeout Maximum number of seconds
allowed for all processes to start. Default is
300 seconds.
-debugger Use the specified debugger to debug MPI
application with. Supported debuggers are gdb and
totalview.
-mmapthreshold Specifies the memory allocation
size in bytes for which MMAP will be used to
obtain memory. By default all memory less than 4
MB is allocated from the heap.

8
Running with MPICH-VMI

Runtime Parameters
-eagerlen Specifies the message size in bytes to
switch from short/eager protocol to rendezvous
protocol. Default is 16KB.
-eagerisendcopy Specifies the largest message
size that can be completed immediately for
asynchronous sends (MPI_Isend).
-disable-short-rdma Disables the use of RDMA
protocol for short messages.
-short-rdma-credits Specifies the maximum number
of unacknowledged short RDMA messages. Default is
32.

9
Running with MPICH-VMI

Runtime Parameters
-rdmachunk Specifies the base RDMA chunk size for
rendezvous protocol. All RDMA transfers for
rendezvous are performed using the base RDMA
chunk size. Default is 256KB.
-rdmapipeline Specifies the maximum number of
RDMA chunks in flight. The overall memory demand
for RDMA is rdmachunk size
rdmapipeline length.

10
Running with MPICH-VMI

Runtime Parameters
-v Verbose Level 1. Output VMI startup messages
and make MPIRUN verbose.
-vv Verbose Level 2. Additionally output any
warning messages.
-vvv Verbose Level 3. Additionally output any
error messages.
-vvvv Verbose Level 10. Excess Debug. Useful only
for developers of MPICH-VMI and submitting crash
dumps.

11
Running with MPICH-VMI

A MPICH-VMI GRID job consists of one or more
subjobs.
A subjob is launched on each site using
individual mpirun commands. The specfile selected
should be one of the xsite network transports
(xsite-mst-tcp or xsite-myrinet-tcp).
The higher performance SAN (Infiniband or Myinet)
is used for intra site communication. Cross site
communication uses TCP automatically.

12
Running with MPICH-VMI

Individual subjobs can use their preferred
underlying transport as part of the GRID job.
Its possible to have two clusters, one with
Infiniband and other myrinet and span a GRID job
across them. Infiniband will be used for intra
cluster communication within the first cluster
and myrinet for communication within the second.
TCP will be used for communication betweeen the
clusters.

13
Running with MPICH-VMI

All subjobs must specify the same GRID specific
parameters
Grid Specific Parameters
-grid-procs Specifies the total number of
processes in the job. np parameter to mpirun
still specifies the number of processes in the
subjob
-grid-crm Specifies the host running the grid CRM
to be used for subjob synchronization.

14
Running with MPICH-VMI

Grid Specific Parameters
-key Alphanumeric string that uniquely identifies
the grid job. This should be the same for all
subjobs!
-allocator-uri The grid CRM allocator to use for
MPI rank assignment. Default allocator is
default which allocates rank in FIFO order.
Use vmi_crm_query to query available allocators
on a CRM.

15
Running with MPICH-VMI

Profiling Specific Parameters
-disable-profiling Disables collection of profile
data. All MPICH-VMI runs gather profiling data by
default. Profile data is sent to the profile
server at NCSA for use by MPICH-VMI developers to
enable further optimizations.
-profile-server Specifies the host running the
MPICH-VMI profile server. Users can retarget the
profile data to their own profiling servers to be
used with profile guided optimization features of
MPICH-VMI.

16
Running with MPICH-VMI

Executing on a single cluster using Myrinet
mpirun np 32 specfile myrinet ./cpi
Executing on a single cluster using Infiniband
Mpirun np 32 specfile mst ./cpi
Dont need to recompile the executable. Selection
at job submission via specfile switch to mpirun

17
Running with MPICH-VMI

Executing a cross site (Grid) run
Launch individual mpirun at each site. Need to
specify the total number of processes (grid
procs) and the number of local processes (np)
Run a 32 processor job with 16 processors at each
site
Site A mpirun np 16 specfile xsite-myrinet-tcp
grid-procs 32 grid-crm ltip of crm hostgt -key
gridtest ./cpi
Site B mpirun np 16 specfile xsite-myrinet-tcp
grid-procs 32 grid-crm ltip of crm hostgt -key
gridtest ./cpi
The same CRM and key must be specified during
each subjob submission

18
Debugging with MPICH-VMI

MPICH-VMI supports debugging of codes using
TotalView and gdb
A primitive wrapper environment for debugging
parallel programs with gdb is available.
Caveat MPD must be used for job launching
All stdin, stdout stderr is redirected to gdb
processes on the cluster. Standard gdb commands
are available in this mode.

19
Debugging with TotalView

To launch your MPI program with Totalview
mpirun tv np lt procsgt ltprog namegt
Attaching to a running program
Start TotalView with no arguments and press N
in the root window and select the MPI job to
which you would want to attach.

20
Profile Guided Optimization

Motivation
Grid computing poses immense challenges to scale
an application running across geographically
distributed clusters with relatively high latency
and low bandwidth.
The need to minimize the data transferred across
the low bandwidth wide area link.
Current grid-aware MPICH implementations expect
topology aware support from user applications
which may not be always feasible.

21
VMI Profile Guided Optimization

MPICH-VMI2 Profile guided optimization creates a
communication pattern based for a specific job
run. This communication pattern is analyzed and
communication ranks are assigned optimally to
minimized the bottleneck traffic for a future run
of the same job.
The profile data is collected by individual nodes
and sent to the head node of each cluster which
forwards it to the vmiprofile server (default
vmiprofile server at NCSA)
Profiling can also be disabled or can be
redirected to a profile server in your own
administrative domain.
Profile data collected is MPI specific
information like the number of short sends and
receives and no personal information is
collected to respect the privacy of the end-user.

22
VMI Profile Data

MPI Communication Rank specific data used to
build point to point communication graphs.
Job specific information to enable optimizations
to be based on a specific job.

23
Profile Data (Job Specific)
24
Profile Data (Rank Specific)
25
Profile Tools

Command line tools
vmicollect
Used for generating communication graphs.
mincut
Used to generate a partition for a communication
topology.
Graphic tools
Pajeck
Used to display the communication grpahs.

26
Profile Analyzer Tools

VMIcollect
Queries the vmiprofile database to collect the
profile data for a specific job.
Outputs a communication graph in pajek format
Arguments
-p ltprogram namegt
-d ltstart dategt ltend dategt
-j ltjobidgt
-a ltarguments passedgt
-n lt of processorsgt
If the query is not unique, the jobid, the
program name and argument list of each matching
entry is displayed.

27
Profile Analyzer Tools

Mincut
Creates a partition of a communication graph
generated from a given jobid.
Outputs the partition list.
Arguments
-j ltjobidgt
-h lthostname of database servergt
-u ltusernamegt
-p ltpasswordgt

28
VMI Grid CRM with allocator modules

The VMI Grid CRM server does rank assignment to
nodes using allocator modules.
The default allocator does assignment using the
standard FIFO mechanism. The other allocators
available are random and mincut.
The format of the allocator file
ltallocatornamegtltmodulepathgtltargumentsgt
Ex queue/opt/mpich-vmi2.0/tools/random.sohostv
miprofile.ncsa.uiuc.edu?userroot

29
Allocator Modules

VMI allocator module implements four functions
int VMI_Grid_AllocatorInit(int argc, char argv)
Initializes the allocator module and takes the
arguments given in the allocator file.
int VMI_Grid_Allocate(PCRM_JOB job, int argc,
char argv)
Allocates ranks to subjobs.
int VMI_Grid_AllocatorTerminate()
Cleans up the allocator module. Currently not
used.
char VMI_Grid_AllocatorName()
Returns a version name for the allocator.

30
Allocator URI

Allocator URI is specified as an mpirun option
for selecting the topology based on a jobid.
Example -allocator-uri mincuthostvmiprofile?use
rapant?jobid256

31
MPICH-VMI Communication Protocols

MPICH-VMI implements two different protocols for
point-to-point communication
Eager
For short messages (less than 16K. User can
change this parameter).
Rendezvous
For long messages (More than 16K. User can change
this parameter).

32
MPICH-VMI Communication Protocols

Eager Protocol
Message is sent to the receiver immediately. The
receivers messaging layer has to allocate some
space for the message if the corresponding
receive has not been posted, i.e., an MPI_Recv
has not been called. This protocol can give good
performance for large messages too if the number
of unexpected receives is very low, since
unexpected receives require memcpy of incoming
message to a temporary buffer. A lot of large
unexpected messages can add a huge overhead.

33
MPICH-VMI Communication Protocols

Eager Protocol (Short messages)
Two different implementations
Using send/recv communication model. Relatively
expensive. Overhead associated with send/recv
communication.
Using RDMA. Faster. Data is deposited directly
into the messaging layers buffer from where it
is copied into applications buffer (one memcpy
if message is expected, two if message is
unexpected).

34
MPICH-VMI Communication Protocols

Rendezvous Protocol (Large messages)
Message is sent to the receiver only after the
receiver has posted a receive (called MPI_Recv)
for the message. This requires an additional
handshake between the communicating processes.
This protocol is implemented in MPICH-VMI using
RDMA and provides a true zero copy data transfer.

35
Tunable Parameters

Two Broad classes of tunable parameters
Communication Protocol Tuning
Tuning Eager send/recv
Tunable parameters eagerisendcopy,
eagerunexcount, eagerlen
Tuning Eager RDMA
Tunable parameters disable-short-rmda,
short-rdma-credits
Tuning Rendezvous
Tunable parameters rdmachunk, rdmapipeline
Miscellaneous Tuning
Tunable parameters mmapthreshold

36
Tuning Eager send/recv Protocol

eagerisendcopy
Specifies the size of the largest message that
can be copied in asynchronous eager send/recv
send to finish the send immediately.
To send any message, the send buffer must be
registered (pinned down). In asynchronous send,
memory registration can be avoided and send can
be immediately completed (application can
immediately reuse the send buffer) if the
message can be appended to the send packet
header, since that region is already registered.

37
Tuning Eager send/recv Protocol

Tradeoffs eagerisendcopy
Pro(s)
For messages of size less than eagerisendcopy,
application would not have to wait for the send
to be completed. So, it can carry on with any
computation, thus parallelizing computation with
communication.
Message buffers of size less than eagerisendcopy
will not have to be registered (registration is
expensive if the buffer is not in VMI cache).
Con(s)
Since messages of size less than eagerisendcopy
will be copied, memcpy can add overhead.
If the message buffer is already registered
(buffer is in VMI cache), copying the message
will add unnecessary overhead.

38
Tuning Eager send/recv Protocol

eagerunexcount
This specifies the maximum number of unexpected
short message receive buffers registered (pinned
down) at any time. If number of unexpected
receives exceed this value, a temporary buffer is
allocated (malloc) for each new unexpected
message where message is copied (memcpy) and
receive buffer is released.

39
Tuning Eager send/recv Protocol

Tradeoffs eagerunexcount
Pro(s)
If your application has a large number of
unexpected receives, you can reduce the amount of
pinned down (registered) memory by reducing the
value of this variable.
Con(s)
If an application has a large number of
unexpected receive buffers using pinned down
memory, memory resources will be strained, possi.
Tradeoff between cost of keeping pinned down
memory from being reused and cost malloc memcpy
for using the temporary buffer.

40
Tuning Eager send/recv Protocol

eagerlen
Messages of size less than or equal to eagerlen
use eager protocol. Messages of size greater than
eagerlen use rendezvous protocol.

41
Tuning Eager send/recv Protocol

Tradeoffs eagerlen
Pro(s)
If the ratio of unexpected receives to expected
receives is very low, using eager protocol even
for large messages might be faster. In that case,
increase the eagerlen so that large messages also
use eager protocol.
If the network interconnect has high latency
(like in TCP), using eager protocol will be more
beneficial as compared to Rendezvous since
Rendezvous requires a handshake between the
sender and the receiver. Therefore, with high
latency interconnect, the performance suffers.
Con(s)
Messages of size less then eagerlen will have
overhead associated with send/recv communication.
Rendezvous, on the other hand, can deposit data
directly into the receivers buffer, avoiding
send/recv overhead.
If the ratio of unexpected receives to expected
receives is very high, messages using eager
protocol will have to be temporarily buffered
which is expensive.
Short messages using eager protocol use a buffer
pool (pinned down memory) where each buffer is
the size of eagerlen. With large eagerlen, more
memory will be pinned down, increasing total
memory utilization of your application. This can
negatively impact performance if your system does
not have sufficient memory resources.

42
Tuning Eager RDMA Protocol

disable-short-rdma
Disables use for RDMA protocol for short
messages. Only send/recv eager protocol is used
for messages less then eagerlen.
In short RDMA, the sender puts the data into a
series of slots whose addresses the receiver has
published to the sender. Only after these slots
have been filled up, the sender starts using the
send/recv eager protocol.

43
Tuning Eager RDMA Protocol

Tradeoffs disable-short-rdma
Pro(s)
Eager RDMA improves both bandwidth and latency,
since the extra overhead (memcpy etc) associated
with eager send/recv protocol is avoided.
Con(s)
Short RDMA requires the receiver to permanently
pin down regions of memory it publishes to the
sender for putting short messages. This requires
the system to have sufficient memory resources.
If your system does not have sufficient memory,
either issue disable-rdma-short to use only eager
protocol for short messages or reduce the number
of RDMA credits to lessen the amount of memory
pinned down.

44
Tuning Eager RDMA Protocol

short-rdma-credits
Maximum number of unacknowledged short RDMA
messages.
short-rdma-credits also equals the number of RDMA
slots allocated by the receiver where the sender
can deposit short RDMA messages The size of each
slot is equal to eagerlen.
If all the short RDMA slots are filled up,
MPICH-VMI switches to send/recv eager protocol.

45
Tuning Eager RDMA Protocol

Tradeoffs short-rdma-credits
Pro(s)
If your application has a large number of
unexpected receives, it is likely that short RDMA
slots will fill up quickly, increasing the ratio
of eager send/recv sends and receives to eager
RDMA sends and receives. If your system has
sufficient memory resources, increasing
short-rdma-credits will potentially improve
performance by decreasing the ratio of eager
send/recv messages to eager RDMA messages, thus
making communication faster.
Con(s)
Greater the number of slots, more will be the
pinned down memory for RDMAs, straining the
memory resources.
Large number of slots mean a bigger set to poll
for incoming data. That can be expensive.

46
Tuning Rendezvous Protocol

rdmachunk
Base chunk size for large RDMA transfers used for
Rendezvous protocol.
MPICH-VMI fragments all messages being sent over
Rendezvous protocol into chunks of size rdmachunk
to avoid pinning down large buffers of memory.

47
Tuning Rendezvous Protocol

Tradeoffs rdmachunk
Pro(s)
Fragmenting large RDMA transfers on both sender
and receiver sides reduces amount of pinned down
memory, conserving memory resources.
Con(s)
If the size of the message being sent is greater
than rdmachunk, the message will be fragmented
and the cost of multiple RDMA puts will be
incurred.

48
Tuning Rendezvous Protocol

rdmapipeline
Maximum number of RDMA chunks in flight.
It also equals the number of buffers receiver
publishes to the sender where the sender can put
the data.
Hence, the amount of registered memory for each
Rendezvous send is rdmapipeline rdmachunk on
both the sender and the receivers end.

49
Tuning Rendezvous Protocol

Tradeoffs rdmapipeline
Pro(s)
If the message to be sent is large than
rdmachunk, two or more puts will be required to
deposit the data into receivers buffer. For each
put, the receiver will post a publish to the
sender after the last put completes. With
rdmapipeline, sender can do multiple puts without
waiting for receivers publish. This can be very
useful, especially with high latency
interconnects.
Con(s)
Large rdmapipeline can strain memory resources
since each communicating process will have to
keep pinned down (registered) memory of size
rdmapipeline rdmachunk for each rendezvous
communication.

50
Tuning Misc. Parameters

mmapthreshold
Specifies the memory allocation size in bytes for
which MMAP will be used to obtain memory. All
memory allocations of size less than
mmapthreshold are done from the heap.
VMI does not keep mmaped memory in its cache of
registered memory. Therefore, there will be a
cache miss for all mmaped memory buffers that
needs to be registered.

51
Tuning Misc. Parameters

Tradeoffs mmapthreshold
Pro(s)
If application is allocating large buffers and
using them for messaging multiple times, having
those buffers being allocated from heap would
help since VMI will keep those buffers in its
cache of registered memory. Increasing
mmapthreshold will be useful since buffers of
sizes less then mmapthreshold will be using heap.
TIP To find out if your application is using
mmaped memory, see cache statistics in profiling
data.

52
Topology Aware Collectives

MPI 1.1 specification has 14 collective
operations. Most popular are Bcast, Barrier,
Scatter/Allscatter, Gather/Allgather,
Reduce/Allreduce.
When running jobs on a grid, the latencies
between any two compute nodes might be relatively
high if these nodes reside on different sites on
the grid rather than on the same site.
Hence, it is desirable that the collectives are
implemented such that the high latency link
connecting one grid site to another grid site is
used minimum possible number of times.
Topology aware collectives implemented in
MPICH-VMI are grid aware. They tend minimize
communication between nodes that do not reside on
the same grid site.
Currently MPI_Bcast, MPI_Barrier, MPI_Reduce and
MPI_Allreduce have been implemented.

53
Topology Aware Collectives

MPI_Bcast
Each communicator has a coordinator node at each
grid site. In the current implementation, the
node with the lowest grank within a site for a
give communicator is designated as the
coordinator for that site.
If the root of the broadcast is the coordinator,
the root does a binomial tree broadcast for nodes
within that site. The root also does a flat tree
broadcast among the coordinator nodes. The
coordinator nodes in turn do a binomial tree
broadcast at their own sites.
If the root of the broadcast is not the
coordinator, it assumes the the role of
coordinator for that site and does a flat tree
broadcast (among coordinators) followed by a
binomial tree broadcast (with its own site).

54
Topology Aware Collectives

MPI_Barrier
Each site has a coordinator node, chosen the same
way as in MPI_Bcast.
First, there is a intra-site gather of empty
barrier messages, with the coordinator acting as
roots.
Then, the coordinator with the lowest grank
(called the master coordinator) blocks till it
receives empty barrier message from the rest of
the coordinators.
The master coordinator replies to site
coordinators with an empty barrier message,
followed by an intra-site broadcast of barrier
messages, with coordinators acting as broadcast
roots.