MPICHVMI - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

MPICHVMI

Description:

To send any message, the send buffer must be registered (pinned down) ... It also equals the number of buffers receiver publishes to the sender where the ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 57
Provided by: Avne
Category:
Tags: mpichvmi | buffer

less

Transcript and Presenter's Notes

Title: MPICHVMI


1
MPICH-VMI
  • VMI Team
  • Cluster Software Tools group, NCSA
  • Monday, August 18, 2003

2
Compiling with MPICH-VMI
  • Include MPICH_INSTALL_PATH/bin installation
    directory in your path
  • This brings in the compiler wrapper scripts into
    your environment
  • mpicc and mpiCC for C and C codes
  • mpif77 and mpif90 for F77 and F90 codes
  • Some underlying compilers such as GNU compiler
    suite do not support F90. Use mpif90 show to
    determine underlying compiler being used.

3
Compiling with MPICH-VMI
  • The compiler scripts are wrappers that include
    all MPICH-VMI specific libraries and paths
  • All underlying compiler switches are supported
    and passed to the compiler
  • The MPICH-VMI library by default is compiled with
    debug symbols.

4
Running with MPICH-VMI
  • mpirun script is available for launching jobs
  • Supports all standard arguments in addition to
    MPICH-VMI specific arguments
  • mpirun uses ssh, rsh and MPD for launching jobs.
    Default is MPD
  • Provides automatic selection/failover
  • If MPD ring not available, falls back to ssh/rsh

5
Running with MPICH-VMI
  • VMI specific arguments related to three broad
    categories
  • Parameters that can be tuned at runtime
  • Parameters for launching GRID jobs
  • Parameters to control profiling of job
  • mpirun help option to list all VMI tunable
    parameters
  • All VMI specific parameters are optional. GRID
    jobs require some parameters to be set.

6
Running with MPICH-VMI
  • Runtime Parameters
  • -specfile Specify the underlying network
    transport to use. This can be a shortened network
    name (tcp, myrinet or mst for Infiniband) or path
    to a VMI transport definition specification file
    in XML format.
  • -force-shell Disables use of MPDs for launching
    job. GRID jobs require use of ssh/rsh for job
    launching.

7
Running with MPICH-VMI
  • Runtime Parameters
  • -job-sync-timeout Maximum number of seconds
    allowed for all processes to start. Default is
    300 seconds.
  • -debugger Use the specified debugger to debug MPI
    application with. Supported debuggers are gdb and
    totalview.
  • -mmapthreshold Specifies the memory allocation
    size in bytes for which MMAP will be used to
    obtain memory. By default all memory less than 4
    MB is allocated from the heap.

8
Running with MPICH-VMI
  • Runtime Parameters
  • -eagerlen Specifies the message size in bytes to
    switch from short/eager protocol to rendezvous
    protocol. Default is 16KB.
  • -eagerisendcopy Specifies the largest message
    size that can be completed immediately for
    asynchronous sends (MPI_Isend).
  • -disable-short-rdma Disables the use of RDMA
    protocol for short messages.
  • -short-rdma-credits Specifies the maximum number
    of unacknowledged short RDMA messages. Default is
    32.

9
Running with MPICH-VMI
  • Runtime Parameters
  • -rdmachunk Specifies the base RDMA chunk size for
    rendezvous protocol. All RDMA transfers for
    rendezvous are performed using the base RDMA
    chunk size. Default is 256KB.
  • -rdmapipeline Specifies the maximum number of
    RDMA chunks in flight. The overall memory demand
    for RDMA is rdmachunk size
    rdmapipeline length.

10
Running with MPICH-VMI
  • Runtime Parameters
  • -v Verbose Level 1. Output VMI startup messages
    and make MPIRUN verbose.
  • -vv Verbose Level 2. Additionally output any
    warning messages.
  • -vvv Verbose Level 3. Additionally output any
    error messages.
  • -vvvv Verbose Level 10. Excess Debug. Useful only
    for developers of MPICH-VMI and submitting crash
    dumps.

11
Running with MPICH-VMI
  • A MPICH-VMI GRID job consists of one or more
    subjobs.
  • A subjob is launched on each site using
    individual mpirun commands. The specfile selected
    should be one of the xsite network transports
    (xsite-mst-tcp or xsite-myrinet-tcp).
  • The higher performance SAN (Infiniband or Myinet)
    is used for intra site communication. Cross site
    communication uses TCP automatically.

12
Running with MPICH-VMI
  • Individual subjobs can use their preferred
    underlying transport as part of the GRID job.
  • Its possible to have two clusters, one with
    Infiniband and other myrinet and span a GRID job
    across them. Infiniband will be used for intra
    cluster communication within the first cluster
    and myrinet for communication within the second.
    TCP will be used for communication betweeen the
    clusters.

13
Running with MPICH-VMI
  • All subjobs must specify the same GRID specific
    parameters
  • Grid Specific Parameters
  • -grid-procs Specifies the total number of
    processes in the job. np parameter to mpirun
    still specifies the number of processes in the
    subjob
  • -grid-crm Specifies the host running the grid CRM
    to be used for subjob synchronization.

14
Running with MPICH-VMI
  • Grid Specific Parameters
  • -key Alphanumeric string that uniquely identifies
    the grid job. This should be the same for all
    subjobs!
  • -allocator-uri The grid CRM allocator to use for
    MPI rank assignment. Default allocator is
    default which allocates rank in FIFO order.
    Use vmi_crm_query to query available allocators
    on a CRM.

15
Running with MPICH-VMI
  • Profiling Specific Parameters
  • -disable-profiling Disables collection of profile
    data. All MPICH-VMI runs gather profiling data by
    default. Profile data is sent to the profile
    server at NCSA for use by MPICH-VMI developers to
    enable further optimizations.
  • -profile-server Specifies the host running the
    MPICH-VMI profile server. Users can retarget the
    profile data to their own profiling servers to be
    used with profile guided optimization features of
    MPICH-VMI.

16
Running with MPICH-VMI
  • Executing on a single cluster using Myrinet
  • mpirun np 32 specfile myrinet ./cpi
  • Executing on a single cluster using Infiniband
  • Mpirun np 32 specfile mst ./cpi
  • Dont need to recompile the executable. Selection
    at job submission via specfile switch to mpirun

17
Running with MPICH-VMI
  • Executing a cross site (Grid) run
  • Launch individual mpirun at each site. Need to
    specify the total number of processes (grid
    procs) and the number of local processes (np)
  • Run a 32 processor job with 16 processors at each
    site
  • Site A mpirun np 16 specfile xsite-myrinet-tcp
    grid-procs 32 grid-crm ltip of crm hostgt -key
    gridtest ./cpi
  • Site B mpirun np 16 specfile xsite-myrinet-tcp
    grid-procs 32 grid-crm ltip of crm hostgt -key
    gridtest ./cpi
  • The same CRM and key must be specified during
    each subjob submission

18
Debugging with MPICH-VMI
  • MPICH-VMI supports debugging of codes using
    TotalView and gdb
  • A primitive wrapper environment for debugging
    parallel programs with gdb is available.
  • Caveat MPD must be used for job launching
  • All stdin, stdout stderr is redirected to gdb
    processes on the cluster. Standard gdb commands
    are available in this mode.

19
Debugging with TotalView
  • To launch your MPI program with Totalview
  • mpirun tv np lt procsgt ltprog namegt
  • Attaching to a running program
  • Start TotalView with no arguments and press N
    in the root window and select the MPI job to
    which you would want to attach.

20
Profile Guided Optimization
  • Motivation
  • Grid computing poses immense challenges to scale
    an application running across geographically
    distributed clusters with relatively high latency
    and low bandwidth.
  • The need to minimize the data transferred across
    the low bandwidth wide area link.
  • Current grid-aware MPICH implementations expect
    topology aware support from user applications
    which may not be always feasible.

21
VMI Profile Guided Optimization
  • MPICH-VMI2 Profile guided optimization creates a
    communication pattern based for a specific job
    run. This communication pattern is analyzed and
    communication ranks are assigned optimally to
    minimized the bottleneck traffic for a future run
    of the same job.
  • The profile data is collected by individual nodes
    and sent to the head node of each cluster which
    forwards it to the vmiprofile server (default
    vmiprofile server at NCSA)
  • Profiling can also be disabled or can be
    redirected to a profile server in your own
    administrative domain.
  • Profile data collected is MPI specific
    information like the number of short sends and
    receives and no personal information is
    collected to respect the privacy of the end-user.

22
VMI Profile Data
  • MPI Communication Rank specific data used to
    build point to point communication graphs.
  • Job specific information to enable optimizations
    to be based on a specific job.

23
Profile Data (Job Specific)
24
Profile Data (Rank Specific)
25
Profile Tools
  • Command line tools
  • vmicollect
  • Used for generating communication graphs.
  • mincut
  • Used to generate a partition for a communication
    topology.
  • Graphic tools
  • Pajeck
  • Used to display the communication grpahs.

26
Profile Analyzer Tools
  • VMIcollect
  • Queries the vmiprofile database to collect the
    profile data for a specific job.
  • Outputs a communication graph in pajek format
  • Arguments
  • -p ltprogram namegt
  • -d ltstart dategt ltend dategt
  • -j ltjobidgt
  • -a ltarguments passedgt
  • -n lt of processorsgt
  • If the query is not unique, the jobid, the
    program name and argument list of each matching
    entry is displayed.

27
Profile Analyzer Tools
  • Mincut
  • Creates a partition of a communication graph
    generated from a given jobid.
  • Outputs the partition list.
  • Arguments
  • -j ltjobidgt
  • -h lthostname of database servergt
  • -u ltusernamegt
  • -p ltpasswordgt

28
VMI Grid CRM with allocator modules
  • The VMI Grid CRM server does rank assignment to
    nodes using allocator modules.
  • The default allocator does assignment using the
    standard FIFO mechanism. The other allocators
    available are random and mincut.
  • The format of the allocator file
  • ltallocatornamegtltmodulepathgtltargumentsgt
  • Ex queue/opt/mpich-vmi2.0/tools/random.sohostv
    miprofile.ncsa.uiuc.edu?userroot

29
Allocator Modules
  • VMI allocator module implements four functions
  • int VMI_Grid_AllocatorInit(int argc, char argv)
  • Initializes the allocator module and takes the
    arguments given in the allocator file.
  • int VMI_Grid_Allocate(PCRM_JOB job, int argc,
    char argv)
  • Allocates ranks to subjobs.
  • int VMI_Grid_AllocatorTerminate()
  • Cleans up the allocator module. Currently not
    used.
  • char VMI_Grid_AllocatorName()
  • Returns a version name for the allocator.

30
Allocator URI
  • Allocator URI is specified as an mpirun option
    for selecting the topology based on a jobid.
  • Example -allocator-uri mincuthostvmiprofile?use
    rapant?jobid256

31
MPICH-VMI Communication Protocols
  • MPICH-VMI implements two different protocols for
    point-to-point communication
  • Eager
  • For short messages (less than 16K. User can
    change this parameter).
  • Rendezvous
  • For long messages (More than 16K. User can change
    this parameter).

32
MPICH-VMI Communication Protocols
  • Eager Protocol
  • Message is sent to the receiver immediately. The
    receivers messaging layer has to allocate some
    space for the message if the corresponding
    receive has not been posted, i.e., an MPI_Recv
    has not been called. This protocol can give good
    performance for large messages too if the number
    of unexpected receives is very low, since
    unexpected receives require memcpy of incoming
    message to a temporary buffer. A lot of large
    unexpected messages can add a huge overhead.

33
MPICH-VMI Communication Protocols
  • Eager Protocol (Short messages)
  • Two different implementations
  • Using send/recv communication model. Relatively
    expensive. Overhead associated with send/recv
    communication.
  • Using RDMA. Faster. Data is deposited directly
    into the messaging layers buffer from where it
    is copied into applications buffer (one memcpy
    if message is expected, two if message is
    unexpected).

34
MPICH-VMI Communication Protocols
  • Rendezvous Protocol (Large messages)
  • Message is sent to the receiver only after the
    receiver has posted a receive (called MPI_Recv)
    for the message. This requires an additional
    handshake between the communicating processes.
  • This protocol is implemented in MPICH-VMI using
    RDMA and provides a true zero copy data transfer.

35
Tunable Parameters
  • Two Broad classes of tunable parameters
  • Communication Protocol Tuning
  • Tuning Eager send/recv
  • Tunable parameters eagerisendcopy,
    eagerunexcount, eagerlen
  • Tuning Eager RDMA
  • Tunable parameters disable-short-rmda,
    short-rdma-credits
  • Tuning Rendezvous
  • Tunable parameters rdmachunk, rdmapipeline
  • Miscellaneous Tuning
  • Tunable parameters mmapthreshold

36
Tuning Eager send/recv Protocol
  • eagerisendcopy
  • Specifies the size of the largest message that
    can be copied in asynchronous eager send/recv
    send to finish the send immediately.
  • To send any message, the send buffer must be
    registered (pinned down). In asynchronous send,
    memory registration can be avoided and send can
    be immediately completed (application can
    immediately reuse the send buffer) if the
    message can be appended to the send packet
    header, since that region is already registered.

37
Tuning Eager send/recv Protocol
  • Tradeoffs eagerisendcopy
  • Pro(s)
  • For messages of size less than eagerisendcopy,
    application would not have to wait for the send
    to be completed. So, it can carry on with any
    computation, thus parallelizing computation with
    communication.
  • Message buffers of size less than eagerisendcopy
    will not have to be registered (registration is
    expensive if the buffer is not in VMI cache).
  • Con(s)
  • Since messages of size less than eagerisendcopy
    will be copied, memcpy can add overhead.
  • If the message buffer is already registered
    (buffer is in VMI cache), copying the message
    will add unnecessary overhead.

38
Tuning Eager send/recv Protocol
  • eagerunexcount
  • This specifies the maximum number of unexpected
    short message receive buffers registered (pinned
    down) at any time. If number of unexpected
    receives exceed this value, a temporary buffer is
    allocated (malloc) for each new unexpected
    message where message is copied (memcpy) and
    receive buffer is released.

39
Tuning Eager send/recv Protocol
  • Tradeoffs eagerunexcount
  • Pro(s)
  • If your application has a large number of
    unexpected receives, you can reduce the amount of
    pinned down (registered) memory by reducing the
    value of this variable.
  • Con(s)
  • If an application has a large number of
    unexpected receive buffers using pinned down
    memory, memory resources will be strained, possi.
  • Tradeoff between cost of keeping pinned down
    memory from being reused and cost malloc memcpy
    for using the temporary buffer.

40
Tuning Eager send/recv Protocol
  • eagerlen
  • Messages of size less than or equal to eagerlen
    use eager protocol. Messages of size greater than
    eagerlen use rendezvous protocol.

41
Tuning Eager send/recv Protocol
  • Tradeoffs eagerlen
  • Pro(s)
  • If the ratio of unexpected receives to expected
    receives is very low, using eager protocol even
    for large messages might be faster. In that case,
    increase the eagerlen so that large messages also
    use eager protocol.
  • If the network interconnect has high latency
    (like in TCP), using eager protocol will be more
    beneficial as compared to Rendezvous since
    Rendezvous requires a handshake between the
    sender and the receiver. Therefore, with high
    latency interconnect, the performance suffers.
  • Con(s)
  • Messages of size less then eagerlen will have
    overhead associated with send/recv communication.
    Rendezvous, on the other hand, can deposit data
    directly into the receivers buffer, avoiding
    send/recv overhead.
  • If the ratio of unexpected receives to expected
    receives is very high, messages using eager
    protocol will have to be temporarily buffered
    which is expensive.
  • Short messages using eager protocol use a buffer
    pool (pinned down memory) where each buffer is
    the size of eagerlen. With large eagerlen, more
    memory will be pinned down, increasing total
    memory utilization of your application. This can
    negatively impact performance if your system does
    not have sufficient memory resources.

42
Tuning Eager RDMA Protocol
  • disable-short-rdma
  • Disables use for RDMA protocol for short
    messages. Only send/recv eager protocol is used
    for messages less then eagerlen.
  • In short RDMA, the sender puts the data into a
    series of slots whose addresses the receiver has
    published to the sender. Only after these slots
    have been filled up, the sender starts using the
    send/recv eager protocol.

43
Tuning Eager RDMA Protocol
  • Tradeoffs disable-short-rdma
  • Pro(s)
  • Eager RDMA improves both bandwidth and latency,
    since the extra overhead (memcpy etc) associated
    with eager send/recv protocol is avoided.
  • Con(s)
  • Short RDMA requires the receiver to permanently
    pin down regions of memory it publishes to the
    sender for putting short messages. This requires
    the system to have sufficient memory resources.
    If your system does not have sufficient memory,
    either issue disable-rdma-short to use only eager
    protocol for short messages or reduce the number
    of RDMA credits to lessen the amount of memory
    pinned down.

44
Tuning Eager RDMA Protocol
  • short-rdma-credits
  • Maximum number of unacknowledged short RDMA
    messages.
  • short-rdma-credits also equals the number of RDMA
    slots allocated by the receiver where the sender
    can deposit short RDMA messages The size of each
    slot is equal to eagerlen.
  • If all the short RDMA slots are filled up,
    MPICH-VMI switches to send/recv eager protocol.

45
Tuning Eager RDMA Protocol
  • Tradeoffs short-rdma-credits
  • Pro(s)
  • If your application has a large number of
    unexpected receives, it is likely that short RDMA
    slots will fill up quickly, increasing the ratio
    of eager send/recv sends and receives to eager
    RDMA sends and receives. If your system has
    sufficient memory resources, increasing
    short-rdma-credits will potentially improve
    performance by decreasing the ratio of eager
    send/recv messages to eager RDMA messages, thus
    making communication faster.
  • Con(s)
  • Greater the number of slots, more will be the
    pinned down memory for RDMAs, straining the
    memory resources.
  • Large number of slots mean a bigger set to poll
    for incoming data. That can be expensive.

46
Tuning Rendezvous Protocol
  • rdmachunk
  • Base chunk size for large RDMA transfers used for
    Rendezvous protocol.
  • MPICH-VMI fragments all messages being sent over
    Rendezvous protocol into chunks of size rdmachunk
    to avoid pinning down large buffers of memory.

47
Tuning Rendezvous Protocol
  • Tradeoffs rdmachunk
  • Pro(s)
  • Fragmenting large RDMA transfers on both sender
    and receiver sides reduces amount of pinned down
    memory, conserving memory resources.
  • Con(s)
  • If the size of the message being sent is greater
    than rdmachunk, the message will be fragmented
    and the cost of multiple RDMA puts will be
    incurred.

48
Tuning Rendezvous Protocol
  • rdmapipeline
  • Maximum number of RDMA chunks in flight.
  • It also equals the number of buffers receiver
    publishes to the sender where the sender can put
    the data.
  • Hence, the amount of registered memory for each
    Rendezvous send is rdmapipeline rdmachunk on
    both the sender and the receivers end.

49
Tuning Rendezvous Protocol
  • Tradeoffs rdmapipeline
  • Pro(s)
  • If the message to be sent is large than
    rdmachunk, two or more puts will be required to
    deposit the data into receivers buffer. For each
    put, the receiver will post a publish to the
    sender after the last put completes. With
    rdmapipeline, sender can do multiple puts without
    waiting for receivers publish. This can be very
    useful, especially with high latency
    interconnects.
  • Con(s)
  • Large rdmapipeline can strain memory resources
    since each communicating process will have to
    keep pinned down (registered) memory of size
    rdmapipeline rdmachunk for each rendezvous
    communication.

50
Tuning Misc. Parameters
  • mmapthreshold
  • Specifies the memory allocation size in bytes for
    which MMAP will be used to obtain memory. All
    memory allocations of size less than
    mmapthreshold are done from the heap.
  • VMI does not keep mmaped memory in its cache of
    registered memory. Therefore, there will be a
    cache miss for all mmaped memory buffers that
    needs to be registered.

51
Tuning Misc. Parameters
  • Tradeoffs mmapthreshold
  • Pro(s)
  • If application is allocating large buffers and
    using them for messaging multiple times, having
    those buffers being allocated from heap would
    help since VMI will keep those buffers in its
    cache of registered memory. Increasing
    mmapthreshold will be useful since buffers of
    sizes less then mmapthreshold will be using heap.
  • TIP To find out if your application is using
    mmaped memory, see cache statistics in profiling
    data.

52
Topology Aware Collectives
  • MPI 1.1 specification has 14 collective
    operations. Most popular are Bcast, Barrier,
    Scatter/Allscatter, Gather/Allgather,
    Reduce/Allreduce.
  • When running jobs on a grid, the latencies
    between any two compute nodes might be relatively
    high if these nodes reside on different sites on
    the grid rather than on the same site.
  • Hence, it is desirable that the collectives are
    implemented such that the high latency link
    connecting one grid site to another grid site is
    used minimum possible number of times.
  • Topology aware collectives implemented in
    MPICH-VMI are grid aware. They tend minimize
    communication between nodes that do not reside on
    the same grid site.
  • Currently MPI_Bcast, MPI_Barrier, MPI_Reduce and
    MPI_Allreduce have been implemented.

53
Topology Aware Collectives
  • MPI_Bcast
  • Each communicator has a coordinator node at each
    grid site. In the current implementation, the
    node with the lowest grank within a site for a
    give communicator is designated as the
    coordinator for that site.
  • If the root of the broadcast is the coordinator,
    the root does a binomial tree broadcast for nodes
    within that site. The root also does a flat tree
    broadcast among the coordinator nodes. The
    coordinator nodes in turn do a binomial tree
    broadcast at their own sites.
  • If the root of the broadcast is not the
    coordinator, it assumes the the role of
    coordinator for that site and does a flat tree
    broadcast (among coordinators) followed by a
    binomial tree broadcast (with its own site).

54
Topology Aware Collectives
  • MPI_Barrier
  • Each site has a coordinator node, chosen the same
    way as in MPI_Bcast.
  • First, there is a intra-site gather of empty
    barrier messages, with the coordinator acting as
    roots.
  • Then, the coordinator with the lowest grank
    (called the master coordinator) blocks till it
    receives empty barrier message from the rest of
    the coordinators.
  • The master coordinator replies to site
    coordinators with an empty barrier message,
    followed by an intra-site broadcast of barrier
    messages, with coordinators acting as broadcast
    roots.

55
Job Startup at Single Site
NCSA
  • mpirun executes
  • spawns process on nodes
  • Contact CRM for job startup synchronization
  • CRM allocates ranks and broadcasts location of
    all ranks
  • Query vmieyes daemons for active network devices
    for each process
  • Establish connection mesh between ranks using
    the available network devices

VMIEYES
MPIRUN
APPLICATION
56
Startup for Grid Jobs
NCSA
  • mpirun executes at NCSA
  • spawns processes on NCSA nodes
  • Contact Subjob CRM for job startup
    synchronization at NCSA
  • Subjob CRM at NCSA broadcasts ranks to all NCSA
    processes
  • Query vmieyes daemons for active network devices
    for each process
  • Establish connection mesh between ranks using
    the available network devices. Myrinet used at
    NCSA in example, TCP for WAN
  • NCSA CRM acts as proxy and registers processes
    with GRID CRM

VMIEYES
MPIRUN
GRID CRM
APPLICATION
SDSC
  • GRID CRM generates job topology and allocates
    ranks. Topology and ranks forwarded by subjob CRM
    servers at NCSA and SDSC

Myrinet
Infiniband
TCP -WAN
  • Establish connection mesh between ranks using
    the available network devices. Infiniband used at
    SDSC in example, TCP for WAN
  • mpirun executes at SDSC
  • spawns processes on SDSC nodes
  • Contact Subjob CRM for job startup
    synchronization at SDSC
  • Subjob CRM at SDSC broadcasts ranks to all SDSC
    processes
  • Query vmieyes daemons for active network devices
    for each process
  • SDSC CRM acts as proxy and registers processes
    with GRID CRM
Write a Comment
User Comments (0)
About PowerShow.com