Title: Protocols and software for exploiting Myrinet clusters
1Protocols and software for exploiting Myrinet
clusters
- Congduc Pham
- and the main contributors
- P. Geoffray,
- L. Prylli,
- B. Tourancheau,
- R. Westrelin
2Parallel machines and clusters
Cplant
Standalone workstation
3Pros for clusters
- Large supercomputers are expensive and suffer
from a short useful life span - Performance of workstations and PCs is rapidly
improving - The communications bandwidth between workstations
is increasing as new networking technologies and
protocols are implemented in LANs and WANs. - Workstation clusters are easier to integrate into
existing networks than special parallel
computers. - Use of clusters of workstations as a distributed
computing resource is very cost effective -
incremental growth or update of system!!!
4No polemical discussion, just statement
Mainframe
PC
Workstation
Mini Computer
1984
Vector Supercomputer
GigaEthernet Giganet SCI Myrinet
from R. Buyya
5The Myrinet technology
- Switch
- full crossbar
- wormhole source routing
- small latency
- Network interface
- embedded RISC processor
- programmable
- local memory
- several DMA engines
Current specifications Up to 200Mhz
processor Up to 8MB local memory 64bit/66Mhz PCI
bus (528 MB/s peak) 250 MB/s full duplex links
6The raw performance is here, but
- the traditional communication software fail to
bring the hardware performance to the applications
200mph
40mph
35mph
Myrinet
Traditional communication layers
180mph
175mph
Optimized communication layers
7Going faster by taking shortcuts
8Our communication architecture
- Provides a complete suite for high-performance
communications.Focus on Myrinet-based clusters - Viewed as layers, but by-passes as much as
possible the OS
MPI-BIP
BIP
BIP-SMP
programmable NICs break the traditional spatial
distribution of tasks
Myrinet physical layer
9BIP, the lowest protocol level
- Basic Interface for Parallelism
- very basic API
- provides a library, a kernel module and a MCP
- definitely not for the end-user
- Optimizations for
- latency
- maximum throughput
- the throughput increase
- The implementation performs
- reduction of the data critical path
- distinction between small and large messages
- burst or write combining for host?NIC
- optimal cache usage
- cache snooping for NIC ?host (monitoring of the
PCI bus) - buffer alignment
- optimal fragment size
10BIP, small message strategy
- Avoids handshakes between the host and the NIC
- Uses PIO to a NIC FIFO on the sending side and an
extra memory copy on the receiving side
11BIP, large message strategy
- Use DMA both on the send side and receive side
higher bandwidth, offload the CPU - Zero-copy mechanism, pipelined transmission
12BIP-SMP a low level for SMP machines
- SMP viewed as best performance/price ratio
architectures (2 or 4 proc.) - BIP-SMP provides
- manage concurrent accesses to the NIC
- low latency intra-node communications
- BIP equivalent inter-node communication
- total transparency for the applications and
end-users
0 1 2 3
13BIP-SMP Moving data between processes
14MPI-BIP the communication middleware
- MPI-BIP adds high-level features to BIP
- based on the MPICH implementation
- provides a portable and widely-used API
- implements a credit-based flow control for small
messages - request FIFO for multiple non-blocking operations
- provides segmentation/reassembly features to
avoid timeouts
15Working with the BIP software suite
- installation
- run configure
- compilation and linkage
- several libraries bip, bip-smp, mpi
- compile with bipcc
- Submitting jobs and monitoring nodes
- run myristat to know which nodes are available
- run bipconf to configure the virtual machine
- use bipload to lunch programs
16WebCM a high level management tool
- web-based management tool
- integrates existing solutions into a common
framework
17The WebCM user interface
- graphical interface for myristat and bipconf
- allows submission of jobs through batch packages
- shows the user's virtual machine definition and
the user's runnning processes - addition of fonctionnalities is performed by
incorporating new software packages
18Latency BIP and MPI-BIP
19Throughput BIP and MPI-BIP
20BIP-SMP intra-node communications
21BIP-SMP inter-node communications
22What run on our clusters?
- Genomic simulation
- Fluid dynamic
- Discrete Event Parallel Simulation
- Distributed Shared Memory System
- Want to know more?
- getting the distribution
- getting the documentation
http//resam.univ-lyon1.fr