Title: HighPerformance Networking HPN Group
1High-Performance Networking (HPN) Group
The HPN Group was formerly known as the SAN
group.
- Distributed Shared-Memory Parallel Computing with
UPC on SAN-based Clusters - Appendix for Q3 Status Report
- DOD Project MDA904-03-R-0507
- February 5, 2004
2Outline
- Objectives and Motivations
- Background
- Related Research
- Approach
- Results
- Conclusions and Future Plans
3Objectives and Motivations
- Objectives
- Support advancements for HPC with Unified
Parallel C (UPC) on cluster systems exploiting
high-throughput, low-latency system-area networks
(SANs) and LANs - Design and analysis of tools to support UPC on
SAN-based systems - Benchmarking and case studies with key UPC
applications - Analysis of tradeoffs in application, network,
service and system design - Motivations
- Increasing demand in sponsor and scientific
computing community for shared-memory parallel
computing with UPC - New and emerging technologies in system-area
networking and cluster computing - Scalable Coherent Interface (SCI)
- Myrinet (GM)
- InfiniBand
- QsNet (Quadrics Elan)
- Gigabit Ethernet and 10 Gigabit Ethernet
- PCI Express (3GIO)
- Clusters offer excellent cost-performance
potential
4Background
- Key sponsor applications and developments toward
shared-memory parallel computing with UPC - More details from sponsor are requested
- UPC extends the C language to exploit parallelism
- Currently runs best on shared-memory
multiprocessors (notably HP/Compaqs UPC
compiler) - First-generation UPC runtime systems becoming
available for clusters (MuPC, Berkeley UPC) - Significant potential advantage in
cost-performance ratio with COTS-based cluster
configurations - Leverage economy of scale
- Clusters exhibit low cost relative to
tightly-coupled SMP, CC-NUMA, and MPP systems - Scalable performance with commercial
off-the-shelf (COTS) technologies
5Related Research
- University of California at Berkeley
- UPC runtime system
- UPC to C translator
- Global-Address Space Networking (GASNet) design
and development - Application benchmarks
- George Washington University
- UPC specification
- UPC documentation
- UPC testing strategies, testing suites
- UPC benchmarking
- UPC collective communications
- Parallel I/O
- Michigan Tech University
- Michigan Tech UPC (MuPC) design and development
- UPC collective communications
- Memory model research
- Programmability studies
- Test suite development
- Ohio State University
- UPC benchmarking
- HP/Compaq
- UPC compiler
- Intrepid
- GCC UPC compiler
6Approach
- Collaboration
- HP/Compaq UPC Compiler V2.1 running in lab on new
ES80 AlphaServer (Marvel) - Support of testing by OSU, MTU, UCB/LBNL, UF, et
al. with leading UPC tools and system for
function performance evaluation - Field test of newest compiler and system
- Benchmarking
- Use and design of applications in UPC to grasp
key concepts and understand performance issues - Exploiting SAN Strengths for UPC
- Design and develop new SCI Conduit for GASNet in
collaboration UCB/LBNL - Evaluate DSM for SCI as option of executing UPC
- Performance Analysis
- Network communication experiments
- UPC computing experiments
- Emphasis on SAN Options and Tradeoffs
- SCI, Myrinet, InfiniBand, Quadrics, GigE, 10GigE,
etc.
7GASNet - Experimental Setup Analysis
- Experimental Results
- Throughput
- Elan shows best performance with approx. 300MB/s
in both put and get operations - Myrinet and SCI very close with 200MB/s on put
operations - Myrinet obtains nearly the same performance with
get operations - SCI suffers from the reference extended API in
get operations (approx 7MB/s) due to greatly
increasing latency - get operations will benefit the greatest from
extended API implementation - Currently being addressed in UFs design of the
extended API for SCI - MPI suffers from high latency but still performs
well on GigE with almost 50 MB/s - Latency
- Elan again performs best put/get(8 µs)
- Myrinet put (20usec), get (33 µs)
- SCI both put and get (25 µs) better than
Myrinet get for small messages - Larger messages suffer from the AM rpc protocol
- MPI latency too high to show (250 µs)
- Elan is the best performer in low-level API tests
- Testbed
- Elan, MPI and SCI conduits
- Dual 2.4 GHz Intel Xeon, 1GB DDR PC2100 (DDR266)
RAM, Intel SE7501BR2 server motherboard with
E7501 chipset - Specs 667 MB/s (300MB/s sustained) Dolphin SCI
D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus - Specs 528 MB/s (340MB/s sustained) Elan3 ,using
PCI-X in two nodes with QM-S16 16 port switch - RedHat 9.0 with gcc compiler V 3.3.2
- GM (Myrinet) conduit (c/o access to cluster at
MTU) - Dual 2.0 GHz Intel Xeon, 2GB DDR PC2100 (DDR266)
RAM - Specs - 250 MB/s Myrinet 2000, using PCI-X, on
8 nodes connected with 16-port M3F-SW16 switch - RedHat 7.3 with Intel C compiler V 7.1
- Experimental Setup
- Elan, GM conduits executed with extended API
implemented - SCI, MPI executed with the reference API (based
on AM in core API) - GASNet Conduit experiments
- Berkeley GASNet Test suite
- Average of 1000 iterations
- Each uses Bulk transfers to take advantage of
implemented extended APIs - Latency results use testsmall
Testbed made available by Michigan Tech
8GASNet Throughput on Conduits
For get operations, must wait for rpc to be
executed before data can be pushed back
9GASNet Latency on Conduits
Despite not having yet constructed the extended
API, which allows better hardware exploitation,
SCI conduit still manages to keep pace with GM
conduit for throughput and most small-message
latencies. Q1 report shows target possibility of
10usec latencies.
SCI results based on generic GASNet version of
extended API, which limits performance.
10UPC Benchmarks IS from NAS benchmarks
- Class A executed with Berkeley UPC runtime system
V1.1 with gcc V3.3.2 for Elan, MPI Intel V7.1
for GM - IS (Integer Sort), lots of fine-grain
communication, low computation - Communication layer should have greatest effect
on performance - Single thread shows performance without use of
communication layer - Poor performance in the GASNet communication
system does NOT necessary indicating poor
performance in UPC application - MPI results poor for GASNet but decent for UPC
applications - Application may need to be larger to confirm this
assertion - GM conduit shows greatest gain in parallelization
(could be partly due to better compiler)
Only two nodes available with Elan, unable to
determine scalability at this point
TCP/IP overhead outweighs benefit of
parallelization
Code developed at GWU
11Network Performance Tests
- Detailed understanding of high-performance
cluster interconnects - Identifies suitable networks for UPC over
clusters - Aids in smooth integration of interconnects with
upper-layer UPC components - Enables optimization of network communication
unicast and collective - Various levels of network performance analysis
- Low-level tests
- InfiniBand based on Virtual Interface Provider
Library (VIPL) - SCI based on Dolphin SISCI and SCALI SCI
- Myrinet based on Myricom GM
- QsNet based on Quadrics Elan Communication
Library - Host architecture issues (e.g. CPU, I/O, etc.)
- Mid-level tests
- Sockets
- Dolphin SCI Sockets on SCI
- BSD Sockets on Gigabit and 10Gigabit Ethernet
- GM Sockets on Myrinet
- SOVIA on InfiniBand
- MPI
- InfiniBand and Myrinet based on MPI/PRO
12Network Performance Tests
- Tests run on two Elan3 cards connected by QM-S16
16-port switch - Quadrics dping used for raw tests
- GASNet testsmall used for latency, testlarge for
throughput - Utilizes extended API
- Results obtained from put operations
- Elan conduit for GASNet more than doubles
hardware latency, but still maintains sub-10 µs
for small messages - Conduit throughput matches hardware
- Elan conduit does not add appreciably to
performance overhead
13Low Level vs. GASNet Conduit
- Tests run on two Myrinet 2000 cards connected by
M3F-SW16 switch - Myricom gm_allsize used for raw tests
- GASNet testsmall used for latency, testlarge for
throughput - Utilizes extended API
- Results obtained from puts
- GM conduit almost doubles the hardware latency,
with latencies of 19 µs for small messages - Conduit throughput follows trend of hardware but
differs by an average of 60MB/s for messages
1024bytes - Conduit peaks at 204MB/s compared to 238MB/s for
hardware - GM conduit adds a small amount to performance
overhead
14Architectural Performance Tests
- Opteron
- Features
- 64-bit processor
- Real-time support of 32-bit OS
- On-chip memory controllers
- Eliminates 4 GB memory barrier imposed by 32-bit
systems - 19.2 GB/s I/O bandwidth per processor
- Future plans
- UPC and other parallel application benchmarks
- Pentium 4 Xeon
- Features
- 32-bit processor
- Hyper-Threading technology
- Increased CPU utilization
- Intel NetBurst microarchitecture
- RISC processor core
- 4.3 GB/s I/O bandwidth
- Future Plans
- UPC and other parallel application benchmarks
15CPU Performance Results
- NAS Benchmarks
- Computationally intensive
- EP, FT, and IS
- Class A problem set size
- Opteron and Xeon comparable with floating-point
operations (FT) - For integer operations, Opteron performs better
compared to Xeon (EP IS)
- DOD Seminumeric Benchmark 2
- Radix sort
- Measures set up time, sort time, and time to
verify the sort - Sorting is the dominant component of execution
time - Results Analysis
- Opteron architecture outperforms Xeons in all
tests performed for all iterations - Setup and Verify times around half as much as
Xeon architecture
16Memory Performance Results
- Lmbench-3.0-a3
- Opteron latency/throughput worsen as expected at
size 64KB (L1 cache size) and 1MB (L2 cache size) - Xeon latency/throughput shows the same trend for
L1 (8KB) but starts earlier for L2 (256KB instead
of 512KB) - Cause under investigation
- Between CPU / L1 / L2, Opteron outperforms Xeon,
but Xeon outperforms Opteron when loading data
from disk into main memory - Write throughput for Xeon stays relatively
constant for size lt L2 cache size suggesting
write-through policy use between L1 and L2 - Xeon read gt Opteron write gt Opteron read gt Xeon
write
17File I/O Results
- Bonnie / Bonnie
- 10 iterations of writing and reading a 2GB file
using per character functions and efficient block
functions - stdio overhead great for per character functions
- Efficient block reads and writes greatly reduce
the CPU utilization - Throughput results were directly proportional to
CPU utilization - Shows the same trend as observed in the memory
performance testing - Xeon read gt Opteron write gt Opteron read gt Xeon
write - Suggesting memory access and I/O access might
utilizes the same mechanism - AIM 9
- 10 iterations using 5MB files testing sequential
and random reads, writes, and copies - Opteron consistently outperforming Xeon by a wide
margin - Large increase in performance for disk reads as
compare to write - Xeon read speeds are very high for all results
with a much lower write performance - Opteron read speeds are also very high and
greatly outperform the Xeons in write performance
in all cases - Xeon sequential read is actually worse than
Opteron, but still comparable
18Conclusions and Future Plans
- Accomplishments to date
- Baselining of UPC on shared-memory
multiprocessors - Evaluation of promising tools for UPC on clusters
- Leverage and extend communication and UPC layers
- Conceptual design of new tools
- Preliminary network and system performance
analyses - Completed V1.0 of the GASNet Core API SCI conduit
for UPC - Key insights
- Inefficient communication system does not
necessarily translate to poor UPC application
performance - Xeon cluster suitable for applications with high
Read/Write ratio - Opteron cluster suitable for generic application
due to comparable Read/Write capability - Future Plans
- Comprehensive performance analysis of new SANs
and SAN-based clusters - Evaluation of UPC methods and tools on various
architectures and systems - UPC benchmarking on cluster architectures,
networks, and conduits - Continuing effort in stabilizing/optimizing
GASNet SCI Conduit - Cost/Performance analysis for all options