Title: UPC Research at University of Florida
1UPC Research atUniversity of Florida
- Alan D. George, Hung-Hsun Su, Bryan Golden, Adam
Leko - HCS Research Laboratory
- University of Florida
2UPC SHMEMPerformance Analysis Tool(PAT)
3UPC/SHMEM PAT - Introduction
- Motivations
- When UPC/SHMEM programs do not yield the desired
or expected performance, why? - Due to complexity of parallel computing,
difficult to determine without tools for
performance analysis - Discouraging for users, new old few options
available for shared-memory computing in UPC and
SHMEM communities - Goals
- Identify important performance factors in
UPC/SHMEM computing - Develop framework for a performance analysis tool
- As new tool or as extension/redesign of existing
non-UPC/SHMEM tools - Design with both performance and user
productivity in mind - Attract new UPC/SHMEM users and support improved
performance - Develop model to predict optimal performance
4UPC/SHMEM PAT Approach (1)
- Define layers to divide the workload and
investigate important issues from different
perspectives - Application layer deals with general parallel
programming issues, as well as issues unique to
the problem at hand - Language layer involves issues specific to the
UPC or SHMEM programming model - Compiler layer includes the effects of
different compilers and their optimization
techniques on performance - Middleware layer includes all system software
that relates to system resources, such as the
communication protocol stack, OS, and runtime
system - Hardware layer comprises key issues within
system resources such as CPU architecture, memory
hierarchy, communication and synchronization
network
5UPC/SHMEM PAT Approach (2)
- Research strategies
- Tool-driven approach (bottom-up, time-saving
approach) - Study existing tools for other programming models
- Identify suitable factors applicable to UPC and
SHMEM - Conduct experiments to verify the relevancy of
these factors - Extend/create UPC/SHMEM-specific PAT
- Layer-driven approach (top-down, comprehensive
approach) - Identify all possible factors in each of the five
layers defined - Conduct experiments to verify the relevancy of
these factors - Determine suitable tool(s) able to provide
measurements for these factors - Extend/create UPC/SHMEM-specific PAT
- Hybrid approach (simult. pursuit of tool-driven
and layer-driven approaches) - Minimize development time
- Maximize usefulness of PAT
Hybrid Approach
6UPC/SHMEM PAT Areas of Study
- Research areas currently under investigation that
are important to development of successful PAT - Algorithm Analysis
- Analytical Models
- Categorization of Platforms/Systems
- Factor Classification and Determination
- Performance Analysis Strategies
- Profiling/Tracing Methods
- Program/Compiler Optimization Techniques
- Tool Design Production/Theoretical
- Tool Evaluation Strategies
- Tool Framework/Approach Theoretical and
Experimental - Usability (includes presentation methodology)
7GASNet SCI Conduitfor UPC Computing
8GASNet SCI Conduit Introduction and Design (1)
- Scalable Coherent Interface (SCI)
- Low-latency, high-bandwidth SAN
- Shared-memory capabilities
- Require memory exporting and importing
- PIO (require importing) DMA (need 8 bytes
alignment) - Remote write 10x faster than remote read
- SCI conduit
- AM enabling (core API)
- Dedicated AM message channels (Command)
- Request/Response pairs to prevent deadlock
- Flags to signal arrival of new AM (Control)
- Put/Get enabling (extended API)
- Global segment (Payload)
This work in collaboration with UPC Group at UC
Berkeley.
9GASNet SCI Conduit Design (2)
- Active Message Transfer
- Obtain free slot
- Tract locally using array of flags
- Package AM header
- Transfer data
- Short AM
- PIO write (Header)
- Medium AM
- PIO write (Header)
- PIO write (Medium Payload)
- Long AM
- PIO write (Header)
- PIO write (Long Payload)
- Payload size ? 1024
- Unaligned portion of payload
- DMA write (multiple of 64 bytes)
- Wait for transfer completion
- Signal AM arrival
10GASNet SCI Conduit Results
- Objective - compare the performance of SCI
conduit to other existing conduits - Experimental Setup
- GASNet configured with segment Large
- GASNet conduit experiments
- Berkeley GASNet test suite
- Average of 1000 iterations
- Executed with target memory falling inside and
then outside the GASNet segment - Latency results use testsmall
- Throughput results use testlarge
- Analysis
- Elan (Quadrics) shows best performance for
latency of puts and gets - VAPI (InfiniBand) is by far the best bandwidth
latency very good - GM (Myrinet) latencies a little higher than all
the rest - Our SCI conduit shows better put latency than
MPI/SCI conduit on SCI for sizes gt 64 bytes very
close to MPI on SCI for smaller messages - Our SCI conduit has latency slightly higher than
MPI on SCI - GM and SCI provide about the same throughput
- Our SCI conduit slightly higher bandwidth for
largest message sizes - Quick look at estimated total cost to support 8
nodes of these interconnect architectures - SCI 8,700
via testbed made available courtesy of Michigan
Tech
11GASNet SCI Conduit Results (Latency)
12GASNet SCI Conduit Results (Bandwidth)
13GASNet SCI Conduit - Conclusions
- Experimental version of our conduit is available
as part of Berkeley UPC 2.0 release - Despite being limited by existing SCI driver from
vendor, it is able to achieve performance fairly
comparable to other conduits - Enhancements to resolve driver limitations are
being investigated in close collaboration with
Dolphin - Support access of all virtual memory on remote
node - Minimize transfer setup overhead
14UPC Benchmarking
15UPC Benchmarking Overview
- Goals
- Produce interesting and useful benchmarks for UPC
- Compare the performance of Berkeley UPC using
various clusters/conduits and HP UPC on HP/Compaq
AlphaServer - Testbed
- Intel Xeon cluster
- Nodes Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100
(DDR266) RAM, Intel SE7501BR2 server motherboard
with E7501 chipset - SCI 667 MB/s (300 MB/s sustained) Dolphin SCI
D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus - MPI MPICH 1.2.5
- RedHat 9.0 with gcc compiler V 3.3.2, Berkeley
UPC runtime system 2.0 - AMD Opteron cluster
- Nodes Dual AMD Opteron 240, 1GB DDR PC2700
(DDR333) RAM, Tyan Thunder K8S server motherboard - InfiniBand 4x (10Gb/s, 800 MB/s sustained)
Voltaire HCS 400LP, using PCI-X 64/100, 24-port
Voltaire ISR 9024 switch - MPI MPI package provided by Voltaire
- SuSE 9.0 with gcc compiler V 3.3.3, Berkeley UPC
runtime system 2.0 - HP/Compaq ES80 AlphaServer (Marvel)
- Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM,
proprietary inter-processor connections - Tru64 5.1B Unix, HP UPC V2.3-test compiler
16UPC Benchmarking Differential Cryptanalysis for
CAMEL Cipher
- Description
- Uses 1024-bit S-Boxes
- Given a key, encrypts data, then tries to guess
key solely based on encrypted data using
differential attack - Has three main phases
- Compute optimal difference pair based on S-Box
(not very CPU-intensive) - Performs main differential attack (extremely
CPU-intensive, brute force using optimal
difference pair) - Analyze data from differential attack (not very
CPU-intensive) - Computationally (independent processes) intensive
several synchronization points - Analysis
- Marvel attained almost perfect speedup,
synchronization cost very low - Berkeley UPC
- Speedup decreases with increasing number of
threads (cost of synchronization increases with
number of threads) - Run time varied greatly as number of threads
increased - Still decent performance for 32 threads (76.25
efficiency, VAPI) - Performance is more sensitive to data affinity
Parameters MAINKEYLOOP 256 NUMPAIRS
400,000 Initial Key 12345
17UPC BenchmarkingDES Differential Attack
Simulator
- Description
- S-DES (8-bit key) cipher (integer-based)
- Creates basic components used in differential
cryptanalysis - S-Boxes, Difference Pair Tables (DPT), and
Differential Distribution Tables (DDT) - Bandwidth-intensive application
- Designed for high cache miss rate, so very costly
in terms of memory access - Analysis
- With increasing number of nodes, bandwidth and
NIC response time become more important - Interconnects with higher bandwidth and faster
response times perform best - Marvel shows near-perfect linear speedup, but
processing time of integers an issue - MPI conduit clearly inadequate for high-bandwidth
programs
18UPC Benchmarking Concurrent Wave Equation
- Description
- A vibrating string is decomposed into points
- In the parallel version, each processor
responsible for updating amplitude of N/P points - Each iteration, each processor exchanges boundary
points with nearest neighbors - Coarse-grained communication
- Algorithm complexity of O(N)
- Analysis
- Sequential C
- Modified version was 30 faster than baseline for
Xeon, but only 17 faster for Opteron - Opteron and Xeon sequential unmodified code have
nearly identical execution times - UPC
- Near linear speedup
- Fairly straightforward to port from sequential
code - upc_forall loop ran faster with arrayj as
affinity expression than with (arrayj)
19UPC Benchmarking Mod 2N Inverse
- Description
- Given list A of size listsize (64-bit integers),
size ranges from 0 to 2j 1 - Compute
- list B, where BiAi right justified
- list C, such that (Bi Ci) 2j 1 (iterative
algorithm) - Check section (gather)
- First node checks all values to verify (Bi Ci)
2j 1. - Computation is embarrassingly parallel and very
communication intensive - MPI, UPC, and SHMEM versions used same algorithm
- Analysis
- AlphaServer Relatively good, UPC, SHMEM, and
MPI comparable performance - Opteron
- Suboptimal performance, communication time
dominates overall execution time - MPI gave best results (more mature compiler),
although code was much more laborious to write - GPSHMEM over MPI lacks good performance vs. plain
MPI, although much easier to write with one-sided
communication functions - However, GPSHMEM gives adequate performance for
applications that are not extremely sensitive to
bandwidth - MPI could use asynchronous calls to hide latency
on check, although communication time dominates
by a large factor
20UPC Benchmarking Convolution
- Description
- Compute convolution of two sequences
- Classic definition of convolution (not image
processing definition) - Embarrassingly parallel (gives an idea of
language overhead), O(N2) algorithm complexity - Parameters sequence sizes 100,000 elements,
data types 32-bit integer, double-precision
floating point - MPI, UPC, and SHMEM versions used same algorithm
- Analysis
- Overall language overhead
- MPI version required most effort to code
- SHMEM slightly easier than MPI because of
one-sided communication functions - UPC easiest to code (conversion of sequential
code very easy), but has potentially limited
performance unless optimizations (get, for, cast)
are used - Overall language performance overhead
- On AlphaServer MPI had most runtime overhead in
most cases GPSHMEM performed surprisingly well - On Opteron Runtime overhead MPI (least) lt SHMEM
lt Berkeley UPC
21Final Conclusions
- Florida group active in three UPC research areas
- UPC SHMEM performance analysis tools (PAT)
- Network and system infrastructure for UPC
computing - UPC benchmarking and optimization
- Status
- PAT research for UPC is recently underway
- SCI networking infrastructure for UPC/GASNet
cluster computing shows promising results - Broad range of UPC benchmarks under development
- Developing plans for additional UPC projects
- Integration of UPC and RC (reconfigurable)
computing - Simulation modeling of UPC systems and apps.