UPC Research at University of Florida - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

UPC Research at University of Florida

Description:

Nodes: Dual AMD Opteron 240, 1GB DDR PC2700 (DDR333) RAM, Tyan Thunder K8S server motherboard ... GPSHMEM over MPI lacks good performance vs. plain MPI, ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 22
Provided by: hcs8
Category:

less

Transcript and Presenter's Notes

Title: UPC Research at University of Florida


1
UPC Research atUniversity of Florida
  • Alan D. George, Hung-Hsun Su, Bryan Golden, Adam
    Leko
  • HCS Research Laboratory
  • University of Florida

2
UPC SHMEMPerformance Analysis Tool(PAT)
3
UPC/SHMEM PAT - Introduction
  • Motivations
  • When UPC/SHMEM programs do not yield the desired
    or expected performance, why?
  • Due to complexity of parallel computing,
    difficult to determine without tools for
    performance analysis
  • Discouraging for users, new old few options
    available for shared-memory computing in UPC and
    SHMEM communities
  • Goals
  • Identify important performance factors in
    UPC/SHMEM computing
  • Develop framework for a performance analysis tool
  • As new tool or as extension/redesign of existing
    non-UPC/SHMEM tools
  • Design with both performance and user
    productivity in mind
  • Attract new UPC/SHMEM users and support improved
    performance
  • Develop model to predict optimal performance

4
UPC/SHMEM PAT Approach (1)
  • Define layers to divide the workload and
    investigate important issues from different
    perspectives
  • Application layer deals with general parallel
    programming issues, as well as issues unique to
    the problem at hand
  • Language layer involves issues specific to the
    UPC or SHMEM programming model
  • Compiler layer includes the effects of
    different compilers and their optimization
    techniques on performance
  • Middleware layer includes all system software
    that relates to system resources, such as the
    communication protocol stack, OS, and runtime
    system
  • Hardware layer comprises key issues within
    system resources such as CPU architecture, memory
    hierarchy, communication and synchronization
    network

5
UPC/SHMEM PAT Approach (2)
  • Research strategies
  • Tool-driven approach (bottom-up, time-saving
    approach)
  • Study existing tools for other programming models
  • Identify suitable factors applicable to UPC and
    SHMEM
  • Conduct experiments to verify the relevancy of
    these factors
  • Extend/create UPC/SHMEM-specific PAT
  • Layer-driven approach (top-down, comprehensive
    approach)
  • Identify all possible factors in each of the five
    layers defined
  • Conduct experiments to verify the relevancy of
    these factors
  • Determine suitable tool(s) able to provide
    measurements for these factors
  • Extend/create UPC/SHMEM-specific PAT
  • Hybrid approach (simult. pursuit of tool-driven
    and layer-driven approaches)
  • Minimize development time
  • Maximize usefulness of PAT

Hybrid Approach
6
UPC/SHMEM PAT Areas of Study
  • Research areas currently under investigation that
    are important to development of successful PAT
  • Algorithm Analysis
  • Analytical Models
  • Categorization of Platforms/Systems
  • Factor Classification and Determination
  • Performance Analysis Strategies
  • Profiling/Tracing Methods
  • Program/Compiler Optimization Techniques
  • Tool Design Production/Theoretical
  • Tool Evaluation Strategies
  • Tool Framework/Approach Theoretical and
    Experimental
  • Usability (includes presentation methodology)

7
GASNet SCI Conduitfor UPC Computing
8
GASNet SCI Conduit Introduction and Design (1)
  • Scalable Coherent Interface (SCI)
  • Low-latency, high-bandwidth SAN
  • Shared-memory capabilities
  • Require memory exporting and importing
  • PIO (require importing) DMA (need 8 bytes
    alignment)
  • Remote write 10x faster than remote read
  • SCI conduit
  • AM enabling (core API)
  • Dedicated AM message channels (Command)
  • Request/Response pairs to prevent deadlock
  • Flags to signal arrival of new AM (Control)
  • Put/Get enabling (extended API)
  • Global segment (Payload)

This work in collaboration with UPC Group at UC
Berkeley.
9
GASNet SCI Conduit Design (2)
  • Active Message Transfer
  • Obtain free slot
  • Tract locally using array of flags
  • Package AM header
  • Transfer data
  • Short AM
  • PIO write (Header)
  • Medium AM
  • PIO write (Header)
  • PIO write (Medium Payload)
  • Long AM
  • PIO write (Header)
  • PIO write (Long Payload)
  • Payload size ? 1024
  • Unaligned portion of payload
  • DMA write (multiple of 64 bytes)
  • Wait for transfer completion
  • Signal AM arrival

10
GASNet SCI Conduit Results
  • Objective - compare the performance of SCI
    conduit to other existing conduits
  • Experimental Setup
  • GASNet configured with segment Large
  • GASNet conduit experiments
  • Berkeley GASNet test suite
  • Average of 1000 iterations
  • Executed with target memory falling inside and
    then outside the GASNet segment
  • Latency results use testsmall
  • Throughput results use testlarge
  • Analysis
  • Elan (Quadrics) shows best performance for
    latency of puts and gets
  • VAPI (InfiniBand) is by far the best bandwidth
    latency very good
  • GM (Myrinet) latencies a little higher than all
    the rest
  • Our SCI conduit shows better put latency than
    MPI/SCI conduit on SCI for sizes gt 64 bytes very
    close to MPI on SCI for smaller messages
  • Our SCI conduit has latency slightly higher than
    MPI on SCI
  • GM and SCI provide about the same throughput
  • Our SCI conduit slightly higher bandwidth for
    largest message sizes
  • Quick look at estimated total cost to support 8
    nodes of these interconnect architectures
  • SCI 8,700

via testbed made available courtesy of Michigan
Tech
11
GASNet SCI Conduit Results (Latency)
12
GASNet SCI Conduit Results (Bandwidth)
13
GASNet SCI Conduit - Conclusions
  • Experimental version of our conduit is available
    as part of Berkeley UPC 2.0 release
  • Despite being limited by existing SCI driver from
    vendor, it is able to achieve performance fairly
    comparable to other conduits
  • Enhancements to resolve driver limitations are
    being investigated in close collaboration with
    Dolphin
  • Support access of all virtual memory on remote
    node
  • Minimize transfer setup overhead

14
UPC Benchmarking
15
UPC Benchmarking Overview
  • Goals
  • Produce interesting and useful benchmarks for UPC
  • Compare the performance of Berkeley UPC using
    various clusters/conduits and HP UPC on HP/Compaq
    AlphaServer
  • Testbed
  • Intel Xeon cluster
  • Nodes Dual 2.4 GHz Intel Xeons, 1GB DDR PC2100
    (DDR266) RAM, Intel SE7501BR2 server motherboard
    with E7501 chipset
  • SCI 667 MB/s (300 MB/s sustained) Dolphin SCI
    D337 (2D/3D) NICs, using PCI 64/66, 4x2 torus
  • MPI MPICH 1.2.5
  • RedHat 9.0 with gcc compiler V 3.3.2, Berkeley
    UPC runtime system 2.0
  • AMD Opteron cluster
  • Nodes Dual AMD Opteron 240, 1GB DDR PC2700
    (DDR333) RAM, Tyan Thunder K8S server motherboard
  • InfiniBand 4x (10Gb/s, 800 MB/s sustained)
    Voltaire HCS 400LP, using PCI-X 64/100, 24-port
    Voltaire ISR 9024 switch
  • MPI MPI package provided by Voltaire
  • SuSE 9.0 with gcc compiler V 3.3.3, Berkeley UPC
    runtime system 2.0
  • HP/Compaq ES80 AlphaServer (Marvel)
  • Four 1GHz EV7 Alpha processors, 8GB RD1600 RAM,
    proprietary inter-processor connections
  • Tru64 5.1B Unix, HP UPC V2.3-test compiler

16
UPC Benchmarking Differential Cryptanalysis for
CAMEL Cipher
  • Description
  • Uses 1024-bit S-Boxes
  • Given a key, encrypts data, then tries to guess
    key solely based on encrypted data using
    differential attack
  • Has three main phases
  • Compute optimal difference pair based on S-Box
    (not very CPU-intensive)
  • Performs main differential attack (extremely
    CPU-intensive, brute force using optimal
    difference pair)
  • Analyze data from differential attack (not very
    CPU-intensive)
  • Computationally (independent processes) intensive
    several synchronization points
  • Analysis
  • Marvel attained almost perfect speedup,
    synchronization cost very low
  • Berkeley UPC
  • Speedup decreases with increasing number of
    threads (cost of synchronization increases with
    number of threads)
  • Run time varied greatly as number of threads
    increased
  • Still decent performance for 32 threads (76.25
    efficiency, VAPI)
  • Performance is more sensitive to data affinity

Parameters MAINKEYLOOP 256 NUMPAIRS
400,000 Initial Key 12345
17
UPC BenchmarkingDES Differential Attack
Simulator
  • Description
  • S-DES (8-bit key) cipher (integer-based)
  • Creates basic components used in differential
    cryptanalysis
  • S-Boxes, Difference Pair Tables (DPT), and
    Differential Distribution Tables (DDT)
  • Bandwidth-intensive application
  • Designed for high cache miss rate, so very costly
    in terms of memory access
  • Analysis
  • With increasing number of nodes, bandwidth and
    NIC response time become more important
  • Interconnects with higher bandwidth and faster
    response times perform best
  • Marvel shows near-perfect linear speedup, but
    processing time of integers an issue
  • MPI conduit clearly inadequate for high-bandwidth
    programs

18
UPC Benchmarking Concurrent Wave Equation
  • Description
  • A vibrating string is decomposed into points
  • In the parallel version, each processor
    responsible for updating amplitude of N/P points
  • Each iteration, each processor exchanges boundary
    points with nearest neighbors
  • Coarse-grained communication
  • Algorithm complexity of O(N)
  • Analysis
  • Sequential C
  • Modified version was 30 faster than baseline for
    Xeon, but only 17 faster for Opteron
  • Opteron and Xeon sequential unmodified code have
    nearly identical execution times
  • UPC
  • Near linear speedup
  • Fairly straightforward to port from sequential
    code
  • upc_forall loop ran faster with arrayj as
    affinity expression than with (arrayj)

19
UPC Benchmarking Mod 2N Inverse
  • Description
  • Given list A of size listsize (64-bit integers),
    size ranges from 0 to 2j 1
  • Compute
  • list B, where BiAi right justified
  • list C, such that (Bi Ci) 2j 1 (iterative
    algorithm)
  • Check section (gather)
  • First node checks all values to verify (Bi Ci)
    2j 1.
  • Computation is embarrassingly parallel and very
    communication intensive
  • MPI, UPC, and SHMEM versions used same algorithm
  • Analysis
  • AlphaServer Relatively good, UPC, SHMEM, and
    MPI comparable performance
  • Opteron
  • Suboptimal performance, communication time
    dominates overall execution time
  • MPI gave best results (more mature compiler),
    although code was much more laborious to write
  • GPSHMEM over MPI lacks good performance vs. plain
    MPI, although much easier to write with one-sided
    communication functions
  • However, GPSHMEM gives adequate performance for
    applications that are not extremely sensitive to
    bandwidth
  • MPI could use asynchronous calls to hide latency
    on check, although communication time dominates
    by a large factor

20
UPC Benchmarking Convolution
  • Description
  • Compute convolution of two sequences
  • Classic definition of convolution (not image
    processing definition)
  • Embarrassingly parallel (gives an idea of
    language overhead), O(N2) algorithm complexity
  • Parameters sequence sizes 100,000 elements,
    data types 32-bit integer, double-precision
    floating point
  • MPI, UPC, and SHMEM versions used same algorithm
  • Analysis
  • Overall language overhead
  • MPI version required most effort to code
  • SHMEM slightly easier than MPI because of
    one-sided communication functions
  • UPC easiest to code (conversion of sequential
    code very easy), but has potentially limited
    performance unless optimizations (get, for, cast)
    are used
  • Overall language performance overhead
  • On AlphaServer MPI had most runtime overhead in
    most cases GPSHMEM performed surprisingly well
  • On Opteron Runtime overhead MPI (least) lt SHMEM
    lt Berkeley UPC

21
Final Conclusions
  • Florida group active in three UPC research areas
  • UPC SHMEM performance analysis tools (PAT)
  • Network and system infrastructure for UPC
    computing
  • UPC benchmarking and optimization
  • Status
  • PAT research for UPC is recently underway
  • SCI networking infrastructure for UPC/GASNet
    cluster computing shows promising results
  • Broad range of UPC benchmarks under development
  • Developing plans for additional UPC projects
  • Integration of UPC and RC (reconfigurable)
    computing
  • Simulation modeling of UPC systems and apps.
Write a Comment
User Comments (0)
About PowerShow.com