OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters

Description:

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 35
Provided by: David1913
Learn more at: http://www.spscicomp.org
Category:

less

Transcript and Presenter's Notes

Title: OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters


1
OPAL Open Source Parallel Algorithm
LibraryDesigning High-Performance Algorithms for
SMP Clusters
  • David A. Bader
  • Electrical Computer Engineering Department
  • Albuquerque High Performance Computing Center
  • University of New Mexico
  • dbader_at_eece.unm.edu
  • http//hpc.eece.unm.edu/

2
High-Performance Applications using SMP Clusters
  • Long-term Earth science studies using terascale
    remotely-sensed global satellite imagery (4 km
    AVHRR GAC)
  • Computational Ecological Studies
    Self-Organization of Semi-Arid Landscapes Test
    of Optimality Principles
  • Computational Bioinformatics Large Scale
    Phylogeny Reconstruction

3
Research Collaborators
  • Joseph JáJá, University of Maryland
  • Bernard Moret, CS (Experimental Algorithmics),
    University of New Mexico
  • Bruce Milne, Biology (Landscape Ecology),
    University of New Mexico
  • Tandy Warnow, CS, University of Texas-Austin
  • IBM ACTC Group (David Klepacki, John Levesque,
    and others)
  • Current Graduate Students
  • Mi Yan, Niranjan Prabhu, Vinila Yarlagadda
  • Laboratory Alumni
  • Kavita Balakavi (Intel), Ajith Illendula (Intel)

4
Acknowledgment of Support
  • NSF CISE Postdoctoral Research Associate in
    Experimental Computer Science No. 96-25668
  • NSF BIO Division of Environmental Biology DEB
    99-10123
  • Department of Energy Sandia-University New
    Assistant Professorship Program (SUNAPP) Award
    AX-3006
  • IBM SUR Grant (UNM Vista-Azul Project )
  • NPACI/SDSC and NCSA/Alliance
  • NSF 00- Algorithms for Irregular Discrete
    Computations on SMPs

5
Outline
  • Motivation
  • SMP Cluster Programming (SIMPLE)
  • Complexity model
  • Message-Passing
  • Shared-Memory
  • OPAL Facets (parallel libraries)
  • OPAL Setting (programming framework)
  • Example SMP Algorithms

6
Motivation
  • High performance computing has been leveraging
    COTS workstation technologies
  • Commodity microprocessors
  • High-performance networks
  • Operating system and compiler technology
  • Symmetric multiprocessor (SMP)
  • Hardware support for hierarchical memory
    management
  • Multithreaded operating system kernels
  • Optimizing compilers and runtime systems

7
SMP Cluster Architectures
  • IBM SP (NPACI Blue Horizon 144x8)
  • Linux Clusters
  • Compaq AlphaServers
  • (PSC/NSF Terascale 682x4)
  • Sun Ultra HPC (4x64)

8
Message-Passing Performance
9
Shared-Memory Performance
  • One Sun HPC E10K processor
  • Contiguous array each element read exactly once
  • C, X cyclic read (stride X) of contiguous array
  • R random access of array

10
High Performance Algorithms for SMP Clusters
  • SIMPLE Model
  • Use a hybrid, natural combination of
    message-passing and shared-memory
  • Message passing interface between nodes
  • Shared-memory programming (OpenMP, POSIX Threads)
    on each SMP node
  • Methodology for adapting message-passing
    algorithms for SMP Clusters
  • Freely-available open source implementation of
    parallel algorithms, libraries, and programming
    environment, for C/C/Fortran with GNU Public
    License (GPL)

11
Optimizing from MPI to SIMPLE (Regular or
Irregular Algorithms)
  • Similar Single-Program Multiple-Data (SPMD)
    paradigm
  • Replace multiple MPI tasks per node with a single
    task and multiple shared-memory threads
  • Parallelize sequential work into equivalent
    shared-memory algorithms
  • Replace MPI communication primitives with
    corresponding SIMPLE primitives

12
Hierarchy of SMP, Message-Passing, and SIMPLE
Libraries
13
Portability Access from User Space
14
Parallel Complexity Models
15
SIMPLE Complexity ModelMessage Passing Primitives
16
Comparison of PRAM to SMP
  • PRAM (theory)
  • O(n) processors
  • Global clock
  • Synchronous shared-memory
  • Unit cost for computation or memory access
  • Ideal Read/Write models (EREW, CREW, CRCW)
  • SMP (practice)
  • P processors (2 to 64)
  • Asynchronous lock-step operation
  • Uniform memory access to main memory (lt 600 ns),
    faster access to local cache (10-40 ns)
  • Cache-coherency at external caches
  • Contention for shared memory

17
OPAL Complexity Model
  • SMP Complexity model motivated by Helman and
    JáJá, Ramachandran
  • Complexity given by the triplet (MA, ME, TC)
  • MA is the number of memory accesses,
  • ME is the maximum volume of data exchanged
    between any processor and memory,
  • TC is the computational complexity.

18
OPAL Facets
  • Common Primitives
  • Read/Write
  • Replicate
  • Barrier
  • Scan
  • Reduce
  • Broadcast
  • Allreduce
  • Techniques
  • Pointer-jumping
  • Balanced Trees (Prefix-Sums)
  • Symmetric Breaking (3-Coloring)
  • Parallel Prefix (List Ranking)
  • Graph Algorithms
  • Spanning Tree
  • Euler Tour
  • Tree Functions
  • Ear Decomposition
  • Combinatorics
  • Sorting
  • Selection
  • Bioinformatics
  • (Minimum Evolution) Phylogeny Trees
  • Computational Genomics Breakpoints, Inversions,
    Translocations

19
SMP Complexity ModelSMP Node Primitives
  • Read/Write
  • Replicate
  • Barrier
  • Scan
  • Reduce
  • Broadcast
  • Allreduce
  • Etc.
  • SMP Complexity model motivated by Helman and
    JáJá
  • Complexity given by the triplet (MA, ME, TC)
  • MA is the number of memory accesses,
  • ME is the maximum volume of data exchanged
    between any processor and memory,
  • TC is the computational complexity.

20
OPAL SettingProgramming Environment
21
Local Context Parameters for Each Thread
22
Control Primitives
23
Memory Management Primitives
24
Example Application Radixsort
  • Stable sort of n integers spread evenly across a
    cluster of p shared-memory r-way nodes
  • Decompose b-bit keys into ?-bit digits
  • Perform b / ? passes of counting sort on digits
    (LSD ? MSD)
  • Counting Sort
  • Compute histogram of local keys
  • Communicate Alltoall primitive of histograms
  • Locally compute prefix-sums of histograms
  • Communicate (Inverse) Alltoall of prefix-sums
  • Rank each local element
  • Perform a personalized communication (1-relation)
    rearranging elements into sorted order

25
(No Transcript)
26
Experimental Platform
  • Cluster of DEC AlphaServer 2100 4/275 nodes
  • four 64-bit dual-issue DEC 21064A (ev4) Alpha
    RISC processors clocked at 275MHz
  • each Alpha has two separate data and instruction
    on-chip 16KB caches. ICache is direct-mapped,
    while DCache is two-way set associative
  • each CPU has a 4MB backup (L2) cache
  • 128-bit system bus connecting CPUs to 2GB of
    shared memory
  • DEC Gigaswitch/ATM with DEC (OC-3c) 155.52 Mbps
    PCI adaptor cards
  • MPI (mpich 1.0.13), pthreads (DEC)

27
(No Transcript)
28
Execution Time of Radix Sort on an SMP Cluster
29
SMP Example Ear Decomposition
  • Ear decomposition
  • Partitions the edges of a graph, useful in
    parallel processing
  • Like peeling the layers of an onion
  • Applied to scientific computing problems
  • Computational mechanics (structural rigidity)
  • Computational biology (molecular structure, atoms
    in DNA chains)
  • Computational fluid dynamics
  • Similar to other parallel algorithms for
    combinatorial problems
  • Trivial and fast sequential algorithm
  • Efficient PRAM algorithm
  • But no known practical, parallel algorithm

30
Ear Decomposition Example
Input
Output Ears
n number of vertices m number of edges
Spanning Tree
31
Ear Decomposition Complexities
  • Message Passing
  • Spanning Tree
  • Ear Decomposition
  • Shared Memory
  • Spanning Tree
  • Ear Decomposition

32
Experimental Platform
  • Sun HPC 10000
  • 64 UltraSparc II 64-bit (sparcV9) processors
    clocked at 400MHz
  • 4-way superscalar, in-order dispatch,
    out-of-order completion
  • each processor has two separate, direct-mapped
    data and instruction on-chip 16KB caches (64 byte
    cache blocks)
  • each CPU has an 8MB backup (L2) cache
  • Four-way interleaved address buses with snooping
  • 16x16 data crossbar between system boards
  • 38 clocks pin-to-pin on a load-miss to main
    memory (Uniform Memory Access)
  • Sun Workshop 6.0 Compilers, Solaris 7
  • Sun HPC ClusterTools 3.1 MPI
  • IEEE POSIX Threads (Sun)

33
Comparison of Ear Decomposition Algorithms
34
Performance of SMP Ear Decomposition on a Variety
of Input Graphs
n 8192
35
SMP Ear Decomposition Algorithms
36
Conclusions
  • New hybrid model for SMP Clusters
  • Open Source Parallel Algorithm Library (OPAL)
  • High-Performance methodology
  • Fastest known algorithms on SMPs and SMP clusters
  • Preliminary experimental results

37
Future Work
  • Algorithms for SMP Clusters
  • Validate complexity model
  • Identify classes of efficient algorithms
  • Library of SMP algorithms
  • Methodology for algorithm-engineering
  • Clusters of Heterogeneous SMP Nodes
  • Varying node sizes
  • Nodes from different vendors architectures
  • Hierarchical clusters of SMPs
  • Scientific Applications
  • Bioinformatics and Genomics
  • Landscape Ecology and Remote Sensing
  • Computational Fluid Dynamics
Write a Comment
User Comments (0)
About PowerShow.com