Title: OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters
1OPAL Open Source Parallel Algorithm
LibraryDesigning High-Performance Algorithms for
SMP Clusters
- David A. Bader
- Electrical Computer Engineering Department
- Albuquerque High Performance Computing Center
- University of New Mexico
- dbader_at_eece.unm.edu
- http//hpc.eece.unm.edu/
2High-Performance Applications using SMP Clusters
- Long-term Earth science studies using terascale
remotely-sensed global satellite imagery (4 km
AVHRR GAC) - Computational Ecological Studies
Self-Organization of Semi-Arid Landscapes Test
of Optimality Principles - Computational Bioinformatics Large Scale
Phylogeny Reconstruction
3Research Collaborators
- Joseph JáJá, University of Maryland
- Bernard Moret, CS (Experimental Algorithmics),
University of New Mexico - Bruce Milne, Biology (Landscape Ecology),
University of New Mexico - Tandy Warnow, CS, University of Texas-Austin
- IBM ACTC Group (David Klepacki, John Levesque,
and others) - Current Graduate Students
- Mi Yan, Niranjan Prabhu, Vinila Yarlagadda
- Laboratory Alumni
- Kavita Balakavi (Intel), Ajith Illendula (Intel)
4Acknowledgment of Support
- NSF CISE Postdoctoral Research Associate in
Experimental Computer Science No. 96-25668 - NSF BIO Division of Environmental Biology DEB
99-10123 - Department of Energy Sandia-University New
Assistant Professorship Program (SUNAPP) Award
AX-3006 - IBM SUR Grant (UNM Vista-Azul Project )
- NPACI/SDSC and NCSA/Alliance
- NSF 00- Algorithms for Irregular Discrete
Computations on SMPs
5Outline
- Motivation
- SMP Cluster Programming (SIMPLE)
- Complexity model
- Message-Passing
- Shared-Memory
- OPAL Facets (parallel libraries)
- OPAL Setting (programming framework)
- Example SMP Algorithms
6Motivation
- High performance computing has been leveraging
COTS workstation technologies - Commodity microprocessors
- High-performance networks
- Operating system and compiler technology
- Symmetric multiprocessor (SMP)
- Hardware support for hierarchical memory
management - Multithreaded operating system kernels
- Optimizing compilers and runtime systems
7SMP Cluster Architectures
- IBM SP (NPACI Blue Horizon 144x8)
- Linux Clusters
- Compaq AlphaServers
- (PSC/NSF Terascale 682x4)
- Sun Ultra HPC (4x64)
8Message-Passing Performance
9Shared-Memory Performance
- One Sun HPC E10K processor
- Contiguous array each element read exactly once
- C, X cyclic read (stride X) of contiguous array
- R random access of array
10High Performance Algorithms for SMP Clusters
- SIMPLE Model
- Use a hybrid, natural combination of
message-passing and shared-memory - Message passing interface between nodes
- Shared-memory programming (OpenMP, POSIX Threads)
on each SMP node - Methodology for adapting message-passing
algorithms for SMP Clusters - Freely-available open source implementation of
parallel algorithms, libraries, and programming
environment, for C/C/Fortran with GNU Public
License (GPL)
11Optimizing from MPI to SIMPLE (Regular or
Irregular Algorithms)
- Similar Single-Program Multiple-Data (SPMD)
paradigm - Replace multiple MPI tasks per node with a single
task and multiple shared-memory threads - Parallelize sequential work into equivalent
shared-memory algorithms - Replace MPI communication primitives with
corresponding SIMPLE primitives
12Hierarchy of SMP, Message-Passing, and SIMPLE
Libraries
13Portability Access from User Space
14Parallel Complexity Models
15SIMPLE Complexity ModelMessage Passing Primitives
16Comparison of PRAM to SMP
- PRAM (theory)
- O(n) processors
- Global clock
- Synchronous shared-memory
- Unit cost for computation or memory access
- Ideal Read/Write models (EREW, CREW, CRCW)
- SMP (practice)
- P processors (2 to 64)
- Asynchronous lock-step operation
- Uniform memory access to main memory (lt 600 ns),
faster access to local cache (10-40 ns) - Cache-coherency at external caches
- Contention for shared memory
17OPAL Complexity Model
- SMP Complexity model motivated by Helman and
JáJá, Ramachandran - Complexity given by the triplet (MA, ME, TC)
- MA is the number of memory accesses,
- ME is the maximum volume of data exchanged
between any processor and memory, - TC is the computational complexity.
18OPAL Facets
- Common Primitives
- Read/Write
- Replicate
- Barrier
- Scan
- Reduce
- Broadcast
- Allreduce
- Techniques
- Pointer-jumping
- Balanced Trees (Prefix-Sums)
- Symmetric Breaking (3-Coloring)
- Parallel Prefix (List Ranking)
- Graph Algorithms
- Spanning Tree
- Euler Tour
- Tree Functions
- Ear Decomposition
- Combinatorics
- Sorting
- Selection
- Bioinformatics
- (Minimum Evolution) Phylogeny Trees
- Computational Genomics Breakpoints, Inversions,
Translocations
19SMP Complexity ModelSMP Node Primitives
- Read/Write
- Replicate
- Barrier
- Scan
- Reduce
- Broadcast
- Allreduce
- Etc.
- SMP Complexity model motivated by Helman and
JáJá - Complexity given by the triplet (MA, ME, TC)
- MA is the number of memory accesses,
- ME is the maximum volume of data exchanged
between any processor and memory, - TC is the computational complexity.
20OPAL SettingProgramming Environment
21Local Context Parameters for Each Thread
22Control Primitives
23Memory Management Primitives
24Example Application Radixsort
- Stable sort of n integers spread evenly across a
cluster of p shared-memory r-way nodes - Decompose b-bit keys into ?-bit digits
- Perform b / ? passes of counting sort on digits
(LSD ? MSD) - Counting Sort
- Compute histogram of local keys
- Communicate Alltoall primitive of histograms
- Locally compute prefix-sums of histograms
- Communicate (Inverse) Alltoall of prefix-sums
- Rank each local element
- Perform a personalized communication (1-relation)
rearranging elements into sorted order
25(No Transcript)
26Experimental Platform
- Cluster of DEC AlphaServer 2100 4/275 nodes
- four 64-bit dual-issue DEC 21064A (ev4) Alpha
RISC processors clocked at 275MHz - each Alpha has two separate data and instruction
on-chip 16KB caches. ICache is direct-mapped,
while DCache is two-way set associative - each CPU has a 4MB backup (L2) cache
- 128-bit system bus connecting CPUs to 2GB of
shared memory - DEC Gigaswitch/ATM with DEC (OC-3c) 155.52 Mbps
PCI adaptor cards - MPI (mpich 1.0.13), pthreads (DEC)
27(No Transcript)
28Execution Time of Radix Sort on an SMP Cluster
29SMP Example Ear Decomposition
- Ear decomposition
- Partitions the edges of a graph, useful in
parallel processing - Like peeling the layers of an onion
- Applied to scientific computing problems
- Computational mechanics (structural rigidity)
- Computational biology (molecular structure, atoms
in DNA chains) - Computational fluid dynamics
- Similar to other parallel algorithms for
combinatorial problems - Trivial and fast sequential algorithm
- Efficient PRAM algorithm
- But no known practical, parallel algorithm
30Ear Decomposition Example
Input
Output Ears
n number of vertices m number of edges
Spanning Tree
31Ear Decomposition Complexities
- Message Passing
- Spanning Tree
- Ear Decomposition
- Shared Memory
- Spanning Tree
- Ear Decomposition
32Experimental Platform
- Sun HPC 10000
- 64 UltraSparc II 64-bit (sparcV9) processors
clocked at 400MHz - 4-way superscalar, in-order dispatch,
out-of-order completion - each processor has two separate, direct-mapped
data and instruction on-chip 16KB caches (64 byte
cache blocks) - each CPU has an 8MB backup (L2) cache
- Four-way interleaved address buses with snooping
- 16x16 data crossbar between system boards
- 38 clocks pin-to-pin on a load-miss to main
memory (Uniform Memory Access) - Sun Workshop 6.0 Compilers, Solaris 7
- Sun HPC ClusterTools 3.1 MPI
- IEEE POSIX Threads (Sun)
33Comparison of Ear Decomposition Algorithms
34Performance of SMP Ear Decomposition on a Variety
of Input Graphs
n 8192
35SMP Ear Decomposition Algorithms
36Conclusions
- New hybrid model for SMP Clusters
- Open Source Parallel Algorithm Library (OPAL)
- High-Performance methodology
- Fastest known algorithms on SMPs and SMP clusters
- Preliminary experimental results
37Future Work
- Algorithms for SMP Clusters
- Validate complexity model
- Identify classes of efficient algorithms
- Library of SMP algorithms
- Methodology for algorithm-engineering
- Clusters of Heterogeneous SMP Nodes
- Varying node sizes
- Nodes from different vendors architectures
- Hierarchical clusters of SMPs
- Scientific Applications
- Bioinformatics and Genomics
- Landscape Ecology and Remote Sensing
- Computational Fluid Dynamics