OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters

About This Presentation

Title:

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters

Description:

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 35

Provided by: David1913

Learn more at: http://www.spscicomp.org

Category:

more less

Transcript and Presenter's Notes

Title: OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters

1
OPAL Open Source Parallel Algorithm
LibraryDesigning High-Performance Algorithms for
SMP Clusters

David A. Bader
Electrical Computer Engineering Department
Albuquerque High Performance Computing Center
University of New Mexico
dbader_at_eece.unm.edu
http//hpc.eece.unm.edu/

2
High-Performance Applications using SMP Clusters

Long-term Earth science studies using terascale
remotely-sensed global satellite imagery (4 km
AVHRR GAC)
Computational Ecological Studies
Self-Organization of Semi-Arid Landscapes Test
of Optimality Principles
Computational Bioinformatics Large Scale
Phylogeny Reconstruction

3
Research Collaborators

Joseph JáJá, University of Maryland
Bernard Moret, CS (Experimental Algorithmics),
University of New Mexico
Bruce Milne, Biology (Landscape Ecology),
University of New Mexico
Tandy Warnow, CS, University of Texas-Austin
IBM ACTC Group (David Klepacki, John Levesque,
and others)
Current Graduate Students
Mi Yan, Niranjan Prabhu, Vinila Yarlagadda
Laboratory Alumni
Kavita Balakavi (Intel), Ajith Illendula (Intel)

4
Acknowledgment of Support

NSF CISE Postdoctoral Research Associate in
Experimental Computer Science No. 96-25668
NSF BIO Division of Environmental Biology DEB
99-10123
Department of Energy Sandia-University New
Assistant Professorship Program (SUNAPP) Award
AX-3006
IBM SUR Grant (UNM Vista-Azul Project )
NPACI/SDSC and NCSA/Alliance
NSF 00- Algorithms for Irregular Discrete
Computations on SMPs

5
Outline

Motivation
SMP Cluster Programming (SIMPLE)
Complexity model
Message-Passing
Shared-Memory
OPAL Facets (parallel libraries)
OPAL Setting (programming framework)
Example SMP Algorithms

6
Motivation

High performance computing has been leveraging
COTS workstation technologies
Commodity microprocessors
High-performance networks
Operating system and compiler technology
Symmetric multiprocessor (SMP)
Hardware support for hierarchical memory
management
Multithreaded operating system kernels
Optimizing compilers and runtime systems

7
SMP Cluster Architectures

IBM SP (NPACI Blue Horizon 144x8)
Linux Clusters
Compaq AlphaServers
(PSC/NSF Terascale 682x4)
Sun Ultra HPC (4x64)

8
Message-Passing Performance
9
Shared-Memory Performance

One Sun HPC E10K processor
Contiguous array each element read exactly once
C, X cyclic read (stride X) of contiguous array
R random access of array

10
High Performance Algorithms for SMP Clusters

SIMPLE Model
Use a hybrid, natural combination of
message-passing and shared-memory
Message passing interface between nodes
Shared-memory programming (OpenMP, POSIX Threads)
on each SMP node
Methodology for adapting message-passing
algorithms for SMP Clusters
Freely-available open source implementation of
parallel algorithms, libraries, and programming
environment, for C/C/Fortran with GNU Public
License (GPL)

11
Optimizing from MPI to SIMPLE (Regular or
Irregular Algorithms)

Similar Single-Program Multiple-Data (SPMD)
paradigm
Replace multiple MPI tasks per node with a single
task and multiple shared-memory threads
Parallelize sequential work into equivalent
shared-memory algorithms
Replace MPI communication primitives with
corresponding SIMPLE primitives

12
Hierarchy of SMP, Message-Passing, and SIMPLE
Libraries
13
Portability Access from User Space
14
Parallel Complexity Models
15
SIMPLE Complexity ModelMessage Passing Primitives
16
Comparison of PRAM to SMP

PRAM (theory)
O(n) processors
Global clock
Synchronous shared-memory
Unit cost for computation or memory access
Ideal Read/Write models (EREW, CREW, CRCW)

SMP (practice)
P processors (2 to 64)
Asynchronous lock-step operation
Uniform memory access to main memory (lt 600 ns),
faster access to local cache (10-40 ns)
Cache-coherency at external caches
Contention for shared memory

17
OPAL Complexity Model

SMP Complexity model motivated by Helman and
JáJá, Ramachandran
Complexity given by the triplet (MA, ME, TC)
MA is the number of memory accesses,
ME is the maximum volume of data exchanged
between any processor and memory,
TC is the computational complexity.

18
OPAL Facets

Common Primitives
Read/Write
Replicate
Barrier
Scan
Reduce
Broadcast
Allreduce
Techniques
Pointer-jumping
Balanced Trees (Prefix-Sums)
Symmetric Breaking (3-Coloring)
Parallel Prefix (List Ranking)

Graph Algorithms
Spanning Tree
Euler Tour
Tree Functions
Ear Decomposition
Combinatorics
Sorting
Selection
Bioinformatics
(Minimum Evolution) Phylogeny Trees
Computational Genomics Breakpoints, Inversions,
Translocations

19
SMP Complexity ModelSMP Node Primitives

Read/Write
Replicate
Barrier
Scan
Reduce
Broadcast
Allreduce
Etc.

SMP Complexity model motivated by Helman and
JáJá
Complexity given by the triplet (MA, ME, TC)
MA is the number of memory accesses,
ME is the maximum volume of data exchanged
between any processor and memory,
TC is the computational complexity.

20
OPAL SettingProgramming Environment
21
Local Context Parameters for Each Thread
22
Control Primitives
23
Memory Management Primitives
24
Example Application Radixsort

Stable sort of n integers spread evenly across a
cluster of p shared-memory r-way nodes
Decompose b-bit keys into ?-bit digits
Perform b / ? passes of counting sort on digits
(LSD ? MSD)
Counting Sort
Compute histogram of local keys
Communicate Alltoall primitive of histograms
Locally compute prefix-sums of histograms
Communicate (Inverse) Alltoall of prefix-sums
Rank each local element
Perform a personalized communication (1-relation)
rearranging elements into sorted order

25
(No Transcript)
26
Experimental Platform

Cluster of DEC AlphaServer 2100 4/275 nodes
four 64-bit dual-issue DEC 21064A (ev4) Alpha
RISC processors clocked at 275MHz
each Alpha has two separate data and instruction
on-chip 16KB caches. ICache is direct-mapped,
while DCache is two-way set associative
each CPU has a 4MB backup (L2) cache
128-bit system bus connecting CPUs to 2GB of
shared memory
DEC Gigaswitch/ATM with DEC (OC-3c) 155.52 Mbps
PCI adaptor cards
MPI (mpich 1.0.13), pthreads (DEC)

27
(No Transcript)
28
Execution Time of Radix Sort on an SMP Cluster
29
SMP Example Ear Decomposition

Ear decomposition
Partitions the edges of a graph, useful in
parallel processing
Like peeling the layers of an onion
Applied to scientific computing problems
Computational mechanics (structural rigidity)
Computational biology (molecular structure, atoms
in DNA chains)
Computational fluid dynamics
Similar to other parallel algorithms for
combinatorial problems
Trivial and fast sequential algorithm
Efficient PRAM algorithm
But no known practical, parallel algorithm

30
Ear Decomposition Example
Input
Output Ears
n number of vertices m number of edges
Spanning Tree
31
Ear Decomposition Complexities

Message Passing
Spanning Tree
Ear Decomposition

Shared Memory
Spanning Tree
Ear Decomposition

32
Experimental Platform

Sun HPC 10000
64 UltraSparc II 64-bit (sparcV9) processors
clocked at 400MHz
4-way superscalar, in-order dispatch,
out-of-order completion
each processor has two separate, direct-mapped
data and instruction on-chip 16KB caches (64 byte
cache blocks)
each CPU has an 8MB backup (L2) cache
Four-way interleaved address buses with snooping
16x16 data crossbar between system boards
38 clocks pin-to-pin on a load-miss to main
memory (Uniform Memory Access)
Sun Workshop 6.0 Compilers, Solaris 7
Sun HPC ClusterTools 3.1 MPI
IEEE POSIX Threads (Sun)

33
Comparison of Ear Decomposition Algorithms
34
Performance of SMP Ear Decomposition on a Variety
of Input Graphs
n 8192
35
SMP Ear Decomposition Algorithms
36
Conclusions

New hybrid model for SMP Clusters
Open Source Parallel Algorithm Library (OPAL)
High-Performance methodology
Fastest known algorithms on SMPs and SMP clusters
Preliminary experimental results

37
Future Work

Algorithms for SMP Clusters
Validate complexity model
Identify classes of efficient algorithms
Library of SMP algorithms
Methodology for algorithm-engineering
Clusters of Heterogeneous SMP Nodes
Varying node sizes
Nodes from different vendors architectures
Hierarchical clusters of SMPs
Scientific Applications
Bioinformatics and Genomics
Landscape Ecology and Remote Sensing
Computational Fluid Dynamics

Write a Comment

User Comments (0)

About PowerShow.com

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters - PowerPoint PPT Presentation

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters

OPAL: Open Source Parallel Algorithm Library Designing High-Performance Algorithms for SMP Clusters David A. Bader Electrical & Computer Engineering Department – PowerPoint PPT presentation