Title: HPC Research @ UNM: X10
1HPC Research _at_ UNM X10ding Graph Analysis
- Mehmet F. Su
- ECE Dept. - University of New Mexico
- Joint work with advisor David A. Bader
- mfatihsu, dbader _at_ ece.unm.edu
2Acknowledgment of Support
- National Science Foundation
- CAREER High-Performance Algorithms for
Scientific Applications (00-93039) - ITR Building the Tree of Life -- A National
Resource for Phyloinformatics and Computational
Phylogenetics (EF/BIO 03-31654) - DEB Ecosystem Studies Self-Organization of
Semi-Arid Landscapes Test of Optimality
Principles (99-10123) - ITR/AP Reconstructing Complex Evolutionary
Histories (01-21377) - DEB Comparative Chloroplast Genomics Integrating
Computational Methods, Molecular Evolution, and
Phylogeny (01-20709) - ITR/AP(DEB) Computing Optimal Phylogenetic Trees
under Genome Rearrangement Metrics (01-13095) - DBI Acquisition of a High Performance
Shared-Memory Computer for Computational Science
and Engineering (04-20513). - IBM PERCS / DARPA High Productivity Computing
Systems (HPCS) - PACI NPACI/SDSC, NCSA/Alliance, PSC
- DOE Sandia National Laboratories
3Outline
- About the speaker
- Graph theoretic problems what and why
- Our research
- IBM PERCS performance evaluation tools SSCA-2
and X10 - Some tool ideas for better productivity
4About the speaker Mehmet F. Su
- Education
- BS Physics (Bilkent University, Ankara, Turkey)
- Physics Dept, Iowa State University, Ames, IA
- PhD track, ECE Dept, University of New Mexico,
Albuquerque, NM - Past and Present External Collaborations
- Condensed Matter Physics Group, Ames National
Laboratory, Ames, IA - HPC apps. in comp. biology, photonics, comp.
electromagnetism - Photonic Microsystems Technologies Group, Sandia
National Laboratories, Albuquerque, NM - HPC apps. in photonics and comp. electromagnetism
5Large-Scale Network Problems
6Characteristics of Graph Problems
- Graphs are of fundamental importance
- Many fast theoretic PRAM algorithms but few fast
parallel implementations - Irregular problems are challenging
- Sparse data structures
- Hard to partition data
- Poor locality hinders cache performance
- Parallel graph and tree algorithms
- Building blocks for higher-level parallel
algorithms - Hard to achieve parallel speedup (very fast
sequential implementations)
7Our Groups Impact
- Our results demonstrate the first parallel
implementations of several combinatorial problems
that for arbitrary, sparse instances in
comparison run faster than the best sequential
implementations - list ranking
- spanning tree, minimum spanning forest, rooted
spanning tree - ear decomposition
- tree contraction and expression evaluation
- maximum flow using push-relabel
- Our source code is freely-available under the GNU
General Public License (GPL).
8Spanning Tree(Cong, Bader Ph.D. 2004, now at
IBM TJ Watson)
9High-End SMP Servers
- IBM pSeries 690 Regatta
- 32-way Power4 1.7GHz, 32GB RAM
- Streams Triad 58.9 GB/s
- IBM pSeries 575
- 2U Rackmount, 8-way SMP, up to 256 GB RAM,
- up to 1024-proc configuration w/ single
cluster 1600 - Streams Triad (8 p5 1.9 GHz procs) 55.7 GB/s
10About SSCA-2
- DARPA High Productivity Computing Systems (HPCS)
Program - Productivity Benchmarks Scalable Synthetic
Compact Application (SSCA) - SSCA-2 Graph Analysis (directed multigraph with
labeled edges) - Simulate large-scale graph problems
- Multiple analysis techniques, single data
- Four computational kernels
- Integer and character ops., no floating point
- Emphasizes integer operations, irregular data
access, choice of data structure - Data structure not modified across kernels
11SSCA-2 Structure
- Scalable Data Generator ? produces random, but
structured, set of edges - Kernel 1 ? Builds the graph data structure from
the set of edges - Kernel 2 ? Searches multigraph for desired
maximum integer weight, and desired string weight
(labels) - Kernel 3 ? Extracts desired subgraphs, given
start vertices and path length - Kernel 4 ? Extracts clusters (cliques) to help
identify the underlying graph structure
12About X10
- New programming language, in development by IBM
- Better productivity, more scalability
- Shorten development/test cycle time
- Object oriented
- New ways to express
- Parallelism
- Data access
- Aggregate operations (scan, reduce etc.)
- Rule out/catch more programming errors, bugs
13Implementation of SSCA-2
- Designed and implemented parallel shared memory
code (C with POSIX threads) for SSCA-2
Bader/Madduri - Interested in X10 implementation
- Evaluate productivity with X10 and its
development environment (Eclipse) - Evaluate SSCA-2 performance on new systems once
X10 is fully optimized
14Tool Ideas for Better Productivity
- Wizard-like interfaces
- NIXes, powerful development environments,
cascaded menus shock many programmers - Intuitive visualization for data
- With zoom/agglomeration, like online street maps
- Library/package indexing tool
- Help resolve unresolved symbols, allow manual
override w/ choices - Autoconf/Automake counterparts
- Determine external dependencies/library symbols
automatically for any environment - Better branch prediction/feedback mechanism
- Collect data over multiple runs
15Tool Ideas (contd)
- Better binding, architecture dependent optimizer
- Detect environment properties at run time
- Integrated tools to help identify performance hot
spots and reasons - Profile for cache misses, branch prediction
issues, check useful tasks performed
concurrently, lock contamination - Visualization to indicate high level compiler
optimizations on Eclipse editor window - Arrows for loop transforms, code relocations,
annotations, different colors for propagated
constants, evaluated expressions etc. - Intermediate language/assembly viewer
- Compiler optimizations, register scheduling, SWP
annotated - Assembly listing from many compilers give similar
info
16IBM Collaborators
- PERCS Performance
- Ram Rajamony
- Pat Bohrer
- Mootaz Elnozahy
- X10 evaluation
- Vivek Sarkar
- Kemal Ebcioglu
- Vijay Saraswat
- Christine Halverson
- Catalina M. Danis
- Jason Ellis
- Advanced Computing Technologies
- David Klepacki
- Guojing Cong
17Backup Slides
18SSCA 2 Graph Analysis Overview
- Application Graph Theory - Stresses memory
access uses integer and character operations (no
floating point) - Scalable Data Generation 4 Computational
Kernels - Scalable Data Generator creates a set of edges
between vertices to form a sparse directed
multi-graph with - Random number of randomly sized cliques
- Random number of intra-clique directed parallel
edges - Random number of gradually 'thinning' edges
linking the cliques - No self loops
- Two types of edge weight labels integer and
character string - only integer weights considered in present
implementation - Randomized vertex numbers
Directed weighted multigraph with no self-loops
19Scalable Data Generation
- Creates a set of edges between vertices to form
a sparse directed multigraph with - Random number of randomly sized cliques
- Random number of intra-clique directed parallel
edges - Random number of gradually 'thinning' edges
linking the cliques - No self loops
- Two types of edge weight labels integer and
character string - only integer weights considered in present
implementation - Randomized vertex numbers
- Vertices should be permuted to remove any
locality for Kernel 4
20Kernel 1 Graph Generation
- Construct a sparse multi-graph from lists of
tuples containing vertex identifiers, implied
direction, and weights that represent data
assigned to the implied edge. - The multi-graph can be represented in any manner,
but it cannot be modified between subsequent
kernels accessing the data. - There are various representations for sparse
directed graphs - including (but not limited to)
sparse matrices and (multi-level) linked lists. - This kernel will be timed.
21Kernel 2 Classify large sets
- Examine all edge weights to determine those
vertex pairs with the largest integer weights and
those vertex pairs with a specified string weight
(label). - The output of this kernel will be two vertex pair
lists - i.e., sets - that will be saved for use
in the following kernel. - These two lists will be start sets SI and SC for
integer start sets and character start sets
respectively. - The process of generating the two lists/sets will
be timed.
22Kernel 3 Extracting Subgraphs
- Produce a series of subgraphs consisting of
vertices and edges on paths of length k from the
vertex pairs start sets SI and SC. - A possible computational kernel for graph
extraction is Breadth First Search. - The process of extracting the graph will be
timed.
23Kernel 4 Clique Extraction
- Use a graph clustering algorithm to partition
the vertices of the graph into subgraphs no
larger than a maximum size so as to minimize the
number of edges that need be cut. - the kernel implementation should not utilize a
priori knowledge of the details in the data
generator or the statistics collected in the
graph generation process - heuristic algorithms that determine the clusters
in near-linear time are permitted - O(V) - The process of identifying the clusters and their
interconnections will be timed.
24X10 Design
- Builds over an existing OO language (Java) to
shorten learning curve - Has new constructs for commonly used data access
patterns (distributions) - Commonly used parallel programming environments
today - Message passing, no shared memory (MPI)
- Shared memory, implicit thread control (OpenMP)
- Shared memory, explicit thread control (Threads)
- Partitioned global shared mem, explicit thread
control (UPC) - PG shared, implicit thread control (HPF)
- can these not be blended?
PG shared can specify affinity to a thread
25X10 Design (contd)
- Supports shared memory, allows local memory,
shared memory is partitioned (places) - Operation can run at a place where data resides
(async) - or data can be sent to a place to get evaluated
(future) - Supports short-hand definitions for array regions
data distribution, extended iterators (foreach
variants) - Generalized barriers (clocks) supporting more
flexible operations (can operate/wait on multiple
clocks), can freeze a variable until a clock
advance (clocked final) - Supports aggregate parallel operators (scan,
reduction) in operator form (not like MPI calls) - Supports atomic sections (unconditional,
conditional), conditional sections lock on a
logical condition (run when something is true) - Weak memory consistency model (enables better
optimizations)
26Minimum Spanning Forest
Boruvka
Best Sequential
Boruvka w/our memory management
Our new SMP algorithm