HPC Research @ UNM: X10 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

HPC Research @ UNM: X10

Description:

BS Physics (Bilkent University, Ankara, Turkey) Physics Dept, Iowa State University, Ames, IA. PhD track, ECE Dept, University of New Mexico, Albuquerque, NM ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 26
Provided by: s730111
Category:

less

Transcript and Presenter's Notes

Title: HPC Research @ UNM: X10


1
HPC Research _at_ UNM X10ding Graph Analysis
  • Mehmet F. Su
  • ECE Dept. - University of New Mexico
  • Joint work with advisor David A. Bader
  • mfatihsu, dbader _at_ ece.unm.edu

2
Acknowledgment of Support
  • National Science Foundation
  • CAREER High-Performance Algorithms for
    Scientific Applications (00-93039)
  • ITR Building the Tree of Life -- A National
    Resource for Phyloinformatics and Computational
    Phylogenetics (EF/BIO 03-31654)
  • DEB Ecosystem Studies Self-Organization of
    Semi-Arid Landscapes Test of Optimality
    Principles (99-10123)
  • ITR/AP Reconstructing Complex Evolutionary
    Histories (01-21377)
  • DEB Comparative Chloroplast Genomics Integrating
    Computational Methods, Molecular Evolution, and
    Phylogeny (01-20709)
  • ITR/AP(DEB) Computing Optimal Phylogenetic Trees
    under Genome Rearrangement Metrics (01-13095)
  • DBI Acquisition of a High Performance
    Shared-Memory Computer for Computational Science
    and Engineering (04-20513).
  • IBM PERCS / DARPA High Productivity Computing
    Systems (HPCS)
  • PACI NPACI/SDSC, NCSA/Alliance, PSC
  • DOE Sandia National Laboratories

3
Outline
  • About the speaker
  • Graph theoretic problems what and why
  • Our research
  • IBM PERCS performance evaluation tools SSCA-2
    and X10
  • Some tool ideas for better productivity

4
About the speaker Mehmet F. Su
  • Education
  • BS Physics (Bilkent University, Ankara, Turkey)
  • Physics Dept, Iowa State University, Ames, IA
  • PhD track, ECE Dept, University of New Mexico,
    Albuquerque, NM
  • Past and Present External Collaborations
  • Condensed Matter Physics Group, Ames National
    Laboratory, Ames, IA
  • HPC apps. in comp. biology, photonics, comp.
    electromagnetism
  • Photonic Microsystems Technologies Group, Sandia
    National Laboratories, Albuquerque, NM
  • HPC apps. in photonics and comp. electromagnetism

5
Large-Scale Network Problems
6
Characteristics of Graph Problems
  • Graphs are of fundamental importance
  • Many fast theoretic PRAM algorithms but few fast
    parallel implementations
  • Irregular problems are challenging
  • Sparse data structures
  • Hard to partition data
  • Poor locality hinders cache performance
  • Parallel graph and tree algorithms
  • Building blocks for higher-level parallel
    algorithms
  • Hard to achieve parallel speedup (very fast
    sequential implementations)

7
Our Groups Impact
  • Our results demonstrate the first parallel
    implementations of several combinatorial problems
    that for arbitrary, sparse instances in
    comparison run faster than the best sequential
    implementations
  • list ranking
  • spanning tree, minimum spanning forest, rooted
    spanning tree
  • ear decomposition
  • tree contraction and expression evaluation
  • maximum flow using push-relabel
  • Our source code is freely-available under the GNU
    General Public License (GPL).

8
Spanning Tree(Cong, Bader Ph.D. 2004, now at
IBM TJ Watson)
9
High-End SMP Servers
  • IBM pSeries 690 Regatta
  • 32-way Power4 1.7GHz, 32GB RAM
  • Streams Triad 58.9 GB/s
  • IBM pSeries 575
  • 2U Rackmount, 8-way SMP, up to 256 GB RAM,
  • up to 1024-proc configuration w/ single
    cluster 1600
  • Streams Triad (8 p5 1.9 GHz procs) 55.7 GB/s

10
About SSCA-2
  • DARPA High Productivity Computing Systems (HPCS)
    Program
  • Productivity Benchmarks Scalable Synthetic
    Compact Application (SSCA)
  • SSCA-2 Graph Analysis (directed multigraph with
    labeled edges)
  • Simulate large-scale graph problems
  • Multiple analysis techniques, single data
  • Four computational kernels
  • Integer and character ops., no floating point
  • Emphasizes integer operations, irregular data
    access, choice of data structure
  • Data structure not modified across kernels

11
SSCA-2 Structure
  • Scalable Data Generator ? produces random, but
    structured, set of edges
  • Kernel 1 ? Builds the graph data structure from
    the set of edges
  • Kernel 2 ? Searches multigraph for desired
    maximum integer weight, and desired string weight
    (labels)
  • Kernel 3 ? Extracts desired subgraphs, given
    start vertices and path length
  • Kernel 4 ? Extracts clusters (cliques) to help
    identify the underlying graph structure

12
About X10
  • New programming language, in development by IBM
  • Better productivity, more scalability
  • Shorten development/test cycle time
  • Object oriented
  • New ways to express
  • Parallelism
  • Data access
  • Aggregate operations (scan, reduce etc.)
  • Rule out/catch more programming errors, bugs

13
Implementation of SSCA-2
  • Designed and implemented parallel shared memory
    code (C with POSIX threads) for SSCA-2
    Bader/Madduri
  • Interested in X10 implementation
  • Evaluate productivity with X10 and its
    development environment (Eclipse)
  • Evaluate SSCA-2 performance on new systems once
    X10 is fully optimized

14
Tool Ideas for Better Productivity
  • Wizard-like interfaces
  • NIXes, powerful development environments,
    cascaded menus shock many programmers
  • Intuitive visualization for data
  • With zoom/agglomeration, like online street maps
  • Library/package indexing tool
  • Help resolve unresolved symbols, allow manual
    override w/ choices
  • Autoconf/Automake counterparts
  • Determine external dependencies/library symbols
    automatically for any environment
  • Better branch prediction/feedback mechanism
  • Collect data over multiple runs

15
Tool Ideas (contd)
  • Better binding, architecture dependent optimizer
  • Detect environment properties at run time
  • Integrated tools to help identify performance hot
    spots and reasons
  • Profile for cache misses, branch prediction
    issues, check useful tasks performed
    concurrently, lock contamination
  • Visualization to indicate high level compiler
    optimizations on Eclipse editor window
  • Arrows for loop transforms, code relocations,
    annotations, different colors for propagated
    constants, evaluated expressions etc.
  • Intermediate language/assembly viewer
  • Compiler optimizations, register scheduling, SWP
    annotated
  • Assembly listing from many compilers give similar
    info

16
IBM Collaborators
  • PERCS Performance
  • Ram Rajamony
  • Pat Bohrer
  • Mootaz Elnozahy
  • X10 evaluation
  • Vivek Sarkar
  • Kemal Ebcioglu
  • Vijay Saraswat
  • Christine Halverson
  • Catalina M. Danis
  • Jason Ellis
  • Advanced Computing Technologies
  • David Klepacki
  • Guojing Cong

17
Backup Slides
18
SSCA 2 Graph Analysis Overview
  • Application Graph Theory - Stresses memory
    access uses integer and character operations (no
    floating point)
  • Scalable Data Generation 4 Computational
    Kernels
  • Scalable Data Generator creates a set of edges
    between vertices to form a sparse directed
    multi-graph with
  • Random number of randomly sized cliques
  • Random number of intra-clique directed parallel
    edges
  • Random number of gradually 'thinning' edges
    linking the cliques
  • No self loops
  • Two types of edge weight labels integer and
    character string
  • only integer weights considered in present
    implementation
  • Randomized vertex numbers

Directed weighted multigraph with no self-loops
19
Scalable Data Generation
  • Creates a set of edges between vertices to form
    a sparse directed multigraph with
  • Random number of randomly sized cliques
  • Random number of intra-clique directed parallel
    edges
  • Random number of gradually 'thinning' edges
    linking the cliques
  • No self loops
  • Two types of edge weight labels integer and
    character string
  • only integer weights considered in present
    implementation
  • Randomized vertex numbers
  • Vertices should be permuted to remove any
    locality for Kernel 4

20
Kernel 1 Graph Generation
  • Construct a sparse multi-graph from lists of
    tuples containing vertex identifiers, implied
    direction, and weights that represent data
    assigned to the implied edge.
  • The multi-graph can be represented in any manner,
    but it cannot be modified between subsequent
    kernels accessing the data.
  • There are various representations for sparse
    directed graphs - including (but not limited to)
    sparse matrices and (multi-level) linked lists.
  • This kernel will be timed.

21
Kernel 2 Classify large sets
  • Examine all edge weights to determine those
    vertex pairs with the largest integer weights and
    those vertex pairs with a specified string weight
    (label).
  • The output of this kernel will be two vertex pair
    lists - i.e., sets - that will be saved for use
    in the following kernel.
  • These two lists will be start sets SI and SC for
    integer start sets and character start sets
    respectively.
  • The process of generating the two lists/sets will
    be timed.

22
Kernel 3 Extracting Subgraphs
  • Produce a series of subgraphs consisting of
    vertices and edges on paths of length k from the
    vertex pairs start sets SI and SC.
  • A possible computational kernel for graph
    extraction is Breadth First Search.
  • The process of extracting the graph will be
    timed.

23
Kernel 4 Clique Extraction
  • Use a graph clustering algorithm to partition
    the vertices of the graph into subgraphs no
    larger than a maximum size so as to minimize the
    number of edges that need be cut.
  • the kernel implementation should not utilize a
    priori knowledge of the details in the data
    generator or the statistics collected in the
    graph generation process
  • heuristic algorithms that determine the clusters
    in near-linear time are permitted - O(V)
  • The process of identifying the clusters and their
    interconnections will be timed.

24
X10 Design
  • Builds over an existing OO language (Java) to
    shorten learning curve
  • Has new constructs for commonly used data access
    patterns (distributions)
  • Commonly used parallel programming environments
    today
  • Message passing, no shared memory (MPI)
  • Shared memory, implicit thread control (OpenMP)
  • Shared memory, explicit thread control (Threads)
  • Partitioned global shared mem, explicit thread
    control (UPC)
  • PG shared, implicit thread control (HPF)
  • can these not be blended?

PG shared can specify affinity to a thread
25
X10 Design (contd)
  • Supports shared memory, allows local memory,
    shared memory is partitioned (places)
  • Operation can run at a place where data resides
    (async)
  • or data can be sent to a place to get evaluated
    (future)
  • Supports short-hand definitions for array regions
    data distribution, extended iterators (foreach
    variants)
  • Generalized barriers (clocks) supporting more
    flexible operations (can operate/wait on multiple
    clocks), can freeze a variable until a clock
    advance (clocked final)
  • Supports aggregate parallel operators (scan,
    reduction) in operator form (not like MPI calls)
  • Supports atomic sections (unconditional,
    conditional), conditional sections lock on a
    logical condition (run when something is true)
  • Weak memory consistency model (enables better
    optimizations)

26
Minimum Spanning Forest
Boruvka
Best Sequential
Boruvka w/our memory management
Our new SMP algorithm
Write a Comment
User Comments (0)
About PowerShow.com