High-Performance Computation for Path Problems in Graphs - PowerPoint PPT Presentation

About This Presentation
Title:

High-Performance Computation for Path Problems in Graphs

Description:

'I observed that most of the coefficients in our matrices were zero; i.e., the ... Strongly connected components, ordered by levels of DAG ... – PowerPoint PPT presentation

Number of Views:141
Avg rating:3.0/5.0
Slides: 33
Provided by: Joh6367
Category:

less

Transcript and Presenter's Notes

Title: High-Performance Computation for Path Problems in Graphs


1

High-Performance Computation for Path Problems
in Graphs
Aydin Buluç John R. Gilbert University of
California, Santa Barbara SIAM Conf. on
Applications of Dynamical Systems May 20, 2009
Support DOE Office of Science, MIT Lincoln
Labs, NSF, DARPA, SGI
2
Horizontal-vertical decomposition Mezic et al.
Slide courtesy of Igor Mezic group, UCSB
3
Combinatorial Scientific Computing
  • I observed that most of the coefficients in our
    matrices were zero i.e., the nonzeros were
    sparse in the matrix, and that typically the
    triangular matrices associated with the forward
    and back solution provided by Gaussian
    elimination would remain sparse if pivot elements
    were chosen with care

- Harry Markowitz, describing the 1950s work on
portfolio theory that won the 1990 Nobel Prize
for Economics
4
A few directions in CSC
  • Hybrid discrete continuous computations
  • Multiscale combinatorial computation
  • Analysis, management, and propagation of
    uncertainty
  • Economic game-theoretic considerations
  • Computational biology bioinformatics
  • Computational ecology
  • Knowledge discovery machine learning
  • Relationship analysis
  • Web search and information retrieval
  • Sparse matrix methods
  • Geometric modeling
  • . . .

5
The Parallel Computing Challenge
Two Nvidia 8800 GPUs gt 1 TFLOPS
LANL / IBM Roadrunner gt 1 PFLOPS
Intel 80-core chip gt 1 TFLOPS
  • Parallelism is no longer optional
  • in every part of a computation.

6
The Parallel Computing Challenge
  • Efficient sequential algorithms for
    graph-theoretic
  • problems often follow long chains of
    dependencies
  • Several parallelization strategies, but no silver
    bullet
  • Partitioning (e.g. for preconditioning PDE
    solvers)
  • Pointer-jumping (e.g. for connected components)
  • Sometimes it just depends on what the input looks
    like
  • A few simple examples . . .

7
Sample kernel Sort logically triangular matrix
Permuted to unit upper triangular form
Original matrix
  • Used in sparse linear solvers (e.g. Matlabs)
  • Simple kernel, abstracts many other graph
    operations (see next)
  • Sequential linear time, simple greedy
    topological sort
  • Parallel no known method is efficient in both
    work and span one parallel step per level
    arbitrarily long dependent chains

8
Bipartite matching
1
5
2
3
4
1
2
3
4
5
A
  • Perfect matching set of edges that hits each
    vertex exactly once
  • Matrix permutation to place nonzeros (or heavy
    elements) on diagonal
  • Efficient sequential algorithms based on
    augmenting paths
  • No known work/span efficient parallel algorithms

9
Strongly connected components
G(A)
PAPT
  • Symmetric permutation to block triangular form
  • Diagonal blocks are strong Hall (irreducible /
    strongly connected)
  • Sequential linear time by depth-first search
    Tarjan
  • Parallel divide conquer, work and span depend
    on input Fleischer, Hendrickson, Pinar

10
Horizontal - vertical decomposition
  • Defined and studied by Mezic et al. in a
    dynamical systems context
  • Strongly connected components, ordered by levels
    of DAG
  • Efficient linear-time sequential algorithms
  • No work/span efficient parallel algorithms known

11
Strong components of 1M-vertex RMAT graph
12
Dulmage-Mendelsohn decomposition
13
Applications of D-M decomposition
  • Strongly connected components of directed graphs
  • Connected components of undirected graphs
  • Permutation to block triangular form for Axb
  • Minimum-size vertex cover of bipartite graphs
  • Extracting vertex separators from edge cuts for
    arbitrary graphs
  • Nonzero structure prediction for sparse matrix
    factorizations

14
Strong Hall components are independent of
choice of matching
15
The Primitives Challenge
  • By analogy to numerical linear algebra. . .
  • What should the combinatorial BLAS look like?

16
Primitives for HPC graph programming
  • Visitor-based multithreaded Berry, Gregor,
    Hendrickson, Lumsdaine
  • search templates natural for many algorithms
  • relatively simple load balancing
  • complex thread interactions, race conditions
  • unclear how applicable to standard
    architectures
  • Array-based data parallel G, Kepner,
    Reinhardt, Robinson, Shah
  • relatively simple control structure
  • user-friendly interface
  • some algorithms hard to express naturally
  • load balancing not so easy
  • Scan-based vectorized Blelloch
  • We dont know the right set of primitives yet!

17
Array-based graph algorithms study Kepner,
Fineman, Kahn, Robinson
18
Multiple-source breadth-first search
AT
X
19
Multiple-source breadth-first search
?
AT
X
ATX
20
Multiple-source breadth-first search
?
AT
X
ATX
  • Sparse array representation gt space efficient
  • Sparse matrix-matrix multiplication gt work
    efficient
  • Span load balance depend on matrix-mult
    implementation

21
Matrices over semirings
  • Matrix multiplication C AB (or
    matrix/vector)
  • Ci,j Ai,1?B1,j Ai,2?B2,j Ai,n?Bn,j
  • Replace scalar operations ? and by
  • ? associative, distributes over ?, identity 1
  • ? associative, commutative, identity 0
    annihilates under ?
  • Then Ci,j Ai,1?B1,j ? Ai,2?B2,j ? ?
    Ai,n?Bn,j
  • Examples (?,) (and,or) (,min) . . .
  • Same data reference pattern and control flow

22
SpGEMM Sparse Matrix x Sparse Matrix Buluc, G
  • Shortest path calculations (APSP)
  • Betweenness centrality
  • BFS from multiple source vertices
  • Subgraph / submatrix indexing
  • Graph contraction
  • Cycle detection
  • Multigrid interpolation restriction
  • Colored intersection searching
  • Applying constraints in finite element modeling
  • Context-free parsing

23
Distributed-memory parallel sparse matrix
multiplication
  • 2D block layout
  • Outer product formulation
  • Sequential hypersparse kernel
  • Asynchronous MPI-2 implementation
  • Experiments TACC Lonestar cluster
  • Good scaling to 256 processors

Time vs Number of cores -- 1M-vertex RMAT
24
All-Pairs Shortest Paths
  • Directed graph with costs on edges
  • Find least-cost paths between all reachable
    vertex pairs
  • Several classical algorithms with
  • Work matrix multiplication
  • Span log2 n
  • Case study of implementation on multicore
    architecture
  • graphics processing unit (GPU)

25
GPU characteristics
  • Powerful two Nvidia 8800s gt 1 TFLOPS
  • Inexpensive 500 each

But
  • Difficult programming model
  • One instruction stream drives 8 arithmetic units
  • Performance is counterintuitive and fragile
  • Memory access pattern has subtle effects on cost
  • Extremely easy to underutilize the device
  • Doing it wrong easily costs 100x in time

26
Recursive All-Pairs Shortest Paths
  • Based on R-Kleene algorithm
  • Well suited for GPU architecture
  • Fast matrix-multiply kernel
  • In-place computation gt low memory bandwidth
  • Few, large MatMul calls gt low GPU dispatch
    overhead
  • Recursion stack on host CPU, not on multicore
    GPU
  • Careful tuning of GPU code

A B
C D
is min, is add
A A recursive call B AB C CA D
D CB D D recursive call B BD C
DC A A BC
27
Execution of Recursive APSP
28
APSP Experiments and observations
128-core Nvidia 8800 Speedup
relative to. . . 1-core CPU
120x 480x 16-core CPU 17x
45x Iterative, 128-core GPU 40x 680x
MSSSP, 128-core GPU 3x
Time vs. Matrix Dimension
  • Conclusions
  • High performance is achievable but not simple
  • Carefully chosen and optimized primitives will
    be key

29
H-V decomposition
  • A span-efficient, but not work-efficient,
    method for H-V decomposition uses APSP to
    determine reachability

30
Reachability Transitive closure
  • APSP gt transitive closure of adjacency matrix
  • Strong components identified by symmetric
    nonzeros

31
H-V structure Acyclic condensation
  • Acyclic condensation is a sparse matrix-matrix
    product
  • Levels identified by APSP for longest paths
  • Practically speaking, a parallel method would
    compromise between work and span efficiency

32
Remarks
  • Combinatorial algorithms are pervasive in
    scientific
  • computing and will become more so.
  • Path computations on graphs are powerful tools,
    but
  • efficiency is a challenge on parallel
    architectures.
  • Carefully chosen and implemented primitive
    operations
  • are key.
  • Lots of exciting opportunities for research!
Write a Comment
User Comments (0)
About PowerShow.com