Title: AllPairsShortestPaths for Large Graphs on the GPU
1All-Pairs-Shortest-Paths for Large Graphs on the
GPU
- Gary J Katz1,2, Joe Kider1
- 1University of Pennsylvania
- 2Lockheed Martin ISGS
2What Will We Cover?
- Quick overview of Transitive Closure and
All-Pairs Shortest Path - Uses for Transitive Closure and All-Pairs
- GPUs, What are they and why do we care?
- The GPU problem with performing Transitive
Closure and All-Pairs. - Solution, The Block Processing Method
- Memory formatting in global and shared memory
- Results
3Previous Work
- A Blocked All-Pairs Shortest-Paths Algorithm
- Venkataraman et al.
- Parallel FPGA-based All-Pairs Shortest Path in
a Diverted Graph - Bondhugula et al.
- Accelerating large graph algorithms on the GPU
using CUDA - Harish
4NVIDIA GPU Architecture
- Issues
- No Access to main memory
- Programmer needs to explicitly reference L1
shared cache - Can not synchronize multiprocessors
- Compute cores are not as smart as CPUs, does
not handle if statements well
5Background
- Some graph G with vertices V and edges E
- G (V,E)
- For every pair of vertices u,v in V a shortest
path from u to v, where the weight of a path is
the sum of he weights of its edges
6Adjacency Matrix
7Quick Overview of Transitive Closure
- The Transitive Closure of G is defined as the
graph G (V, E), whereE (i,j) there is
a path from vertex i to vertex j in G - -Introduction to Algorithms, T. Cormen
Simply Stated The Transitive Closure of a graph
is the list of edges for any vertices that can
reach each other
1
5
1
5
Edges 1, 5 2, 1 4, 2 4, 3 6, 3 8, 6
Edges 1, 5 2, 1 4, 2 4, 3 6, 3 8, 6 2, 5 8, 3 7,
6 7, 3
2
2
4
4
6
6
3
8
8
3
7
7
8Warshalls algorithm transitive closure
- Computes the transitive closure of a relation
- (Alternatively all paths in a directed graph)
- Example of transitive closure1
0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 1
0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 0
Design and Analysis of Algorithms - Chapter 8
7
9Warshalls algorithm
- Main idea a path exists between two vertices i,
j, iff - there is an edge from i to j or
- there is a path from i to j going through vertex
1 or - there is a path from i to j going through vertex
1 and/or 2 or -
- there is a path from i to j going through vertex
1, 2, and/or k or - ...
- there is a path from i to j going through any of
the other vertices
Design and Analysis of Algorithms - Chapter 8
8
10Warshalls algorithm
- Idea dynamic programming
- Let V1, , n and for kn, Vk1, , k
- For any pair of vertices i, j?V, identify all
paths from i to j whose intermediate vertices are
all drawn from Vk Pijkp1, p2, , if Pijk??
then Rki, j1 - For any pair of vertices i, j Rni, j, that is
Rn - Starting with R0A, the adjacency matrix, how to
get R1 ? ? Rk-1 ? Rk ? ? Rn
Vk
P1
i
j
p2
Design and Analysis of Algorithms - Chapter 8
9
11Warshalls algorithm
- Idea dynamic programming
- p?Pijk p is a path from i to j with all
intermediate vertices in Vk - If k is not on p, then p is also a path from i to
j with all intermediate vertices in Vk-1
p?Pijk-1
k
Vk
Vk-1
p
i
j
Design and Analysis of Algorithms - Chapter 8
10
12Warshalls algorithm
- Idea dynamic programming
- p?Pijk p is a path from i to j with all
intermediate vertices in Vk - If k is on p, then we break down p into p1 and p2
where - p1 is a path from i to k with all intermediate
vertices in Vk-1 - p2 is a path from k to j with all intermediate
vertices in Vk-1
p
k
Vk
p1
p2
Vk-1
i
j
Design and Analysis of Algorithms - Chapter 8
11
13Warshalls algorithm
- In the kth stage determine if a path exists
between two vertices i, j using just vertices
among 1, , k - R(k-1)i,j (path using
just 1, , k-1) - R(k)i,j or
- (R(k-1)i,k and R(k-1)k,j) (path from
i to k -
and from k to j -
using just 1, , k-1)
k
i
kth stage
j
Design and Analysis of Algorithms - Chapter 8
12
14Quick Overview All-Pairs-Shortest-Path
- The All-Pairs Shortest-Path of G is defined for
every pair of vertices u,v E V as the shortest
(least weight) path from u to v, where the weight
of a path is the sum of the weights of its
constituent edges. - -Introduction to Algorithms, T. Cormen
Simply Stated The All-Pairs-Shortest-Path of a
graph is the most optimal list of vertices
connecting any two vertices that can reach each
other
1
5
Paths 1 ? 5 2 ? 1 4 ? 2 4 ? 3 6 ? 3 8 ? 6 2 ? 1 ?
5 8 ? 6 ? 3 7 ? 8 ? 6 7 ? 8 ? 6 ? 3
2
4
6
8
3
7
15Uses for Transitive Closure and All-Pairs
16Floyd-Warshall Algorithm
1
1
1
1
1
5
2
4
Pass 1 Finds all connections that are connected
through 1
Pass 6 Finds all connections that are connected
through 6
Pass 8 Finds all connections that are connected
through 8
6
8
3
Running Time O(V3)
7
17Parallel Floyd-Warshall
Each Processing Element needs global access to
memory
This can be an issue for GPUs
Theres a short coming to this algorithm though
18The Question
- How do we calculate the transitive closure on the
GPU to - Take advantage of shared memory
- Accommodate data sizes that do not fit in memory
Can we perform partial processing of the data?
19Block Processing of Floyd-Warshall
Organizational structure for block processing?
Data Matrix
20Block Processing of Floyd-Warshall
21Block Processing of Floyd-Warshall
N 4
22Block Processing of Floyd-Warshall
K 1
i,j i,k k,j (5,1) -gt (5,1)
(1,1) (8,1) -gt (8,1) (1,1) (5,4) -gt (5,1)
(1,4) (8,4) -gt (8,1) (1,4)
K 4
i,j i,k k,j (5,1) -gt (5,4)
(4,1) (8,1) -gt (8,4) (4,1) (5,4) -gt (5,4)
(4,4) (8,4) -gt (8,4) (4,4)
Wi,j Wi,j (Wi,k Wk,j)
For each pass, k, the cells retrieved must be
processed to at least k-1
23Block Processing of Floyd-Warshall
Putting it all Together Processing K 1-4
Pass 1 i 1-4, j 1-4 Pass 2 i
5-8, j 1-4 i 1-4, j 5-8 Pass 3 i
5-8, j 5-8
Wi,j Wi,j (Wi,k Wk,j)
24Block Processing of Floyd-Warshall
Range i 5,8 j 5,8 k 5,8
N 8
Computing k 5-8
25Block Processing of Floyd-Warshall
Putting it all Together Processing K 5-8
Pass 1 i 5-8, j 5-8 Pass 2 i
5-8, j 1-4 i 1-4, j 5-8 Pass 3 i
1-4, j 1-4
Transitive Closure Is complete for k 1-8
Wi,j Wi,j (Wi,k Wk,j)
26Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 1
27Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 2
28Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 3
29Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 4
30Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 5
31Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 6
32Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 7
33Increasing the Number of Blocks
- Primary blocks are along the diagonal
- Secondary blocks are the rows and columns of the
primary block - Tertiary blocks are all remaining blocks
Pass 8
In Total N Passes 3 sub-passes per pass
34Running it on the GPU
- Using CUDA
- Written by NVIDIA to access GPU as a parallel
processor - Do not need to use graphics API
Grid Dimension
- Memory Indexing
- CUDA Provides
- Grid Dimension
- Block Dimension
- Block Id
- Thread Id
Block Id
Thread Id
Block Dimension
35Partial Memory Indexing
1
1
N - 1
0
SP2
1
N - 1
SP3
N - 1
36Memory Format for All-Pairs Solution
- All-Pairs requires twice the memory footprint of
Transitive Closure
1
5
Connecting Node
Distance
2
4
6
1
2
3
4
5
6
7
8
8
3
1
0
1
0
1
1
2
2
3
0
1
0
1
4
N
7
5
0
1
6
7
3
8
6
7
8
3
8
2
0
1
8
6
2
0
1
2N
Shortest Path
37Results
SM cache efficient GPU implementation compared to
standard GPU implementation
38Results
SM cache efficient GPU implementation compared to
standard CPU implementation and cache-efficient
CPU implementation
39Results
SM cache efficient GPU implementation compared to
best variant of Han et al.s tuned code
40Conclusion
- Advantages of Algorithm
- Relatively Easy to Implement
- Cheap Hardware
- Much Faster than standard CPU version
- Can work for any data size
Special thanks to NVIDIA for supporting our
research
41Backup
42CUDA
- CompUte Driver Architecture
- Extension of C
- Automatically creates thousands of threads to run
on a graphics card - Used to create non-graphical applications
- Pros
- Allows user to design algorithms that will run in
parallel - Easy to learn, extension of C
- Has CPU version, implemented by kicking off
threads - Cons
- Low level, C like language
- Requires understanding of GPU architecture to
fully exploit