Title: Graph Algorithms
1Graph Algorithms
- Carl Tropper
- Department of Computer Science
- McGill University
2Definitions
- An undirected graph G is a pair (V,E), where V is
a finite set of points called vertices and E is a
finite set of edges. - An edge e ? E is an unordered pair (u,v), where
u,v ? V. - In a directed graph, the edge e is an ordered
pair (u,v). An edge (u,v) is incident from vertex
u and is incident to vertex v. - A path from a vertex v to a vertex u is a
sequence ltv0,v1,v2,,vkgt of vertices where v0
v, vk u, and (vi, vi1) ? E for I 0, 1,,
k-1. - The length of a path is defined as the number of
edges in the path.
3Directed and undirected
4More definitions
- An undirected graph is connected if every pair of
vertices is connected by a path. - A forest is an acyclic graph, and a tree is a
connected acyclic graph. - A graph that has weights associated with each
edge is called a weighted graph.
5How to represent a graph 1
- Adjacency matrix represents undirected graph
- ?(n2)space to store matrix
6How to represent a graph 2
An ordered list of nodes Weights are inside
nodes ?(E) space to store the list
7Which is better?
- If graph is sparse, use pointers/list
- If dense, then go with the adjacency matrix
8MST Prims Algorithm
- A spanning tree of an undirected graph G is a
subgraph of G which is (1) a tree and (2)
contains all the vertices of G. - In a weighted graph, the weight of a subgraph is
the sum of the weights of the edges in the
subgraph. - A minimum spanning tree (MST) for a weighted
undirected graph is a spanning tree with minimum
weight.
9MST on the right
10Prim
- is a greedy algorithm
- Is recursive
- Start by selecting an arbitrary vertex, include
it into the current MST (the root) - Grow the current MST by inserting into it the
vertex closest to one of the vertices already in
current MST (the vertex which (1) is outside the
MST and (2) adds the smallest weight to the MST)
11Prim
12Formal Prim
Lines 10-13 executed n-1 times 12 and 13 executed
O(n) times Prim is ?(n2)
13Parallel Prim
-
- Let p be the number of processes, and let n be
the number of vertices. - The adjacency matrix is partitioned in 1-D block
fashion, with distance vector d partitioned
accordingly. - In each step, a processor selects the locally
closest node, followed by a global (all to one)
reduction to select globally closest node. One
processor (say P0) contains the MST. - This node is inserted into MST, and the choice
broadcast (one to all) to all processors. - Each processor updates its part of the d vector
locally.
14Parallel Prim picture
15Analysis
- The cost to select the minimum entry is ??(n/p)
- The cost of a broadcast is ?(log p).
- The cost of local update of the d vector is
?(n/p). - The parallel time per iteration is ?(n/p log
p). - The total parallel time is given by ?(n2/p n
log p). - The corresponding isoefficiency is ?(p2log2p).
16Single source shortest pathDijkstra
- For a weighted graph G (V,E,w), the
single-source shortest paths problem is to find
the shortest paths from a vertex v ? V to all
other vertices in V. - Dijkstra's algorithm is similar to Prim's
algorithm. It maintains a set of nodes for which
the shortest paths are known. - It grows this set based on the node closest to
source using one of the nodes in the current
shortest path set. - The difference Prim stores the the cost of the
minimal cost edge connecting a vertex in VT to u,
Dijkstra stores minimal cost to reach u - Greedy algorithm
17(No Transcript)
18Single source analysis
- The weighted adjacency matrix is partitioned
using the 1-D block mapping. - Each process selects, locally, the node closest
to the source, followed by a global reduction to
select next node. - The node is broadcast to all processors and the
l-vector updated. - The difference Prim stores the the cost of the
minimal cost edge connecting a vertex in VT to u,
Dijkstra stores minimal cost to reach u - The parallel performance of Dijkstra's algorithm
is identical to that of Prim's algorithm.
19All Pairs Shortest Path
- Given a weighted graph G(V,E,w), the all-pairs
shortest paths problem is to find the shortest
paths between all pairs of vertices vi, vj ? V. - Look at two versions of Dijkstra
- And Floyds algorithm
20First of all
- Execute n instances of the single-source shortest
path problem, one for each of the n source
vertices. - Complexity is ?(n3) because complexity of
shortest path algorithm is ?(n2)
21Two strategies
- Source partitioned-execute n shortest path
problems on n processors. Each of the n nodes
gets to be a source node. - Source parallel-partition adjacency matrix a la
Prim
22Dijkstra Source Partitioned
- Use n processors, each processor Pi finds the
shortest paths from vertex vi to all other
vertices by executing Dijkstra's sequential
single-source shortest paths algorithm. - It requires no interprocess communication
(provided that the adjacency matrix is replicated
at all processes). - The parallel run time of this formulation is
T(n2). - While the algorithm is cost optimal, it can only
use n processors. Therefore, the isoefficiency
due to concurrency is p3.
23Dijkstra Source Parallel
- Want to keep more then n processors busy
- Given p processors (p gt n), each single source
shortest path problem is executed by one on one
of n partitions - Use p/n processors for each of the problems
- From before, the parallel time is
- TP T(n3/p) computation
- T(n log p) communication
- For cost optimality, we have p O(n2/log n) and
the isoefficiency is T((p log p)1.5).
24Floyds
- For any pair of vertices vi, vj ? V, consider all
paths from vi to vj whose intermediate vertices
belong to the set v1,v2,,vk. Let pi(,kj) (of
weight di(,kj) be the minimum-weight path among
them. - If vertex vk is not in the shortest path from vi
to vj, then pi(,kj) is the same as pi(,kj-1). - If f vk is in pi(,kj), then we can break pi(,kj)
into two paths - one from vi to vk and one from
vk to vj . Each of these paths uses vertices from
v1,v2,,vk-1.
25Recurrence
From our observations, the following recurrence
relation follows
This equation must be computed for each pair of
nodes and for k 1, n. The serial complexity
is O(n3).
26Floyd
This program computes the all-pairs shortest
paths of the graph G (V,E) with adjacency
matrix A.
27Parallel Floyd
- Matrix D(k) is divided into p blocks of size (n /
vp) x (n / vp) n2/p elements per block - Each processor updates its part of the matrix
during each iteration. - To compute dl(,kk-1) processor Pi,j must get
dl(,kk-1) and dk(,kr-1). - In general, during the kth iteration, each of the
vp processes containing part of the kth row send
it to the vp - 1 processes in the same column. - Similarly, each of the vp processes containing
part of the kth column sends it to the vp - 1
processes in the same row.
28Matrix Dk
(a) Matrix D(k) distributed by 2-D block mapping
into vp x vp subblocks, and (b) the subblock of
D(k) assigned to process Pi,j.
29Communication
- In general, during the kth iteration, each of the
vp processes containing part of the kth row send
it to the vp - 1 processes in the same column. - Similarly, each of the vp processes containing
part of the kth column sends it to the vp - 1
processes in the same row.
30Communication Patterns
a) Communication patterns used in the 2-D block
mapping. When computing di(,kj), information must
be sent to the highlighted process from two other
processes along the same row and column. (b) The
row and column of vp processes that contain the
kth row and column send them along process
columns and rows.
31Floyd-Parallel Formulation
Floyd's parallel formulation using the 2-D block
mapping. P,j denotes all the processes in the
jth column, and Pi, denotes all the processes in
the ith row. The matrix D(0) is the adjacency
matrix.
32Analysis
- During each iteration of the algorithm, the kth
row and kth column of processors perform a
one-to-all broadcast along their rows/columns. - The size of this broadcast is n/vp elements,
taking time T((n log p)/ vp). - The synchronization step takes time T(log p).
- The computation time is T(n2/p)
- There are n iterations of the algorithm
k1,..,n - The parallel run time of the 2-D block mapping
formulation of Floyd's algorithm is
33Analysis II
- The above formulation can use O(n2 / log2 n)
processors cost-optimally. - The isoefficiency of this formulation is T(p1.5
log3 p). - This algorithm can be further improved by
relaxing the strict synchronization after each
iteration - Go to next slide
34Pipelining Floyd
- The synchronization step in parallel Floyd's
algorithm can be removed without affecting the
correctness of the algorithm. - A process starts working on the kth iteration as
soon as it has computed the (k-1)th iteration and
has the relevant parts of the D(k-1) matrix.
35Pipelining Floyd
Communication protocol followed in the pipelined
2-D block mapping formulation of Floyd's
algorithm. Assume that process 4 at time t has
just computed a segment of the kth column of the
D(k-1) matrix. It sends the segment to processes
3 and 5. These processes receive the segment at
time t 1 (where the time unit is the time it
takes for a matrix segment to travel over the
communication link between adjacent processes).
Similarly, processes farther away from process 4
receive the segment later. Process 1 (at the
boundary) does not forward the segment after
receiving it.
36Pipelining Analysis
- In each step, n/vp elements of the first row are
sent from process Pi,j to Pi1,j. - Similarly, elements of the first column are sent
from process Pi,j to process Pi,j1. - Each such step takes time T(n/vp).
- After T(vp) steps, process Pvp ,vp gets the
relevant elements of the first row and first
column in time T(n). - The values of successive rows and columns follow
after time T(n2/p) in a pipelined mode. - Process Pvp ,vp finishes its share of the
shortest path computation in time T(n3/p) T(n).
- When process Pvp ,vp has finished the (n-1)th
iteration, it sends the relevant values of the
nth row and column to the other processes.
37Pipelining Analysis
- The overall parallel run time of this formulation
is
- The pipelined formulation of Floyd's algorithm
uses up to O(n2) processes efficiently. - The corresponding isoefficiency is T(p1.5).
38Comparison of shortest path algorithms
Assumption Parallel architecture has O(p)
bisection bandwith minimum communication between
2 halves of network Conclusion Pipelined Floyd
is the most scalable-lowest iso and run time and
can use ?(n2) processors
39Transitive Closure of a Graph
- If G (V,E) is a graph, then the transitive
closure of G is defined as the graph G (V,E),
where E (vi,vj) there is a path from vi to
vj in G - The connectivity matrix of G is a matrix A
(ai,j) such that ai,j 1 if there is a path
from vi to vj or i j, and ai,j 8 otherwise. - To compute A we assign a weight of 1 to each
edge of E and use an all-pairs shortest paths
algorithm on the graph.
40Connected Components of a Graph
G(V,E) VC1 ? C2? ?Ck u,v belong to Ci iff
there is a path from u to v and vice versa
Picture
41Depth First Search
- Perform DFS on the graph to get a forest - each
tree in the forest corresponds to a connected
component.
- b has the components
- Each component is a tree
42Parallel Component algorithm
- Partition adjacency matrix into p sub-graphs Gi
and assign each Gi to a process Pi - Each Pi computes spanning forest of Gi
- Merge spanning forests pair wise until there is
one spanning forest
43Parallel Components
44Ops for merging
- Algorithm uses disjoint sets of edges.
- Ops for the disjoint sets
- find(x)
- returns a pointer to the representative element
of the set containing x . Each set has its own
unique representative. - union(x, y)
- unites the sets containing the elements x and y.
45Merging Ops
- For merging forest A into forest B, for each edge
(u,v) of A, a find operation is performed to
determine if the vertices are in the same tree of
B. - If not, then the two trees (sets) of B containing
u and v are united by a union operation. - Otherwise, no union operation is necessary.
- Merging A and B requires at most 2(n-1) find
operations and (n-1) union operations.
46Parallel Block Mapping
- The n x n adjacency matrix is partitioned into p
blocks. - Each processor can compute its local spanning
forest in time T(n2/p). - Merging is done by embedding a logical tree into
the topology. There are log p merging stages, and
each takes time T(n). Thus, the cost due to
merging is T(n log p). - During each merging stage, spanning forests are
sent between nearest neighbors. T(n) edges of the
spanning forest are transmitted
47Performance of Block Mapping
For a cost-optimal formulation p O(n / log n).
The corresponding isoefficiency is T(p2 log2 p).
- The performance is similar to Prim and Dijkstras
- single shortest path algorithm
48Sparse Graphs
- A graph G (V,E) is sparse if E is much
smaller than V2
49Algorithms for sparse graphs
- Can reduce the complexity of dense graph
algorithms by making use of adjacency list
instead of adjacency graph - E.g. Prims algorithm is T(n2) for a dense matrix
and T(E log n) for a sparse matrix - Key to good performance of dense matrix
algorithms was the ability to assign roughly
equal workloads to all of the processors and to
keep the communication local - Floyd-assigned equal size blocks from the
adjacency matrix of consecutive rows and
columnsgtlocal communication
50Sparse difficulties
- Partitioning adjacency matrix is harder then it
looks - Assign equal vertices to processors their
adjacency lists. Some processors may have more
links then others. - Assign equal linksgt need to split adjacency
list of a vertex among processorsgtlots of
communication - Works for certain structures-next slide
51Grid graph-a certain structure
- If vertices have more or less the same degree it
works.
52Maximal Independent Set
- A set of vertices I ? V is called independent if
no pair of vertices in I is connected via an edge
in G. An independent set is called maximal if by
including any other vertex not in I, the
independence property is violated.
53Who cares?
- Maximal independent sets can be used to find how
many parallel tasks from a task graph can be
executed - Used in graph coloring algorithms
54Simple MIS algorithm
- start with empty MIS I, and assign all vertices
to a candidate set C. - Vertex v from C is moved into I and all vertices
adjacent to v are removed from C. - This process is repeated until C is empty.
- Problem serial algorithm
55Lubys algorithm
- In the beginning C is set equal to V, I is empty
- Assign random numbers to all of the nodes of the
graph - If vertex has a smaller random the all of its
adjacent vertices, then - put it in I
- Delete all of its adjacent neighbors from C
- Repeat above steps until C is empty
- Takes O(log V) steps on the average
- Luby invented this algorithm for graph coloring
56Parallel Luby for Shared Address Space
- 3 arrays of size V
- I idependence array I(i)1 if i is in MIS.
Initially all I(i)0 - R random number array
- C candidate array C(i)1 if i is a candidate
- Partition C among p processes
- Each process generates random number for each of
its vertices - Each process checks to see if random numbers of
its vertices are smaller then those of adjacent
vertices - Process zeroes entries corresponding to adjacent
entries
57Single Source Shortest Paths
- 2 steps happen in each iteration
- Extract u ? (V-VT) such that luminlv, v ?
V-VT - For each vertex in (V-VT), compute
lvminlv,lu w(u,v) - Make sense to use adjacency list for last
equation - Use priority queue to store l values with
smallest on top - Priority queue implemented with binary heap
58Johnson
Updating vertices in the heap is the big time
sink-O(log n) per update and E
updatesgt?(Elog n) complexity
59Parallel Johnson-first attempts
- Use one processor P0 to house the queue-other
processes update lv and give them to P0 - Problems
- Single queue is a bottleneck
- Small number of processes can be kept busy
(E/V) - So distribute queue to processes.
- Need low latency architecture to make this work
60First atttempts-continued
- Still have low speedup (O(log n)) if each update
takes O(1)) - Can extract multiple nodes from queue-all
vertices u with same minimum distance - Why? Can process nodes with same distance in any
order - Still not enough speedup
61Solution-speculative decomposition
- When process Pi extracts the vertex u ? Vi, it
sends a message to processes that store vertices
adjacent to u. - Process Pj, upon receiving this message, sets the
value of lv stored in its priority queue to
minlv,lu w(u,v). - If a shorter path has been discovered to node v,
it is reinserted back into the local priority
queue. - The algorithm terminates only when all the queues
become empty.
62Speculative decompositionDistributed Memory
- Partition queue among the processors
- Partition vertices and adjacency lists among the
processors - Each processor updates its local queue and sends
the results to the processors with adjacent
vertices - Recipient processor updates its value for
shortest path
63More precisely
- For example u?Pi and v?Pj and (u.v) is an edge.
Pi extracts u and updates lu - Pi sends message to Pj with new value for v,
which is luw(u,v). - Pj compares luw(u,v) to spv
- If smaller, then have new spv
- If larger, then discard
64In the beginning
Only the source queue has a non-empty priority
queue Then the wave happens
65Mapping-2-D Blockn/vp x n /vp mesh
66Analysis of 2-D
- At most O( p) processes are busy at any time
because the wave moves diagonally (diagonal p) - Max speedup is SW/W/vp vp
- E1/vp
- Lousy efficiency
672-D Cyclic
68Analysis
- Improves things because vertices are further
apart - Bad news more communication
691-D Block Mapping
701-D rules
- Better idea because as wave spreads more
processors get involved concurrently - Assume p/2 processes are busy, then
- SW/W/p/2p/2
- E1/2
- Improvement over 2-D
- Bad side-uses O(n) processes