Graph Algorithms

About This Presentation

Title:

Graph Algorithms

Description:

Lines 10-13 executed n-1 times. 12 and 13 executed O(n) times. Prim is (n2) Parallel Prim ... Works for 'certain' structures-next ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 71

Provided by: carltr

Category:

more less

Transcript and Presenter's Notes

Title: Graph Algorithms

1
Graph Algorithms

Carl Tropper
Department of Computer Science
McGill University

2
Definitions

An undirected graph G is a pair (V,E), where V is
a finite set of points called vertices and E is a
finite set of edges.
An edge e ? E is an unordered pair (u,v), where
u,v ? V.
In a directed graph, the edge e is an ordered
pair (u,v). An edge (u,v) is incident from vertex
u and is incident to vertex v.
A path from a vertex v to a vertex u is a
sequence ltv0,v1,v2,,vkgt of vertices where v0
v, vk u, and (vi, vi1) ? E for I 0, 1,,
k-1.
The length of a path is defined as the number of
edges in the path.

3
Directed and undirected
4
More definitions

An undirected graph is connected if every pair of
vertices is connected by a path.
A forest is an acyclic graph, and a tree is a
connected acyclic graph.
A graph that has weights associated with each
edge is called a weighted graph.

5
How to represent a graph 1

Adjacency matrix represents undirected graph
?(n2)space to store matrix

6
How to represent a graph 2
An ordered list of nodes Weights are inside
nodes ?(E) space to store the list
7
Which is better?

If graph is sparse, use pointers/list
If dense, then go with the adjacency matrix

8
MST Prims Algorithm

A spanning tree of an undirected graph G is a
subgraph of G which is (1) a tree and (2)
contains all the vertices of G.
In a weighted graph, the weight of a subgraph is
the sum of the weights of the edges in the
subgraph.
A minimum spanning tree (MST) for a weighted
undirected graph is a spanning tree with minimum
weight.

9
MST on the right
10
Prim

is a greedy algorithm
Is recursive
Start by selecting an arbitrary vertex, include
it into the current MST (the root)
Grow the current MST by inserting into it the
vertex closest to one of the vertices already in
current MST (the vertex which (1) is outside the
MST and (2) adds the smallest weight to the MST)

11
Prim
12
Formal Prim
Lines 10-13 executed n-1 times 12 and 13 executed
O(n) times Prim is ?(n2)
13
Parallel Prim

Let p be the number of processes, and let n be
the number of vertices.
The adjacency matrix is partitioned in 1-D block
fashion, with distance vector d partitioned
accordingly.
In each step, a processor selects the locally
closest node, followed by a global (all to one)
reduction to select globally closest node. One
processor (say P0) contains the MST.
This node is inserted into MST, and the choice
broadcast (one to all) to all processors.
Each processor updates its part of the d vector
locally.

14
Parallel Prim picture
15
Analysis

The cost to select the minimum entry is ??(n/p)
The cost of a broadcast is ?(log p).
The cost of local update of the d vector is
?(n/p).
The parallel time per iteration is ?(n/p log
p).
The total parallel time is given by ?(n2/p n
log p).
The corresponding isoefficiency is ?(p2log2p).

16
Single source shortest pathDijkstra

For a weighted graph G (V,E,w), the
single-source shortest paths problem is to find
the shortest paths from a vertex v ? V to all
other vertices in V.
Dijkstra's algorithm is similar to Prim's
algorithm. It maintains a set of nodes for which
the shortest paths are known.
It grows this set based on the node closest to
source using one of the nodes in the current
shortest path set.
The difference Prim stores the the cost of the
minimal cost edge connecting a vertex in VT to u,
Dijkstra stores minimal cost to reach u
Greedy algorithm

17
(No Transcript)
18
Single source analysis

The weighted adjacency matrix is partitioned
using the 1-D block mapping.
Each process selects, locally, the node closest
to the source, followed by a global reduction to
select next node.
The node is broadcast to all processors and the
l-vector updated.
The difference Prim stores the the cost of the
minimal cost edge connecting a vertex in VT to u,
Dijkstra stores minimal cost to reach u
The parallel performance of Dijkstra's algorithm
is identical to that of Prim's algorithm.

19
All Pairs Shortest Path

Given a weighted graph G(V,E,w), the all-pairs
shortest paths problem is to find the shortest
paths between all pairs of vertices vi, vj ? V.
Look at two versions of Dijkstra
And Floyds algorithm

20
First of all

Execute n instances of the single-source shortest
path problem, one for each of the n source
vertices.
Complexity is ?(n3) because complexity of
shortest path algorithm is ?(n2)

21
Two strategies

Source partitioned-execute n shortest path
problems on n processors. Each of the n nodes
gets to be a source node.
Source parallel-partition adjacency matrix a la
Prim

22
Dijkstra Source Partitioned

Use n processors, each processor Pi finds the
shortest paths from vertex vi to all other
vertices by executing Dijkstra's sequential
single-source shortest paths algorithm.
It requires no interprocess communication
(provided that the adjacency matrix is replicated
at all processes).
The parallel run time of this formulation is
T(n2).
While the algorithm is cost optimal, it can only
use n processors. Therefore, the isoefficiency
due to concurrency is p3.

23
Dijkstra Source Parallel

Want to keep more then n processors busy
Given p processors (p gt n), each single source
shortest path problem is executed by one on one
of n partitions
Use p/n processors for each of the problems
From before, the parallel time is
TP T(n3/p) computation
T(n log p) communication
For cost optimality, we have p O(n2/log n) and
the isoefficiency is T((p log p)1.5).

24
Floyds

For any pair of vertices vi, vj ? V, consider all
paths from vi to vj whose intermediate vertices
belong to the set v1,v2,,vk. Let pi(,kj) (of
weight di(,kj) be the minimum-weight path among
them.
If vertex vk is not in the shortest path from vi
to vj, then pi(,kj) is the same as pi(,kj-1).
If f vk is in pi(,kj), then we can break pi(,kj)
into two paths - one from vi to vk and one from
vk to vj . Each of these paths uses vertices from
v1,v2,,vk-1.

25
Recurrence
From our observations, the following recurrence
relation follows
This equation must be computed for each pair of
nodes and for k 1, n. The serial complexity
is O(n3).
26
Floyd
This program computes the all-pairs shortest
paths of the graph G (V,E) with adjacency
matrix A.
27
Parallel Floyd

Matrix D(k) is divided into p blocks of size (n /
vp) x (n / vp) n2/p elements per block
Each processor updates its part of the matrix
during each iteration.
To compute dl(,kk-1) processor Pi,j must get
dl(,kk-1) and dk(,kr-1).
In general, during the kth iteration, each of the
vp processes containing part of the kth row send
it to the vp - 1 processes in the same column.
Similarly, each of the vp processes containing
part of the kth column sends it to the vp - 1
processes in the same row.

28
Matrix Dk
(a) Matrix D(k) distributed by 2-D block mapping
into vp x vp subblocks, and (b) the subblock of
D(k) assigned to process Pi,j.
29
Communication

In general, during the kth iteration, each of the
vp processes containing part of the kth row send
it to the vp - 1 processes in the same column.
Similarly, each of the vp processes containing
part of the kth column sends it to the vp - 1
processes in the same row.

30
Communication Patterns
a) Communication patterns used in the 2-D block
mapping. When computing di(,kj), information must
be sent to the highlighted process from two other
processes along the same row and column. (b) The
row and column of vp processes that contain the
kth row and column send them along process
columns and rows.
31
Floyd-Parallel Formulation
Floyd's parallel formulation using the 2-D block
mapping. P,j denotes all the processes in the
jth column, and Pi, denotes all the processes in
the ith row. The matrix D(0) is the adjacency
matrix.
32
Analysis

During each iteration of the algorithm, the kth
row and kth column of processors perform a
one-to-all broadcast along their rows/columns.
The size of this broadcast is n/vp elements,
taking time T((n log p)/ vp).
The synchronization step takes time T(log p).
The computation time is T(n2/p)
There are n iterations of the algorithm
k1,..,n
The parallel run time of the 2-D block mapping
formulation of Floyd's algorithm is

33
Analysis II

The above formulation can use O(n2 / log2 n)
processors cost-optimally.
The isoefficiency of this formulation is T(p1.5
log3 p).
This algorithm can be further improved by
relaxing the strict synchronization after each
iteration
Go to next slide

34
Pipelining Floyd

The synchronization step in parallel Floyd's
algorithm can be removed without affecting the
correctness of the algorithm.
A process starts working on the kth iteration as
soon as it has computed the (k-1)th iteration and
has the relevant parts of the D(k-1) matrix.

35
Pipelining Floyd
Communication protocol followed in the pipelined
2-D block mapping formulation of Floyd's
algorithm. Assume that process 4 at time t has
just computed a segment of the kth column of the
D(k-1) matrix. It sends the segment to processes
3 and 5. These processes receive the segment at
time t 1 (where the time unit is the time it
takes for a matrix segment to travel over the
communication link between adjacent processes).
Similarly, processes farther away from process 4
receive the segment later. Process 1 (at the
boundary) does not forward the segment after
receiving it.
36
Pipelining Analysis

In each step, n/vp elements of the first row are
sent from process Pi,j to Pi1,j.
Similarly, elements of the first column are sent
from process Pi,j to process Pi,j1.
Each such step takes time T(n/vp).
After T(vp) steps, process Pvp ,vp gets the
relevant elements of the first row and first
column in time T(n).
The values of successive rows and columns follow
after time T(n2/p) in a pipelined mode.
Process Pvp ,vp finishes its share of the
shortest path computation in time T(n3/p) T(n).
When process Pvp ,vp has finished the (n-1)th
iteration, it sends the relevant values of the
nth row and column to the other processes.

37
Pipelining Analysis

The overall parallel run time of this formulation
is

The pipelined formulation of Floyd's algorithm
uses up to O(n2) processes efficiently.
The corresponding isoefficiency is T(p1.5).

38
Comparison of shortest path algorithms
Assumption Parallel architecture has O(p)
bisection bandwith minimum communication between
2 halves of network Conclusion Pipelined Floyd
is the most scalable-lowest iso and run time and
can use ?(n2) processors
39
Transitive Closure of a Graph

If G (V,E) is a graph, then the transitive
closure of G is defined as the graph G (V,E),
where E (vi,vj) there is a path from vi to
vj in G
The connectivity matrix of G is a matrix A
(ai,j) such that ai,j 1 if there is a path
from vi to vj or i j, and ai,j 8 otherwise.
To compute A we assign a weight of 1 to each
edge of E and use an all-pairs shortest paths
algorithm on the graph.

40
Connected Components of a Graph
G(V,E) VC1 ? C2? ?Ck u,v belong to Ci iff
there is a path from u to v and vice versa

Picture
41
Depth First Search

Perform DFS on the graph to get a forest - each
tree in the forest corresponds to a connected
component.

b has the components
Each component is a tree

42
Parallel Component algorithm

Partition adjacency matrix into p sub-graphs Gi
and assign each Gi to a process Pi
Each Pi computes spanning forest of Gi
Merge spanning forests pair wise until there is
one spanning forest

43
Parallel Components
44
Ops for merging

Algorithm uses disjoint sets of edges.
Ops for the disjoint sets
find(x)
returns a pointer to the representative element
of the set containing x . Each set has its own
unique representative.
union(x, y)
unites the sets containing the elements x and y.

45
Merging Ops

For merging forest A into forest B, for each edge
(u,v) of A, a find operation is performed to
determine if the vertices are in the same tree of
B.
If not, then the two trees (sets) of B containing
u and v are united by a union operation.
Otherwise, no union operation is necessary.
Merging A and B requires at most 2(n-1) find
operations and (n-1) union operations.

46
Parallel Block Mapping

The n x n adjacency matrix is partitioned into p
blocks.
Each processor can compute its local spanning
forest in time T(n2/p).
Merging is done by embedding a logical tree into
the topology. There are log p merging stages, and
each takes time T(n). Thus, the cost due to
merging is T(n log p).
During each merging stage, spanning forests are
sent between nearest neighbors. T(n) edges of the
spanning forest are transmitted

47
Performance of Block Mapping
For a cost-optimal formulation p O(n / log n).
The corresponding isoefficiency is T(p2 log2 p).

The performance is similar to Prim and Dijkstras
single shortest path algorithm

48
Sparse Graphs

A graph G (V,E) is sparse if E is much
smaller than V2

49
Algorithms for sparse graphs

Can reduce the complexity of dense graph
algorithms by making use of adjacency list
instead of adjacency graph
E.g. Prims algorithm is T(n2) for a dense matrix
and T(E log n) for a sparse matrix
Key to good performance of dense matrix
algorithms was the ability to assign roughly
equal workloads to all of the processors and to
keep the communication local
Floyd-assigned equal size blocks from the
adjacency matrix of consecutive rows and
columnsgtlocal communication

50
Sparse difficulties

Partitioning adjacency matrix is harder then it
looks
Assign equal vertices to processors their
adjacency lists. Some processors may have more
links then others.
Assign equal linksgt need to split adjacency
list of a vertex among processorsgtlots of
communication
Works for certain structures-next slide

51
Grid graph-a certain structure

If vertices have more or less the same degree it
works.

52
Maximal Independent Set

A set of vertices I ? V is called independent if
no pair of vertices in I is connected via an edge
in G. An independent set is called maximal if by
including any other vertex not in I, the
independence property is violated.

53
Who cares?

Maximal independent sets can be used to find how
many parallel tasks from a task graph can be
executed
Used in graph coloring algorithms

54
Simple MIS algorithm

start with empty MIS I, and assign all vertices
to a candidate set C.
Vertex v from C is moved into I and all vertices
adjacent to v are removed from C.
This process is repeated until C is empty.
Problem serial algorithm

55
Lubys algorithm

In the beginning C is set equal to V, I is empty
Assign random numbers to all of the nodes of the
graph
If vertex has a smaller random the all of its
adjacent vertices, then
put it in I
Delete all of its adjacent neighbors from C
Repeat above steps until C is empty
Takes O(log V) steps on the average
Luby invented this algorithm for graph coloring

56
Parallel Luby for Shared Address Space

3 arrays of size V
I idependence array I(i)1 if i is in MIS.
Initially all I(i)0
R random number array
C candidate array C(i)1 if i is a candidate
Partition C among p processes
Each process generates random number for each of
its vertices
Each process checks to see if random numbers of
its vertices are smaller then those of adjacent
vertices
Process zeroes entries corresponding to adjacent
entries

57
Single Source Shortest Paths

2 steps happen in each iteration
Extract u ? (V-VT) such that luminlv, v ?
V-VT
For each vertex in (V-VT), compute
lvminlv,lu w(u,v)
Make sense to use adjacency list for last
equation
Use priority queue to store l values with
smallest on top
Priority queue implemented with binary heap

58
Johnson
Updating vertices in the heap is the big time
sink-O(log n) per update and E
updatesgt?(Elog n) complexity
59
Parallel Johnson-first attempts

Use one processor P0 to house the queue-other
processes update lv and give them to P0
Problems
Single queue is a bottleneck
Small number of processes can be kept busy
(E/V)
So distribute queue to processes.
Need low latency architecture to make this work

60
First atttempts-continued

Still have low speedup (O(log n)) if each update
takes O(1))
Can extract multiple nodes from queue-all
vertices u with same minimum distance
Why? Can process nodes with same distance in any
order
Still not enough speedup

61
Solution-speculative decomposition

When process Pi extracts the vertex u ? Vi, it
sends a message to processes that store vertices
adjacent to u.
Process Pj, upon receiving this message, sets the
value of lv stored in its priority queue to
minlv,lu w(u,v).
If a shorter path has been discovered to node v,
it is reinserted back into the local priority
queue.
The algorithm terminates only when all the queues
become empty.

62
Speculative decompositionDistributed Memory

Partition queue among the processors
Partition vertices and adjacency lists among the
processors
Each processor updates its local queue and sends
the results to the processors with adjacent
vertices
Recipient processor updates its value for
shortest path

63
More precisely

For example u?Pi and v?Pj and (u.v) is an edge.
Pi extracts u and updates lu
Pi sends message to Pj with new value for v,
which is luw(u,v).
Pj compares luw(u,v) to spv
If smaller, then have new spv
If larger, then discard

64
In the beginning
Only the source queue has a non-empty priority
queue Then the wave happens
65
Mapping-2-D Blockn/vp x n /vp mesh

66
Analysis of 2-D