Graph Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Graph Algorithms

Description:

Lines 10-13 executed n-1 times. 12 and 13 executed O(n) times. Prim is (n2) Parallel Prim ... Works for 'certain' structures-next ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 71
Provided by: carltr
Category:
Tags: algorithms | graph

less

Transcript and Presenter's Notes

Title: Graph Algorithms


1
Graph Algorithms
  • Carl Tropper
  • Department of Computer Science
  • McGill University

2
Definitions
  • An undirected graph G is a pair (V,E), where V is
    a finite set of points called vertices and E is a
    finite set of edges.
  • An edge e ? E is an unordered pair (u,v), where
    u,v ? V.
  • In a directed graph, the edge e is an ordered
    pair (u,v). An edge (u,v) is incident from vertex
    u and is incident to vertex v.
  • A path from a vertex v to a vertex u is a
    sequence ltv0,v1,v2,,vkgt of vertices where v0
    v, vk u, and (vi, vi1) ? E for I 0, 1,,
    k-1.
  • The length of a path is defined as the number of
    edges in the path.

3
Directed and undirected
4
More definitions
  • An undirected graph is connected if every pair of
    vertices is connected by a path.
  • A forest is an acyclic graph, and a tree is a
    connected acyclic graph.
  • A graph that has weights associated with each
    edge is called a weighted graph.

5
How to represent a graph 1
  • Adjacency matrix represents undirected graph
  • ?(n2)space to store matrix

6
How to represent a graph 2
An ordered list of nodes Weights are inside
nodes ?(E) space to store the list
7
Which is better?
  • If graph is sparse, use pointers/list
  • If dense, then go with the adjacency matrix

8
MST Prims Algorithm
  • A spanning tree of an undirected graph G is a
    subgraph of G which is (1) a tree and (2)
    contains all the vertices of G.
  • In a weighted graph, the weight of a subgraph is
    the sum of the weights of the edges in the
    subgraph.
  • A minimum spanning tree (MST) for a weighted
    undirected graph is a spanning tree with minimum
    weight.

9
MST on the right
10
Prim
  • is a greedy algorithm
  • Is recursive
  • Start by selecting an arbitrary vertex, include
    it into the current MST (the root)
  • Grow the current MST by inserting into it the
    vertex closest to one of the vertices already in
    current MST (the vertex which (1) is outside the
    MST and (2) adds the smallest weight to the MST)

11
Prim
12
Formal Prim
Lines 10-13 executed n-1 times 12 and 13 executed
O(n) times Prim is ?(n2)
13
Parallel Prim
  • Let p be the number of processes, and let n be
    the number of vertices.
  • The adjacency matrix is partitioned in 1-D block
    fashion, with distance vector d partitioned
    accordingly.
  • In each step, a processor selects the locally
    closest node, followed by a global (all to one)
    reduction to select globally closest node. One
    processor (say P0) contains the MST.
  • This node is inserted into MST, and the choice
    broadcast (one to all) to all processors.
  • Each processor updates its part of the d vector
    locally.

14
Parallel Prim picture
15
Analysis
  • The cost to select the minimum entry is ??(n/p)
  • The cost of a broadcast is ?(log p).
  • The cost of local update of the d vector is
    ?(n/p).
  • The parallel time per iteration is ?(n/p log
    p).
  • The total parallel time is given by ?(n2/p n
    log p).
  • The corresponding isoefficiency is ?(p2log2p).

16
Single source shortest pathDijkstra
  • For a weighted graph G (V,E,w), the
    single-source shortest paths problem is to find
    the shortest paths from a vertex v ? V to all
    other vertices in V.
  • Dijkstra's algorithm is similar to Prim's
    algorithm. It maintains a set of nodes for which
    the shortest paths are known.
  • It grows this set based on the node closest to
    source using one of the nodes in the current
    shortest path set.
  • The difference Prim stores the the cost of the
    minimal cost edge connecting a vertex in VT to u,
    Dijkstra stores minimal cost to reach u
  • Greedy algorithm

17
(No Transcript)
18
Single source analysis
  • The weighted adjacency matrix is partitioned
    using the 1-D block mapping.
  • Each process selects, locally, the node closest
    to the source, followed by a global reduction to
    select next node.
  • The node is broadcast to all processors and the
    l-vector updated.
  • The difference Prim stores the the cost of the
    minimal cost edge connecting a vertex in VT to u,
    Dijkstra stores minimal cost to reach u
  • The parallel performance of Dijkstra's algorithm
    is identical to that of Prim's algorithm.

19
All Pairs Shortest Path
  • Given a weighted graph G(V,E,w), the all-pairs
    shortest paths problem is to find the shortest
    paths between all pairs of vertices vi, vj ? V.
  • Look at two versions of Dijkstra
  • And Floyds algorithm

20
First of all
  • Execute n instances of the single-source shortest
    path problem, one for each of the n source
    vertices.
  • Complexity is ?(n3) because complexity of
    shortest path algorithm is ?(n2)

21
Two strategies
  • Source partitioned-execute n shortest path
    problems on n processors. Each of the n nodes
    gets to be a source node.
  • Source parallel-partition adjacency matrix a la
    Prim

22
Dijkstra Source Partitioned
  • Use n processors, each processor Pi finds the
    shortest paths from vertex vi to all other
    vertices by executing Dijkstra's sequential
    single-source shortest paths algorithm.
  • It requires no interprocess communication
    (provided that the adjacency matrix is replicated
    at all processes).
  • The parallel run time of this formulation is
    T(n2).
  • While the algorithm is cost optimal, it can only
    use n processors. Therefore, the isoefficiency
    due to concurrency is p3.

23
Dijkstra Source Parallel
  • Want to keep more then n processors busy
  • Given p processors (p gt n), each single source
    shortest path problem is executed by one on one
    of n partitions
  • Use p/n processors for each of the problems
  • From before, the parallel time is
  • TP T(n3/p) computation
  • T(n log p) communication
  • For cost optimality, we have p O(n2/log n) and
    the isoefficiency is T((p log p)1.5).

24
Floyds
  • For any pair of vertices vi, vj ? V, consider all
    paths from vi to vj whose intermediate vertices
    belong to the set v1,v2,,vk. Let pi(,kj) (of
    weight di(,kj) be the minimum-weight path among
    them.
  • If vertex vk is not in the shortest path from vi
    to vj, then pi(,kj) is the same as pi(,kj-1).
  • If f vk is in pi(,kj), then we can break pi(,kj)
    into two paths - one from vi to vk and one from
    vk to vj . Each of these paths uses vertices from
    v1,v2,,vk-1.

25
Recurrence
From our observations, the following recurrence
relation follows
This equation must be computed for each pair of
nodes and for k 1, n. The serial complexity
is O(n3).
26
Floyd
This program computes the all-pairs shortest
paths of the graph G (V,E) with adjacency
matrix A.
27
Parallel Floyd
  • Matrix D(k) is divided into p blocks of size (n /
    vp) x (n / vp) n2/p elements per block
  • Each processor updates its part of the matrix
    during each iteration.
  • To compute dl(,kk-1) processor Pi,j must get
    dl(,kk-1) and dk(,kr-1).
  • In general, during the kth iteration, each of the
    vp processes containing part of the kth row send
    it to the vp - 1 processes in the same column.
  • Similarly, each of the vp processes containing
    part of the kth column sends it to the vp - 1
    processes in the same row.

28
Matrix Dk
(a) Matrix D(k) distributed by 2-D block mapping
into vp x vp subblocks, and (b) the subblock of
D(k) assigned to process Pi,j.
29
Communication
  • In general, during the kth iteration, each of the
    vp processes containing part of the kth row send
    it to the vp - 1 processes in the same column.
  • Similarly, each of the vp processes containing
    part of the kth column sends it to the vp - 1
    processes in the same row.

30
Communication Patterns
a) Communication patterns used in the 2-D block
mapping. When computing di(,kj), information must
be sent to the highlighted process from two other
processes along the same row and column. (b) The
row and column of vp processes that contain the
kth row and column send them along process
columns and rows.
31
Floyd-Parallel Formulation
Floyd's parallel formulation using the 2-D block
mapping. P,j denotes all the processes in the
jth column, and Pi, denotes all the processes in
the ith row. The matrix D(0) is the adjacency
matrix.
32
Analysis
  • During each iteration of the algorithm, the kth
    row and kth column of processors perform a
    one-to-all broadcast along their rows/columns.
  • The size of this broadcast is n/vp elements,
    taking time T((n log p)/ vp).
  • The synchronization step takes time T(log p).
  • The computation time is T(n2/p)
  • There are n iterations of the algorithm
    k1,..,n
  • The parallel run time of the 2-D block mapping
    formulation of Floyd's algorithm is

33
Analysis II
  • The above formulation can use O(n2 / log2 n)
    processors cost-optimally.
  • The isoefficiency of this formulation is T(p1.5
    log3 p).
  • This algorithm can be further improved by
    relaxing the strict synchronization after each
    iteration
  • Go to next slide

34
Pipelining Floyd
  • The synchronization step in parallel Floyd's
    algorithm can be removed without affecting the
    correctness of the algorithm.
  • A process starts working on the kth iteration as
    soon as it has computed the (k-1)th iteration and
    has the relevant parts of the D(k-1) matrix.

35
Pipelining Floyd
Communication protocol followed in the pipelined
2-D block mapping formulation of Floyd's
algorithm. Assume that process 4 at time t has
just computed a segment of the kth column of the
D(k-1) matrix. It sends the segment to processes
3 and 5. These processes receive the segment at
time t 1 (where the time unit is the time it
takes for a matrix segment to travel over the
communication link between adjacent processes).
Similarly, processes farther away from process 4
receive the segment later. Process 1 (at the
boundary) does not forward the segment after
receiving it.
36
Pipelining Analysis
  • In each step, n/vp elements of the first row are
    sent from process Pi,j to Pi1,j.
  • Similarly, elements of the first column are sent
    from process Pi,j to process Pi,j1.
  • Each such step takes time T(n/vp).
  • After T(vp) steps, process Pvp ,vp gets the
    relevant elements of the first row and first
    column in time T(n).
  • The values of successive rows and columns follow
    after time T(n2/p) in a pipelined mode.
  • Process Pvp ,vp finishes its share of the
    shortest path computation in time T(n3/p) T(n).
  • When process Pvp ,vp has finished the (n-1)th
    iteration, it sends the relevant values of the
    nth row and column to the other processes.

37
Pipelining Analysis
  • The overall parallel run time of this formulation
    is
  • The pipelined formulation of Floyd's algorithm
    uses up to O(n2) processes efficiently.
  • The corresponding isoefficiency is T(p1.5).

38
Comparison of shortest path algorithms
Assumption Parallel architecture has O(p)
bisection bandwith minimum communication between
2 halves of network Conclusion Pipelined Floyd
is the most scalable-lowest iso and run time and
can use ?(n2) processors
39
Transitive Closure of a Graph
  • If G (V,E) is a graph, then the transitive
    closure of G is defined as the graph G (V,E),
    where E (vi,vj) there is a path from vi to
    vj in G
  • The connectivity matrix of G is a matrix A
    (ai,j) such that ai,j 1 if there is a path
    from vi to vj or i j, and ai,j 8 otherwise.
  • To compute A we assign a weight of 1 to each
    edge of E and use an all-pairs shortest paths
    algorithm on the graph.

40
Connected Components of a Graph
G(V,E) VC1 ? C2? ?Ck u,v belong to Ci iff
there is a path from u to v and vice versa

Picture
41
Depth First Search
  • Perform DFS on the graph to get a forest - each
    tree in the forest corresponds to a connected
    component.
  • b has the components
  • Each component is a tree

42
Parallel Component algorithm
  • Partition adjacency matrix into p sub-graphs Gi
    and assign each Gi to a process Pi
  • Each Pi computes spanning forest of Gi
  • Merge spanning forests pair wise until there is
    one spanning forest

43
Parallel Components
44
Ops for merging
  • Algorithm uses disjoint sets of edges.
  • Ops for the disjoint sets
  • find(x)
  • returns a pointer to the representative element
    of the set containing x . Each set has its own
    unique representative.
  • union(x, y)
  • unites the sets containing the elements x and y.

45
Merging Ops
  • For merging forest A into forest B, for each edge
    (u,v) of A, a find operation is performed to
    determine if the vertices are in the same tree of
    B.
  • If not, then the two trees (sets) of B containing
    u and v are united by a union operation.
  • Otherwise, no union operation is necessary.
  • Merging A and B requires at most 2(n-1) find
    operations and (n-1) union operations.

46
Parallel Block Mapping
  • The n x n adjacency matrix is partitioned into p
    blocks.
  • Each processor can compute its local spanning
    forest in time T(n2/p).
  • Merging is done by embedding a logical tree into
    the topology. There are log p merging stages, and
    each takes time T(n). Thus, the cost due to
    merging is T(n log p).
  • During each merging stage, spanning forests are
    sent between nearest neighbors. T(n) edges of the
    spanning forest are transmitted

47
Performance of Block Mapping
For a cost-optimal formulation p O(n / log n).
The corresponding isoefficiency is T(p2 log2 p).
  • The performance is similar to Prim and Dijkstras
  • single shortest path algorithm

48
Sparse Graphs
  • A graph G (V,E) is sparse if E is much
    smaller than V2

49
Algorithms for sparse graphs
  • Can reduce the complexity of dense graph
    algorithms by making use of adjacency list
    instead of adjacency graph
  • E.g. Prims algorithm is T(n2) for a dense matrix
    and T(E log n) for a sparse matrix
  • Key to good performance of dense matrix
    algorithms was the ability to assign roughly
    equal workloads to all of the processors and to
    keep the communication local
  • Floyd-assigned equal size blocks from the
    adjacency matrix of consecutive rows and
    columnsgtlocal communication

50
Sparse difficulties
  • Partitioning adjacency matrix is harder then it
    looks
  • Assign equal vertices to processors their
    adjacency lists. Some processors may have more
    links then others.
  • Assign equal linksgt need to split adjacency
    list of a vertex among processorsgtlots of
    communication
  • Works for certain structures-next slide

51
Grid graph-a certain structure
  • If vertices have more or less the same degree it
    works.

52
Maximal Independent Set
  • A set of vertices I ? V is called independent if
    no pair of vertices in I is connected via an edge
    in G. An independent set is called maximal if by
    including any other vertex not in I, the
    independence property is violated.

53
Who cares?
  • Maximal independent sets can be used to find how
    many parallel tasks from a task graph can be
    executed
  • Used in graph coloring algorithms

54
Simple MIS algorithm
  • start with empty MIS I, and assign all vertices
    to a candidate set C.
  • Vertex v from C is moved into I and all vertices
    adjacent to v are removed from C.
  • This process is repeated until C is empty.
  • Problem serial algorithm

55
Lubys algorithm
  • In the beginning C is set equal to V, I is empty
  • Assign random numbers to all of the nodes of the
    graph
  • If vertex has a smaller random the all of its
    adjacent vertices, then
  • put it in I
  • Delete all of its adjacent neighbors from C
  • Repeat above steps until C is empty
  • Takes O(log V) steps on the average
  • Luby invented this algorithm for graph coloring

56
Parallel Luby for Shared Address Space
  • 3 arrays of size V
  • I idependence array I(i)1 if i is in MIS.
    Initially all I(i)0
  • R random number array
  • C candidate array C(i)1 if i is a candidate
  • Partition C among p processes
  • Each process generates random number for each of
    its vertices
  • Each process checks to see if random numbers of
    its vertices are smaller then those of adjacent
    vertices
  • Process zeroes entries corresponding to adjacent
    entries

57
Single Source Shortest Paths
  • 2 steps happen in each iteration
  • Extract u ? (V-VT) such that luminlv, v ?
    V-VT
  • For each vertex in (V-VT), compute
    lvminlv,lu w(u,v)
  • Make sense to use adjacency list for last
    equation
  • Use priority queue to store l values with
    smallest on top
  • Priority queue implemented with binary heap

58
Johnson
Updating vertices in the heap is the big time
sink-O(log n) per update and E
updatesgt?(Elog n) complexity
59
Parallel Johnson-first attempts
  • Use one processor P0 to house the queue-other
    processes update lv and give them to P0
  • Problems
  • Single queue is a bottleneck
  • Small number of processes can be kept busy
    (E/V)
  • So distribute queue to processes.
  • Need low latency architecture to make this work

60
First atttempts-continued
  • Still have low speedup (O(log n)) if each update
    takes O(1))
  • Can extract multiple nodes from queue-all
    vertices u with same minimum distance
  • Why? Can process nodes with same distance in any
    order
  • Still not enough speedup

61
Solution-speculative decomposition
  • When process Pi extracts the vertex u ? Vi, it
    sends a message to processes that store vertices
    adjacent to u.
  • Process Pj, upon receiving this message, sets the
    value of lv stored in its priority queue to
    minlv,lu w(u,v).
  • If a shorter path has been discovered to node v,
    it is reinserted back into the local priority
    queue.
  • The algorithm terminates only when all the queues
    become empty.

62
Speculative decompositionDistributed Memory
  • Partition queue among the processors
  • Partition vertices and adjacency lists among the
    processors
  • Each processor updates its local queue and sends
    the results to the processors with adjacent
    vertices
  • Recipient processor updates its value for
    shortest path

63
More precisely
  • For example u?Pi and v?Pj and (u.v) is an edge.
    Pi extracts u and updates lu
  • Pi sends message to Pj with new value for v,
    which is luw(u,v).
  • Pj compares luw(u,v) to spv
  • If smaller, then have new spv
  • If larger, then discard

64
In the beginning
Only the source queue has a non-empty priority
queue Then the wave happens
65
Mapping-2-D Blockn/vp x n /vp mesh

66
Analysis of 2-D
  • At most O( p) processes are busy at any time
    because the wave moves diagonally (diagonal p)
  • Max speedup is SW/W/vp vp
  • E1/vp
  • Lousy efficiency

67
2-D Cyclic
68
Analysis
  • Improves things because vertices are further
    apart
  • Bad news more communication

69
1-D Block Mapping
70
1-D rules
  • Better idea because as wave spreads more
    processors get involved concurrently
  • Assume p/2 processes are busy, then
  • SW/W/p/2p/2
  • E1/2
  • Improvement over 2-D
  • Bad side-uses O(n) processes
Write a Comment
User Comments (0)
About PowerShow.com