Presented by Ozgur D. Sahin - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Presented by Ozgur D. Sahin

Description:

Presented by Ozgur D. Sahin – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 24
Provided by: Ozgur6
Learn more at: https://web.ece.ucsb.edu
Category:

less

Transcript and Presenter's Notes

Title: Presented by Ozgur D. Sahin


1
Presented by Ozgur D. Sahin
2
Outline
  • Introduction
  • Neighborhood Functions
  • ANF Algorithm
  • Modifications
  • Experimental Results
  • Data Mining using ANF
  • Conclusions

3
Introduction Motivation
  • Graph-based data is becoming more importatnt
  • Internet modeling, academic citations, phone
    records, movie databases, CAD circuits
  • Example Questions
  • How robust is the Internet to failures?
  • What are the most influential database papers?
  • What is the best opening move in tic-tac-toe?
  • Are phone call patterns in Asia similar to those
    in the U.S.?
  • Goal Quickly answer questions on graph-
    represented data

4
Answering Questions
  • We can answer these questions if we can compute
    following three properties related to
    connectivity and neighborhood structure
  • Graph Similarity Decide if two graphs have
    similar connectivity/neighborhood structure
  • Subgraph Similarity Compare how two subgraphs of
    a given graph are connected
  • Vertex Importance Assign an importance to each
    node based on its connectivity
  • This paper provides such a tool ANF (Approximate
    Neighborhood Function)

5
Challenges
  • Following properties should be satisfied
  • Error Guarantees Accurate estimates
  • Fast Scale linearly with n ( of nodes) and m (
    of edges)
  • Low Storage
  • Adapts to available memory
  • Parallelizable
  • Sequential scan of the edge file
  • Estimates per node

6
Definitions - Neighborhood Functions
dist(u,v) of edges on the shortest path from u
to v Define following neighborhood functions
7
Definitions - Neighborhood Functions
Generalize these two definitions to deal with
subgraphs
8
Basic ANF Algorithm
  • N(h) can be computed by a graph traversal
  • Graph traversal accesses edges in random order
  • Running time is O(nm)
  • Access edges in sequential order
  • M(x,h) is the set of nodes within distance h of
    node x

9
Basic ANF Algorithm
  • How to compute the number of distinct elements in
    the set M(x,h)
  • A dictionary data structure O(n2log n)
    time/space
  • Use bits to mark membership O(n2) space
  • Use probabilistic counting algorithm
  • Approximate set sizes using log nr bits

10
Probabilistic counting algorithm
  • Approximate set sizes using log nr bits
  • Instead of one bit per node, give half the nodes
    bit 0, a quarter of them bit 1, and so on (A node
    is given bit i with probability 1/2i1)
  • The approximation of the size of a set is
    proportional to 2b, where b is the least bit that
    has not been set in the bit representation of
    this set
  • Use k parallel approximations
  • M(x,h) is represented by k(log nr) bits

11
Basic ANF Algorithm
  • Consider a ring with 5 nodes
  • Example for k3 and r0
  • Bit 0 is the leftmost bit in each 3-bit mask
  • M(2,1) is the union of M(2,0), M(1,0), and
    M(3,0)
  • M(2,1)M(2,0) OR M(1,0) OR M(3,0)
  • IN(2,1) is computed from the average of the least
    zero bit positions
  • Avg(211)/34/3 ? IN(2,1) (24/3)/0.77359
    3.25

12
Basic ANF Algorithm
13
Modifications
  • M(x,h) uses M(y,h-1) but not M(y,h-2), so just
    keep the M(y,h-1) during iteration h.
  • Include a mark bit to handle generalized
    neighborhood functions
  • Break bit masks into smaller pieces if they are
    larger than the available memory

14
Leading Ones Compression
  • As ANF runs, most bit masks will have many
    leading 1s
  • Compress bit masks by including a counter of the
    leading ones
  • Bit shuffling of k parallel bit masks enables
    further compression
  • 11010,11100 ? 1111011000
  • Provides up to 23 speed-up

15
Experiments
  • Data Sets 3 real (Router, Cornell, Cora) and 4
    synthetic
  • Evaluation Metric

16
Experiments - Accuracy
k64 - ANF achieves less than 7 error - ANFs
error is independent of the data set
17
Experiments - Time
18
Experiments - Scalability
19
Data Mining with ANF
  • ANF tool can be used to answer graph mining
    problems
  • Best opening move for Tic-Tac-Toe game
  • Clustering movie classes
  • Measuring the robustness of the Internet
  • Use summarized statistics derived from
    neighborhood function
  • Many real graphs follow a power law
  • N(h) µ hH, where H is defined as the hop
    exponent
  • Use individual hop exponent as a measure of
    importance

20
Tic-Tac-Toe
  • Show The best opening move is the center square
  • Each possible board configuration is a node and
    there is an edge from board x to board y if it is
    a possible move
  • Compute individual neighborhood functions for
    each of the 9 possible first moves

21
Clustering Movies
  • Consider IMDB (Internet Movie Data Base) where
    each movie is identified as being in one or more
    classes (such as documentaries, dramas, comedies,
    etc)
  • Construct a graph for each class and cluster
    similar ones

22
Internet Router Data
  • How robust the Internet is to router failures
  • Delete some number of routers and measure
    connectivity
  • Random failures do not disrupt the Internet
  • Targeted failures can dramatically disrupt it

23
Conclusions
  • ANF uses an efficient and accurate approximation
    algorithm
  • ANF tool provides several advantages including
    following
  • Accurate
  • Fast
  • Low storage requirements
  • Parallelizable
  • ANF makes it possible to answer many interesting
    questions
Write a Comment
User Comments (0)
About PowerShow.com