Scaling linkbased similarity search

About This Presentation

Title:

Scaling linkbased similarity search

Description:

... in the same class are more similar as those in different classes ... First sight of these on real(ly big) web data. Yes, they do make sense! Open problems ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 22

Provided by: rczb

Category:

more less

Transcript and Presenter's Notes

Title: Scaling linkbased similarity search

1
Scaling link-based similarity search
Dániel Fogaras, Balázs Rácz
Computer and Automation Research Institute of the
Hungarian Academy of Sciences
Budapest University of Technology and Economics
2
Outline

Introduction
Scaling link-based similarity search
Scaling link-based similarity search
Scaling link-based similarity search
First scalable algorithm for SimRank
New similarity functions
Experiments

3
Introduction / Motivation

Similarity search on the Web

4
Approaches / Related Results

Text-based
Classic IR
Min-hash fingerprinting (Broder 98)
Pure link-based
Single-step cocitation, bibliographic coupling,
Multi-step
Companion (Dean, Henzinger, 98)
SimRank (Jeh, Widom, 02)
Hybrid
Anchor text based (Haveliwala et al. 02)

random access quadratic
5
Scalability requirements

Architecture

Webgraph
V8.000.000.000 web pages
E80.000.000.000 hyperlinks
Indexing
Limited main memory, stream access to graph
Within sorting time
Parallelizable
Distributed index database
Query
Limited number of DB access
Parallelizable

Webgraph
Indexing
Database
Database
Database
Query
Sim(u,v) ?
Top(u) ?
6
SimRank

Jeh and Widom, 2002
the similarity of two pages is the average
similarity of their referring pages
Formalized as a PageRank-like equation
Power iteration quadratic storage and time
Goal quadratic ? linear

7
Randomization

For pages u and v, start two random walks from
them, following the links backwards.
Let t be the first meeting time
Jeh, Widom sim(u,v)expected value of ct
Our algorithm
Monte Carlo method
simulate N independent pair of random walks
approximate sim with the average of ct
Index DB N random walk for each page
Query calculate meeting times

8
Derandomization

(partially)
trick 1 pair-wise independence is enough
trick 2 anything after the first meeting is
irrelevant
? coalescing (sticky) walks

9
Compact storage

trick 3 We only need the time of the first
meeting, not the path itself

For the path of u4 the first smaller indexed
path it meets is the path of u3 the meeting time
is 3.
Storage L integers/path ? 2 integers/path
10
Gains

Vno. of pages (109)
Nno. of indep. simulations (100)
Indexing stream access to the graph, V cells of
memory (or external memory)
Index Database size NV (500 GB)
Query 2N disk seeks, time proportional to the
number of results
Parallelizable to N machines (5 GB storage, 2
disk seeks/query each)

11
Parallelization

Each back-end server one simulation
Query ask N servers, merge the results
Fault tolerance
when a server fails
merge N-1 resultsets
Load balance ask any N servers
Adapt to workload under heavy load, ask fewer
servers ? slight loss of precision

12
New similarity functions

Problem with SimRank nodes with high in-degree
are dissimilar to all other nodes

When SimRank fails Pages u and v have k
witnesses for similarity, yet sim(u,v) 1/k.
13
PSimRank

Coupling ? walks attract each other, like they
were walking towards the same goal
Still, PSimRank can be computed within the same
Monte Carlo similarity search framework (all
scalability properties still hold!)

14
Extended Jaccard-coefficient

Take the k-step in-neighborhood of pages u,v
Calculate their similarity using Jaccard-coeff.
Take the exponentially weighted sum in k
Storage trick does not work increase in disk
requirement (both storage and seeks/query)
indexing is the same as prior

15
Experimental evaluation

Evaluation methodology Haveliwala et al. 02
Uses Open Directory Project (dmoz.org)
Ground truth similarity in directory
familial distance documents in the same class
are more similar as those in different classes
Compare orderings of familial distance and
calculated similarity
Stanford WebBase
80M pages including 200K ODP pages

16
Experiments 1 path length
Multi-step similarity does make sense!...
17
Experiments 2 decay factor c
but mostly when downweighted.
18
Experiments 3 number of simul. N
Note recall ( of results) grows linearly.
19
Further results

In the paper
Formal analysis of error of approximation vs.
number of independent simulations exponential
decay of error
WAW2004
Application of a similar random walk based Monte
Carlo method to compute Personalized PageRank
Lower bound worst case (i.e., on arbitrary
graphs), quadratic DB is required unless
approximation

20
Conclusion