Title: Scaling linkbased similarity search
1Scaling link-based similarity search
Dániel Fogaras, Balázs Rácz
Computer and Automation Research Institute of the
Hungarian Academy of Sciences
Budapest University of Technology and Economics
2Outline
- Introduction
- Scaling link-based similarity search
- Scaling link-based similarity search
- Scaling link-based similarity search
- First scalable algorithm for SimRank
- New similarity functions
- Experiments
3Introduction / Motivation
- Similarity search on the Web
4Approaches / Related Results
- Text-based
- Classic IR
- Min-hash fingerprinting (Broder 98)
- Pure link-based
- Single-step cocitation, bibliographic coupling,
- Multi-step
- Companion (Dean, Henzinger, 98)
- SimRank (Jeh, Widom, 02)
- Hybrid
- Anchor text based (Haveliwala et al. 02)
random access quadratic
5Scalability requirements
- Webgraph
- V8.000.000.000 web pages
- E80.000.000.000 hyperlinks
- Indexing
- Limited main memory, stream access to graph
- Within sorting time
- Parallelizable
- Distributed index database
- Query
- Limited number of DB access
- Parallelizable
Webgraph
Indexing
Database
Database
Database
Query
Sim(u,v) ?
Top(u) ?
6SimRank
- Jeh and Widom, 2002
- the similarity of two pages is the average
similarity of their referring pages - Formalized as a PageRank-like equation
- Power iteration quadratic storage and time
- Goal quadratic ? linear
7Randomization
- For pages u and v, start two random walks from
them, following the links backwards. - Let t be the first meeting time
- Jeh, Widom sim(u,v)expected value of ct
- Our algorithm
- Monte Carlo method
- simulate N independent pair of random walks
- approximate sim with the average of ct
- Index DB N random walk for each page
- Query calculate meeting times
8Derandomization
- (partially)
- trick 1 pair-wise independence is enough
- trick 2 anything after the first meeting is
irrelevant - ? coalescing (sticky) walks
9Compact storage
- trick 3 We only need the time of the first
meeting, not the path itself
For the path of u4 the first smaller indexed
path it meets is the path of u3 the meeting time
is 3.
Storage L integers/path ? 2 integers/path
10Gains
- Vno. of pages (109)
- Nno. of indep. simulations (100)
- Indexing stream access to the graph, V cells of
memory (or external memory) - Index Database size NV (500 GB)
- Query 2N disk seeks, time proportional to the
number of results - Parallelizable to N machines (5 GB storage, 2
disk seeks/query each)
11Parallelization
- Each back-end server one simulation
- Query ask N servers, merge the results
- Fault tolerance
- when a server fails
- merge N-1 resultsets
- Load balance ask any N servers
- Adapt to workload under heavy load, ask fewer
servers ? slight loss of precision
12New similarity functions
- Problem with SimRank nodes with high in-degree
are dissimilar to all other nodes
When SimRank fails Pages u and v have k
witnesses for similarity, yet sim(u,v) 1/k.
13PSimRank
- Coupling ? walks attract each other, like they
were walking towards the same goal - Still, PSimRank can be computed within the same
Monte Carlo similarity search framework (all
scalability properties still hold!)
14Extended Jaccard-coefficient
- Take the k-step in-neighborhood of pages u,v
- Calculate their similarity using Jaccard-coeff.
- Take the exponentially weighted sum in k
- Storage trick does not work increase in disk
requirement (both storage and seeks/query)
indexing is the same as prior
15Experimental evaluation
- Evaluation methodology Haveliwala et al. 02
- Uses Open Directory Project (dmoz.org)
- Ground truth similarity in directory
- familial distance documents in the same class
are more similar as those in different classes - Compare orderings of familial distance and
calculated similarity - Stanford WebBase
- 80M pages including 200K ODP pages
16Experiments 1 path length
Multi-step similarity does make sense!...
17Experiments 2 decay factor c
but mostly when downweighted.
18Experiments 3 number of simul. N
Note recall ( of results) grows linearly.
19Further results
- In the paper
- Formal analysis of error of approximation vs.
number of independent simulations exponential
decay of error - WAW2004
- Application of a similar random walk based Monte
Carlo method to compute Personalized PageRank - Lower bound worst case (i.e., on arbitrary
graphs), quadratic DB is required unless
approximation
20Conclusion
- Approximation algorithm for multi-step/ recursive
similarity functions - Uses simulated random walks
- Monte Carlo method
- Scalable
- New similarity functions
- First sight of these on real(ly big) web data
- Yes, they do make sense!
21Open problems
- Theoretical
- Analysis of storage trick efficiency
- Expected query time/result set size
- Practical
- Comparison with text/anchor text methods
- Both
- Further methods
- Our methods are general
- Methods that use special features of the web
graph? - Goal better approximation, recall
- Combined methods