Title: Random Walks in Ranking Query Results in Semistructured Databases
1Random Walks in Ranking Query Results in
Semistructured Databases
2Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
3Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
4Ranking Web Pages
- Rank according to
- Relevance of page to query
- Quality of page
5Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
6PageRank
- Stanford project
- Lawrence Page, Sergey Brin, Rajeev Motwani, Terry
Winograd. The PageRank Citation Ranking Bringing
Order to the Web. - Started Google
7PageRank
- Make use of the link structure of the web to
calculate a quality ranking (PageRank) for each
web page. - Each page has unique PageRank, independent of
keyword query - PageRank does NOT express relevance of page to
query
8PageRank is a Usage Simulation
- Random surfer
- Given a random URL
- Clicks randomly on links
- After a while gets bored and gets a new random
URL - The number of visits to each page is its
PageRank.
9PageRank Calculation Intuition
- PageRank of page P increases when pages with
large PageRanks point to P.
10PageRank Calculation
- PR(A)(1-d) d(PR(T1)/C(T1) PR(Tn)/C(Tn))
- d damping factor, normally this is set to 0.85.
- T1, , Tn pages pointing to page A
- PR(A) PageRank of page A.
- PR(Ti) PageRank of page Ti.
- C(Ti) the number of links going out of page Ti.
- Note d is needed due to PageRank sinks
11Example of Calculation (1)
Page A
Page B
Page C
Page D
12Example of Calculation (2)
Page A 1
Page B 1
10.85/2
10.85/2
10.85
10.85
Page C 1
Page D 1
10.85
13- Each page has not passed on 0.15, so we get
- Page A 0.85 (from Page C) 0.15 (not
transferred) 1 - Page B 0.425 (from Page A) 0.15 (not
transferred) 0.575 - Page C 0.85 (from Page D) 0.85 (from Page B)
0.425 (from Page A) 0.15 (not transferred)
2.275 - Page D receives none, but has not transferred
0.15 0.15
Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
14Example of Calculation (3)
Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
15- Page A 2.2750.85 (from Page C) 0.15 (not
transferred) 2.08375 - Page B 10.85/2 (from Page A) 0.15 (not
transferred) 0.575 - Page C 0.150.85 (from Page D)
0.5750.85(from Page B) 10.85/2 (from Page
A) 0.15 (not transferred) 1.19125 - Page D receives none, but has not transferred
0.15 0.15
Page A 2.03875
Page B 0.575
Page C 1.1925
Page D 0.15
16Example of calculation (4)
- After 20 iterations, we get
Page A 1.490
Page B 0.783
Page C 1.577
Page D 0.15
17Example - Conclusions
- Page C has the highest PageRank, and page A has
the next highest page C has a highest importance
in this page graph! - More iterations lead to convergence of PageRanks.
18Google
- Uses PageRank as one of the criteria to rank
keyword query results. - Other criteria (may) include
- Term frequencies
- Term proximities
- Term position (title, top of page, etc)
- Term characteristics (boldface, capitalized, etc)
- Link analysis information
- Category information
- Popularity information
19Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
20Hubs Authorities
- Jon M. Kleinberg Authoritative Sources in a
Hyperlinked Environment. JACM 46(5) 604-632
(1999) - HITS ( Hypertext-Induced Topic Search) developed
by Jon Kleinberg, while visiting IBM Almaden. - IBM expanded HITS into Clever.
- IBM doesn't see Clever as real-time search
engine. But create constantly refreshed lists of
relevant pages for categories
21Hubs Authorities
- Rank pages according to keyword query (in
contrast to PageRank)
22Hubs Authorities
- Good hub page that points to many good
authorities. - Good authority page pointed to by many good
hubs. - Given Keyword Query, assign a hub and an
authoritative value to each page. - Pages with high authority are results of query
23Hubs Authorities Calculation Root Set and
Base Set
- Using query term to collect a root set of pages
from text-based search engine (AltaVista)
Root Set
24Hubs Authorities Calculation Root Set and
Base Set (Contd)
- Expand root set into base set by including (up to
a designated size cut-off) - all pages linked to by pages in root set
- all pages that link to a page in root set
- Typical base set contains roughly 1000-5000 pages
Base Set
Root Set
25Hubs Authorities Calculation
- Iterative algorithm on Base Set authority
weights a(p), and hub weights h(p). - Set authority weights a(p) 1, and hub weights
h(p) 1 for all p. - Repeat following two operations(and then
re-normalize a and h to have unit norm)
v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
26Example Mini Web
A
M
H
-
i
i
1
X
T
H
M
A
-
i
i
1
Y
Z
27Example
Iteration 0 1 2 3
X
Y
Z
28Hubs Authorities Calculation
- Theorem (Kleinberg, 1998). The iterates a(p) and
h(p) converge to the principal eigenvectors of
MTM and MMT, where M is the adjacency matrix of
the (directed) Web subgraph.
29PageRank v.s. Authorities
- PageRank
- (Google)
- computed for all web pages stored in the database
prior to the query - computes authorities only
- Trivial and fast to compute
- HITS
- (CLEVER)
- performed on the set of retrieved web pages for
each query - computes authorities and hubs
- easy to compute, but real-time execution is hard
30Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
31Keyword Search in Databases
- The label of a node is Type (Value) degree
- Query Vagelis, Gravano
- Assume that SIGMOD 01 has 500 attendees and 50
papers. Each paper has 10 references and 2
authors.
32Result of Keyword Query
- Result is tree T of nodes where
- each edge corresponds to an edge of the data
graph - every keyword contained in a node of T
- no node of T is redundant (minimal)
33Example
Results R1 Vagelis PREFER SIGMOD 01
L.Gravano R2 Vagelis PREFER Fagin PODS96
Top-k ICDE2002 L.Gravano R3 Vagelis PREFER
Insignificant1 paper Insignificant2 paper
Unknown Gravano
34Roadmap
- Ranking Web Pages using link structure
- Overview
- PageRank
- Hubs Authorities
- Ranking Keyword Search Results in Semistructured
Databases - Problem Statement
- Previous Work
- Ongoing Work Ranking using Random Walks
35Previous Work
Results R1 R2 R3
XKeyword, DISCOVER, DBXplorer, Goldman98 Score
is inverse of path distance between nodes. BANKS
Weighted distance Results output R1, R2, R3
36Previous Work Keyword Queries
- XKeyword. V. Hristidis, Y. Papakonstantinou, A.
Balmin. ICDE 2003 - DISCOVER. V. Hristidis, Y. Papakonstantinou.
VLDB 2002 - DBXplorer. S. Agrawal et al. ICDE 2002
- Three step architecture
- Data stored in DBMS
- Schema use
- BANKS. G. Bhalotia et al. ICDE 2002
- Database viewed as graph
- No schema info
- Steiner tree problem approximations
- Proximity searching in databases. R. Goldman et
al. VLDB 1998 - Database viewed as graph
- No schema info
- hub nodes
37Previous Work
Results R1 R2 R3
- Prior work Results output R1, R2, R3
- Intuitively R3 shows a tighter connection than
R1 (higher relevance between keywords) - But R2 connects objects of higher importance
than R3 (higher quality of result) - Relevance and Quality can be contradicting
factors
38Random Walks (RW)
- Score of result AB Probability that a random
walk goes from A to B - Captures Relevance, but ignores Quality of
result. - P(A?B ?C) 1/degree(A) 1/degree(B)
39Random Walks (RW)
Results R1 R2 R3
- RW Results output R3, R2, R1
- But R2 connects objects of higher importance
than R3 (higher quality of result) - Relevance and Quality can be contradicting
factors
40Random Walks PageRank (RWPR)
- Score of result AB Probability that a random
walk starting from any node, goes through both A
and B. - Captures both Relevance and Quality of result.
- Score PR(A) P(AB) PR(B) P(BA)
- P(AB) can be computed using PageRank algorithm
setting the pagerank source to A
41Random Walks PageRank (RWPR)
Results R1 R2 R3
- RWPR Results output R2, R1, R3
- Assuming
42Example - Details
The following table shows the scores of the
results according to 3 ranking methods
Ranking
43Random Walk Variations
44Page vs Structured Results Ranking
45Open issues
- Efficiently calculating RW
- First thoughts Two ways
- DISCOVER-like with CNs
- BANKS-like, using shorthest path progressively
- Edges must have different weights for PR and RW
calculation. (eg Paper cites Paper is one-way
for PR but two-way for RW) - How to assign PR and RW weights on schema graph?
46Conclusions
- The concept of Random Walks has proven very
useful in ranking Web pages - Can also be used in ranking results of queries in
structured/semistructured databases. - Problem is more complicated