Random Walks in Ranking Query Results in Semistructured Databases - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Random Walks in Ranking Query Results in Semistructured Databases

Description:

Given Keyword Query, assign a hub and an authoritative value to each page. ... Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p) ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 47

Provided by: users

Category:

more less

Transcript and Presenter's Notes

Title: Random Walks in Ranking Query Results in Semistructured Databases

1
Random Walks in Ranking Query Results in
Semistructured Databases

Vagelis Hristidis

2
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

3
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

4
Ranking Web Pages

Rank according to
Relevance of page to query
Quality of page

5
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

6
PageRank

Stanford project
Lawrence Page, Sergey Brin, Rajeev Motwani, Terry
Winograd. The PageRank Citation Ranking Bringing
Order to the Web.
Started Google

7
PageRank

Make use of the link structure of the web to
calculate a quality ranking (PageRank) for each
web page.
Each page has unique PageRank, independent of
keyword query
PageRank does NOT express relevance of page to
query

8
PageRank is a Usage Simulation

Random surfer
Given a random URL
Clicks randomly on links
After a while gets bored and gets a new random
URL
The number of visits to each page is its
PageRank.

9
PageRank Calculation Intuition

PageRank of page P increases when pages with
large PageRanks point to P.

10
PageRank Calculation

PR(A)(1-d) d(PR(T1)/C(T1) PR(Tn)/C(Tn))
d damping factor, normally this is set to 0.85.
T1, , Tn pages pointing to page A
PR(A) PageRank of page A.
PR(Ti) PageRank of page Ti.
C(Ti) the number of links going out of page Ti.
Note d is needed due to PageRank sinks

11
Example of Calculation (1)
Page A
Page B
Page C
Page D
12
Example of Calculation (2)
Page A 1
Page B 1
10.85/2
10.85/2
10.85
10.85
Page C 1
Page D 1
10.85
13

Each page has not passed on 0.15, so we get
Page A 0.85 (from Page C) 0.15 (not
transferred) 1
Page B 0.425 (from Page A) 0.15 (not
transferred) 0.575
Page C 0.85 (from Page D) 0.85 (from Page B)
0.425 (from Page A) 0.15 (not transferred)
2.275
Page D receives none, but has not transferred
0.15 0.15

Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
14
Example of Calculation (3)
Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
15

Page A 2.2750.85 (from Page C) 0.15 (not
transferred) 2.08375
Page B 10.85/2 (from Page A) 0.15 (not
transferred) 0.575
Page C 0.150.85 (from Page D)
0.5750.85(from Page B) 10.85/2 (from Page
A) 0.15 (not transferred) 1.19125
Page D receives none, but has not transferred
0.15 0.15

Page A 2.03875
Page B 0.575
Page C 1.1925
Page D 0.15
16
Example of calculation (4)

After 20 iterations, we get

Page A 1.490
Page B 0.783
Page C 1.577
Page D 0.15
17
Example - Conclusions

Page C has the highest PageRank, and page A has
the next highest page C has a highest importance
in this page graph!
More iterations lead to convergence of PageRanks.

18
Google

Uses PageRank as one of the criteria to rank
keyword query results.
Other criteria (may) include
Term frequencies
Term proximities
Term position (title, top of page, etc)
Term characteristics (boldface, capitalized, etc)
Link analysis information
Category information
Popularity information

19
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

20
Hubs Authorities

Jon M. Kleinberg Authoritative Sources in a
Hyperlinked Environment. JACM 46(5) 604-632
(1999)
HITS ( Hypertext-Induced Topic Search) developed
by Jon Kleinberg, while visiting IBM Almaden.
IBM expanded HITS into Clever.
IBM doesn't see Clever as real-time search
engine. But create constantly refreshed lists of
relevant pages for categories

21
Hubs Authorities

Rank pages according to keyword query (in
contrast to PageRank)

22
Hubs Authorities

Good hub page that points to many good
authorities.
Good authority page pointed to by many good
hubs.
Given Keyword Query, assign a hub and an
authoritative value to each page.
Pages with high authority are results of query

23
Hubs Authorities Calculation Root Set and
Base Set

Using query term to collect a root set of pages
from text-based search engine (AltaVista)

Root Set
24
Hubs Authorities Calculation Root Set and
Base Set (Contd)

Expand root set into base set by including (up to
a designated size cut-off)
all pages linked to by pages in root set
all pages that link to a page in root set
Typical base set contains roughly 1000-5000 pages

Base Set
Root Set
25
Hubs Authorities Calculation

Iterative algorithm on Base Set authority
weights a(p), and hub weights h(p).
Set authority weights a(p) 1, and hub weights
h(p) 1 for all p.
Repeat following two operations(and then
re-normalize a and h to have unit norm)

v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
26
Example Mini Web

A
M
H

-
i
i
1
X
T

H
M
A

-
i
i
1
Y
Z
27
Example

Iteration 0 1 2 3
X
Y
Z
28
Hubs Authorities Calculation

Theorem (Kleinberg, 1998). The iterates a(p) and
h(p) converge to the principal eigenvectors of
MTM and MMT, where M is the adjacency matrix of
the (directed) Web subgraph.

29
PageRank v.s. Authorities

PageRank
(Google)
computed for all web pages stored in the database
prior to the query
computes authorities only
Trivial and fast to compute

HITS
(CLEVER)
performed on the set of retrieved web pages for
each query
computes authorities and hubs
easy to compute, but real-time execution is hard

30
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

31
Keyword Search in Databases

The label of a node is Type (Value) degree
Query Vagelis, Gravano
Assume that SIGMOD 01 has 500 attendees and 50
papers. Each paper has 10 references and 2
authors.

32
Result of Keyword Query

Result is tree T of nodes where
each edge corresponds to an edge of the data
graph
every keyword contained in a node of T
no node of T is redundant (minimal)

33
Example
Results R1 Vagelis PREFER SIGMOD 01
L.Gravano R2 Vagelis PREFER Fagin PODS96
Top-k ICDE2002 L.Gravano R3 Vagelis PREFER
Insignificant1 paper Insignificant2 paper
Unknown Gravano
34
Roadmap

Ranking Web Pages using link structure
Overview
PageRank
Hubs Authorities
Ranking Keyword Search Results in Semistructured
Databases
Problem Statement
Previous Work
Ongoing Work Ranking using Random Walks

35
Previous Work
Results R1 R2 R3
XKeyword, DISCOVER, DBXplorer, Goldman98 Score
is inverse of path distance between nodes. BANKS
Weighted distance Results output R1, R2, R3
36
Previous Work Keyword Queries

XKeyword. V. Hristidis, Y. Papakonstantinou, A.
Balmin. ICDE 2003
DISCOVER. V. Hristidis, Y. Papakonstantinou.
VLDB 2002
DBXplorer. S. Agrawal et al. ICDE 2002
Three step architecture
Data stored in DBMS
Schema use
BANKS. G. Bhalotia et al. ICDE 2002
Database viewed as graph
No schema info
Steiner tree problem approximations
Proximity searching in databases. R. Goldman et
al. VLDB 1998
Database viewed as graph
No schema info
hub nodes

37
Previous Work
Results R1 R2 R3

Prior work Results output R1, R2, R3
Intuitively R3 shows a tighter connection than
R1 (higher relevance between keywords)
But R2 connects objects of higher importance
than R3 (higher quality of result)
Relevance and Quality can be contradicting
factors

38
Random Walks (RW)

Score of result AB Probability that a random
walk goes from A to B
Captures Relevance, but ignores Quality of
result.
P(A?B ?C) 1/degree(A) 1/degree(B)

39
Random Walks (RW)
Results R1 R2 R3

RW Results output R3, R2, R1
But R2 connects objects of higher importance
than R3 (higher quality of result)
Relevance and Quality can be contradicting
factors

40
Random Walks PageRank (RWPR)

Score of result AB Probability that a random
walk starting from any node, goes through both A
and B.
Captures both Relevance and Quality of result.
Score PR(A) P(AB) PR(B) P(BA)
P(AB) can be computed using PageRank algorithm
setting the pagerank source to A

41
Random Walks PageRank (RWPR)
Results R1 R2 R3

RWPR Results output R2, R1, R3
Assuming

42
Example - Details
The following table shows the scores of the
results according to 3 ranking methods
Ranking
43
Random Walk Variations
44
Page vs Structured Results Ranking
45
Open issues

Efficiently calculating RW
First thoughts Two ways
DISCOVER-like with CNs
BANKS-like, using shorthest path progressively
Edges must have different weights for PR and RW
calculation. (eg Paper cites Paper is one-way
for PR but two-way for RW)
How to assign PR and RW weights on schema graph?

46
Conclusions

The concept of Random Walks has proven very
useful in ranking Web pages
Can also be used in ranking results of queries in
structured/semistructured databases.
Problem is more complicated

Write a Comment

User Comments (0)