Random Walks in Ranking Query Results in Semistructured Databases - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Random Walks in Ranking Query Results in Semistructured Databases

Description:

PageRank Calculation Intuition ... Hubs & Authorities Calculation ... Edges must have different weights for PR and RW calculation. ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 47
Provided by: usersC8
Category:

less

Transcript and Presenter's Notes

Title: Random Walks in Ranking Query Results in Semistructured Databases


1
Random Walks in Ranking Query Results in
Semistructured Databases
  • Vagelis Hristidis

2
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

3
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

4
Ranking Web Pages
  • Rank according to
  • Relevance of page to query
  • Quality of page

5
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

6
PageRank
  • Stanford project
  • Lawrence Page, Sergey Brin, Rajeev Motwani, Terry
    Winograd. The PageRank Citation Ranking Bringing
    Order to the Web.
  • Started Google

7
PageRank
  • Make use of the link structure of the web to
    calculate a quality ranking (PageRank) for each
    web page.
  • Each page has unique PageRank, independent of
    keyword query
  • PageRank does NOT express relevance of page to
    query

8
PageRank is a Usage Simulation
  • Random surfer
  • Given a random URL
  • Clicks randomly on links
  • After a while gets bored and gets a new random
    URL
  • The number of visits to each page is its
    PageRank.

9
PageRank Calculation Intuition
  • PageRank of page P increases when pages with
    large PageRanks point to P.

10
PageRank Calculation
  • PR(A)(1-d) d(PR(T1)/C(T1) PR(Tn)/C(Tn))
  • d damping factor, normally this is set to 0.85.
  • T1, , Tn pages pointing to page A
  • PR(A) PageRank of page A.
  • PR(Ti) PageRank of page Ti.
  • C(Ti) the number of links going out of page Ti.
  • Note d is needed due to PageRank sinks

11
Example of Calculation (1)
Page A
Page B
Page C
Page D
12
Example of Calculation (2)
Page A 1
Page B 1
10.85/2
10.85/2
10.85
10.85
Page C 1
Page D 1
10.85
13
  • Each page has not passed on 0.15, so we get
  • Page A 0.85 (from Page C) 0.15 (not
    transferred) 1
  • Page B 0.425 (from Page A) 0.15 (not
    transferred) 0.575
  • Page C 0.85 (from Page D) 0.85 (from Page B)
    0.425 (from Page A) 0.15 (not transferred)
    2.275
  • Page D receives none, but has not transferred
    0.15 0.15

Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
14
Example of Calculation (3)
Page A 1
Page B 0.575
Page C 2.275
Page D 0.15
15
  • Page A 2.2750.85 (from Page C) 0.15 (not
    transferred) 2.08375
  • Page B 10.85/2 (from Page A) 0.15 (not
    transferred) 0.575
  • Page C 0.150.85 (from Page D)
    0.5750.85(from Page B) 10.85/2 (from Page
    A) 0.15 (not transferred) 1.19125
  • Page D receives none, but has not transferred
    0.15 0.15

Page A 2.03875
Page B 0.575
Page C 1.1925
Page D 0.15
16
Example of calculation (4)
  • After 20 iterations, we get

Page A 1.490
Page B 0.783
Page C 1.577
Page D 0.15
17
Example - Conclusions
  • Page C has the highest PageRank, and page A has
    the next highest page C has a highest importance
    in this page graph!
  • More iterations lead to convergence of PageRanks.

18
Google
  • Uses PageRank as one of the criteria to rank
    keyword query results.
  • Other criteria (may) include
  • Term frequencies
  • Term proximities
  • Term position (title, top of page, etc)
  • Term characteristics (boldface, capitalized, etc)
  • Link analysis information
  • Category information
  • Popularity information

19
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

20
Hubs Authorities
  • Jon M. Kleinberg Authoritative Sources in a
    Hyperlinked Environment. JACM 46(5) 604-632
    (1999)
  • HITS ( Hypertext-Induced Topic Search) developed
    by Jon Kleinberg, while visiting IBM Almaden.
  • IBM expanded HITS into Clever.
  • IBM doesn't see Clever as real-time search
    engine. But create constantly refreshed lists of
    relevant pages for categories

21
Hubs Authorities
  • Rank pages according to keyword query (in
    contrast to PageRank)

22
Hubs Authorities
  • Good hub page that points to many good
    authorities.
  • Good authority page pointed to by many good
    hubs.
  • Given Keyword Query, assign a hub and an
    authoritative value to each page.
  • Pages with high authority are results of query

23
Hubs Authorities Calculation Root Set and
Base Set
  • Using query term to collect a root set of pages
    from text-based search engine (AltaVista)

Root Set
24
Hubs Authorities Calculation Root Set and
Base Set (Contd)
  • Expand root set into base set by including (up to
    a designated size cut-off)
  • all pages linked to by pages in root set
  • all pages that link to a page in root set
  • Typical base set contains roughly 1000-5000 pages

Base Set
Root Set
25
Hubs Authorities Calculation
  • Iterative algorithm on Base Set authority
    weights a(p), and hub weights h(p).
  • Set authority weights a(p) 1, and hub weights
    h(p) 1 for all p.
  • Repeat following two operations(and then
    re-normalize a and h to have unit norm)

v1
v1
h(v1)
a(v1)
p
v2
p
v2
h(v2)
a(v2)
v3
h(v3)
v3
a(v3)
26
Example Mini Web
     


A
M
H

-
i
i
1
X
T

H
M
A

-
i
i
1
Y
Z
27
Example
 

Iteration 0 1 2 3
X
Y
Z
28
Hubs Authorities Calculation
  • Theorem (Kleinberg, 1998). The iterates a(p) and
    h(p) converge to the principal eigenvectors of
    MTM and MMT, where M is the adjacency matrix of
    the (directed) Web subgraph.

29
PageRank v.s. Authorities
  • PageRank
  • (Google)
  • computed for all web pages stored in the database
    prior to the query
  • computes authorities only
  • Trivial and fast to compute
  • HITS
  • (CLEVER)
  • performed on the set of retrieved web pages for
    each query
  • computes authorities and hubs
  • easy to compute, but real-time execution is hard

30
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

31
Keyword Search in Databases
  • The label of a node is Type (Value) degree
  • Query Vagelis, Gravano
  • Assume that SIGMOD 01 has 500 attendees and 50
    papers. Each paper has 10 references and 2
    authors.

32
Result of Keyword Query
  • Result is tree T of nodes where
  • each edge corresponds to an edge of the data
    graph
  • every keyword contained in a node of T
  • no node of T is redundant (minimal)

33
Example
Results R1 Vagelis PREFER SIGMOD 01
L.Gravano R2 Vagelis PREFER Fagin PODS96
Top-k ICDE2002 L.Gravano R3 Vagelis PREFER
Insignificant1 paper Insignificant2 paper
Unknown Gravano
34
Roadmap
  • Ranking Web Pages using link structure
  • Overview
  • PageRank
  • Hubs Authorities
  • Ranking Keyword Search Results in Semistructured
    Databases
  • Problem Statement
  • Previous Work
  • Ongoing Work Ranking using Random Walks

35
Previous Work
Results R1 R2 R3
XKeyword, DISCOVER, DBXplorer, Goldman98 Score
is inverse of path distance between nodes. BANKS
Weighted distance Results output R1, R2, R3
36
Previous Work Keyword Queries
  • XKeyword. V. Hristidis, Y. Papakonstantinou, A.
    Balmin. ICDE 2003
  • DISCOVER. V. Hristidis, Y. Papakonstantinou.
    VLDB 2002
  • DBXplorer. S. Agrawal et al. ICDE 2002
  • Three step architecture
  • Data stored in DBMS
  • Schema use
  • BANKS. G. Bhalotia et al. ICDE 2002
  • Database viewed as graph
  • No schema info
  • Steiner tree problem approximations
  • Proximity searching in databases. R. Goldman et
    al. VLDB 1998
  • Database viewed as graph
  • No schema info
  • hub nodes

37
Previous Work
Results R1 R2 R3
  • Prior work Results output R1, R2, R3
  • Intuitively R3 shows a tighter connection than
    R1 (higher relevance between keywords)
  • But R2 connects objects of higher importance
    than R3 (higher quality of result)
  • Relevance and Quality can be contradicting
    factors

38
Random Walks (RW)
  • Score of result AB Probability that a random
    walk goes from A to B
  • Captures Relevance, but ignores Quality of
    result.
  • P(A?B ?C) 1/degree(A) 1/degree(B)

39
Random Walks (RW)
Results R1 R2 R3
  • RW Results output R3, R2, R1
  • But R2 connects objects of higher importance
    than R3 (higher quality of result)
  • Relevance and Quality can be contradicting
    factors

40
Random Walks PageRank (RWPR)
  • Score of result AB Probability that a random
    walk starting from any node, goes through both A
    and B.
  • Captures both Relevance and Quality of result.
  • Score PR(A) P(AgtB) PR(B) P(BgtA)
  • P(AgtB) can be computed using PageRank algorithm
    setting the pagerank source to A

41
Random Walks PageRank (RWPR)
Results R1 R2 R3
  • RWPR Results output R2, R1, R3
  • Assuming

42
Example - Details
The following table shows the scores of the
results according to 3 ranking methods
Ranking
43
Random Walk Variations
44
Page vs Structured Results Ranking
45
Open issues
  • Efficiently calculating RW
  • First thoughts Two ways
  • DISCOVER-like with CNs
  • BANKS-like, using shorthest path progressively
  • Edges must have different weights for PR and RW
    calculation. (eg Paper cites Paper is one-way
    for PR but two-way for RW)
  • How to assign PR and RW weights on schema graph?

46
Conclusions
  • The concept of Random Walks has proven very
    useful in ranking Web pages
  • Can also be used in ranking results of queries in
    structured/semistructured databases.
  • Problem is more complicated
Write a Comment
User Comments (0)
About PowerShow.com