PageSim: A Link-based Measure of Web Page Similarity - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

PageSim: A Link-based Measure of Web Page Similarity

Description:

In PageSim, PageRank (PR) score is used to measure the authority of a web page. ... How to compare performance of two similarity measures, e.g., PageSim and SimRank? ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 19
Provided by: All5162
Category:

less

Transcript and Presenter's Notes

Title: PageSim: A Link-based Measure of Web Page Similarity


1
PageSim A Link-based Measure of Web Page
Similarity
  • Research Group Presentation
  • Allen Z. Lin, 8 Mar 2006

2
Outline
  • What Why?
  • Existing approaches
  • PageSim a new approach
  • Demostrations
  • Conclusion and current work

3
What Why?
  • Ranking similarity between web pages.
  • Applications on the Web
  • Finding related, or similar, web pages to a
    page.
  • Googles Similar pages
  • Web page classification.
  • YAHOO!s Web Directory. http//dir.yahoo.com/
  • hierarchical structure
  • Key questionHow to measure the similarity?

4
Existing approaches
  • Text-based
  • Using common features of two web pages.
  • Jaccards coefficient, Adamic/Adar
  • Link-based
  • Using neighbors between two web pages.
  • Common neighbor, Co-citation, SimRank
  • Using paths between two web pages.
  • Katz index, Hitting time

5
Existing approaches (cont.)
  • Notations
  • Sim(a,b) similarity score of web page a and b.
  • I(a) in-link neighbors of web page a.
  • O(a) out-link neighbors of web page a.
  • Common neighbor method
  • Sim(a,b) O(a)nO(b)
  • (c,d) 2
  • Cocitation method
  • Sim(a,b) I(a)nI(b)
  • (c,d) 2

6
Existing approaches (cont.)
  • SimRank
  • Two pages are similar if they are referenced
    (cited, or linked to) by similar pages.
  • 1. Sim(u,u)1 2. Sim(u,v)0 if I(u) I(v)
    0.
  • Recursive definition
  • C is a constant between 0 and 1.
  • The iteration starts with Sim(u,u)1,
    Sim(u,v)0if u? v.

7
PageSim a new approach
  • Two problems
  • On the Web, not all links are equally important.
  • Common neighbor, Cocitation
  • A similarity measure should be able to measure
    the similarity between any two web pages.
  • SimRank
  • PageSim
  • Take the above problems into account.

8
PageSim a new approach (cont.)
  • Cocitation
  • Which page is more similar to d, c or e?
  • Suppose page a is YAHOO!s homepage, and b is a
    personal web page.
  • Authoritative pages are more important.

9
PageSim a new approach (cont.)
  • SimRank
  • Are a and b similar?
  • SimRank says NOs.
  • Are the answers reasonable?

10
PageSim a new approach (cont.)
  • Page a linking to b and c means a thinks
  • b and c are kind of similar.
  • both b and c are kind of similar to a too.
  • Page a spreads similarity to its neighbors.
  • Authoritative pages spread more similarity.

11
PageSim a new approach (cont.)
  • PageSim
  • In PageSim, PageRank (PR) score is used to
    measure the authority of a web page.
  • PR assigns global importance scores to all web
    pages.
  • Each page spreads its own similarity score (PR
    score) to its neighbors.
  • Each page also propagates other pages similarity
    scores to its neighbors.
  • After the similarity score propagation finished,
    each page contains an array of similarity scores.
  • PageRank score propagation

12
PageSim a new approach (cont.)
  • Example similarity propagation (page a only)
  • PR(a)100, PR(b)55, PR(c)102
  • Each page propagate 80 of its similarity score
    averagely to its neighbors.

13
PageSim a new approach (cont.)
  • Example similarity propagation (cont.)
  • PR(a)100, PR(b)55, PR(c)102
  • Each page contains a similarity score vector(SV).
  • SV(a) (100, 35, 82 ),
  • SV(b) ( 40, 55, 33 ),
  • SV(c) ( 72, 44, 102 ),
  • PageSim score (PS) computation
  • PS(a,b)Smin( SV(a), SV(b) )
    403533 108
  • Two pages are more similar if they share more
    common similarity scores.

14
PageSim a new approach (cont.)
  • Example similarity spreading (cont.)
  • PageSim score matrix
  • PS_matrix (PS(u,v))nxn a 217 b 108
    128 c 189 117 219
  • PS_matrix is symmetric.
  • PS(a,b) PS(b, a)
  • Any web page is most similar to itself.
  • PS(u,u) max ( PS(u,v) ), for any v.

15
Demostrations
  • Example 1 single link
  • PageSim matrixa 100b 80 265c 64
    212 469.2d 51.2 169.6 375.4 694.1
  • PR (100, 185, 257.2, 318.6)
  • SimRank matrix1 0 1 0 0 1 0 0 0 1

16
Demostrations (cont.)
  • Example 2 loop link
  • PageSim matrixa 295.2b 246.4 295.2 c 230.4
    246.4 295.2d 246.4 230.4 246.4 295.2
  • PR (100, 100, 100, 100)
  • SimRank matrix1 0 1 0 0 1 0 0 0 1

17
Demostrations (cont.)
  • Example 3 more complex
  • PageSim matrix1 100.02 40.0 487.63
    50.7 159.4 397.44 10.7 238.5 130.0
    275.55 10.7 130.0 130.0 130.0 314.9PR
    (100, 40.0, 50.7, 10.7, 10.7)
  • SimRank matrix1 1 2 0 1 3 0 0.25
    14 0 0 0.5 15 0 0
    0.5 1 1
  • PageSim results
  • v3 is most similar to v1.
  • v4 is most similar to v2.

18
Conclusion and current work
  • Conclusion
  • Web page similarity measuresText-based
    Link-based
  • PageSim PageRank score propagation.
  • Current work
  • Propagation radius pruning.
  • How to compare performance of two similarity
    measures, e.g., PageSim and SimRank?
  • Text-based measures.

Thank you!
Write a Comment
User Comments (0)
About PowerShow.com