CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB - PowerPoint PPT Presentation

About This Presentation
Title:

CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB

Description:

Entry (p,p) : the indegree of page p. Bibliographic coupling matrix AAT. Authority / Hub. diagonal term: authority is influenced by number of citation ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 10
Provided by: newl7
Category:
Tags: analysis | and | content | for | link | searching | the | web | hub

less

Transcript and Presenter's Notes

Title: CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB


1
CONTENT AND LINK ANALYSIS FOR SEARCHING THE WEB
  • Kemal Efe, Vijay Raghavan, and Arun Lakhotia
  • University of Louisiana
  • Presented by Lan Nie
  • 09/01/2005, Lehigh University

2
  • Introduction
  • Search engine
  • Crawl, index and retrieve information about web
    pages.
  • find all of the relevant pages
  • rank them by relevance to the user query
  • present a rank-ordered result
  • Recall and Precision
  • Early Search Engine
  • Solely keyword matching
  • Lots of low quality pages, rankings rarely agreed
    with users interests
  • Synonymy and polysemy
  • Modern Search Engine
  • Linkage structure provides valuable information.
  • Link analysis combined with content analysis
  • Substantially improve the search quality

3
  • Link Analysis
  • Authority Flow Model
  • Link a channel for authority flow
  • A page q with authority rank rq(i) at iteration i
    distribute all its current authority equally
    among its outgoing links.
  • However..
  • An authorative page on a subject is likely to be
    co-cited with other authoritative pages on the
    same subject
  • Rank of a page should augmented by its
    co-citation degree.
  • Random Walk Model
  • Surfer walks on the web graph and make random
    decisions about where to go next
  • PageRank is a combination of authority flow model
    and random walk model

4
  • Continued..
  • Authority and Hub
  • Co-citation matrix ATA
  • Entry (p,q) the number of joint co-citations
    received by p and q
  • Entry (p,p) the indegree of page p.
  • Bibliographic coupling matrix AAT
  • Authority / Hub
  • diagonal term authority is influenced by
    number of citation
  • non-diagonal term authority is influenced by the
    degree co-citation
  • Influence (co-citation) Influence (citation)
  • A more general model different weights for
    diagonal terms and no-diagonal terms in the above
    computation
  • HITS algorithm combined the authority and hub
    idea together

5
Content Analysis
  • Which pages are important in the Web Graph? (Link
    Analysis)
  • Which pages are relevant to the query?
    (Content Analysis)
  • Tasks of Content Analysis
  • how a page is relevant to the user query
  • Similarity between documents in vector space
  • Cosine Similarity
  • Okapi measure, Three Level Scoring,Cover Density
    Ranking
  • Where on the page to search for the query terms
  • Fields Title, Anchor text, Abstract
  • Properties Font, Highlighting, Capitalization,
    distance between subquery terms
  • Deal with synonymy and polysemy
  • LSI,GVSM
  • Application in classification, document search
    and relevance ranking.

6
  • Combining Content and Link Analysis (PageRank)
  • Page Rank A Random Surfer (Brin and Page1998)
  • With 1-d, jumps to a random page with d, follows
    a random outlink.
  • Rank Is independent of query/ topic.
  • Topic Sensitive Page Rank Multiple Focused
    Surfer(Haveliwala2002)
  • A set of predefined topics (top level categories
    of ODP), with Ct as the set
  • of URLs in the ODP category t.
  • Each page is assigned a rank vector , one rank
    for each topic.
  • Each surfer is focused on a specific topic t
  • With 1-d, jumps to a page in Ct with d, follows
    a random outlink
  • For a given query, a pages query-sensitive score
    is inner product of the pages rank vector and
    the querys topic distribution vector.

7
  • Combining Content and Link Analysis (HITS)
  • HITS
  • Sampling
  • Use query to collect a root set of pages from a
    textual search engine
  • Expand the root set into a base set by adding
    pages linked to and from the root set
  • Calculation of Authorities and Hubs
  • Problems
  • Tightly Knit Community (TKC) effect
  • HITS has converged to the regions of the web
    graph which is highly connected . How about TKC
    is irrelevant to the topic?
  • Page propagate the same authority weight to each
    outgoing page
  • Result is dominated by one community, a page
    would be deemed unimportant if it is popular of a
    smaller community. ExampleJaguar

8
  • Continued..
  • Improvement of HITS
  • Chakrabarti et al.1999
  • Outlinks in different part of page may point to
    different topics
  • Page splitting outlinks in the small page tend
    to be on the same topic
  • Li et al.2002
  • A good hub is likely to be cited, hub weights of
    pages are increased depending on their authority
    weights
  • Cohn and Chang2001
  • PHITS A probabilistic model to rank a page
    within its own community rather than within the
    entire base set
  • Dean et. Al.1999
  • Whats Related? Given the seed page, find its
    parents, children of its parents, its children,
    parents of its children
  • Given a see page, find pages link to it, and what
    else they link to
  • Output pages that are most frequently co-cited
    with the seed URL
  • Bharat and Henzinger1998, Chakrabarti et
    al.1998
  • Weighted HITS

9
  • Weighted HITS
  • CLEVER project (Chakrabarti et al.1998)
  • A relevance weight is computed for each link
  • W (p, q) The number of query matches in the
    surrounding texts of the link p-q
  • Query Expansion (Bharat and Henzinger1998)
  • A relevance weight is assigned to each page
  • Broader Query Q concatenation the first 1000
    words from each doc in the root set
  • W(p) cosine similarity between page p and
    broader query Q
Write a Comment
User Comments (0)
About PowerShow.com