Combating Web Spam with TrustRank - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Combating Web Spam with TrustRank

Description:

... evaluation based on 31 million sites crawled by the Alta Vista search engine. ... crawled and indexed by the Alta Vista search engine, they group several billion ... – PowerPoint PPT presentation

Number of Views:489
Avg rating:3.0/5.0
Slides: 32
Provided by: nitinm
Category:

less

Transcript and Presenter's Notes

Title: Combating Web Spam with TrustRank


1
Combating Web Spam with TrustRank
  • Presented by
  • Nitin Mittal

2
About the Paper
  • Name
  • Combating Web Spam with TrustRank
  • Authors
  • Zoltan Gyongyi
  • Hector Garcia-Molina
  • Jan Pedersen
  • Year
  • 2004
  • Topic
  • This paper presented the first attempt to
    formalizing the problem of combating Web Spam
    and introduced a comprehensive solution to
    assist in the detection of Web Spam.

3
Presentation Outline
  • Introduction
  • Preliminaries
  • Trust Rank Algorithm
  • Selecting Seeds set
  • Experiment Result
  • Conclusions
  • Further improvements

4
Introduction
  • Web Spam
  • The term web Spam refers to hyperlinked pages
    on the World Wide Web that are created with the
    intention of misleading search engines.
  • Techniques of Web Spam
  • By adding the thousands of keywords to home page,
    so that a search engine will index the key words
    and return the bogus site as an answer to queries
    on that keywords.
  • Creating a large number of bogus web pages, all
    pointing to a single target page. Since many
    search engines ranks the pages with the
    consideration on incoming links.

5
Introduction(Cont.)
  • Issues with Web Spam
  • Its hard to check if pages has the contents
    related to the keywords listed at its main page
    or they are just inserted for Web Spam.
  • Inter linking websites may represent useful
    relations between the sites, or they may have
    been created to boost the rank of each others
    pages.
  • Not an easy task for computer to categories web
    pages due to the large number of web pages.

6
Introduction(Cont.)
  • Contribution of this paper are
  • Formalize the problem of Web Spam and Spam
    detection algorithms.
  • Defined metrics for assessing the efficacy of
    detection algorithms.
  • Presented schemes for selecting seed sets of
    pages to be manually evaluate.
  • Introduced the TrustRank algorithms for
    determining the likelihood that pages are
    reputable(not Spam).
  • Discuss results of an extensive evaluation based
    on 31 million sites crawled by the Alta Vista
    search engine.

7
Preliminaries
  • Web Model
  • Page Rank

8
Preliminaries (Cont)
  • 1. Web Model
  • Web is modeled as a graph G(V,E), Consisting
    of a set V of N pages (vertices) and a set E of
    directed links (Edges) that connect pages.
  • Multiple hyperlink between two pages p and q, is
    collapsed into a single link(p,q) ? E.
  • In-degree, number of inlinks of a page p, is
    l(p)
  • Out-degree, number of outlinks of a page p, is
    ?(p)
  • Pages without outlinks are referred to as
    non-referencing pages.
  • Pages without inlinks are referred to as
    unreferenced pages.
  • Pages that are both unreferenced and
    non-referencing at the same time are referred as
    isolated pages.

9
Preliminaries (Cont)
  • Page 1 is unrefernced page.
  • Page 4 is non-referncing page.

10
Preliminaries (Cont)
  • Transposition matrix, T
  • Inverse Transposition matrix, U

11
Preliminaries (Cont)
  • 2. PageRank
  • PageRank is a well known algorithm that uses
    link information to assign global importance
    scores to all pages on the web.
  • The intuition behind PageRank is that a web page
    is important, if several other important web
    pages point to it.
  • The PageRank score r(p) of a page p is defined
    as
  • Where ? is a decay factor.

12
TrustRank Algorithm
  • 1. Assessing Trust
  • 2. Computing Trust
  • 3. The TrustRank Algorithm

13
TrustRank Algorithm(Cont)
  • 1.Assessing Trust
  • Determination of initial set of pages as if it
    is a Web Spam or not, requires human evaluation.
    A notion is introduced as human checking a page
    for spam by a binary oracle function O over all
    pages p?V
  • Oracle invocations are expensive and time
    consuming. Thus, our objective is to call
    function O on selective pages.

14
TrustRank Algorithm(Cont)
  • To discover good pages without invoking the
    oracle function on the entire web, paper made an
    important empirical observation that Good pages
    seldom points to bad ones.
  • Spam pages can, and in fact often do, link to
    good pages.

15
TrustRank Algorithm(Cont)
  • Threshold Trust Property
  • If a page p receives a score above ? then we
    know that it is good. Otherwise, we cannot tell
    anything about p.
  • Where, T(p) is a Ideal Trust Property.

16
TrustRank Algorithm(Cont)
  • 2. Computing Trust
  • There is a limited budget L, of function O
    invocation. The subsets of good and bad seed
    pages by ? and ?-, respectively. The remaining
    pages are not checked by the human expert, we
    assign them a trust score of ½ to signal out lack
    of information.

17
TrustRank Algorithm(Cont)
  • Trust Propagation
  • As we are not sure that pages reachable from
    good seeds are indeed good.
  • The further away we are from good seed pages,
    the certain we are that a page is good.
  • Trust Attenuation
  • Trust Dampening

18
TrustRank Algorithm(Cont)
Trust Splitting
19
TrustRank Algorithm(Cont)
20
TrustRank Algorithm(Cont)
  • Step 1.
  • S0.08,0.13,0.08,0.10,0.09,0.06,0.02
  • Step 2.
  • ?2,4,5,1,3,6,7
  • Step 3.
  • Invoke Oracle function on set s2,4,5.
  • Step 4.
  • d0,1/2,0,1/2,0,0,0
  • Step 5.
  • Evaluate the TrustRank score on whole set.

21
Selecting Seed Set
  • 1. Inverse PageRank
  • We could select select seed pages based on the
    number of outlinks.
  • Inverse PageRank is a heuristic, that is works
    well in practice.
  • 2. High PageRank
  • As high PageRank pages are likely to point to
    other high-PageRank pages, then good trust scores
    will also be propagated to pages that are likely
    to be at the top of result sets.

22
Experiment Result
  • In August 2003, using the complete set of pages
    crawled and indexed by the Alta Vista search
    engine, they group several billion pages into
    31,003,946 sites.
  • The author of this paper played the role of the
    oracle.
  • After conducting experiments to compare the
    inverse PageRank and the high PageRank seed
    selection schemes,Inverse PageRank was selected.
  • With the help of major web directories, out of
    top 25,000 websites, they selected 7,900.
  • Oracle is called on top 1250, and selected 178
    sites as good seeds.

23
Experiment Result (Cont.)

PageRank Versus TrustRank
24
Experiment Result (Cont.)
  • PageRank Versus TrustRank
  • PageRank algorithm does not incorporate any
    knowledge about the quality of a site, nor does
    it explicitly penalize badness. Where as,
    TrustRank is meant to differentiate good and bad
    sites.
  • Almost no spam in top 5 TrustRank buckets, while
    it is surprising that almost 20 of the second
    PageRank bucket is bad.
  • Spam is highest in PageRank buckets 9 and 10,
    while corresponding TrustRank buckets 9 has
    different values.

25
Experiment Result (Cont.)
  • Other Strategies to evaluate results
  • 1.Pairwise Orderedness
  • 2.Precision and Recall

26
Experiment Result (Cont.)
  • 1. Pairwise Orderedness
  • Pairwise Orderedness is related to ordered
    trust property, which tell us that what fraction
    of the pairs of website (p,q), for which
    T(Threshold Trust Property) does not make
    mistake.
  • TrustRank constantly outperforms both the
    ignorant function and PageRank.

27
Experiment Result (Cont.)
  • 2. Precision and Recall
  • Precision-
  • It is the fraction of good among all pages in
    the set X that have a trust score above the
    average trust score.
  • Recall-
  • It is the ratio between the number of good pages
    with a trust score above average trust score and
    the total number of good pages in the set X.

28
Conclusions
  • Search engines combats Web Spam with a variety of
    ad hoc, often proprietary techniques.
  • Paper presented a solution to assist in the
    detection of web Spam with the help of Trust
    Rank.
  • Results shows that presented solution can
    effectively identify a significant number of
    strongly reputable (non-Spam) pages.
  • Trust Rank can be used either separately to
    filter the index, or in combination with PageRank
    and other metrics to rank search results.

29
Limitation
  • Although, TrustRank guarantees that top-scored
    sites are good one, and has better result then
    PageRank. But, TrustRank is unable to effectively
    separate low-scored good sites from bad one, due
    to the lack of distinguishing features (inlinks)
    of the sites.

30
Further improvements
  • Explore the interplay between dampening and
    splitting for trust propagation.
  • Instead of selecting the entire seed set at once,
    an iterative process could be implemented to
    reconsider the pages that oracle should evaluate
    next, based on the previous outcome of oracle
    results, after oracle has evaluated some pages.

31
  • Thank You
Write a Comment
User Comments (0)
About PowerShow.com