Combating Web Spam with TrustRank - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Combating Web Spam with TrustRank

Description:

... evaluation based on 31 million sites crawled by the Alta Vista search engine. ... crawled and indexed by the Alta Vista search engine, they group several billion ... – PowerPoint PPT presentation

Number of Views:489

Avg rating:3.0/5.0

Slides: 32

Provided by: nitinm

Category:

more less

Transcript and Presenter's Notes

Title: Combating Web Spam with TrustRank

1
Combating Web Spam with TrustRank

Presented by
Nitin Mittal

2
About the Paper

Name
Combating Web Spam with TrustRank
Authors
Zoltan Gyongyi
Hector Garcia-Molina
Jan Pedersen
Year
2004
Topic
This paper presented the first attempt to
formalizing the problem of combating Web Spam
and introduced a comprehensive solution to
assist in the detection of Web Spam.

3
Presentation Outline

Introduction
Preliminaries
Trust Rank Algorithm
Selecting Seeds set
Experiment Result
Conclusions
Further improvements

4
Introduction

Web Spam
The term web Spam refers to hyperlinked pages
on the World Wide Web that are created with the
intention of misleading search engines.
Techniques of Web Spam
By adding the thousands of keywords to home page,
so that a search engine will index the key words
and return the bogus site as an answer to queries
on that keywords.
Creating a large number of bogus web pages, all
pointing to a single target page. Since many
search engines ranks the pages with the
consideration on incoming links.

5
Introduction(Cont.)

Issues with Web Spam
Its hard to check if pages has the contents
related to the keywords listed at its main page
or they are just inserted for Web Spam.
Inter linking websites may represent useful
relations between the sites, or they may have
been created to boost the rank of each others
pages.
Not an easy task for computer to categories web
pages due to the large number of web pages.

6
Introduction(Cont.)

Contribution of this paper are
Formalize the problem of Web Spam and Spam
detection algorithms.
Defined metrics for assessing the efficacy of
detection algorithms.
Presented schemes for selecting seed sets of
pages to be manually evaluate.
Introduced the TrustRank algorithms for
determining the likelihood that pages are
reputable(not Spam).
Discuss results of an extensive evaluation based
on 31 million sites crawled by the Alta Vista
search engine.

7
Preliminaries

Web Model
Page Rank

8
Preliminaries (Cont)

1. Web Model
Web is modeled as a graph G(V,E), Consisting
of a set V of N pages (vertices) and a set E of
directed links (Edges) that connect pages.
Multiple hyperlink between two pages p and q, is
collapsed into a single link(p,q) ? E.
In-degree, number of inlinks of a page p, is
l(p)
Out-degree, number of outlinks of a page p, is
?(p)
Pages without outlinks are referred to as
non-referencing pages.
Pages without inlinks are referred to as
unreferenced pages.
Pages that are both unreferenced and
non-referencing at the same time are referred as
isolated pages.

9
Preliminaries (Cont)

Page 1 is unrefernced page.
Page 4 is non-referncing page.

10
Preliminaries (Cont)

Transposition matrix, T
Inverse Transposition matrix, U

11
Preliminaries (Cont)

2. PageRank
PageRank is a well known algorithm that uses
link information to assign global importance
scores to all pages on the web.
The intuition behind PageRank is that a web page
is important, if several other important web
pages point to it.
The PageRank score r(p) of a page p is defined
as
Where ? is a decay factor.

12
TrustRank Algorithm

1. Assessing Trust
2. Computing Trust
3. The TrustRank Algorithm

13
TrustRank Algorithm(Cont)

1.Assessing Trust
Determination of initial set of pages as if it
is a Web Spam or not, requires human evaluation.
A notion is introduced as human checking a page
for spam by a binary oracle function O over all
pages p?V
Oracle invocations are expensive and time
consuming. Thus, our objective is to call
function O on selective pages.

14
TrustRank Algorithm(Cont)

To discover good pages without invoking the
oracle function on the entire web, paper made an
important empirical observation that Good pages
seldom points to bad ones.
Spam pages can, and in fact often do, link to
good pages.

15
TrustRank Algorithm(Cont)

Threshold Trust Property
If a page p receives a score above ? then we
know that it is good. Otherwise, we cannot tell
anything about p.
Where, T(p) is a Ideal Trust Property.

16
TrustRank Algorithm(Cont)

2. Computing Trust
There is a limited budget L, of function O
invocation. The subsets of good and bad seed
pages by ? and ?-, respectively. The remaining
pages are not checked by the human expert, we
assign them a trust score of ½ to signal out lack
of information.

17
TrustRank Algorithm(Cont)

Trust Propagation
As we are not sure that pages reachable from
good seeds are indeed good.
The further away we are from good seed pages,
the certain we are that a page is good.
Trust Attenuation
Trust Dampening

18
TrustRank Algorithm(Cont)
Trust Splitting
19
TrustRank Algorithm(Cont)
20
TrustRank Algorithm(Cont)

Step 1.
S0.08,0.13,0.08,0.10,0.09,0.06,0.02
Step 2.
?2,4,5,1,3,6,7
Step 3.
Invoke Oracle function on set s2,4,5.
Step 4.
d0,1/2,0,1/2,0,0,0
Step 5.
Evaluate the TrustRank score on whole set.

21
Selecting Seed Set

1. Inverse PageRank
We could select select seed pages based on the
number of outlinks.
Inverse PageRank is a heuristic, that is works
well in practice.
2. High PageRank
As high PageRank pages are likely to point to
other high-PageRank pages, then good trust scores
will also be propagated to pages that are likely
to be at the top of result sets.

22
Experiment Result

In August 2003, using the complete set of pages
crawled and indexed by the Alta Vista search
engine, they group several billion pages into
31,003,946 sites.
The author of this paper played the role of the
oracle.
After conducting experiments to compare the
inverse PageRank and the high PageRank seed
selection schemes,Inverse PageRank was selected.
With the help of major web directories, out of
top 25,000 websites, they selected 7,900.
Oracle is called on top 1250, and selected 178
sites as good seeds.

23
Experiment Result (Cont.)

PageRank Versus TrustRank
24
Experiment Result (Cont.)

PageRank Versus TrustRank
PageRank algorithm does not incorporate any
knowledge about the quality of a site, nor does
it explicitly penalize badness. Where as,
TrustRank is meant to differentiate good and bad
sites.
Almost no spam in top 5 TrustRank buckets, while
it is surprising that almost 20 of the second
PageRank bucket is bad.
Spam is highest in PageRank buckets 9 and 10,
while corresponding TrustRank buckets 9 has
different values.

25
Experiment Result (Cont.)

Other Strategies to evaluate results
1.Pairwise Orderedness
2.Precision and Recall

26
Experiment Result (Cont.)

1. Pairwise Orderedness
Pairwise Orderedness is related to ordered
trust property, which tell us that what fraction
of the pairs of website (p,q), for which
T(Threshold Trust Property) does not make
mistake.
TrustRank constantly outperforms both the
ignorant function and PageRank.

27
Experiment Result (Cont.)

2. Precision and Recall
Precision-
It is the fraction of good among all pages in
the set X that have a trust score above the
average trust score.
Recall-
It is the ratio between the number of good pages
with a trust score above average trust score and
the total number of good pages in the set X.

28
Conclusions

Search engines combats Web Spam with a variety of
ad hoc, often proprietary techniques.
Paper presented a solution to assist in the
detection of web Spam with the help of Trust
Rank.
Results shows that presented solution can
effectively identify a significant number of
strongly reputable (non-Spam) pages.
Trust Rank can be used either separately to
filter the index, or in combination with PageRank
and other metrics to rank search results.

29
Limitation

Although, TrustRank guarantees that top-scored
sites are good one, and has better result then
PageRank. But, TrustRank is unable to effectively
separate low-scored good sites from bad one, due
to the lack of distinguishing features (inlinks)
of the sites.

30
Further improvements

Explore the interplay between dampening and
splitting for trust propagation.
Instead of selecting the entire seed set at once,
an iterative process could be implemented to
reconsider the pages that oracle should evaluate
next, based on the previous outcome of oracle
results, after oracle has evaluated some pages.