Title: Combating Web Spam with TrustRank
1Combating Web Spam with TrustRank
- Presented by
- Nitin Mittal
2About the Paper
- Name
- Combating Web Spam with TrustRank
- Authors
- Zoltan Gyongyi
- Hector Garcia-Molina
- Jan Pedersen
- Year
- 2004
- Topic
- This paper presented the first attempt to
formalizing the problem of combating Web Spam
and introduced a comprehensive solution to
assist in the detection of Web Spam.
3Presentation Outline
- Introduction
- Preliminaries
- Trust Rank Algorithm
- Selecting Seeds set
- Experiment Result
- Conclusions
- Further improvements
4Introduction
- Web Spam
- The term web Spam refers to hyperlinked pages
on the World Wide Web that are created with the
intention of misleading search engines. - Techniques of Web Spam
- By adding the thousands of keywords to home page,
so that a search engine will index the key words
and return the bogus site as an answer to queries
on that keywords. - Creating a large number of bogus web pages, all
pointing to a single target page. Since many
search engines ranks the pages with the
consideration on incoming links.
5Introduction(Cont.)
- Issues with Web Spam
- Its hard to check if pages has the contents
related to the keywords listed at its main page
or they are just inserted for Web Spam. - Inter linking websites may represent useful
relations between the sites, or they may have
been created to boost the rank of each others
pages. - Not an easy task for computer to categories web
pages due to the large number of web pages.
6Introduction(Cont.)
- Contribution of this paper are
- Formalize the problem of Web Spam and Spam
detection algorithms. - Defined metrics for assessing the efficacy of
detection algorithms. - Presented schemes for selecting seed sets of
pages to be manually evaluate. - Introduced the TrustRank algorithms for
determining the likelihood that pages are
reputable(not Spam). - Discuss results of an extensive evaluation based
on 31 million sites crawled by the Alta Vista
search engine.
7Preliminaries
8Preliminaries (Cont)
- 1. Web Model
- Web is modeled as a graph G(V,E), Consisting
of a set V of N pages (vertices) and a set E of
directed links (Edges) that connect pages. - Multiple hyperlink between two pages p and q, is
collapsed into a single link(p,q) ? E. - In-degree, number of inlinks of a page p, is
l(p) - Out-degree, number of outlinks of a page p, is
?(p) - Pages without outlinks are referred to as
non-referencing pages. - Pages without inlinks are referred to as
unreferenced pages. - Pages that are both unreferenced and
non-referencing at the same time are referred as
isolated pages.
9Preliminaries (Cont)
- Page 1 is unrefernced page.
- Page 4 is non-referncing page.
10Preliminaries (Cont)
- Transposition matrix, T
- Inverse Transposition matrix, U
11Preliminaries (Cont)
- 2. PageRank
- PageRank is a well known algorithm that uses
link information to assign global importance
scores to all pages on the web. - The intuition behind PageRank is that a web page
is important, if several other important web
pages point to it. - The PageRank score r(p) of a page p is defined
as - Where ? is a decay factor.
12TrustRank Algorithm
- 1. Assessing Trust
- 2. Computing Trust
- 3. The TrustRank Algorithm
13TrustRank Algorithm(Cont)
- 1.Assessing Trust
- Determination of initial set of pages as if it
is a Web Spam or not, requires human evaluation.
A notion is introduced as human checking a page
for spam by a binary oracle function O over all
pages p?V -
- Oracle invocations are expensive and time
consuming. Thus, our objective is to call
function O on selective pages.
14TrustRank Algorithm(Cont)
- To discover good pages without invoking the
oracle function on the entire web, paper made an
important empirical observation that Good pages
seldom points to bad ones. - Spam pages can, and in fact often do, link to
good pages.
15TrustRank Algorithm(Cont)
- Threshold Trust Property
-
- If a page p receives a score above ? then we
know that it is good. Otherwise, we cannot tell
anything about p. - Where, T(p) is a Ideal Trust Property.
16TrustRank Algorithm(Cont)
- 2. Computing Trust
- There is a limited budget L, of function O
invocation. The subsets of good and bad seed
pages by ? and ?-, respectively. The remaining
pages are not checked by the human expert, we
assign them a trust score of ½ to signal out lack
of information.
17TrustRank Algorithm(Cont)
- Trust Propagation
- As we are not sure that pages reachable from
good seeds are indeed good. - The further away we are from good seed pages,
the certain we are that a page is good. - Trust Attenuation
- Trust Dampening
18TrustRank Algorithm(Cont)
Trust Splitting
19TrustRank Algorithm(Cont)
20TrustRank Algorithm(Cont)
- Step 1.
- S0.08,0.13,0.08,0.10,0.09,0.06,0.02
- Step 2.
- ?2,4,5,1,3,6,7
- Step 3.
- Invoke Oracle function on set s2,4,5.
- Step 4.
- d0,1/2,0,1/2,0,0,0
- Step 5.
- Evaluate the TrustRank score on whole set.
21Selecting Seed Set
- 1. Inverse PageRank
- We could select select seed pages based on the
number of outlinks. - Inverse PageRank is a heuristic, that is works
well in practice. - 2. High PageRank
- As high PageRank pages are likely to point to
other high-PageRank pages, then good trust scores
will also be propagated to pages that are likely
to be at the top of result sets.
22Experiment Result
- In August 2003, using the complete set of pages
crawled and indexed by the Alta Vista search
engine, they group several billion pages into
31,003,946 sites. - The author of this paper played the role of the
oracle. - After conducting experiments to compare the
inverse PageRank and the high PageRank seed
selection schemes,Inverse PageRank was selected. - With the help of major web directories, out of
top 25,000 websites, they selected 7,900. - Oracle is called on top 1250, and selected 178
sites as good seeds.
23Experiment Result (Cont.)
PageRank Versus TrustRank
24Experiment Result (Cont.)
- PageRank Versus TrustRank
- PageRank algorithm does not incorporate any
knowledge about the quality of a site, nor does
it explicitly penalize badness. Where as,
TrustRank is meant to differentiate good and bad
sites. - Almost no spam in top 5 TrustRank buckets, while
it is surprising that almost 20 of the second
PageRank bucket is bad. - Spam is highest in PageRank buckets 9 and 10,
while corresponding TrustRank buckets 9 has
different values.
25Experiment Result (Cont.)
- Other Strategies to evaluate results
- 1.Pairwise Orderedness
- 2.Precision and Recall
26Experiment Result (Cont.)
- 1. Pairwise Orderedness
- Pairwise Orderedness is related to ordered
trust property, which tell us that what fraction
of the pairs of website (p,q), for which
T(Threshold Trust Property) does not make
mistake. - TrustRank constantly outperforms both the
ignorant function and PageRank.
27Experiment Result (Cont.)
- 2. Precision and Recall
- Precision-
- It is the fraction of good among all pages in
the set X that have a trust score above the
average trust score. - Recall-
- It is the ratio between the number of good pages
with a trust score above average trust score and
the total number of good pages in the set X.
28Conclusions
- Search engines combats Web Spam with a variety of
ad hoc, often proprietary techniques. - Paper presented a solution to assist in the
detection of web Spam with the help of Trust
Rank. - Results shows that presented solution can
effectively identify a significant number of
strongly reputable (non-Spam) pages. - Trust Rank can be used either separately to
filter the index, or in combination with PageRank
and other metrics to rank search results.
29Limitation
- Although, TrustRank guarantees that top-scored
sites are good one, and has better result then
PageRank. But, TrustRank is unable to effectively
separate low-scored good sites from bad one, due
to the lack of distinguishing features (inlinks)
of the sites.
30Further improvements
- Explore the interplay between dampening and
splitting for trust propagation. - Instead of selecting the entire seed set at once,
an iterative process could be implemented to
reconsider the pages that oracle should evaluate
next, based on the previous outcome of oracle
results, after oracle has evaluated some pages.
31