Web Spam Detection with AntiTrust Rank - PowerPoint PPT Presentation

About This Presentation
Title:

Web Spam Detection with AntiTrust Rank

Description:

Distributed content creation, linking (no coordination) ... Combating Web Spam with Trust Rank. Zoltan Gyongyi, Hector Garcia-Molina and Jan Pedersen. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 17
Provided by: airwebCs
Category:

less

Transcript and Presenter's Notes

Title: Web Spam Detection with AntiTrust Rank


1
Web Spam Detection with Anti-Trust Rank
  • Vijay Krishnan
  • Rashmi Raj
  • Computer Science Department
  • Stanford University

2
The World Wide Web
  • Huge
  • Distributed content creation, linking (no
    coordination)
  • Structured databases, unstructured text,
    semi-structured data.
  • Content includes truth, lies, obsolete
    information, contradictions,

3
PageRank
  • Intuition a page is important if important
    pages link to it.
  • In high-falutin terms importance the
    principal eigenvector of the stochastic matrix of
    the Web.
  • (A few fixups needed.)

4
PageRank
  • Web graph encoded by matrix M
  • NXN matrix (N number of web pages)
  • Mij 1/O(j) iff there is a link from j to i
  • Mij 0 otherwise
  • O(j) set of pages node i links to
  • Define matrix A as follows
  • Aij ßMij (1-ß)/N, where 0ltßlt1
  • 1-ß is the tax discussed in prior lecture
  • Page rank r is first eigenvector of A
  • Ar r

5
Many Random Walkers Model
  • Imagine a large number M of independent,
    identical random walkers (MÀN)
  • At any point in time, let M(p) be the number of
    random walkers at page p
  • The page rank of p is the fraction of random
    walkers that are expected to be at page p i.e.,
    EM(p)/M.

6
Economic Considerations
  • Search has become the default gateway to the web
  • Very high premium to appear on the first page of
    search results
  • e.g., e-commerce sites
  • advertising-driven sites

7
What is Web Spam?
  • Spamming any deliberate action solely in order
    to boost a web pages position in search engine
    results, incommensurate with pages real value
  • Spam web pages that are the result of spamming
  • This is a very broad defintion
  • SEO industry might disagree!
  • SEO search engine optimization
  • Approximately 10-15 of web pages are spam

8
Types of Spamming Techniques
  • Term spamming
  • Manipulating the text of web pages in order to
    appear relevant to queries
  • Link spamming
  • Creating link structures that boost page rank or
    hubs and authorities scores

9
Link Spam
  • Three kinds of web pages from a spammers point
    of view
  • Inaccessible pages
  • Accessible pages
  • e.g., web log comments pages
  • spammer can post links to his pages
  • Own pages
  • Completely controlled by spammer
  • May span multiple domain names

10
Link Spam Detection
  • Open research area
  • One approach TrustRank

11
Trust Rank
  • Basic principle approximate isolation
  • It is rare for a good page to point to a bad
    (spam) page
  • Sample a set of seed pages from the web.
  • Set trust of each trusted page to 1
  • Propagate trust through links
  • Each page gets a trust value between 0 and 1
  • Use a threshold value and mark all pages below
    the trust threshold as spam

12
Anti-Trust Approach
  • Broadly based on the same approximate
  • isolation principle
  • This principle also implies that the pages
    pointing to spam pages are very likely to be spam
    pages themselves.
  • Anti-Trust is propagated in the reverse direction
    along incoming links, starting from a seed set of
    spam pages.
  • A page can be classified as a spam page if it has
    Anti-Trust Rank value more than a chosen
    threshold value.

13
Seed Set selection
  • Seed spam set chosen from pages with high page
    rank.
  • Nearly 100 URLS containing certain terms like
    viagra,gambling, hardporn as substrings are
    spam. Use these for evaluation.
  • Also some seed pages were chosen by an Oracle
    (Human Expert).

14
Results
  • Overall Percentage of spam pages 0.28.
  • Average page rank of spam/Average Page Rank
    2.6.
  • of spam pages in
  • top 1000 Anti-Trust rank pages 25.3
  • Bottom 1000 Trust rank pages 0.68
  • Ratio of average page ranks of spam pages
    returned by ATR vs. TR is roughly 6.

15
Results
16
References
  • The PageRank citation ranking Bringing order to
    the web. L. Page, S. Brin, R. Motwani and T.
    Winograd. Technical Report, Stanford University,
    1998.
  • Combating Web Spam with Trust Rank. Zoltan
    Gyongyi, Hector Garcia-Molina and Jan Pedersen.
    In VLDB 2004.
  • Topic-sensitive PageRank. Taher Haveliwala. In
    WWW 2002.
  • The WebGraph dataset. Online at
  • http//webgraph-data.dsi.unimi.it/
Write a Comment
User Comments (0)
About PowerShow.com