Adversarial Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Adversarial Information Retrieval

Description:

Adversarial Information Retrieval ... Search Engine Spamming Link-spam Link-bombing Spam Blogs ... spam detector Algorithm Select a small subset of ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 24
Provided by: Ryan172
Category:

less

Transcript and Presenter's Notes

Title: Adversarial Information Retrieval


1
Adversarial Information Retrieval
  • The Manipulation of Web Content

2
Introduction
  • Examples
  • TrustRank and Other Methods

3
What is Adversarial IR?
  • Gathering, Indexing, Retrieving and Ranking
    Information
  • Subset of the information has been manipulated
    maliciously
  • Financial Gain

4
What is the Goal of AIR?
  • Detect the bad sites or communities
  • Improve precision on search engines by
    eliminating the bad guys

5
Simplest form
  • First generation engines relied heavily on tf/idf
  • The top-ranked pages for the query maui resort
    were the ones containing the most mauis and
    resorts
  • SEOs responded with dense repetitions of chosen
    terms
  • e.g., maui resort maui resort maui resort
  • Often, the repetitions would be in the same color
    as the background of the web page
  • Repeated terms got indexed by crawlers
  • But not visible to humans on browsers

Pure word density cannot be trusted as an IR
signal
6
Search Engine Spamming
  • Link-spam
  • Link-bombing
  • Spam Blogs
  • Comment Spam
  • Keyword Spam
  • Malicious Tagging

7
Spamming
  • Online tutorials for search engine persuasion
    techniques
  • How to boost your PageRank
  • Artificial links and Web communities
  • Latest trend Google bombing
  • a community of people create (genuine) links with
    a specific anchor text towards a specific page.
    Usually to make a political point

8
Google Bombing
9
Our Focus
  • Link Manipulation

10
Trust Rank
  • Observation
  • Good pages tend to link good pages.
  • Human is the best spam detector
  • Algorithm
  • Select a small subset of pages and let a human
    classify them
  • Propagate goodness of pages

11
Propagation
  • Trust function T
  • T(p) returns the propability that p is a good
    page
  • Initial values
  • T(p) 1, if p was found to be a good page
  • T(p) 0, if p was found to be a spam page
  • Iterations
  • propagate Trust following out-links
  • only a fixed number of iteration M.

12
Propagation (2)
  • Problem with propagation
  • Pages reachable from good seeds might not be good
  • the further away we are from good seed pages, the
    less certain we are that a page is good.
  • solution reduce trust as we move further away
    from the good seed pages (trust attenuation).

13
Trust attenuation dampening
  • Propagate a dampened trust score ß lt 1 at first
    step
  • At n-th step propagate a trust of ßn

14
Trust attenuation splitting
  • Parent trust value is splittet among child nodes
  • Observation the more the links the less the care
    in choosing them
  • Mix damp and split? ßn(splitted trust)

15
Selection Inverse PageRank
  • The seed set S should
  • be as small as possible
  • cover a large part of the Web
  • Covering is related to out-links in the very same
    way PageRank is related to in-link
  • Inverse PageRank !
  • Perform PageRank on a graph with inverted links
  • G' (V, E') where (p,q) ? E' ??(q, p) ? E.

16
Algorithm
  • Select seeds ( s ) and order by preference
  • Invoke oracle (human) on the first L seeds,
  • Initialize and normalize oracle response d
  • Compute TrustRank score (as in PageRank
    formula) t ß Tt(1-ß) d
  • T is the adjacency matrix of the Web Graph.
  • ß is the dampening factor. (usually .85)

17
Algorithm - example
  • s 0.08, 0.13, 0.08, 0.10, 0.09, 0.06, 0.02
  • Ordering 2, 4, 5, 1, 3, 6, 7
  • L3 2, 4, 5 d0, 0.5, 0, 0.5, 0, 0, 0
  • ß0.85 M20
  • t 0, 0.18, 0.12, 0.15, 0.13, 0.05, 0.05
  • NB. max0.18
  • Issues with page 1 and 5

18
Issues with TrustRank
  • Coverage of the seed set may not be broad enough
  • Many different topics exist, each with good pages
  • TrustRank has a bias towards communities that are
    heavily represented in the seed set
  • inadvertently helps spammers that fool these
    communities

19
Bias towards larger partitions
  • Divide the seed set into n partitions, each has
    mi nodes
  • ti TrustRank score calculated by using
    partition i as the seed set
  • t TrustRank score calculated by using all the
    partitions as one combined seed set

20
Basic ideas
  • Use pages labeled with topics as seed pages
  • Pages listed in highly regarded topic directories
  • Trust should be propagated by topics
  • link between two pages is usually created in a
    topic specific context

21
Topical TrustRank
  • Topical TrustRank
  • Partition the seed set into topically coherent
    groups
  • TrustRank is calculated for each topic
  • Final ranking is generated by a combination of
    these topic specific trust scores
  • Note
  • TrustRank is essentially biased PageRank
  • Topical TrustRank is fundamentally the same as
    Topic-Sensitive PageRank, but for demoting spam

22
Combination of trust scores
  • Simple summation
  • default mechanism just seen
  • Quality bias
  • Each topic weighted by a bias factor
  • Summation of these weighted topic scores
  • One possible bias Average PageRank value of the
    seed pages of the topic

23
Further Improvements
  • Seed Weighting
  • Instead of assigning an equal weight to each seed
    page, assign a weight proportional to its quality
    / importance
  • Seed Filtering
  • Filtering out low quality pages that may exist in
    topic directories
  • Finer topics
  • Lower layers of the topic directory
Write a Comment
User Comments (0)
About PowerShow.com