Taking over Search Engines - PowerPoint PPT Presentation

About This Presentation
Title:

Taking over Search Engines

Description:

Taking over Search Engines. Web Spamming. What is Spamming ? Spamming is the art of increasing ... Use the seed set in the personalization vector. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 43
Provided by: malvasia
Category:
Tags: engines | over | search | taking

less

Transcript and Presenter's Notes

Title: Taking over Search Engines


1
Taking over Search Engines
2
Web Spamming
  • What is Spamming ?
  • Spamming is the art of increasingthe rank of a
    page.
  • Why ?
  • Having more visits means gaining more money.
  • How ?
  • Web search engines are the gateways to the web.
  • Get listed in the top results.

3
How much Spam out there ?
  • Real-Web data from the MSN crawler collected
    during August 2004
  • 105,484,446 Web pages

4
Why is spam bad ?
  • For Users
  • Useless pages.
  • For Search Engines
  • Wastes bandwidth, CPU cycles, storage space.
  • Pollutes corpus.
  • Distorts ranking of results.
    (Again bad news for users !)

5
Techniques
  • Web Search Engines use a number of measure to
    estimate the importance of a page
  • Content Analysis TF x IDF,
  • Link Analysis PageRank,
  • Also spammers use a number of techniques !
  • Content Manipulation, i.e. terms
  • Stucture Manipulation, i.e. links

6
Content Manipulation 1
  • Repetition Repetition Repetition Repetition
    Repetition Repetition Repetition Repetition
  • Increases the Term Frequency
  • dumortierite dumose dumous dump dumper dumpage
    Dumping dumper dumpily
  • Makes a document relevant to many queries.
  • It is effective when using rare words (Inverse
    Document Frequency).

7
Where ?
  • Body, Title, Meta Tag, Anchor, Url.

8
Content Manipulation 2
  • Content Repurposing
  • Weaving
  • Insertion of spam words into a well formed page
    copied another web-site.
  • Phrase Stitching
  • Gluing well formed sentences copied from many
    other web-sites.
  • Why ?
  • Overcomes simple statistics that may be taken
    into account by web search engines

9
The Big Picture (1)Techniques / Boosting / Term
lta hreftarget.htmlgtfree, great deals, cheap,
inexpensive, cheap, freelt/agt
Link Bombing
10
Link Manipulation
  • Links and pages from the attacker point of view

11
Creating (Hijacked) In-Links
  • Honey pots.
  • copies of valuable content (e.g. Unix Man Pages)
    with hidden links to spam farms or target pages.
  • Web Directories, Blogs, Wikis
  • all of the above usually have high Page Rank, and
    it is possible to add outgoing links to owned
    pages.
  • Link Exchange
  • Buy Expired domain
  • Creating Link Farms

12
Spamming HITS
  • HITS algorithm
  • Searches for Hubs and Authorities
  • Top ranked pages are the more authoritative ones
  • Spam on HITS
  • Find a collection of good Hubs
  • Add links from Hubs to the target page
  • The target page is now linked to good Hubs !!

13
PageRank
  • PageRank in one equation
  • PR(p) ? M (1- ?) Vp
  • M is the adjacency matrix of the Web Graph.
  • ? is the damping factor. (usually .85)
  • in case of fairness Vp1/N (N of pages
    in the Web).
  • V is the personalization vector.
  • What happens if a page p has no outgoing links ?
  • ? of its PR is lost --gt all the PR will be lost
    eventually.
  • solution normalize rows of M.
    (i.e. insert links to every other page)

14
Aggregate Page Rank
  • Total page rank is affected by
  • Number of pages
  • Incoming Links
  • Outgoing Links
  • Dangling Nodes
  • Topologies that
  • Use as many pages as possible
  • minimize outgoing links
  • minimize dangling nodes

incoming links
WEB-SITE
outgoing links
15
Chain topology (more is better)
PR (Web Site) 0.34
I
a
O
0.18
0.34
0.47
PR (Web Site) 0.210.29 0.50
I
a
O
b
0.11
0.21
0.37
0.29
I
a
b
c
d
e
f
O
0.03
0.07
0.09
0.12
0.14
0.16
0.17
0.18
PR (Web Site) 0.77
16
Ring topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.11
0.03
b
f
0.15
0.11
PR (Web Site) 0.86
c
e
0.12
0.14
d
0.13
17
Clique topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.04
0.03
b
f
0.15
0.15
PR (Web Site) 0.93
c
e
0.15
0.15
d
0.15
18
Increasing Page Rank of a single target page
  • Complicated structures do not help
  • chain, ring, clique waste page rank among every
    node in the website
  • To maximize the page rank of a target page a
  • all hijacked pages I must point to a
  • all boosting pages (b,c,d,e,f) must point to a
  • no links among boosting pages
  • the target page must point to all of the boosting
    pages

19
Star topology
I
a
O
0.18
0.34
0.47
0.09
b
0.09
0.09
c
f
PR (a) 0.43
I
a
O
0.09
0.03
d
e
0.09
0.09
20
Putting all together
  • Given many spam farms
  • Create highly connected topologies among target
    pages
  • Link Exchange
  • every target page will be rewarded proportionally
    to their previous page rank

21
Is it worth ?
  • Page rank has a power low distribution
  • if a page has a low initial PageRankit is easy
    to improve it and to get higher ranking
  • if a page as an higher initial PageRankit is
    hard to improve it and it is harder to overcome
    other pages
  • Consider that
  • it is cheap to generate automatically a link
    farm, but
  • spamming is expensive in terms of registered
    domains and IPs.

22
Hiding Techniques
  • Discriminate between real users and crawlers in
    order to hide spam activity to both of them

23
Content Hiding
  • Use background color for text.
  • add keywords
  • Use small 1 pixel anchor images.
  • add links

24
Cloaking
  • Identify whether the request comes from a real
    user or a search engine and provide different
    content.
  • To users
  • provide target pages.
  • To Search Engines
  • provide useful and interesting text.
  • provide a link structure that increase PageRank.
  • Solution
  • Download the same page twice.

25
Redirection
  • The redirection mechanism is used to create
    doorways to target pages
  • Search Engines
  • download the page and crawl its links.
  • Users
  • are immediately redirected to a target page.

26
Why content hiding is tough
  • HTML code can be parsed trying to detect spam
    intrusions.
  • Javascript code can be parsed too, but it is more
    difficult.
  • Eventually, it is needed to interpret the code.
  • Crawling is already very expensive !

27
Link analysis algorithms against web spamming
  • TrustRank
  • Anti-Trust Rank
  • Truncated Page Rank
  • SpamRank

28
Trust Rank
  • Observation
  • Good pages tend to link good pages.
  • Human is the best spam detector
  • Algorithm
  • Select a small subset of pages and let a human
    classify them
  • Propagate goodness of pages

29
Trust Rank Selection
  • The seed set S should
  • be as small as possible
  • cover a large part of the Web
  • Covering is related to out-links in the very same
    way PageRank is related to in-link
  • Inverse PageRank !
  • A small number of pages with the highest Inverse
    PageRank is labeled by a human expert.

30
Trust Rank Propagation
  • Initial values
  • TR(p) 1, if p was found to be a good page
  • TR(p) 0, otherwise
  • Iterations
  • propagate Trust in the same way as PageRank
  • splitting through out-links
  • damping (attenuation) ?
  • only a fixed number of iteration M.

31
Trust Rank Results
32
Anti-Trust Rank
  • Goal
  • find spam pages
  • Algorithm
  • Obtain a seed set of spam pages labeled by hand.
    (prefer high PageRank)
  • Compute PageRank Algorithm on the trasnposed
    adjacency matrix.
  • Use the seed set in the personalization vector.
  • Rank the pages in descending order of their
    scores.

33
Anti-Trust Rank
34
Truncated Page Rank
  • Observation
  • Good pages have high page rank because of pages
    between 5 and 10 hops away

35
Truncated Page Rank
  • Observation
  • Good pages have high page rank because of pages
    between 5 and 10 hops away
  • Spam pages gain page rank because of pages in
    their neighborhood

36
Truncated Page Rank
  • Observation
  • Good pages have high page rank because of pages
    between 5 and 10 hops away
  • Spam pages gain page rank because of pages in
    their neighborhood
  • Solution
  • promote rank coming from far away
  • demote rank coming from the closest pages

37
Truncated Page Rank
  • Rank propagates through links
  • only a fraction ? propagates according to the
    adjacency matrix M
  • 5 steps of propagation mean
  • ?M ?M ?M ?M ?M ?5M5
  • We can calculate the page rank of a page by
    summing up the contributions from different
    distances
  • PR(p) ? ?t Mt ? damping(t) Mt
  • We can replace ?n with a function like this

38
Truncated Page Rank
  • Strategy
  • Pages whose PageRank is largely different from
    its Truncated PageRank are likely to be spam
  • Results
  • Comparable with TrustRank

39
Spam Rank
  • Observations
  • Spam pages are usually supported by low PageRank
    Pages.
  • Spammers have a limited budget, so they replicate
    only what they need for boosting PageRank.
  • Idea
  • Find missing statistical features of dishonest
    supporters.
  • Due to the self-similarity, the honest supporter
    set should have a power-law distribution of
    PageRank.

40
Spam Rank Algorithm
  • Find supporters for each page.
  • Check whether each set of supporters follows a
    power-law distribution of its PageRank.
  • Create penalties for suspicious pages.
  • Run PageRank using a personalization vector based
    on penalties.
  • Spam Rank is a Measure of Undeserved PageRank

41
Spam Rank Results
42
fine.
Write a Comment
User Comments (0)
About PowerShow.com