Title: Taking over Search Engines
1Taking over Search Engines
2Web Spamming
- What is Spamming ?
- Spamming is the art of increasingthe rank of a
page. - Why ?
- Having more visits means gaining more money.
- How ?
- Web search engines are the gateways to the web.
- Get listed in the top results.
3How much Spam out there ?
- Real-Web data from the MSN crawler collected
during August 2004 - 105,484,446 Web pages
4Why is spam bad ?
- For Users
- Useless pages.
- For Search Engines
- Wastes bandwidth, CPU cycles, storage space.
- Pollutes corpus.
- Distorts ranking of results.
(Again bad news for users !)
5Techniques
- Web Search Engines use a number of measure to
estimate the importance of a page - Content Analysis TF x IDF,
- Link Analysis PageRank,
- Also spammers use a number of techniques !
- Content Manipulation, i.e. terms
- Stucture Manipulation, i.e. links
6Content Manipulation 1
- Repetition Repetition Repetition Repetition
Repetition Repetition Repetition Repetition - Increases the Term Frequency
- dumortierite dumose dumous dump dumper dumpage
Dumping dumper dumpily - Makes a document relevant to many queries.
- It is effective when using rare words (Inverse
Document Frequency).
7Where ?
- Body, Title, Meta Tag, Anchor, Url.
8Content Manipulation 2
- Content Repurposing
- Weaving
- Insertion of spam words into a well formed page
copied another web-site. - Phrase Stitching
- Gluing well formed sentences copied from many
other web-sites. - Why ?
- Overcomes simple statistics that may be taken
into account by web search engines
9The Big Picture (1)Techniques / Boosting / Term
lta hreftarget.htmlgtfree, great deals, cheap,
inexpensive, cheap, freelt/agt
Link Bombing
10Link Manipulation
- Links and pages from the attacker point of view
11Creating (Hijacked) In-Links
- Honey pots.
- copies of valuable content (e.g. Unix Man Pages)
with hidden links to spam farms or target pages. - Web Directories, Blogs, Wikis
- all of the above usually have high Page Rank, and
it is possible to add outgoing links to owned
pages. - Link Exchange
- Buy Expired domain
- Creating Link Farms
12Spamming HITS
- HITS algorithm
- Searches for Hubs and Authorities
- Top ranked pages are the more authoritative ones
- Spam on HITS
- Find a collection of good Hubs
- Add links from Hubs to the target page
- The target page is now linked to good Hubs !!
13PageRank
- PageRank in one equation
- PR(p) ? M (1- ?) Vp
- M is the adjacency matrix of the Web Graph.
- ? is the damping factor. (usually .85)
- in case of fairness Vp1/N (N of pages
in the Web). - V is the personalization vector.
- What happens if a page p has no outgoing links ?
- ? of its PR is lost --gt all the PR will be lost
eventually. - solution normalize rows of M.
(i.e. insert links to every other page)
14Aggregate Page Rank
- Total page rank is affected by
- Number of pages
- Incoming Links
- Outgoing Links
- Dangling Nodes
- Topologies that
- Use as many pages as possible
- minimize outgoing links
- minimize dangling nodes
incoming links
WEB-SITE
outgoing links
15Chain topology (more is better)
PR (Web Site) 0.34
I
a
O
0.18
0.34
0.47
PR (Web Site) 0.210.29 0.50
I
a
O
b
0.11
0.21
0.37
0.29
I
a
b
c
d
e
f
O
0.03
0.07
0.09
0.12
0.14
0.16
0.17
0.18
PR (Web Site) 0.77
16Ring topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.11
0.03
b
f
0.15
0.11
PR (Web Site) 0.86
c
e
0.12
0.14
d
0.13
17Clique topology
I
a
O
0.18
0.34
0.47
0.18
I
a
O
0.04
0.03
b
f
0.15
0.15
PR (Web Site) 0.93
c
e
0.15
0.15
d
0.15
18Increasing Page Rank of a single target page
- Complicated structures do not help
- chain, ring, clique waste page rank among every
node in the website - To maximize the page rank of a target page a
- all hijacked pages I must point to a
- all boosting pages (b,c,d,e,f) must point to a
- no links among boosting pages
- the target page must point to all of the boosting
pages
19Star topology
I
a
O
0.18
0.34
0.47
0.09
b
0.09
0.09
c
f
PR (a) 0.43
I
a
O
0.09
0.03
d
e
0.09
0.09
20Putting all together
- Given many spam farms
- Create highly connected topologies among target
pages - Link Exchange
- every target page will be rewarded proportionally
to their previous page rank
21Is it worth ?
- Page rank has a power low distribution
- if a page has a low initial PageRankit is easy
to improve it and to get higher ranking - if a page as an higher initial PageRankit is
hard to improve it and it is harder to overcome
other pages - Consider that
- it is cheap to generate automatically a link
farm, but - spamming is expensive in terms of registered
domains and IPs.
22Hiding Techniques
- Discriminate between real users and crawlers in
order to hide spam activity to both of them
23Content Hiding
- Use background color for text.
- add keywords
- Use small 1 pixel anchor images.
- add links
24Cloaking
- Identify whether the request comes from a real
user or a search engine and provide different
content. - To users
- provide target pages.
- To Search Engines
- provide useful and interesting text.
- provide a link structure that increase PageRank.
- Solution
- Download the same page twice.
25Redirection
- The redirection mechanism is used to create
doorways to target pages - Search Engines
- download the page and crawl its links.
- Users
- are immediately redirected to a target page.
26Why content hiding is tough
- HTML code can be parsed trying to detect spam
intrusions. - Javascript code can be parsed too, but it is more
difficult. - Eventually, it is needed to interpret the code.
- Crawling is already very expensive !
27Link analysis algorithms against web spamming
- TrustRank
- Anti-Trust Rank
- Truncated Page Rank
- SpamRank
28Trust Rank
- Observation
- Good pages tend to link good pages.
- Human is the best spam detector
- Algorithm
- Select a small subset of pages and let a human
classify them - Propagate goodness of pages
29Trust Rank Selection
- The seed set S should
- be as small as possible
- cover a large part of the Web
- Covering is related to out-links in the very same
way PageRank is related to in-link - Inverse PageRank !
- A small number of pages with the highest Inverse
PageRank is labeled by a human expert.
30Trust Rank Propagation
- Initial values
- TR(p) 1, if p was found to be a good page
- TR(p) 0, otherwise
- Iterations
- propagate Trust in the same way as PageRank
- splitting through out-links
- damping (attenuation) ?
- only a fixed number of iteration M.
31Trust Rank Results
32Anti-Trust Rank
- Goal
- find spam pages
- Algorithm
- Obtain a seed set of spam pages labeled by hand.
(prefer high PageRank) - Compute PageRank Algorithm on the trasnposed
adjacency matrix. - Use the seed set in the personalization vector.
- Rank the pages in descending order of their
scores.
33Anti-Trust Rank
34Truncated Page Rank
- Observation
- Good pages have high page rank because of pages
between 5 and 10 hops away
35Truncated Page Rank
- Observation
- Good pages have high page rank because of pages
between 5 and 10 hops away - Spam pages gain page rank because of pages in
their neighborhood
36Truncated Page Rank
- Observation
- Good pages have high page rank because of pages
between 5 and 10 hops away - Spam pages gain page rank because of pages in
their neighborhood - Solution
- promote rank coming from far away
- demote rank coming from the closest pages
37Truncated Page Rank
- Rank propagates through links
- only a fraction ? propagates according to the
adjacency matrix M - 5 steps of propagation mean
- ?M ?M ?M ?M ?M ?5M5
- We can calculate the page rank of a page by
summing up the contributions from different
distances - PR(p) ? ?t Mt ? damping(t) Mt
- We can replace ?n with a function like this
38Truncated Page Rank
- Strategy
- Pages whose PageRank is largely different from
its Truncated PageRank are likely to be spam - Results
- Comparable with TrustRank
39Spam Rank
- Observations
- Spam pages are usually supported by low PageRank
Pages. - Spammers have a limited budget, so they replicate
only what they need for boosting PageRank. - Idea
- Find missing statistical features of dishonest
supporters. - Due to the self-similarity, the honest supporter
set should have a power-law distribution of
PageRank.
40Spam Rank Algorithm
- Find supporters for each page.
- Check whether each set of supporters follows a
power-law distribution of its PageRank. - Create penalties for suspicious pages.
- Run PageRank using a personalization vector based
on penalties. - Spam Rank is a Measure of Undeserved PageRank
41Spam Rank Results
42fine.