Title: Link Analysis and Anti-Spam
1Link Analysis and Anti-Spam
- Tie-Yan Liu
- Microsoft Research Asia
2Outline
- First Session
- Overview of Link Analysis Technologies
- PageRank and HITS
- Second Session
- More about Link Analysis Algorithms
- Third Session
- Spam and Anti-Spam
- Homework
3First Session
4Typical Search Engine Architecture
5Ranking for the Search Results
- Todays search engines may return millions of
pages for a certain query - It is definitely not possible for the user to
preview all these results - An appropriate ranking will be very helpful.
- Ranking on relevance
- Ranking on importance
6Traditional IR Ranking
- A ranking purely on relevance
- Term frequency (tf)
- Inverse Document Frequency (idf)
- Okapi
- Many other aspects that Dr. Shuming Shi will
mention in the next course.
7Limitations of Traditional IR
- Text-based ranking function
- www.harvard.edu can hardly be recognized as one
of the most authoritative pages for the query
harvard, since many other web pages contain
harvard more often. - The number of pages with the same relevance is
still too large for the users to preview. - Pages are not sufficiently self-descriptive
- Usually the term search engine doesn't appear
on the web pages of search engines.
8Whats More for Web Search
- In order to solve these problems
- We must leverage other information on the Web
- We must distinguish those pages with the same
amount of relevance - Link Analysis
- The web is not just a collection of pure-text
documents - the hyperlinks are also very important!
- A link from page A to page B may indicate
- A is related to B, or
- A is recommending, citing, voting for or
endorsing B - Links effect the ranking of web pages and thus
have commercial value.
9Famous Link Analysis Methods
10HITS - Kleinbergs Algorithm
- HITS Hypertext Induced Topic Selection
- For each vertex v in a subgraph of interest
- a(v) - the authority of v
- h(v) - the hubness of v
- A site is very authoritative if it receives many
citations. Citation from important sites weight
more than citations from less-important sites - Hubness shows the importance of a site. A good
hub is a site that links to many authoritative
sites
11Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
12Convergence of Authority and Hubness
- Recursive dependency
-
- a(v) ? S h(w)
- h(v) ? S a(w)
w ? pav
w ? chv
- Using Linear Algebra, we can prove
a(v) and h(v) converge
13HITS Example
Find a base subgraph
- Start with a root set R 1, 2, 3, 4
- 1, 2, 3, 4 - nodes relevant to
the topic
- Expand the root set R to include all the
children and a fixed number of parents of nodes
in R
? A new set S (base subgraph) ?
14HITS Example
Hubs and authorities two n-dimensional a and h
- HubsAuthorities(G)
- 1 ? 1,,1 ? R
- a ? h ? 1
- t ? 1
- repeat
- for each v in V
- do a (v) ? S h (w)
- h (v) ? S a (w)
- a ? a / a
- h ? h / h
- t ? t 1
- until a a h h lt
e - return (a , h )
V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
15HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
16Matrix Denotion of HITS
- It is clear that the authority and hubness values
calculated by the aforementioned algorithm is the
left and right singular vector of the adjacency
matrix of the base sub graph.
17PageRank
- Introduced by Page et al (1998)
- The page rank is proportional to its parents
rank, but inversely proportional to its parents
outdegree
18Matrix Notation
Adjacent Matrix
A
19Matrix Notation
- Matrix Notation
- r B r
- Pagerank is embedded in the eigenvector of B
associated with the eigen value 1.
B
20Matrix Notation
21Markov Chain Notation
- Random surfer model
- Description of a random walk through the Web
graph - Interpreted as a transition matrix with
asymptotic probability that a surfer is currently
browsing that page
rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?8) regardless of the initial ranks ?
22Problem
- Rank Sink Problem
- In general, many Web pages have no
inlinks/outlinks - It results in dangling edges in the graph
- E.g.
- no parent ? rank 0
- MT converges to a matrix
- whose last column is all zero
- no children ? no solution
- MT converges to zero matrix
23Modification
- Surfer will restart browsing by picking a new Web
page at random - M ( B E )
- E escape matrix
- M stochastic matrix
- Still problem?
- It is not guaranteed that M is primitive
- If M is stochastic and primitive, PageRank
converges to corresponding stationary
distribution of M
24Distribution of the Mixture Model
- The probability distribution that results from
combining the Markovian random walk distribution
the static rank source distribution - r ee (1- e)x
- e probability of selecting non-linked page
PageRank
Now, transition matrix eH (1- e)M is
primitive and stochasticrt converges to the
dominant eigenvector
25PageRank v.s. HITS - Algorithm
26PageRank v.s. HITS - Stability
- Whether the link analysis algorithms based on
eigenvectors are stable in the sense that results
dont change significantly? - General Strategy for evaluating stability
- 1. Start with original adjacency matrix, A
- 2. Perturb the matrix to get A, Select k nodes
in graph to add or delete - 3. Compute distance, d(r(A),r(A)), for some
distance measure d and objective function r that
measures the quality of results of A somehow - 4. Compute amount of perturbation p(?,?) for
some distance function p that measures the amount
of perturbation - 5. Evaluate the conditions, if any, where small
values for p generate large values for d
27Stability of HITS
- Ng 2001
- A bound on the number of hyperlinks k that can
added or deleted from one page without affecting
the authority or hubness weights -
- Observations
- Stability determined by eigengap
- Eigengap difference between 1st and 2nd
eigenvalues - ATA for authorities, AAT for hubs
- If eigengap is big, HITS will be insensitive to
small perturbations, vice versa if small
d eigengap ?1 ?2d maximum outdegree of G
28Stability of PageRank
- Looser bound
- Ng et al (2001)
- Bianchini et al (2001)
- Observations
- The parameter e of the mixture model has a
stabilization role - If original k pages to be modified do not have
high overall PR scores then perturbed scores will
not be far from the original
29Second Session
30Pre-PageRank
- PageRank achieves great success in the industry,
many people regarded it as a break-through in the
research field as well. - Actually the basic idea of PageRank has already
appeared in many previous works - Mark 1988
- Bray 1996
- Marchiori 1997
-
31Mark 1988
- To calculate the score S of a document at
vertex v
1
S S(w)
S(v) s(v)
chv
w ? ch(v)
v a vertex in the hypertext graph G (V,
E) S(v) the global score s(v) the score if the
document is isolated ch(v) children of the
document at vertex v
- Limitation
- - Require G to be a directed acyclic graph (DAG)
- - If v has a single link to w, S(v) gt S(w)
- If v has a long path to w and s(v) lt s(w),
then S(v) gt S (w)
Mark, D. M., (1988), "Network models in
geomorphology," Chapter 4 in Modeling in
Geomorphologic Systems, Edited by M. G. Anderson,
John Wiley., p.73-97.
32Bray 1996
- The visibility of a site is measured by the
number of other sites pointing to it - Authority?
- The luminosity of a site is measured by the
number of other sites to which it points - Hub?
33Marchiori (1997)
- Hyper information should complement textual
information to obtain the overall information
S(v) s(v) h(v)
- S(v) overall information - s(v) textual
information - h(v) hyper information
r(v, w)
w ? chv
- F a fading constant, F ? (0, 1) - r(v,
w) the rank of w after sorting the children of v
by S(w)
34Post PageRank
- And following the success of PageRank, a lot of
new algorithms were also proposed. - Fast PageRank calculation (Haveliwala)
- Topic-sensitive PageRank
- Personalized PageRank
- LinkFusion
35Fast PageRank calculation Haveliwala 1999
- Partition the destination vector into d blocks
that each fit into main memory, and to compute
one block at a time. - This algorithm is quite similar in structure to
the Block Nested-Loop Join algorithm in database
systems. which also performs very well for data
sets of moderate size but eventually loses out to
more scalable approaches.
36Fast PageRank calculation Haveliwala 2003
- Basic observation
- the convergence rates of the PageRank values of
individual pages during application of the Power
Method is nonuniform. That is, many pages
converge quickly, with a few pages taking much
longer to converge. Furthermore, the pages that
converge slowly are generally those pages with
high PageRank.
37Topic-Specific PageRank Haveliwala - WWW02
- Topic-specific PageRanks
- For each page precomputed PageRank values of the
most relevant topics used for each query. - 16 topics
38Link Fusion Zeng, WWW04
- In a more generalized scenario, suppose there are
N data types. The importance attribute of one
type of object can be reinforced by both inter
and intra-type links as - Suppose w is the attribute vector of all the
objects in the URM. Link Fusion can be
represented as - wnewLurmTwold
- Such iterative calculation can be continued
- wn(LurmT)nw0
- The result w is the prime eigenvector of Lurm,
which can be explained as the value of data
objects regarding a specific attribute.
39Limits of Link Analysis
- Pay-for-place
- Search engine bias organizations pay search
engines and page rank - Advertisements organizations pay high ranking
pages for advertising space - With a primary effect of increased visibility to
end users and a secondary effect of increased
respectability due to relevance to high ranking
page
40Limits of Link Analysis
- Stability
- Adding even a small number of nodes/edges to the
graph has a significant impact - Topic drift
- A top authority may be a hub of pages on a
different topic resulting in increased rank of
the authority page - Content evolution
- Adding/removing links/content can affect the
intuitive authority rank of a page requiring
recalculation of page ranks
41Third Session
42What is Link Spam
- Since link analysis has played an important role
in search engines, it has large commercial values - Improving ones PageRank, can directly increase
ones clicks thus earn more money. - Link Spam is something trying to unfairly gain a
high ranking on a search engine for a web page
without improving the user experience, by mean of
tricky modification / manipulation of the link
graph.
43Link Spamming Technologies
- Adding outlinks
- Replicate hub pages
- Adding inlinks
- Create a honey pot
- Infiltrate a web directory
- Post links on blog, wiki, etc
- Participate in-link exchange
- Buy expired domains
- Create own spam farm.
44Case Study Spam HITS
- Hub score can be increased by adding outlinks to
the target page - Authority score can be increased by creating
hyperlinks from high-hub-score pages to the
target page.
45Case Study Spam PageRank
- Factors that influence PageRank
- PR(t)PRstatic(t)PRin(t)-PRout(t)-PRsink(t)
- Strategies
- Own pages are part of the spam farm, maximizing
PRstatic - Accessible pages point to the spam farm,
maximizing PRin - Links pointing outside the spam farm are
supressed, minimizing PRout(t) - All pages within the farm have some outlinks,
minimizing PRsink(t)
46Anti-Spam
- Early approaches
- BHITS, SALSA, DOM, revised HITS, BadRank
- State-of-the-art
- TrustRank (2004)
- Revised PageRank (VLDB2004)
- BadRank (WWW2005)
- SpamRank (WWW2005, workshop)
47TrustRank
- Basic assumption
- Good pages seldom point to spam pages, but spam
pages may very likely point to good pages. - Use TrustRank to denote the goodness of a
webpage, and use Trust Propagation to label all
the web pages starting from a small human-labeled
seed set.
48TrustRank
- Step 1 Initialization
- How to select seeds
- Inverse PageRank (Hub pages, since they have more
influence) - High PageRank (Important pages are more important
to search applications) - Step 2 Propagation
49TrustRank
- Step 3
- Trust Dampening
- Trust Splitting
50BadRank
- Motivation
- Pages in the spam farm are densely connected, and
many common pages exist in both the inlinks and
outlinks of these pages. - Propagate the badness of pages in the seed set to
detect other the spam pages in the Web.
51BadRank
- Step 1 Initialization
- At least 3 common nodes (approximately the same,
i.e. with the same domain name) in the inlink and
outlink sets - Step 2 Expansion
- ParentPenalty if a page links to many bad pages
(larger than a threshold), it will also be
labeled as bad. - Delete all the links between detected bad pages
before PageRank calculation.
52Revised PageRank
- Assumption
- The spam farm have high correlation with each
other. - Approach
- Increase the probability of jumping from nodes
with large correlation coefficients.
53Revised PageRank
- Step 1 Collusion detection
- Calculate PageRank values for different e
- Calculate the correlation coefficient between the
curve of node xs PageRank and 1/ e, denoted by
co-co(x). - Step 2 e Personalization
- Use F(edefault, co-co(x)) to personalize the
original matrix U. - Recalculate PageRank.
54SpamRank
- Key assumption
- Supporters of an honest page should not overly
dependent on one another, i.e. they should be
spread across different quality. - Due to the self-similarity, the honest supporter
set should have a power-law distribution of
PageRank. - Spammers have a limited budget, so they do not
replicate the unimportant structures.
55Summary
- The current works on anti-spam are very limited.
- Promising research directions
- Use more statistics and the properties of the
transition probability matrix to detect spam - Design a new spam-free ranking function
56Homework
57Technical Report Writing
- HITS and PageRank are both based on simple linear
algebra, can you design some other link analysis
algorithm based on advanced linear algebra or
matrix factorization? - The performance / sensitivity of PageRank with
respect to the smoothing factor e. - How to speed up the calculation of PageRank using
matrix factorization, or some specific
characteristics of the Markov chain? - PageRank is the eigenvector of a 2-D matrix, then
can LinkFusion be the eigenvector of a 3-D
tensor? - Stability analysis for other link analysis
algorithms. - A survey on the state-of-the-art spam
technologies. - How to design a search engine that is robust to
spam? - Other novel research topics related to link
analysis.
58Requirements
- Send the report to Tie-Yan.Liu_at_microsoft.com
before Dec 4 (within 1 month). - The length should not be less than 8 pages, with
the template at http//www.acm.org/sigs/pubs/proce
ed/template.html - There must be something new and intersting in
your report, and yous better use some
experiments to support your idea. - Never try to copy or steal already-published
ideas as your technical report. We are sure we
have read much more than you can find.
59Other Information
- Slides can be found at
- http//research.microsoft.com/users/tyliu/