Link Analysis and Anti-Spam - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Link Analysis and Anti-Spam

Description:

Usually the term 'search engine' doesn't appear on the web pages of search engines. ... What's More for Web Search. In order to solve these problems ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 60
Provided by: wuzhuoha
Category:
Tags: analysis | anti | link | spam

less

Transcript and Presenter's Notes

Title: Link Analysis and Anti-Spam


1
Link Analysis and Anti-Spam
  • Tie-Yan Liu
  • Microsoft Research Asia

2
Outline
  • First Session
  • Overview of Link Analysis Technologies
  • PageRank and HITS
  • Second Session
  • More about Link Analysis Algorithms
  • Third Session
  • Spam and Anti-Spam
  • Homework

3
First Session
4
Typical Search Engine Architecture
5
Ranking for the Search Results
  • Todays search engines may return millions of
    pages for a certain query
  • It is definitely not possible for the user to
    preview all these results
  • An appropriate ranking will be very helpful.
  • Ranking on relevance
  • Ranking on importance

6
Traditional IR Ranking
  • A ranking purely on relevance
  • Term frequency (tf)
  • Inverse Document Frequency (idf)
  • Okapi
  • Many other aspects that Dr. Shuming Shi will
    mention in the next course.

7
Limitations of Traditional IR
  • Text-based ranking function
  • www.harvard.edu can hardly be recognized as one
    of the most authoritative pages for the query
    harvard, since many other web pages contain
    harvard more often.
  • The number of pages with the same relevance is
    still too large for the users to preview.
  • Pages are not sufficiently self-descriptive
  • Usually the term search engine doesn't appear
    on the web pages of search engines.

8
Whats More for Web Search
  • In order to solve these problems
  • We must leverage other information on the Web
  • We must distinguish those pages with the same
    amount of relevance
  • Link Analysis
  • The web is not just a collection of pure-text
    documents
  • the hyperlinks are also very important!
  • A link from page A to page B may indicate
  • A is related to B, or
  • A is recommending, citing, voting for or
    endorsing B
  • Links effect the ranking of web pages and thus
    have commercial value.

9
Famous Link Analysis Methods
  • HITS
  • PageRank

10
HITS - Kleinbergs Algorithm
  • HITS Hypertext Induced Topic Selection
  • For each vertex v in a subgraph of interest
  • a(v) - the authority of v
  • h(v) - the hubness of v
  • A site is very authoritative if it receives many
    citations. Citation from important sites weight
    more than citations from less-important sites
  • Hubness shows the importance of a site. A good
    hub is a site that links to many authoritative
    sites

11
Authority and Hubness
5
2
3
1
1
6
4
7
h(1) a(5) a(6) a(7)
a(1) h(2) h(3) h(4)
12
Convergence of Authority and Hubness
  • Recursive dependency
  • a(v) ? S h(w)
  • h(v) ? S a(w)

w ? pav
w ? chv
  • Using Linear Algebra, we can prove

a(v) and h(v) converge
13
HITS Example
Find a base subgraph
  • Start with a root set R 1, 2, 3, 4
  • 1, 2, 3, 4 - nodes relevant to
    the topic
  • Expand the root set R to include all the
    children and a fixed number of parents of nodes
    in R

? A new set S (base subgraph) ?
14
HITS Example
Hubs and authorities two n-dimensional a and h
  • HubsAuthorities(G)
  • 1 ? 1,,1 ? R
  • a ? h ? 1
  • t ? 1
  • repeat
  • for each v in V
  • do a (v) ? S h (w)
  • h (v) ? S a (w)
  • a ? a / a
  • h ? h / h
  • t ? t 1
  • until a a h h lt
    e
  • return (a , h )

V
0
0
t
w ? pav
t -1
w ? pav
t
t -1
t
t
t
t
t
t
t
t
t -1
t -1
t
t
15
HITS Example Results
Authority
Hubness
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Authority and hubness weights
16
Matrix Denotion of HITS
  • It is clear that the authority and hubness values
    calculated by the aforementioned algorithm is the
    left and right singular vector of the adjacency
    matrix of the base sub graph.

17
PageRank
  • Introduced by Page et al (1998)
  • The page rank is proportional to its parents
    rank, but inversely proportional to its parents
    outdegree

18
Matrix Notation
Adjacent Matrix
A
19
Matrix Notation
  • Matrix Notation
  • r B r
  • Pagerank is embedded in the eigenvector of B
    associated with the eigen value 1.

B
20
Matrix Notation
21
Markov Chain Notation
  • Random surfer model
  • Description of a random walk through the Web
    graph
  • Interpreted as a transition matrix with
    asymptotic probability that a surfer is currently
    browsing that page

rt M rt-1 M transition matrix for a
first-order Markov chain (stochastic)
Does it converge to some sensible solution (as
t?8) regardless of the initial ranks ?
22
Problem
  • Rank Sink Problem
  • In general, many Web pages have no
    inlinks/outlinks
  • It results in dangling edges in the graph
  • E.g.
  • no parent ? rank 0
  • MT converges to a matrix
  • whose last column is all zero
  • no children ? no solution
  • MT converges to zero matrix

23
Modification
  • Surfer will restart browsing by picking a new Web
    page at random
  • M ( B E )
  • E escape matrix
  • M stochastic matrix
  • Still problem?
  • It is not guaranteed that M is primitive
  • If M is stochastic and primitive, PageRank
    converges to corresponding stationary
    distribution of M

24
Distribution of the Mixture Model
  • The probability distribution that results from
    combining the Markovian random walk distribution
    the static rank source distribution
  • r ee (1- e)x
  • e probability of selecting non-linked page

PageRank
Now, transition matrix eH (1- e)M is
primitive and stochasticrt converges to the
dominant eigenvector
25
PageRank v.s. HITS - Algorithm
26
PageRank v.s. HITS - Stability
  • Whether the link analysis algorithms based on
    eigenvectors are stable in the sense that results
    dont change significantly?
  • General Strategy for evaluating stability
  • 1. Start with original adjacency matrix, A
  • 2. Perturb the matrix to get A, Select k nodes
    in graph to add or delete
  • 3. Compute distance, d(r(A),r(A)), for some
    distance measure d and objective function r that
    measures the quality of results of A somehow
  • 4. Compute amount of perturbation p(?,?) for
    some distance function p that measures the amount
    of perturbation
  • 5. Evaluate the conditions, if any, where small
    values for p generate large values for d

27
Stability of HITS
  • Ng 2001
  • A bound on the number of hyperlinks k that can
    added or deleted from one page without affecting
    the authority or hubness weights
  • Observations
  • Stability determined by eigengap
  • Eigengap difference between 1st and 2nd
    eigenvalues
  • ATA for authorities, AAT for hubs
  • If eigengap is big, HITS will be insensitive to
    small perturbations, vice versa if small

d eigengap ?1 ?2d maximum outdegree of G
28
Stability of PageRank
  • Looser bound
  • Ng et al (2001)
  • Bianchini et al (2001)
  • Observations
  • The parameter e of the mixture model has a
    stabilization role
  • If original k pages to be modified do not have
    high overall PR scores then perturbed scores will
    not be far from the original

29
Second Session
30
Pre-PageRank
  • PageRank achieves great success in the industry,
    many people regarded it as a break-through in the
    research field as well.
  • Actually the basic idea of PageRank has already
    appeared in many previous works
  • Mark 1988
  • Bray 1996
  • Marchiori 1997

31
Mark 1988
  • To calculate the score S of a document at
    vertex v

1
S S(w)
S(v) s(v)
chv
w ? ch(v)
v a vertex in the hypertext graph G (V,
E) S(v) the global score s(v) the score if the
document is isolated ch(v) children of the
document at vertex v
  • Limitation
  • - Require G to be a directed acyclic graph (DAG)
  • - If v has a single link to w, S(v) gt S(w)
  • If v has a long path to w and s(v) lt s(w),
    then S(v) gt S (w)

Mark, D. M., (1988), "Network models in
geomorphology," Chapter 4 in Modeling in
Geomorphologic Systems, Edited by M. G. Anderson,
John Wiley., p.73-97.
32
Bray 1996
  • The visibility of a site is measured by the
    number of other sites pointing to it
  • Authority?
  • The luminosity of a site is measured by the
    number of other sites to which it points
  • Hub?

33
Marchiori (1997)
  • Hyper information should complement textual
    information to obtain the overall information

S(v) s(v) h(v)
- S(v) overall information - s(v) textual
information - h(v) hyper information
r(v, w)
  • h(v) S F S(w)

w ? chv
- F a fading constant, F ? (0, 1) - r(v,
w) the rank of w after sorting the children of v
by S(w)

34
Post PageRank
  • And following the success of PageRank, a lot of
    new algorithms were also proposed.
  • Fast PageRank calculation (Haveliwala)
  • Topic-sensitive PageRank
  • Personalized PageRank
  • LinkFusion

35
Fast PageRank calculation Haveliwala 1999
  • Partition the destination vector into d blocks
    that each fit into main memory, and to compute
    one block at a time.
  • This algorithm is quite similar in structure to
    the Block Nested-Loop Join algorithm in database
    systems. which also performs very well for data
    sets of moderate size but eventually loses out to
    more scalable approaches.

36
Fast PageRank calculation Haveliwala 2003
  • Basic observation
  • the convergence rates of the PageRank values of
    individual pages during application of the Power
    Method is nonuniform. That is, many pages
    converge quickly, with a few pages taking much
    longer to converge. Furthermore, the pages that
    converge slowly are generally those pages with
    high PageRank.

37
Topic-Specific PageRank Haveliwala - WWW02
  • Topic-specific PageRanks
  • For each page precomputed PageRank values of the
    most relevant topics used for each query.
  • 16 topics

38
Link Fusion Zeng, WWW04
  • In a more generalized scenario, suppose there are
    N data types. The importance attribute of one
    type of object can be reinforced by both inter
    and intra-type links as
  • Suppose w is the attribute vector of all the
    objects in the URM. Link Fusion can be
    represented as
  • wnewLurmTwold
  • Such iterative calculation can be continued
  • wn(LurmT)nw0
  • The result w is the prime eigenvector of Lurm,
    which can be explained as the value of data
    objects regarding a specific attribute.

39
Limits of Link Analysis
  • Pay-for-place
  • Search engine bias organizations pay search
    engines and page rank
  • Advertisements organizations pay high ranking
    pages for advertising space
  • With a primary effect of increased visibility to
    end users and a secondary effect of increased
    respectability due to relevance to high ranking
    page

40
Limits of Link Analysis
  • Stability
  • Adding even a small number of nodes/edges to the
    graph has a significant impact
  • Topic drift
  • A top authority may be a hub of pages on a
    different topic resulting in increased rank of
    the authority page
  • Content evolution
  • Adding/removing links/content can affect the
    intuitive authority rank of a page requiring
    recalculation of page ranks

41
Third Session
42
What is Link Spam
  • Since link analysis has played an important role
    in search engines, it has large commercial values
  • Improving ones PageRank, can directly increase
    ones clicks thus earn more money.
  • Link Spam is something trying to unfairly gain a
    high ranking on a search engine for a web page
    without improving the user experience, by mean of
    tricky modification / manipulation of the link
    graph.

43
Link Spamming Technologies
  • Adding outlinks
  • Replicate hub pages
  • Adding inlinks
  • Create a honey pot
  • Infiltrate a web directory
  • Post links on blog, wiki, etc
  • Participate in-link exchange
  • Buy expired domains
  • Create own spam farm.

44
Case Study Spam HITS
  • Hub score can be increased by adding outlinks to
    the target page
  • Authority score can be increased by creating
    hyperlinks from high-hub-score pages to the
    target page.

45
Case Study Spam PageRank
  • Factors that influence PageRank
  • PR(t)PRstatic(t)PRin(t)-PRout(t)-PRsink(t)
  • Strategies
  • Own pages are part of the spam farm, maximizing
    PRstatic
  • Accessible pages point to the spam farm,
    maximizing PRin
  • Links pointing outside the spam farm are
    supressed, minimizing PRout(t)
  • All pages within the farm have some outlinks,
    minimizing PRsink(t)

46
Anti-Spam
  • Early approaches
  • BHITS, SALSA, DOM, revised HITS, BadRank
  • State-of-the-art
  • TrustRank (2004)
  • Revised PageRank (VLDB2004)
  • BadRank (WWW2005)
  • SpamRank (WWW2005, workshop)

47
TrustRank
  • Basic assumption
  • Good pages seldom point to spam pages, but spam
    pages may very likely point to good pages.
  • Use TrustRank to denote the goodness of a
    webpage, and use Trust Propagation to label all
    the web pages starting from a small human-labeled
    seed set.

48
TrustRank
  • Step 1 Initialization
  • How to select seeds
  • Inverse PageRank (Hub pages, since they have more
    influence)
  • High PageRank (Important pages are more important
    to search applications)
  • Step 2 Propagation

49
TrustRank
  • Step 3
  • Trust Dampening
  • Trust Splitting

50
BadRank
  • Motivation
  • Pages in the spam farm are densely connected, and
    many common pages exist in both the inlinks and
    outlinks of these pages.
  • Propagate the badness of pages in the seed set to
    detect other the spam pages in the Web.

51
BadRank
  • Step 1 Initialization
  • At least 3 common nodes (approximately the same,
    i.e. with the same domain name) in the inlink and
    outlink sets
  • Step 2 Expansion
  • ParentPenalty if a page links to many bad pages
    (larger than a threshold), it will also be
    labeled as bad.
  • Delete all the links between detected bad pages
    before PageRank calculation.

52
Revised PageRank
  • Assumption
  • The spam farm have high correlation with each
    other.
  • Approach
  • Increase the probability of jumping from nodes
    with large correlation coefficients.

53
Revised PageRank
  • Step 1 Collusion detection
  • Calculate PageRank values for different e
  • Calculate the correlation coefficient between the
    curve of node xs PageRank and 1/ e, denoted by
    co-co(x).
  • Step 2 e Personalization
  • Use F(edefault, co-co(x)) to personalize the
    original matrix U.
  • Recalculate PageRank.

54
SpamRank
  • Key assumption
  • Supporters of an honest page should not overly
    dependent on one another, i.e. they should be
    spread across different quality.
  • Due to the self-similarity, the honest supporter
    set should have a power-law distribution of
    PageRank.
  • Spammers have a limited budget, so they do not
    replicate the unimportant structures.

55
Summary
  • The current works on anti-spam are very limited.
  • Promising research directions
  • Use more statistics and the properties of the
    transition probability matrix to detect spam
  • Design a new spam-free ranking function

56
Homework
57
Technical Report Writing
  1. HITS and PageRank are both based on simple linear
    algebra, can you design some other link analysis
    algorithm based on advanced linear algebra or
    matrix factorization?
  2. The performance / sensitivity of PageRank with
    respect to the smoothing factor e.
  3. How to speed up the calculation of PageRank using
    matrix factorization, or some specific
    characteristics of the Markov chain?
  4. PageRank is the eigenvector of a 2-D matrix, then
    can LinkFusion be the eigenvector of a 3-D
    tensor?
  5. Stability analysis for other link analysis
    algorithms.
  6. A survey on the state-of-the-art spam
    technologies.
  7. How to design a search engine that is robust to
    spam?
  8. Other novel research topics related to link
    analysis.

58
Requirements
  • Send the report to Tie-Yan.Liu_at_microsoft.com
    before Dec 4 (within 1 month).
  • The length should not be less than 8 pages, with
    the template at http//www.acm.org/sigs/pubs/proce
    ed/template.html
  • There must be something new and intersting in
    your report, and yous better use some
    experiments to support your idea.
  • Never try to copy or steal already-published
    ideas as your technical report. We are sure we
    have read much more than you can find.

59
Other Information
  • Slides can be found at
  • http//research.microsoft.com/users/tyliu/
Write a Comment
User Comments (0)
About PowerShow.com