Introduction to Link Analysis - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Introduction to Link Analysis

Description:

Suppose we are given a collection of documents on some broad topic ... Nowhere to go on next step. Microsoft becomes a dead end. Yahoo. M'soft. Amazon. y. a = m ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 73
Provided by: anan56
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Link Analysis


1
Introduction to Link Analysis
  • September 11, 2007
  • Analysis of Social Media Seminar
  • Natalie Glance

2
Content adapted from
  • CS345 Data Mining Class Slides
  • Anand Rajaraman, Jeffrey D. Ullman
  • IMA Tutorial on Measuring Modeling the Web
  • Andrew Tomkins

3
Problem formulation (1998)
  • Suppose we are given a collection of documents on
    some broad topic
  • e.g., universities, evolution, iraq
  • perhaps obtained through a text search
  • Can we organize these documents in some manner?
  • Link analysis techniques use links, not text

4
Link Analysis Algorithms
  • Page Rank
  • Hubs and Authorities
  • Both proposed at about the same time (1998)
  • Topic-Specific Page Rank
  • Spam Detection Algorithms
  • Mining for Communities
  • Further Reading Link Analysis for Web 2.0

5
PageRank
  • The PageRank citation ranking Bringing order to
    the Web, L. Page, S. Brin, R. Motwani, T.
    Winograd, 1999.
  • The Anatomy of a Large-Scale Hypertextual Web
    Search Engine, S. Brin, L. Page (1998).

6
Ranking web pages
  • Web pages are not equally important
  • www.silly-billy.com v www.stanford.edu
  • Inlinks as votes
  • www.stanford.edu has 23,400 inlinks
  • www.silly-billy.com has 1 inlink
  • Are all inlinks equal?
  • Recursive question!

7
Simple recursive formulation
  • Each links vote is proportional to the
    importance of its source page
  • If page P with importance x has n outlinks, each
    link gets x/n votes

8
Simple flow model
  • The web long ago

y y /2 a /2 a y /2 m m a /2
y/2
y
a/2
y/2
m
a/2
m
a
9
Solving the flow equations
  • 3 equations, 3 unknowns, no constants
  • No unique solution
  • All solutions equivalent modulo scale factor
  • Additional constraint forces uniqueness
  • yam 1
  • y 2/5, a 2/5, m 1/5
  • Gaussian elimination method works for small
    examples, but we need a better method for large
    graphs

10
Matrix formulation
  • Matrix M has one row and one column for each web
    page
  • Suppose page j has n outlinks
  • If j links to i, then Mij1/n
  • Else Mij0
  • M is a column stochastic matrix
  • Columns sum to 1
  • Suppose r is a vector with one entry per web page
  • ri is the importance score of page i
  • Call it the rank vector

11
Example
Suppose page j links to 3 pages, including i
r
12
Eigenvector formulation
  • The flow equations can be written
  • r Mr
  • So the rank vector is an eigenvector of the
    stochastic web matrix
  • In fact, its first or principal eigenvector, with
    corresponding eigenvalue 1

13
Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
14
Power Iteration method
  • Simple iterative scheme (aka relaxation)
  • Suppose there are N web pages
  • Initialize r0 1/N,.,1/NT
  • Iterate rk1 Mrk
  • Stop when rk1 - rk1 lt ?
  • x1 ?1iNxi is the L1 norm
  • Can use any other vector norm e.g., Euclidean

15
Power Iteration Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y a m
1/3 1/3 1/3
1/3 1/2 1/6
5/12 1/3 1/4
3/8 11/24 1/6
2/5 2/5 1/5
. . .
16
Random Walk Interpretation
  • Imagine a random web surfer
  • At any time t, surfer is on some page P
  • At time t1, the surfer follows an outlink from P
    uniformly at random
  • Ends up on some page Q linked from P
  • Process repeats indefinitely
  • Let p(t) be a vector whose ith component is the
    probability that the surfer is at page i at time
    t
  • p(t) is a probability distribution on pages

17
The stationary distribution
  • Where is the surfer at time t1?
  • Follows a link uniformly at random
  • p(t1) Mp(t)
  • Suppose the random walk reaches a state such that
    p(t1) Mp(t) p(t)
  • Then p(t) is called a stationary distribution for
    the random walk
  • Our rank vector r satisfies r Mr
  • So it is a stationary distribution for the random
    surfer

18
Existence and Uniqueness
  • A central result from the theory of random walks
    (aka Markov processes)
  • For graphs that satisfy certain conditions, the
    stationary distribution is unique and eventually
    will be reached no matter what the initial
    probability distribution at time t 0.

19
Spider traps
  • A group of pages is a spider trap if there are no
    links from within the group to outside the group
  • Random surfer gets trapped
  • Spider traps violate the conditions needed for
    the random walk theorem

20
Microsoft becomes a spider trap
Yahoo
y a m
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
21
Random teleports
  • The PageRank solution for spider traps
  • At each time step, the random surfer has two
    options
  • With probability ?, follow a link at random
  • With probability 1-?, jump to some page uniformly
    at random
  • Common values for ? are in the range 0.8 to 0.9
  • Surfer will teleport out of spider trap within a
    few time steps

22
Matrix formulation
  • Suppose there are N pages
  • Consider a page j, with set of outlinks O(j)
  • We have Mij 1/O(j) when j links to i and Mij
    0 otherwise
  • The random teleport is equivalent to
  • adding a teleport link from j to every other page
    with probability (1-?)/N
  • reducing the probability of following each
    outlink from 1/O(j) to ?/O(j)
  • Equivalent tax each page a fraction (1-?) of its
    score and redistribute evenly

23
PageRank
  • Construct the NXN matrix A as follows
  • Aij ?Mij (1-?)/N
  • Verify that A is a stochastic matrix
  • The page rank vector r is the principal
    eigenvector of this matrix
  • satisfying r Ar
  • Equivalently, r is the stationary distribution of
    the random walk with teleports

24
Previous example with ?0.8
1/2 1/2 0 1/2 0 0 0 1/2
1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 13/15
Msoft
Amazon
y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
25
Dead ends
  • Pages with no outlinks are dead ends for the
    random surfer
  • Nowhere to go on next step

26
Microsoft becomes a dead end
1/2 1/2 0 1/2 0 0 0 1/2
0
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 1/15
Msoft
Amazon
y a m
1 1 1
1 0.6 0.6
0.787 0.547 0.387
0.648 0.430 0.333
0 0 0
. . .
27
Dealing with dead-ends
  • Teleport
  • Follow random teleport links with probability 1.0
    from dead-ends
  • Adjust matrix accordingly
  • Prune and propagate
  • Preprocess the graph to eliminate dead-ends
  • Might require multiple passes
  • Compute page rank on reduced graph
  • Approximate values for deadends by propagating
    values from reduced graph

28
Computing PageRank
  • Key step is matrix-vector multiply
  • rnew Arold
  • Easy if we have enough main memory to hold A,
    rold, rnew
  • Say N 1 billion pages
  • We need 4 bytes for each entry (say)
  • 2 billion entries for vectors, approx 8GB
  • Matrix A has N2 entries
  • 1018 is a large number!

29
Comparison of Query for University
Google vs. AltaVista, circa 1998
30
Problems with PageRank
  • Measures generic popularity of a page
  • Biased against topic-specific authorities
  • Ambiguous queries e.g., Michael Jordan
  • Uses a single measure of importance
  • Other models e.g., hubs and authorities
  • Susceptible to Link spam
  • Artificial link topographies created in order to
    boost page rank

31
Topic-Specific Page Rank
  • Taher H. Haveliwala. Topic-Sensitive PageRank.
    11th International World Wide Web Conference,
    2002
  • Instead of generic popularity, can we measure
    popularity within a topic?
  • E.g., computer science, health
  • Bias the random walk
  • When the random walker teleports, he picks a page
    from a set S of web pages
  • S contains only pages that are relevant to the
    topic
  • E.g., Open Directory (DMOZ) pages for a given
    topic (www.dmoz.org)
  • For each teleport set S, we get a different rank
    vector rS

32
Matrix formulation
  • Aij ?Mij (1-?)/S if i in S
  • Aij ?Mij otherwise
  • Show that A is stochastic
  • We have weighted all pages in the teleport set S
    equally
  • Could also assign different weights to them

33
How well does TSPR work?
  • Experimental results Haveliwala 2000
  • Picked 16 topics
  • Teleport sets determined using DMOZ
  • E.g., arts, business, sports,
  • Blind study using volunteers
  • 35 test queries
  • Results ranked using Page Rank and TSPR of most
    closely related topic
  • E.g., bicycling using Sports ranking
  • In most cases volunteers preferred TSPR ranking

34
Which topic ranking to use?
  • User can pick from a menu
  • Use Bayesian classification schemes to classify
    query into a topic
  • Can use the context of the query
  • E.g., query is launched from a web page talking
    about a known topic
  • History of queries e.g., basketball followed by
    jordan
  • User context e.g., users My Yahoo settings,
    bookmarks,

35
Scaling with topics and users
  • Suppose we wanted to cover 1000s of topics
  • Need to compute 1000s of different rank vectors
  • Need to store and retrieve them efficiently at
    query time
  • For good performance vectors must fit in memory
  • Even harder when we consider personalization
  • Each user has their own teleport vector
  • One page rank vector per user!

36
Web spam
  • Spamming any deliberate action solely in order
    to boost a web pages position in search engine
    results, incommensurate with pages real value
  • Spam web pages that are the result of spamming
  • This is a very broad defintion
  • SEO industry might disagree!
  • SEO search engine optimization
  • Approximately 10-15 of web pages are spam

37
Boosting techniques
  • Term spamming
  • Manipulating the text of web pages in order to
    appear relevant to queries
  • Link spamming
  • Creating link structures that boost page rank or
    hubs and authorities scores

38
Link Farms
  • Spammers goal
  • Maximize the page rank of target page t
  • Technique
  • Get as many links from accessible pages as
    possible to target page t
  • Construct link farm to get PageRank multiplier
    effect

39
Link Farms
One of the most common and effective
organizations for a link farm
40
TrustRank idea
  • Combating Web Spam with TrustRank, Z. Gyongyi, H.
    Garcia-Molina, J. Pedersen, VLDB, 2004.
  • Basic principle approximate isolation
  • It is rare for a good page to point to a bad
    (spam) page
  • Sample a set of seed pages from the web
  • Have an oracle (human) identify the good pages
    and the spam pages in the seed set
  • Expensive task, so must make seed set as small as
    possible

41
Trust Propagation
  • Call the subset of seed pages that are identified
    as good the trusted pages
  • Set trust of each trusted page to 1
  • Propagate trust through links
  • Each page gets a trust value between 0 and 1
  • Use a threshold value and mark all pages below
    the trust threshold as spam

42
Example
1
2
3
good
4
bad
5
6
7
43
Rules for trust propagation
  • Trust attenuation
  • The degree of trust conferred by a trusted page
    decreases with distance
  • Trust splitting
  • The larger the number of outlinks from a page,
    the less scrutiny the page author gives each
    outlink
  • Trust is split across outlinks

44
Simple model
  • Suppose trust of page p is t(p)
  • Set of outlinks O(p)
  • For each q in O(p), p confers the trust
  • bt(p)/O(p) for 0ltblt1
  • Trust is additive
  • Trust of p is the sum of the trust conferred on p
    by all its inlinked pages
  • Note similarity to Topic-Specific Page Rank
  • Within a scaling factor, trust rank biased page
    rank with trusted pages as teleport set

45
Picking the seed set
  • Two conflicting considerations
  • Human has to inspect each seed page, so seed set
    must be as small as possible
  • Must ensure every good page gets adequate trust
    rank, so need make all good pages reachable from
    seed set by short paths

46
Approaches to picking seed set
  • Suppose we want to pick a seed set of k pages
  • PageRank
  • Pick the top k pages by PageRank
  • Assume high PageRank pages are close to other
    highly ranked pages
  • We care more about high PageRank good pages

47
Inverse page rank
  • Pick the pages with the maximum number of
    outlinks
  • Can make it recursive
  • Pick pages that link to pages with many outlinks
  • Formalize as inverse PageRank
  • Construct graph G by reversing each edge in web
    graph G
  • PageRank in G is inverse page rank in G
  • Pick top k pages by inverse PageRank

48
Hubs and Authorities
  • J. Kleinberg. Authoritative sources in a
    hyperlinked environment. Proc. 9th ACM-SIAM
    Symposium on Discrete Algorithms, 1998. Extended
    version in Journal of the ACM 46(1999).

49
HITS Model
  • Interesting documents fall into two classes
  • Authorities are pages containing useful
    information
  • course home pages
  • home pages of auto manufacturers
  • Hubs are pages that link to authorities
  • course bulletin
  • list of US auto manufacturers

50
Idealized view
Hubs
Authorities
51
Mutually recursive definition
  • A good hub links to many good authorities
  • A good authority is linked from many good hubs
  • Model using two scores for each node
  • Hub score and Authority score
  • Represented as vectors h and a

52
Transition Matrix A
  • HITS uses a matrix Ai, j 1 if page i links to
    page j, 0 if not
  • AT, the transpose of A, is similar to the
    PageRank matrix M, but AT has 1s where M has
    fractions

53
Example
y a m
Yahoo
y 1 1 1 a 1 0 1 m 0 1
0
A
Msoft
Amazon
54
Hub and Authority Equations
  • The hub score of page P is proportional to the
    sum of the authority scores of the pages it links
    to
  • h ?Aa
  • Constant ? is a scale factor
  • The authority score of page P is proportional to
    the sum of the hub scores of the pages it is
    linked from
  • a µAT h
  • Constant µ is scale factor

55
Iterative algorithm
  • Initialize h, a to all 1s
  • h Aa
  • Scale h so that its max entry is 1.0
  • a ATh
  • Scale a so that its max entry is 1.0
  • Continue until h, a converge

56
Example
1 1 1 A 1 0 1 0 1 0
1 1 0 AT 1 0 1 1 1 0
. . . . . . . . .
1 0.732 1

1 1 1
1 1 1
1 4/5 1
1 0.75 1
a(yahoo) a(amazon) a(msoft)
. . . . . . . . .
h(yahoo) 1 h(amazon)
1 h(msoft) 1
1 2/3 1/3
1 0.73 0.27
1.000 0.732 0.268
1 0.71 0.29
57
Existence and Uniqueness
  • h ?Aa
  • a µAT h
  • h ?µAAT h
  • a ?µATA a
  • Under reasonable assumptions about A,
  • the dual iterative algorithm converges to vectors
  • h and a such that
  • h is the principal eigenvector of the matrix AAT
  • a is the principal eigenvector of the matrix ATA

58
Bipartite cores
Hubs
Authorities
Most densely-connected core (primary core)
Less densely-connected core (secondary core)
59
Secondary cores
  • A single topic can have many bipartite cores
  • corresponding to different meanings, or points of
    view
  • abortion pro-choice, pro-life
  • evolution darwinian, intelligent design
  • jaguar auto, Mac, NFL team, panthera onca
  • How to find such secondary cores?

60
Non-primary eigenvectors
  • AAT and ATA have the same set of eigenvalues
  • An eigenpair is the pair of eigenvectors with the
    same eigenvalue
  • The primary eigenpair (largest eigenvalue) is
    what we get from the iterative algorithm
  • Non-primary eigenpairs correspond to other
    bipartite cores
  • The eigenvalue is a measure of the density of
    links in the core

61
Finding secondary cores
  • Once we find the primary core, we can remove its
    links from the graph
  • Repeat HITS algorithm on residual graph to find
    the next bipartite core
  • Technically, not exactly equivalent to
    non-primary eigenpair model

62
Creating the graph for HITS
  • We need a well-connected graph of pages for HITS
    to work well
  • Add all pages linking into and out of search
    results

63
Sample results search engines
  • Authorities (in 1998)
  • .346 http//www.yahoo.com/ Yahoo!
  • .291 http//www.excite.com/ Excite
  • .239 http//www.mckinley.com/ Welcome to
    Magellan!
  • .231 http//www.lycos.com/ Lycos Home Page
  • .231 http//www.altavista.digital.com/ AltaVista
    Main Page

64
PageRank and HITS
  • PageRank and HITS are two solutions to the same
    problem
  • What is the value of an inlink from S to D?
  • In the PageRank model, the value of the link
    depends on the links into S
  • In the HITS model, it depends on the value of the
    other links out of S
  • The destinies of Page Rank and HITS post-1998
    were very different
  • Why?

65
Trawling the Web for Emerging Communities KRR 98
  • (slides in separate ppt)

66
Link analysis over weblogs
  • How are Blogs different than the Web
  • Time-stamps -gt time ordering of posts/links
  • Blog-blog links (blogrolls) post-post links
  • Friends lists
  • Reciprocal links built in as trackbacks
  • Rank of post vs. rank of blog
  • Search results timeliness vs. relevance/authority
  • Research topics
  • Information diffusion
  • Ranking blogs
  • Cascades
  • Community mining

67
Information diffusion ranking
  • Implicit Structure and the Dynamic of Blogspace,
    Eytan Adar, WWE-2004
  • Macroscopic microscopic patterns of blog
    epidemics
  • Implicit explicit ranking algorithms that take
    advantage of infection patterns
  • iRank acts on the implicit link structure to find
    those blogs that initiate these epidemics

68
Information diffusion in blogs
  • Information diffusion through blogspace, Gruhl et
    al, WWW 2004
  • How topics spread through blogs
  • Macroscopic long-running chatter topics vs.
    bursty spike topics
  • Microscopic model topic diffusion as infectious
    disease

69
Information cascades
  • Finding patterns in blog shapes and blog
    evolution, M. McGlohon, J. Leskovec, C.
    Faloutsos, M. Hurst and N. Glance
  • Study topology of cascades implicit in blog post
    linking behavior
  • Temporal evolution of types of cascades for given
    blogs

70
Ranking blogs EigenRumor Algorithm
  • The EigenRumor Algorithm for Ranking Blogs, Ko
    Fujimura, WWE-2005
  • "EigenRumor" algorithm scores each blog entry
    based on eigenvector calculations.
  • Higher score assigned to the blog entries
    submitted by a good blogger but not yet linked to
    by any other blogs based on acceptance of the
    blogger's prior work.

71
Mining weblog communities
  • Extracting Latent Weblog Communities A
    Partitioning Algorithm for Bipartite Graphs,
    Kazunari Ishida, WWE-2005
  • Builds on Trawling for communities KRR 1998
  • Bipartite graphs derived from weblog update
    information and cited webpages
  • Partitioning method weakest pairs (WP) method to
    divide a bipartite graph into complete bipartite
    subgraphs
  • Discovery of Blog Communities Based on Mutual
    Awareness, Yu-Ru Lin,Hari Sundaram, Yun Chi, Jun
    Tatemura and Belle Tseng, WWE 2006.
  • Compute mutual awareness matrix based on
    reciprocal links
  • Use an iterative, ranking-based clustering scheme
    on the mutual awareness matrix to determine the
    communities
  • PageRank is used to determine the seeds at each
    step, followed by diffusion of association to
    determine the community members.

72
Combining sentiment link analysis
  • Modeling Trust and Influence in the Blogosphere
    Using Link Polarity,A. Kale, A. Karandikar, P.
    Kolari, A. Java, T. Finin and A. Joshi, ICWSM
    2007
  • Steps
  • Identify polarity of blog-blog link
  • Use trust propagation models to spread sentiment
    from subset of connected blogs to other blogs
  • Seed with computed base polarity of known source
    blog (blog or blog post)
  • Take into account magnitude of strength or
    weakness of sentiment
  • Uses lexicon of positive and negative words

73
Beyond blogs
  • Why We Twitter Understanding Microblogging Usage
    and Communities,Akshay Java, Xiaodan Song, Tim
    Finin, and Belle Tseng, Joint 9thWEBKDD and 1st
    SNA-KDD Workshop, August 2007.
  • Dataset includes about 1350K posts from over 75K
    users
  • Presents usage trends, network properties, top
    hubs and authorities, community structure and
    geographic distribution.
Write a Comment
User Comments (0)
About PowerShow.com