Introduction to Link Analysis - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Introduction to Link Analysis

Description:

Suppose we are given a collection of documents on some broad topic ... Nowhere to go on next step. Microsoft becomes a dead end. Yahoo. M'soft. Amazon. y. a = m ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 73

Provided by: anan56

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Link Analysis

1
Introduction to Link Analysis

September 11, 2007
Analysis of Social Media Seminar
Natalie Glance

2
Content adapted from

CS345 Data Mining Class Slides
Anand Rajaraman, Jeffrey D. Ullman
IMA Tutorial on Measuring Modeling the Web
Andrew Tomkins

3
Problem formulation (1998)

Suppose we are given a collection of documents on
some broad topic
e.g., universities, evolution, iraq
perhaps obtained through a text search
Can we organize these documents in some manner?
Link analysis techniques use links, not text

4
Link Analysis Algorithms

Page Rank
Hubs and Authorities
Both proposed at about the same time (1998)
Topic-Specific Page Rank
Spam Detection Algorithms
Mining for Communities
Further Reading Link Analysis for Web 2.0

5
PageRank

The PageRank citation ranking Bringing order to
the Web, L. Page, S. Brin, R. Motwani, T.
Winograd, 1999.
The Anatomy of a Large-Scale Hypertextual Web
Search Engine, S. Brin, L. Page (1998).

6
Ranking web pages

Web pages are not equally important
www.silly-billy.com v www.stanford.edu
Inlinks as votes
www.stanford.edu has 23,400 inlinks
www.silly-billy.com has 1 inlink
Are all inlinks equal?
Recursive question!

7
Simple recursive formulation

Each links vote is proportional to the
importance of its source page
If page P with importance x has n outlinks, each
link gets x/n votes

8
Simple flow model

The web long ago

y y /2 a /2 a y /2 m m a /2
y/2
y
a/2
y/2
m
a/2
m
a
9
Solving the flow equations

3 equations, 3 unknowns, no constants
No unique solution
All solutions equivalent modulo scale factor
Additional constraint forces uniqueness
yam 1
y 2/5, a 2/5, m 1/5
Gaussian elimination method works for small
examples, but we need a better method for large
graphs

10
Matrix formulation

Matrix M has one row and one column for each web
page
Suppose page j has n outlinks
If j links to i, then Mij1/n
Else Mij0
M is a column stochastic matrix
Columns sum to 1
Suppose r is a vector with one entry per web page
ri is the importance score of page i
Call it the rank vector

11
Example
Suppose page j links to 3 pages, including i
r
12
Eigenvector formulation

The flow equations can be written
r Mr
So the rank vector is an eigenvector of the
stochastic web matrix
In fact, its first or principal eigenvector, with
corresponding eigenvalue 1

13
Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
14
Power Iteration method

Simple iterative scheme (aka relaxation)
Suppose there are N web pages
Initialize r0 1/N,.,1/NT
Iterate rk1 Mrk
Stop when rk1 - rk1 lt ?
x1 ?1iNxi is the L1 norm
Can use any other vector norm e.g., Euclidean

15
Power Iteration Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y a m
1/3 1/3 1/3
1/3 1/2 1/6
5/12 1/3 1/4
3/8 11/24 1/6
2/5 2/5 1/5
. . .
16
Random Walk Interpretation

Imagine a random web surfer
At any time t, surfer is on some page P
At time t1, the surfer follows an outlink from P
uniformly at random
Ends up on some page Q linked from P
Process repeats indefinitely
Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time
t
p(t) is a probability distribution on pages

17
The stationary distribution

Where is the surfer at time t1?
Follows a link uniformly at random
p(t1) Mp(t)
Suppose the random walk reaches a state such that
p(t1) Mp(t) p(t)
Then p(t) is called a stationary distribution for
the random walk
Our rank vector r satisfies r Mr
So it is a stationary distribution for the random
surfer

18
Existence and Uniqueness

A central result from the theory of random walks
(aka Markov processes)
For graphs that satisfy certain conditions, the
stationary distribution is unique and eventually
will be reached no matter what the initial
probability distribution at time t 0.

19
Spider traps

A group of pages is a spider trap if there are no
links from within the group to outside the group
Random surfer gets trapped
Spider traps violate the conditions needed for
the random walk theorem

20
Microsoft becomes a spider trap
Yahoo
y a m
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
21
Random teleports

The PageRank solution for spider traps
At each time step, the random surfer has two
options
With probability ?, follow a link at random
With probability 1-?, jump to some page uniformly
at random
Common values for ? are in the range 0.8 to 0.9
Surfer will teleport out of spider trap within a
few time steps

22
Matrix formulation

Suppose there are N pages
Consider a page j, with set of outlinks O(j)
We have Mij 1/O(j) when j links to i and Mij
0 otherwise
The random teleport is equivalent to
adding a teleport link from j to every other page
with probability (1-?)/N
reducing the probability of following each
outlink from 1/O(j) to ?/O(j)
Equivalent tax each page a fraction (1-?) of its
score and redistribute evenly

23
PageRank

Construct the NXN matrix A as follows
Aij ?Mij (1-?)/N
Verify that A is a stochastic matrix
The page rank vector r is the principal
eigenvector of this matrix
satisfying r Ar
Equivalently, r is the stationary distribution of
the random walk with teleports

24
Previous example with ?0.8
1/2 1/2 0 1/2 0 0 0 1/2
1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 13/15
Msoft
Amazon
y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
25
Dead ends

Pages with no outlinks are dead ends for the
random surfer
Nowhere to go on next step

26
Microsoft becomes a dead end
1/2 1/2 0 1/2 0 0 0 1/2
0
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 1/15
Msoft
Amazon
y a m
1 1 1
1 0.6 0.6
0.787 0.547 0.387
0.648 0.430 0.333
0 0 0
. . .
27
Dealing with dead-ends

Teleport
Follow random teleport links with probability 1.0
from dead-ends
Adjust matrix accordingly
Prune and propagate
Preprocess the graph to eliminate dead-ends
Might require multiple passes
Compute page rank on reduced graph
Approximate values for deadends by propagating
values from reduced graph

28
Computing PageRank

Key step is matrix-vector multiply
rnew Arold
Easy if we have enough main memory to hold A,
rold, rnew
Say N 1 billion pages
We need 4 bytes for each entry (say)
2 billion entries for vectors, approx 8GB
Matrix A has N2 entries
1018 is a large number!

29
Comparison of Query for University
Google vs. AltaVista, circa 1998
30
Problems with PageRank

Measures generic popularity of a page
Biased against topic-specific authorities
Ambiguous queries e.g., Michael Jordan
Uses a single measure of importance
Other models e.g., hubs and authorities
Susceptible to Link spam
Artificial link topographies created in order to
boost page rank

31
Topic-Specific Page Rank

Taher H. Haveliwala. Topic-Sensitive PageRank.
11th International World Wide Web Conference,
2002
Instead of generic popularity, can we measure
popularity within a topic?
E.g., computer science, health
Bias the random walk
When the random walker teleports, he picks a page
from a set S of web pages
S contains only pages that are relevant to the
topic
E.g., Open Directory (DMOZ) pages for a given
topic (www.dmoz.org)
For each teleport set S, we get a different rank
vector rS

32
Matrix formulation

Aij ?Mij (1-?)/S if i in S
Aij ?Mij otherwise
Show that A is stochastic
We have weighted all pages in the teleport set S
equally
Could also assign different weights to them

33
How well does TSPR work?

Experimental results Haveliwala 2000
Picked 16 topics
Teleport sets determined using DMOZ
E.g., arts, business, sports,
Blind study using volunteers
35 test queries
Results ranked using Page Rank and TSPR of most
closely related topic
E.g., bicycling using Sports ranking
In most cases volunteers preferred TSPR ranking

34
Which topic ranking to use?

User can pick from a menu
Use Bayesian classification schemes to classify
query into a topic
Can use the context of the query
E.g., query is launched from a web page talking
about a known topic
History of queries e.g., basketball followed by
jordan
User context e.g., users My Yahoo settings,
bookmarks,

35
Scaling with topics and users

Suppose we wanted to cover 1000s of topics
Need to compute 1000s of different rank vectors
Need to store and retrieve them efficiently at
query time
For good performance vectors must fit in memory
Even harder when we consider personalization
Each user has their own teleport vector
One page rank vector per user!

36
Web spam

Spamming any deliberate action solely in order
to boost a web pages position in search engine
results, incommensurate with pages real value
Spam web pages that are the result of spamming
This is a very broad defintion
SEO industry might disagree!
SEO search engine optimization
Approximately 10-15 of web pages are spam

37
Boosting techniques

Term spamming
Manipulating the text of web pages in order to
appear relevant to queries
Link spamming
Creating link structures that boost page rank or
hubs and authorities scores

38
Link Farms

Spammers goal
Maximize the page rank of target page t
Technique
Get as many links from accessible pages as
possible to target page t
Construct link farm to get PageRank multiplier
effect

39
Link Farms
One of the most common and effective
organizations for a link farm
40
TrustRank idea

Combating Web Spam with TrustRank, Z. Gyongyi, H.
Garcia-Molina, J. Pedersen, VLDB, 2004.
Basic principle approximate isolation
It is rare for a good page to point to a bad
(spam) page
Sample a set of seed pages from the web
Have an oracle (human) identify the good pages
and the spam pages in the seed set
Expensive task, so must make seed set as small as
possible

41
Trust Propagation

Call the subset of seed pages that are identified
as good the trusted pages
Set trust of each trusted page to 1
Propagate trust through links
Each page gets a trust value between 0 and 1
Use a threshold value and mark all pages below
the trust threshold as spam

42
Example
1
2
3
good
4
bad
5
6
7
43
Rules for trust propagation

Trust attenuation
The degree of trust conferred by a trusted page
decreases with distance
Trust splitting
The larger the number of outlinks from a page,
the less scrutiny the page author gives each
outlink
Trust is split across outlinks

44
Simple model

Suppose trust of page p is t(p)
Set of outlinks O(p)
For each q in O(p), p confers the trust
bt(p)/O(p) for 0ltblt1
Trust is additive
Trust of p is the sum of the trust conferred on p
by all its inlinked pages
Note similarity to Topic-Specific Page Rank
Within a scaling factor, trust rank biased page
rank with trusted pages as teleport set

45
Picking the seed set

Two conflicting considerations
Human has to inspect each seed page, so seed set
must be as small as possible
Must ensure every good page gets adequate trust
rank, so need make all good pages reachable from
seed set by short paths

46
Approaches to picking seed set

Suppose we want to pick a seed set of k pages
PageRank
Pick the top k pages by PageRank
Assume high PageRank pages are close to other
highly ranked pages
We care more about high PageRank good pages

47
Inverse page rank

Pick the pages with the maximum number of
outlinks
Can make it recursive
Pick pages that link to pages with many outlinks
Formalize as inverse PageRank
Construct graph G by reversing each edge in web
graph G
PageRank in G is inverse page rank in G
Pick top k pages by inverse PageRank

48
Hubs and Authorities

J. Kleinberg. Authoritative sources in a
hyperlinked environment. Proc. 9th ACM-SIAM
Symposium on Discrete Algorithms, 1998. Extended
version in Journal of the ACM 46(1999).

49
HITS Model

Interesting documents fall into two classes
Authorities are pages containing useful
information
course home pages
home pages of auto manufacturers
Hubs are pages that link to authorities
course bulletin
list of US auto manufacturers

50
Idealized view
Hubs
Authorities
51
Mutually recursive definition

A good hub links to many good authorities
A good authority is linked from many good hubs
Model using two scores for each node
Hub score and Authority score
Represented as vectors h and a

52
Transition Matrix A

HITS uses a matrix Ai, j 1 if page i links to
page j, 0 if not
AT, the transpose of A, is similar to the
PageRank matrix M, but AT has 1s where M has
fractions

53
Example
y a m
Yahoo
y 1 1 1 a 1 0 1 m 0 1
0
A
Msoft
Amazon
54
Hub and Authority Equations

The hub score of page P is proportional to the
sum of the authority scores of the pages it links
to
h ?Aa
Constant ? is a scale factor
The authority score of page P is proportional to
the sum of the hub scores of the pages it is
linked from
a µAT h
Constant µ is scale factor

55
Iterative algorithm

Initialize h, a to all 1s
h Aa
Scale h so that its max entry is 1.0
a ATh
Scale a so that its max entry is 1.0
Continue until h, a converge

56
Example
1 1 1 A 1 0 1 0 1 0
1 1 0 AT 1 0 1 1 1 0
. . . . . . . . .
1 0.732 1

1 1 1
1 1 1
1 4/5 1
1 0.75 1
a(yahoo) a(amazon) a(msoft)
. . . . . . . . .
h(yahoo) 1 h(amazon)
1 h(msoft) 1
1 2/3 1/3
1 0.73 0.27
1.000 0.732 0.268
1 0.71 0.29
57
Existence and Uniqueness

h ?Aa
a µAT h
h ?µAAT h
a ?µATA a
Under reasonable assumptions about A,
the dual iterative algorithm converges to vectors
h and a such that
h is the principal eigenvector of the matrix AAT
a is the principal eigenvector of the matrix ATA

58
Bipartite cores
Hubs
Authorities
Most densely-connected core (primary core)
Less densely-connected core (secondary core)
59
Secondary cores

A single topic can have many bipartite cores
corresponding to different meanings, or points of
view
abortion pro-choice, pro-life
evolution darwinian, intelligent design
jaguar auto, Mac, NFL team, panthera onca
How to find such secondary cores?

60
Non-primary eigenvectors

AAT and ATA have the same set of eigenvalues
An eigenpair is the pair of eigenvectors with the
same eigenvalue
The primary eigenpair (largest eigenvalue) is
what we get from the iterative algorithm
Non-primary eigenpairs correspond to other
bipartite cores
The eigenvalue is a measure of the density of
links in the core

61
Finding secondary cores

Once we find the primary core, we can remove its
links from the graph
Repeat HITS algorithm on residual graph to find
the next bipartite core
Technically, not exactly equivalent to
non-primary eigenpair model

62
Creating the graph for HITS

We need a well-connected graph of pages for HITS
to work well
Add all pages linking into and out of search
results

63
Sample results search engines

Authorities (in 1998)
.346 http//www.yahoo.com/ Yahoo!
.291 http//www.excite.com/ Excite
.239 http//www.mckinley.com/ Welcome to
Magellan!
.231 http//www.lycos.com/ Lycos Home Page
.231 http//www.altavista.digital.com/ AltaVista
Main Page

64
PageRank and HITS

PageRank and HITS are two solutions to the same
problem
What is the value of an inlink from S to D?
In the PageRank model, the value of the link
depends on the links into S
In the HITS model, it depends on the value of the
other links out of S
The destinies of Page Rank and HITS post-1998
were very different
Why?

65
Trawling the Web for Emerging Communities KRR 98

(slides in separate ppt)

66
Link analysis over weblogs

How are Blogs different than the Web
Time-stamps -gt time ordering of posts/links
Blog-blog links (blogrolls) post-post links
Friends lists
Reciprocal links built in as trackbacks
Rank of post vs. rank of blog
Search results timeliness vs. relevance/authority
Research topics
Information diffusion
Ranking blogs
Cascades
Community mining

67
Information diffusion ranking

Implicit Structure and the Dynamic of Blogspace,
Eytan Adar, WWE-2004
Macroscopic microscopic patterns of blog
epidemics
Implicit explicit ranking algorithms that take
advantage of infection patterns
iRank acts on the implicit link structure to find
those blogs that initiate these epidemics

68
Information diffusion in blogs

Information diffusion through blogspace, Gruhl et
al, WWW 2004
How topics spread through blogs
Macroscopic long-running chatter topics vs.
bursty spike topics
Microscopic model topic diffusion as infectious
disease

69
Information cascades

Finding patterns in blog shapes and blog
evolution, M. McGlohon, J. Leskovec, C.
Faloutsos, M. Hurst and N. Glance
Study topology of cascades implicit in blog post
linking behavior
Temporal evolution of types of cascades for given
blogs

70
Ranking blogs EigenRumor Algorithm

The EigenRumor Algorithm for Ranking Blogs, Ko
Fujimura, WWE-2005
"EigenRumor" algorithm scores each blog entry
based on eigenvector calculations.
Higher score assigned to the blog entries
submitted by a good blogger but not yet linked to
by any other blogs based on acceptance of the
blogger's prior work.

71
Mining weblog communities

Extracting Latent Weblog Communities A
Partitioning Algorithm for Bipartite Graphs,
Kazunari Ishida, WWE-2005
Builds on Trawling for communities KRR 1998
Bipartite graphs derived from weblog update
information and cited webpages
Partitioning method weakest pairs (WP) method to
divide a bipartite graph into complete bipartite
subgraphs
Discovery of Blog Communities Based on Mutual
Awareness, Yu-Ru Lin,Hari Sundaram, Yun Chi, Jun
Tatemura and Belle Tseng, WWE 2006.
Compute mutual awareness matrix based on
reciprocal links
Use an iterative, ranking-based clustering scheme
on the mutual awareness matrix to determine the
communities
PageRank is used to determine the seeds at each
step, followed by diffusion of association to
determine the community members.

72
Combining sentiment link analysis

Modeling Trust and Influence in the Blogosphere
Using Link Polarity,A. Kale, A. Karandikar, P.
Kolari, A. Java, T. Finin and A. Joshi, ICWSM
2007
Steps
Identify polarity of blog-blog link
Use trust propagation models to spread sentiment
from subset of connected blogs to other blogs
Seed with computed base polarity of known source
blog (blog or blog post)
Take into account magnitude of strength or
weakness of sentiment
Uses lexicon of positive and negative words

73
Beyond blogs

Why We Twitter Understanding Microblogging Usage
and Communities,Akshay Java, Xiaodan Song, Tim
Finin, and Belle Tseng, Joint 9thWEBKDD and 1st
SNA-KDD Workshop, August 2007.
Dataset includes about 1350K posts from over 75K
users
Presents usage trends, network properties, top
hubs and authorities, community structure and
geographic distribution.

Write a Comment

User Comments (0)