Title: Introduction to Link Analysis
1Introduction to Link Analysis
- September 11, 2007
- Analysis of Social Media Seminar
- Natalie Glance
2Content adapted from
- CS345 Data Mining Class Slides
- Anand Rajaraman, Jeffrey D. Ullman
- IMA Tutorial on Measuring Modeling the Web
- Andrew Tomkins
3Problem formulation (1998)
- Suppose we are given a collection of documents on
some broad topic - e.g., universities, evolution, iraq
- perhaps obtained through a text search
- Can we organize these documents in some manner?
- Link analysis techniques use links, not text
4Link Analysis Algorithms
- Page Rank
- Hubs and Authorities
- Both proposed at about the same time (1998)
- Topic-Specific Page Rank
- Spam Detection Algorithms
- Mining for Communities
- Further Reading Link Analysis for Web 2.0
5PageRank
- The PageRank citation ranking Bringing order to
the Web, L. Page, S. Brin, R. Motwani, T.
Winograd, 1999. - The Anatomy of a Large-Scale Hypertextual Web
Search Engine, S. Brin, L. Page (1998).
6Ranking web pages
- Web pages are not equally important
- www.silly-billy.com v www.stanford.edu
- Inlinks as votes
- www.stanford.edu has 23,400 inlinks
- www.silly-billy.com has 1 inlink
- Are all inlinks equal?
- Recursive question!
7Simple recursive formulation
- Each links vote is proportional to the
importance of its source page - If page P with importance x has n outlinks, each
link gets x/n votes
8Simple flow model
y y /2 a /2 a y /2 m m a /2
y/2
y
a/2
y/2
m
a/2
m
a
9Solving the flow equations
- 3 equations, 3 unknowns, no constants
- No unique solution
- All solutions equivalent modulo scale factor
- Additional constraint forces uniqueness
- yam 1
- y 2/5, a 2/5, m 1/5
- Gaussian elimination method works for small
examples, but we need a better method for large
graphs
10Matrix formulation
- Matrix M has one row and one column for each web
page - Suppose page j has n outlinks
- If j links to i, then Mij1/n
- Else Mij0
- M is a column stochastic matrix
- Columns sum to 1
- Suppose r is a vector with one entry per web page
- ri is the importance score of page i
- Call it the rank vector
11Example
Suppose page j links to 3 pages, including i
r
12Eigenvector formulation
- The flow equations can be written
- r Mr
- So the rank vector is an eigenvector of the
stochastic web matrix - In fact, its first or principal eigenvector, with
corresponding eigenvalue 1
13Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y y /2 a /2 a y /2 m m a /2
14Power Iteration method
- Simple iterative scheme (aka relaxation)
- Suppose there are N web pages
- Initialize r0 1/N,.,1/NT
- Iterate rk1 Mrk
- Stop when rk1 - rk1 lt ?
- x1 ?1iNxi is the L1 norm
- Can use any other vector norm e.g., Euclidean
15Power Iteration Example
y a m
y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0
y a m
1/3 1/3 1/3
1/3 1/2 1/6
5/12 1/3 1/4
3/8 11/24 1/6
2/5 2/5 1/5
. . .
16Random Walk Interpretation
- Imagine a random web surfer
- At any time t, surfer is on some page P
- At time t1, the surfer follows an outlink from P
uniformly at random - Ends up on some page Q linked from P
- Process repeats indefinitely
- Let p(t) be a vector whose ith component is the
probability that the surfer is at page i at time
t - p(t) is a probability distribution on pages
17The stationary distribution
- Where is the surfer at time t1?
- Follows a link uniformly at random
- p(t1) Mp(t)
- Suppose the random walk reaches a state such that
p(t1) Mp(t) p(t) - Then p(t) is called a stationary distribution for
the random walk - Our rank vector r satisfies r Mr
- So it is a stationary distribution for the random
surfer
18Existence and Uniqueness
- A central result from the theory of random walks
(aka Markov processes) - For graphs that satisfy certain conditions, the
stationary distribution is unique and eventually
will be reached no matter what the initial
probability distribution at time t 0.
19Spider traps
- A group of pages is a spider trap if there are no
links from within the group to outside the group - Random surfer gets trapped
- Spider traps violate the conditions needed for
the random walk theorem
20Microsoft becomes a spider trap
Yahoo
y a m
y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1
Msoft
Amazon
y a m
1 1 1
1 1/2 3/2
3/4 1/2 7/4
5/8 3/8 2
0 0 3
. . .
21Random teleports
- The PageRank solution for spider traps
- At each time step, the random surfer has two
options - With probability ?, follow a link at random
- With probability 1-?, jump to some page uniformly
at random - Common values for ? are in the range 0.8 to 0.9
- Surfer will teleport out of spider trap within a
few time steps
22Matrix formulation
- Suppose there are N pages
- Consider a page j, with set of outlinks O(j)
- We have Mij 1/O(j) when j links to i and Mij
0 otherwise - The random teleport is equivalent to
- adding a teleport link from j to every other page
with probability (1-?)/N - reducing the probability of following each
outlink from 1/O(j) to ?/O(j) - Equivalent tax each page a fraction (1-?) of its
score and redistribute evenly
23PageRank
- Construct the NXN matrix A as follows
- Aij ?Mij (1-?)/N
- Verify that A is a stochastic matrix
- The page rank vector r is the principal
eigenvector of this matrix - satisfying r Ar
- Equivalently, r is the stationary distribution of
the random walk with teleports
24Previous example with ?0.8
1/2 1/2 0 1/2 0 0 0 1/2
1
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 13/15
Msoft
Amazon
y a m
1 1 1
1.00 0.60 1.40
0.84 0.60 1.56
0.776 0.536 1.688
7/11 5/11 21/11
. . .
25Dead ends
- Pages with no outlinks are dead ends for the
random surfer - Nowhere to go on next step
26Microsoft becomes a dead end
1/2 1/2 0 1/2 0 0 0 1/2
0
1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3
0.2
Yahoo
0.8
y 7/15 7/15 1/15 a 7/15 1/15 1/15 m
1/15 7/15 1/15
Msoft
Amazon
y a m
1 1 1
1 0.6 0.6
0.787 0.547 0.387
0.648 0.430 0.333
0 0 0
. . .
27Dealing with dead-ends
- Teleport
- Follow random teleport links with probability 1.0
from dead-ends - Adjust matrix accordingly
- Prune and propagate
- Preprocess the graph to eliminate dead-ends
- Might require multiple passes
- Compute page rank on reduced graph
- Approximate values for deadends by propagating
values from reduced graph
28Computing PageRank
- Key step is matrix-vector multiply
- rnew Arold
- Easy if we have enough main memory to hold A,
rold, rnew - Say N 1 billion pages
- We need 4 bytes for each entry (say)
- 2 billion entries for vectors, approx 8GB
- Matrix A has N2 entries
- 1018 is a large number!
29Comparison of Query for University
Google vs. AltaVista, circa 1998
30Problems with PageRank
- Measures generic popularity of a page
- Biased against topic-specific authorities
- Ambiguous queries e.g., Michael Jordan
- Uses a single measure of importance
- Other models e.g., hubs and authorities
- Susceptible to Link spam
- Artificial link topographies created in order to
boost page rank
31Topic-Specific Page Rank
- Taher H. Haveliwala. Topic-Sensitive PageRank.
11th International World Wide Web Conference,
2002 - Instead of generic popularity, can we measure
popularity within a topic? - E.g., computer science, health
- Bias the random walk
- When the random walker teleports, he picks a page
from a set S of web pages - S contains only pages that are relevant to the
topic - E.g., Open Directory (DMOZ) pages for a given
topic (www.dmoz.org) - For each teleport set S, we get a different rank
vector rS
32Matrix formulation
- Aij ?Mij (1-?)/S if i in S
- Aij ?Mij otherwise
- Show that A is stochastic
- We have weighted all pages in the teleport set S
equally - Could also assign different weights to them
33How well does TSPR work?
- Experimental results Haveliwala 2000
- Picked 16 topics
- Teleport sets determined using DMOZ
- E.g., arts, business, sports,
- Blind study using volunteers
- 35 test queries
- Results ranked using Page Rank and TSPR of most
closely related topic - E.g., bicycling using Sports ranking
- In most cases volunteers preferred TSPR ranking
34Which topic ranking to use?
- User can pick from a menu
- Use Bayesian classification schemes to classify
query into a topic - Can use the context of the query
- E.g., query is launched from a web page talking
about a known topic - History of queries e.g., basketball followed by
jordan - User context e.g., users My Yahoo settings,
bookmarks,
35Scaling with topics and users
- Suppose we wanted to cover 1000s of topics
- Need to compute 1000s of different rank vectors
- Need to store and retrieve them efficiently at
query time - For good performance vectors must fit in memory
- Even harder when we consider personalization
- Each user has their own teleport vector
- One page rank vector per user!
36Web spam
- Spamming any deliberate action solely in order
to boost a web pages position in search engine
results, incommensurate with pages real value - Spam web pages that are the result of spamming
- This is a very broad defintion
- SEO industry might disagree!
- SEO search engine optimization
- Approximately 10-15 of web pages are spam
37Boosting techniques
- Term spamming
- Manipulating the text of web pages in order to
appear relevant to queries - Link spamming
- Creating link structures that boost page rank or
hubs and authorities scores
38Link Farms
- Spammers goal
- Maximize the page rank of target page t
- Technique
- Get as many links from accessible pages as
possible to target page t - Construct link farm to get PageRank multiplier
effect
39Link Farms
One of the most common and effective
organizations for a link farm
40TrustRank idea
- Combating Web Spam with TrustRank, Z. Gyongyi, H.
Garcia-Molina, J. Pedersen, VLDB, 2004. - Basic principle approximate isolation
- It is rare for a good page to point to a bad
(spam) page - Sample a set of seed pages from the web
- Have an oracle (human) identify the good pages
and the spam pages in the seed set - Expensive task, so must make seed set as small as
possible
41Trust Propagation
- Call the subset of seed pages that are identified
as good the trusted pages - Set trust of each trusted page to 1
- Propagate trust through links
- Each page gets a trust value between 0 and 1
- Use a threshold value and mark all pages below
the trust threshold as spam
42Example
1
2
3
good
4
bad
5
6
7
43Rules for trust propagation
- Trust attenuation
- The degree of trust conferred by a trusted page
decreases with distance - Trust splitting
- The larger the number of outlinks from a page,
the less scrutiny the page author gives each
outlink - Trust is split across outlinks
44Simple model
- Suppose trust of page p is t(p)
- Set of outlinks O(p)
- For each q in O(p), p confers the trust
- bt(p)/O(p) for 0ltblt1
- Trust is additive
- Trust of p is the sum of the trust conferred on p
by all its inlinked pages - Note similarity to Topic-Specific Page Rank
- Within a scaling factor, trust rank biased page
rank with trusted pages as teleport set
45Picking the seed set
- Two conflicting considerations
- Human has to inspect each seed page, so seed set
must be as small as possible - Must ensure every good page gets adequate trust
rank, so need make all good pages reachable from
seed set by short paths
46Approaches to picking seed set
- Suppose we want to pick a seed set of k pages
- PageRank
- Pick the top k pages by PageRank
- Assume high PageRank pages are close to other
highly ranked pages - We care more about high PageRank good pages
47Inverse page rank
- Pick the pages with the maximum number of
outlinks - Can make it recursive
- Pick pages that link to pages with many outlinks
- Formalize as inverse PageRank
- Construct graph G by reversing each edge in web
graph G - PageRank in G is inverse page rank in G
- Pick top k pages by inverse PageRank
48Hubs and Authorities
- J. Kleinberg. Authoritative sources in a
hyperlinked environment. Proc. 9th ACM-SIAM
Symposium on Discrete Algorithms, 1998. Extended
version in Journal of the ACM 46(1999).
49HITS Model
- Interesting documents fall into two classes
- Authorities are pages containing useful
information - course home pages
- home pages of auto manufacturers
- Hubs are pages that link to authorities
- course bulletin
- list of US auto manufacturers
50Idealized view
Hubs
Authorities
51Mutually recursive definition
- A good hub links to many good authorities
- A good authority is linked from many good hubs
- Model using two scores for each node
- Hub score and Authority score
- Represented as vectors h and a
52Transition Matrix A
- HITS uses a matrix Ai, j 1 if page i links to
page j, 0 if not - AT, the transpose of A, is similar to the
PageRank matrix M, but AT has 1s where M has
fractions
53Example
y a m
Yahoo
y 1 1 1 a 1 0 1 m 0 1
0
A
Msoft
Amazon
54Hub and Authority Equations
- The hub score of page P is proportional to the
sum of the authority scores of the pages it links
to - h ?Aa
- Constant ? is a scale factor
- The authority score of page P is proportional to
the sum of the hub scores of the pages it is
linked from - a µAT h
- Constant µ is scale factor
55Iterative algorithm
- Initialize h, a to all 1s
- h Aa
- Scale h so that its max entry is 1.0
- a ATh
- Scale a so that its max entry is 1.0
- Continue until h, a converge
56Example
1 1 1 A 1 0 1 0 1 0
1 1 0 AT 1 0 1 1 1 0
. . . . . . . . .
1 0.732 1
1 1 1
1 1 1
1 4/5 1
1 0.75 1
a(yahoo) a(amazon) a(msoft)
. . . . . . . . .
h(yahoo) 1 h(amazon)
1 h(msoft) 1
1 2/3 1/3
1 0.73 0.27
1.000 0.732 0.268
1 0.71 0.29
57Existence and Uniqueness
- h ?Aa
- a µAT h
- h ?µAAT h
- a ?µATA a
- Under reasonable assumptions about A,
- the dual iterative algorithm converges to vectors
- h and a such that
- h is the principal eigenvector of the matrix AAT
- a is the principal eigenvector of the matrix ATA
58Bipartite cores
Hubs
Authorities
Most densely-connected core (primary core)
Less densely-connected core (secondary core)
59Secondary cores
- A single topic can have many bipartite cores
- corresponding to different meanings, or points of
view - abortion pro-choice, pro-life
- evolution darwinian, intelligent design
- jaguar auto, Mac, NFL team, panthera onca
- How to find such secondary cores?
60Non-primary eigenvectors
- AAT and ATA have the same set of eigenvalues
- An eigenpair is the pair of eigenvectors with the
same eigenvalue - The primary eigenpair (largest eigenvalue) is
what we get from the iterative algorithm - Non-primary eigenpairs correspond to other
bipartite cores - The eigenvalue is a measure of the density of
links in the core
61Finding secondary cores
- Once we find the primary core, we can remove its
links from the graph - Repeat HITS algorithm on residual graph to find
the next bipartite core - Technically, not exactly equivalent to
non-primary eigenpair model
62Creating the graph for HITS
- We need a well-connected graph of pages for HITS
to work well - Add all pages linking into and out of search
results
63Sample results search engines
- Authorities (in 1998)
- .346 http//www.yahoo.com/ Yahoo!
- .291 http//www.excite.com/ Excite
- .239 http//www.mckinley.com/ Welcome to
Magellan! - .231 http//www.lycos.com/ Lycos Home Page
- .231 http//www.altavista.digital.com/ AltaVista
Main Page
64PageRank and HITS
- PageRank and HITS are two solutions to the same
problem - What is the value of an inlink from S to D?
- In the PageRank model, the value of the link
depends on the links into S - In the HITS model, it depends on the value of the
other links out of S - The destinies of Page Rank and HITS post-1998
were very different - Why?
65Trawling the Web for Emerging Communities KRR 98
66Link analysis over weblogs
- How are Blogs different than the Web
- Time-stamps -gt time ordering of posts/links
- Blog-blog links (blogrolls) post-post links
- Friends lists
- Reciprocal links built in as trackbacks
- Rank of post vs. rank of blog
- Search results timeliness vs. relevance/authority
- Research topics
- Information diffusion
- Ranking blogs
- Cascades
- Community mining
67Information diffusion ranking
- Implicit Structure and the Dynamic of Blogspace,
Eytan Adar, WWE-2004 - Macroscopic microscopic patterns of blog
epidemics - Implicit explicit ranking algorithms that take
advantage of infection patterns - iRank acts on the implicit link structure to find
those blogs that initiate these epidemics
68Information diffusion in blogs
- Information diffusion through blogspace, Gruhl et
al, WWW 2004 - How topics spread through blogs
- Macroscopic long-running chatter topics vs.
bursty spike topics - Microscopic model topic diffusion as infectious
disease
69Information cascades
- Finding patterns in blog shapes and blog
evolution, M. McGlohon, J. Leskovec, C.
Faloutsos, M. Hurst and N. Glance - Study topology of cascades implicit in blog post
linking behavior - Temporal evolution of types of cascades for given
blogs
70Ranking blogs EigenRumor Algorithm
- The EigenRumor Algorithm for Ranking Blogs, Ko
Fujimura, WWE-2005 - "EigenRumor" algorithm scores each blog entry
based on eigenvector calculations. - Higher score assigned to the blog entries
submitted by a good blogger but not yet linked to
by any other blogs based on acceptance of the
blogger's prior work.
71Mining weblog communities
- Extracting Latent Weblog Communities A
Partitioning Algorithm for Bipartite Graphs,
Kazunari Ishida, WWE-2005 - Builds on Trawling for communities KRR 1998
- Bipartite graphs derived from weblog update
information and cited webpages - Partitioning method weakest pairs (WP) method to
divide a bipartite graph into complete bipartite
subgraphs - Discovery of Blog Communities Based on Mutual
Awareness, Yu-Ru Lin,Hari Sundaram, Yun Chi, Jun
Tatemura and Belle Tseng, WWE 2006. - Compute mutual awareness matrix based on
reciprocal links - Use an iterative, ranking-based clustering scheme
on the mutual awareness matrix to determine the
communities - PageRank is used to determine the seeds at each
step, followed by diffusion of association to
determine the community members.
72Combining sentiment link analysis
- Modeling Trust and Influence in the Blogosphere
Using Link Polarity,A. Kale, A. Karandikar, P.
Kolari, A. Java, T. Finin and A. Joshi, ICWSM
2007 - Steps
- Identify polarity of blog-blog link
- Use trust propagation models to spread sentiment
from subset of connected blogs to other blogs - Seed with computed base polarity of known source
blog (blog or blog post) - Take into account magnitude of strength or
weakness of sentiment - Uses lexicon of positive and negative words
73Beyond blogs
- Why We Twitter Understanding Microblogging Usage
and Communities,Akshay Java, Xiaodan Song, Tim
Finin, and Belle Tseng, Joint 9thWEBKDD and 1st
SNA-KDD Workshop, August 2007. - Dataset includes about 1350K posts from over 75K
users - Presents usage trends, network properties, top
hubs and authorities, community structure and
geographic distribution.