Title: Search Engine Technology
1Search Engine Technology11http//www.cs.columbi
a.edu/radev/SET07.html
- November 15, 2007
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2SET Fall 2007
17. continued
3Slide from Reka Albert
4Slide from Reka Albert
5The strength of weak ties
- Granovetters study finding jobs
- Weak ties more people can be reached through
weak ties than strong ties (e.g., through your
7th and 8th best friends) - More here http//en.wikipedia.org/wiki/Weak_tie
6Prestige and centrality
- Degree centrality how many neighbors each node
has. - Closeness centrality how close a node is to all
of the other nodes - Betweenness centrality based on the role that a
node plays by virtue of being on the path between
two other nodes - Eigenvector centrality the paths in the random
walk are weighted by the centrality of the nodes
that the path connects. - Prestige same as centrality but for directed
graphs.
7SET Fall 2007
18. Graph-based methods Harmonic
functions Random walks PageRank
8Random walks and harmonic functions
- Drunkards walk
- Start at position 0 on a line
- What is the prob. of reaching 5 before reaching
0? - Harmonic functions
- P(0) 0
- P(N) 1
- P(x) ½p(x-1) ½p(x1), for 0ltxltN
- (in general, replace ½ with the bias in the walk)
0
1
2
3
4
5
9The original Dirichlet problem
()
- Distribution of temperature in a sheet of metal.
- One end of the sheet has temperature t0, the
other end t1. - Laplaces differential equation
- This is a special (steady-state) case of the
(transient) heat equation - In general, the solutions to this equation are
called harmonic functions.
10Learning harmonic functions
- The method of relaxations
- Discrete approximation.
- Assign fixed values to the boundary points.
- Assign arbitrary values to all other points.
- Adjust their values to be the average of their
neighbors. - Repeat until convergence.
- Monte Carlo method
- Perform a random walk on the discrete
representation. - Compute f as the probability of a random walk
ending in a particular fixed point. - Eigenvector methods
- Look at the stationary distribution of a random
walk
11Eigenvectors and eigenvalues
- An eigenvector is an implicit direction for a
matrix where v (eigenvector) is non-zero,
though ? (eigenvalue) can be any complex number
in principle - Computing eigenvalues
12Eigenvectors and eigenvalues
- Example
- Det (A-lI) (-1-l)(-l)-320
- Then ll2-60 l12 l2-3
- For l12
- Solutions x1x2
13Stochastic matrices
- Stochastic matrices each row (or column) adds up
to 1 and no value is less than 0. Example - The largest eigenvalue of a stochastic matrix E
is real ?1 1. - For ?1, the left (principal) eigenvector is p,
the right eigenvector 1 - In other words, GTp p.
14Electrical networks and random walks
- Ergodic (connected) Markov chain with transition
matrix P
c
1 O
1 O
wPw
a
b
0.5 O
0.5 O
d
From Doyle and Snell 2000
15Electrical networks and random walks
c
1 O
1 O
b
a
0.5 O
0.5 O
- vx is the probability that a random walk starting
at x will reach a before reaching b.
d
- The random walk interpretation allows us to use
Monte Carlo methods to solve electrical circuits.
1 V
16Markov chains
- A homogeneous Markov chain is defined by an
initial distribution x and a Markov kernel E. - Path sequence (x0, x1, , xn).Xi xi-1E
- The probability of a path can be computed as a
product of probabilities for each step i. - Random walk find Xj given x0, E, and j.
17Stationary solutions
- The fundamental Ergodic Theorem for Markov chains
Grimmett and Stirzaker 1989 says that the
Markov chain with kernel E has a stationary
distribution p under three conditions - E is stochastic
- E is irreducible
- E is aperiodic
- To make these conditions true
- All rows of E add up to 1 (and no value is
negative) - Make sure that E is strongly connected
- Make sure that E is not bipartite
- Example PageRank Brin and Page 1998 use
teleportation
18 Example
This graph E has a second graph E(not drawn)
superimposed on itE is the uniform transition
graph.
19Eigenvectors
- An eigenvector is an implicit direction for a
matrix. - Ev ?v, where v is non-zero, though ? can be any
complex number in principle. - The largest eigenvalue of a stochastic matrix E
is real ?1 1. - For ?1, the left (principal) eigenvector is p,
the right eigenvector 1 - In other words, ETp p.
20Computing the stationary distribution
function PowerStatDist (E) begin p(0) u
(or p(0) 1,0,0) i1 repeat p(i)
ETp(i-1) L p(i)-p(i-1)1 i
i 1 until L lt ? return p(i) end
Solution for thestationary distribution
Convergence rate is O(m)
21 Example
22PageRank
- Developed at Stanford and allegedly still being
used at Google. - Not query-specific, although query-specific
varieties exist. - In general, each page is indexed along with the
anchor texts pointing to it. - Among the pages that match the users query,
Google shows the ones with the largest PageRank. - Google also uses vector-space matching, keyword
proximity, anchor text, etc.
23(No Transcript)
24(No Transcript)
25SET Winter 2007
19. Hubs and authorities Bipartite
graphs HITS and SALSA Models of the
web
26HITS
- Hypertext-induced text selection.
- Developed by Jon Kleinberg and colleagues at IBM
Almaden as part of the CLEVER engine. - HITS is query-specific.
- Hubs and authorities, e.g. collections of
bookmarks about cars vs. actual sites about cars.
27HITS
- Each node in the graph is ranked for hubness (h)
and authoritativeness (a). - Some nodes may have high scores on both.
- Example authorities for the query java
- www.gamelan.com
- java.sun.com
- digitalfocus.com/digitalfocus/ (The Java
developer) - lightyear.ncsa.uiuc.edu/srp/java/javabooks.html
- sunsite.unc.edu/javafaq/javafaq.html
28HITS
- HITS algorithm
- obtain root set (using a search engine) related
to the input query - expand the root set by radius one on either side
(typically to size 1000-5000) - run iterations on the hub and authority scores
together - report top-ranking authorities and hubs
- Eigenvector interpretation
29Example
slide from Baldi et al.
30HITS
- HITS is now used by Ask.com and Teoma.com .
- It can also be used to identify communities
(e.g., based on synonyms as well as
controversial topics. - Example for jaguar
- Principal eigenvector gives pages about the
animal - The positive end of the second nonprincipal
eigenvector gives pages about the football team - The positive end of the third nonprincipal
eigenvector gives pages about the car. - Example for abortion
- The positive end of the second nonprincipal
eigenvector gives pages on planned parenthood
and reproductive rights - The negative end of the same eigenvector includes
pro-life sites. - SALSA (Lempel and Moran 2001)
31Models of the Web
- Evolving networks fundamental object of
statistical physics, social networks,
mathematical biology, and epidemiology
32Evolving Word-based Web
- Observations
- Links are made based on topics
- Topics are expressed with words
- Words are distributed very unevenly (Zipf,
Benford, self-triggerability laws) - Model
- Pick n
- Generate n lengths according to a power-law
distribution - Generate n documents using a trigram model
- Model (contd)
- Pick words in decreasing order of r.
- Generate hyperlinks with random directionality
- Outcome
- Generates power-law degree distributions
- Generates topical communities
- Natural variation of PageRank LexRank
33Readings
- MRS 11, MRS 12, paper by Church and Gale
(http//citeseer.ist.psu.edu/church95poisson.html)