Search Engine Technology - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Search Engine Technology

Description:

obtain root set (using a search engine) related to the input query ... The positive end of the third nonprincipal eigenvector gives pages about the car. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 34
Provided by: rad2
Category:

less

Transcript and Presenter's Notes

Title: Search Engine Technology


1
Search Engine Technology11http//www.cs.columbi
a.edu/radev/SET07.html
  • November 15, 2007
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
SET Fall 2007
17. continued
3
Slide from Reka Albert
4
Slide from Reka Albert
5
The strength of weak ties
  • Granovetters study finding jobs
  • Weak ties more people can be reached through
    weak ties than strong ties (e.g., through your
    7th and 8th best friends)
  • More here http//en.wikipedia.org/wiki/Weak_tie

6
Prestige and centrality
  • Degree centrality how many neighbors each node
    has.
  • Closeness centrality how close a node is to all
    of the other nodes
  • Betweenness centrality based on the role that a
    node plays by virtue of being on the path between
    two other nodes
  • Eigenvector centrality the paths in the random
    walk are weighted by the centrality of the nodes
    that the path connects.
  • Prestige same as centrality but for directed
    graphs.

7
SET Fall 2007
18. Graph-based methods Harmonic
functions Random walks PageRank
8
Random walks and harmonic functions
  • Drunkards walk
  • Start at position 0 on a line
  • What is the prob. of reaching 5 before reaching
    0?
  • Harmonic functions
  • P(0) 0
  • P(N) 1
  • P(x) ½p(x-1) ½p(x1), for 0ltxltN
  • (in general, replace ½ with the bias in the walk)

0
1
2
3
4
5
9
The original Dirichlet problem
()
  • Distribution of temperature in a sheet of metal.
  • One end of the sheet has temperature t0, the
    other end t1.
  • Laplaces differential equation
  • This is a special (steady-state) case of the
    (transient) heat equation
  • In general, the solutions to this equation are
    called harmonic functions.

10
Learning harmonic functions
  • The method of relaxations
  • Discrete approximation.
  • Assign fixed values to the boundary points.
  • Assign arbitrary values to all other points.
  • Adjust their values to be the average of their
    neighbors.
  • Repeat until convergence.
  • Monte Carlo method
  • Perform a random walk on the discrete
    representation.
  • Compute f as the probability of a random walk
    ending in a particular fixed point.
  • Eigenvector methods
  • Look at the stationary distribution of a random
    walk

11
Eigenvectors and eigenvalues
  • An eigenvector is an implicit direction for a
    matrix where v (eigenvector) is non-zero,
    though ? (eigenvalue) can be any complex number
    in principle
  • Computing eigenvalues

12
Eigenvectors and eigenvalues
  • Example
  • Det (A-lI) (-1-l)(-l)-320
  • Then ll2-60 l12 l2-3
  • For l12
  • Solutions x1x2

13
Stochastic matrices
  • Stochastic matrices each row (or column) adds up
    to 1 and no value is less than 0. Example
  • The largest eigenvalue of a stochastic matrix E
    is real ?1 1.
  • For ?1, the left (principal) eigenvector is p,
    the right eigenvector 1
  • In other words, GTp p.

14
Electrical networks and random walks
  • Ergodic (connected) Markov chain with transition
    matrix P

c
1 O
1 O
wPw
a
b
0.5 O
0.5 O
d
From Doyle and Snell 2000
15
Electrical networks and random walks
c
1 O
1 O
b
a
0.5 O
0.5 O
  • vx is the probability that a random walk starting
    at x will reach a before reaching b.

d
  • The random walk interpretation allows us to use
    Monte Carlo methods to solve electrical circuits.

1 V
16
Markov chains
  • A homogeneous Markov chain is defined by an
    initial distribution x and a Markov kernel E.
  • Path sequence (x0, x1, , xn).Xi xi-1E
  • The probability of a path can be computed as a
    product of probabilities for each step i.
  • Random walk find Xj given x0, E, and j.

17
Stationary solutions
  • The fundamental Ergodic Theorem for Markov chains
    Grimmett and Stirzaker 1989 says that the
    Markov chain with kernel E has a stationary
    distribution p under three conditions
  • E is stochastic
  • E is irreducible
  • E is aperiodic
  • To make these conditions true
  • All rows of E add up to 1 (and no value is
    negative)
  • Make sure that E is strongly connected
  • Make sure that E is not bipartite
  • Example PageRank Brin and Page 1998 use
    teleportation

18
Example
This graph E has a second graph E(not drawn)
superimposed on itE is the uniform transition
graph.
19
Eigenvectors
  • An eigenvector is an implicit direction for a
    matrix.
  • Ev ?v, where v is non-zero, though ? can be any
    complex number in principle.
  • The largest eigenvalue of a stochastic matrix E
    is real ?1 1.
  • For ?1, the left (principal) eigenvector is p,
    the right eigenvector 1
  • In other words, ETp p.

20
Computing the stationary distribution
function PowerStatDist (E) begin p(0) u
(or p(0) 1,0,0) i1 repeat p(i)
ETp(i-1) L p(i)-p(i-1)1 i
i 1 until L lt ? return p(i) end
Solution for thestationary distribution
Convergence rate is O(m)
21
Example
22
PageRank
  • Developed at Stanford and allegedly still being
    used at Google.
  • Not query-specific, although query-specific
    varieties exist.
  • In general, each page is indexed along with the
    anchor texts pointing to it.
  • Among the pages that match the users query,
    Google shows the ones with the largest PageRank.
  • Google also uses vector-space matching, keyword
    proximity, anchor text, etc.

23
(No Transcript)
24
(No Transcript)
25
SET Winter 2007
19. Hubs and authorities Bipartite
graphs HITS and SALSA Models of the
web
26
HITS
  • Hypertext-induced text selection.
  • Developed by Jon Kleinberg and colleagues at IBM
    Almaden as part of the CLEVER engine.
  • HITS is query-specific.
  • Hubs and authorities, e.g. collections of
    bookmarks about cars vs. actual sites about cars.

27
HITS
  • Each node in the graph is ranked for hubness (h)
    and authoritativeness (a).
  • Some nodes may have high scores on both.
  • Example authorities for the query java
  • www.gamelan.com
  • java.sun.com
  • digitalfocus.com/digitalfocus/ (The Java
    developer)
  • lightyear.ncsa.uiuc.edu/srp/java/javabooks.html
  • sunsite.unc.edu/javafaq/javafaq.html

28
HITS
  • HITS algorithm
  • obtain root set (using a search engine) related
    to the input query
  • expand the root set by radius one on either side
    (typically to size 1000-5000)
  • run iterations on the hub and authority scores
    together
  • report top-ranking authorities and hubs
  • Eigenvector interpretation

29
Example
slide from Baldi et al.
30
HITS
  • HITS is now used by Ask.com and Teoma.com .
  • It can also be used to identify communities
    (e.g., based on synonyms as well as
    controversial topics.
  • Example for jaguar
  • Principal eigenvector gives pages about the
    animal
  • The positive end of the second nonprincipal
    eigenvector gives pages about the football team
  • The positive end of the third nonprincipal
    eigenvector gives pages about the car.
  • Example for abortion
  • The positive end of the second nonprincipal
    eigenvector gives pages on planned parenthood
    and reproductive rights
  • The negative end of the same eigenvector includes
    pro-life sites.
  • SALSA (Lempel and Moran 2001)

31
Models of the Web
  • Evolving networks fundamental object of
    statistical physics, social networks,
    mathematical biology, and epidemiology
  • Erdös/Rényi 59, 60
  • Barabási/Albert 99
  • Watts/Strogatz 98
  • Kleinberg 98
  • Menczer 02
  • Radev 03

32
Evolving Word-based Web
  • Observations
  • Links are made based on topics
  • Topics are expressed with words
  • Words are distributed very unevenly (Zipf,
    Benford, self-triggerability laws)
  • Model
  • Pick n
  • Generate n lengths according to a power-law
    distribution
  • Generate n documents using a trigram model
  • Model (contd)
  • Pick words in decreasing order of r.
  • Generate hyperlinks with random directionality
  • Outcome
  • Generates power-law degree distributions
  • Generates topical communities
  • Natural variation of PageRank LexRank

33
Readings
  • MRS 11, MRS 12, paper by Church and Gale
    (http//citeseer.ist.psu.edu/church95poisson.html)
Write a Comment
User Comments (0)
About PowerShow.com