What is page importance

About This Presentation
Title:

What is page importance

Description:

2 0.345 http://www.hotmail/com/ 3 0.309 http://www.naplesnews.net ... But even a dumb user may once in a while do something other than ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: What is page importance


1
What is page importance?
  • Page importance is hard to define unilaterally
    such that it satisfies everyone. There are
    however some desiderata
  • It should be sensitive to
  • The query
  • Or at least the topic of the query..
  • The user
  • Or at least the user population
  • The link structure of the web
  • The amount of accesses the page gets
  • It should be stable w.r.t. small random changes
    in the network link structure
  • It shouldnt be easy to subvert with intentional
    changes to link structure

How about Eloquence of the page
informativeness of the page
2
Desiderata for link-based ranking
  • A page that is referenced by lot of important
    pages (has more back links) is more important
    (Authority)
  • A page referenced by a single important page may
    be more important than that referenced by five
    unimportant pages
  • A page that references a lot of important pages
    is also important (Hub)
  • Importance can be propagated
  • Your importance is the weighted sum of the
    importance conferred on you by the pages that
    refer to you
  • The importance you confer on a page may be
    proportional to how many other pages you refer to
    (cite)
  • (Also what you say about them when you cite them!)

Different Notions of importance
Qn Can we assign consistent authority/hub
values to pages?
3
Authorities and Hubsas mutually reinforcing
properties
  • Authorities and hubs related to the same query
    tend to form a bipartite subgraph of the web
    graph.
  • Suppose each page has an authority score a(p) and
    a hub score h(p)

hubs
authorities
4
Authority and Hub Pages
  • I Authority Computation for each page p
  • a(p) ? h(q)
  • q (q, p)?E
  • O Hub Computation for each page p
  • h(p) ? a(q)
  • q (p, q)?E

q1
q2
p
q3
q1
q2
p
q3
A set of simultaneous equations Can we solve
these?
5
Authority and Hub Pages (8)
  • Matrix representation of operations I and O.
  • Let A be the adjacency matrix of SG entry (p, q)
    is 1 if p has a link to q, else the entry is 0.
  • Let AT be the transpose of A.
  • Let hi be vector of hub scores after i
    iterations.
  • Let ai be the vector of authority scores after i
    iterations.
  • Operation I ai AT hi-1
  • Operation O hi A ai

Normalize after every multiplication
6
Authority and Hub Pages (11)
  • Example Initialize all scores to 1.
  • 1st Iteration
  • I operation
  • a(q1) 1, a(q2) a(q3) 0,
  • a(p1) 3, a(p2) 2
  • O operation h(q1) 5,
  • h(q2) 3, h(q3) 5, h(p1) 1, h(p2) 0
  • Normalization a(q1) 0.267, a(q2) a(q3)
    0,
  • a(p1) 0.802, a(p2) 0.535, h(q1) 0.645,
  • h(q2) 0.387, h(q3) 0.645, h(p1) 0.129,
    h(p2) 0

q1
p1
q2
p2
q3
7
Authority and Hub Pages (12)
  • After 2 Iterations
  • a(q1) 0.061, a(q2) a(q3) 0, a(p1)
    0.791,
  • a(p2) 0.609, h(q1) 0.656, h(q2) 0.371,
  • h(q3) 0.656, h(p1) 0.029, h(p2) 0
  • After 5 Iterations
  • a(q1) a(q2) a(q3) 0,
  • a(p1) 0.788, a(p2) 0.615
  • h(q1) 0.657, h(q2) 0.369,
  • h(q3) 0.657, h(p1) h(p2) 0

q1
p1
q2
p2
q3
8
What happens if you multiply a vector by a matrix?
  • In general, when you multiply a vector by a
    matrix, the vector gets scaled as well as
    rotated
  • ..except when the vector happens to be in the
    direction of one of the eigen vectors of the
    matrix
  • .. in which case it only gets scaled (stretched)
  • A (symmetric square) matrix has all real eigen
    values, and the values give an indication of the
    amount of stretching that is done for vectors in
    that direction
  • The eigen vectors of the matrix define a new
    ortho-normal space
  • You can model the multiplication of a general
    vector by the matrix in terms of
  • First decompose the general vector into its
    projections in the eigen vector directions
  • ..which means just take the dot product of the
    vector with the (unit) eigen vector
  • Then multiply the projections by the
    corresponding eigen valuesto get the new vector.
  • This explains why power method converges to
    principal eigen vector..
  • ..since if a vector has a non-zero projection in
    the principal eigen vector direction, then
    repeated multiplication will keep stretching the
    vector in that direction, so that eventually all
    other directions vanish by comparison..

Optional
9
(why) Does the procedure converge?
As we multiply repeatedly with M, the component
of x in the direction of principal eigen vector
gets stretched wrt to other directions.. So we
converge finally to the direction of principal
eigenvector Necessary condition x must have a
component in the direction of principal eigen
vector (c1must be non-zero)
The rate of convergence depends on the eigen gap
10
Can we power iterate to get other (secondary)
eigen vectors?
  • Yesjust find a matrix M2 such that M2 has the
    same eigen vectors as M, but the eigen value
    corresponding to the first eigen vector e1 is
    zeroed out..
  • Now do power iteration on M2
  • Alternately start with a random vector v, and
    find a new vector v v (v.e1)e1 and do power
    iteration on M with v

Why? 1. M2e1 0 2. If e2 is the
second eigen vector of M, then
it is also an eigen vector of M2
11
Authority and Hub Pages
  • Algorithm (summary)
  • submit q to a search engine to obtain the
    root set S
  • expand S into the base set T
  • obtain the induced subgraph SG(V, E) using T
  • initialize a(p) h(p) 1 for all p in V
  • for each p in V until the scores converge
  • apply Operation I
  • apply Operation O
  • normalize a(p) and h(p)
  • return pages with top authority hub scores

12
(No Transcript)
13
Base set computation
  • can be made easy by storing the link structure of
    the Web in advance Link structure table (during
    crawling)
  • --Most search engines serve this
    information now. (e.g. Googles link search)
  • parent_url child_url
  • url1 url2
  • url1 url3

14
Authority and Hub Pages (9)
  • After each iteration of applying Operations I
    and O, normalize all authority and hub scores.
  • Repeat until the scores for each page
    converge (the convergence is guaranteed).
  • 5. Sort pages in descending authority scores.
  • 6. Display the top authority pages.

15
Handling spam links
  • Should all links be equally treated?
  • Two considerations
  • Some links may be more meaningful/important than
    other links.
  • Web site creators may trick the system to make
    their pages more authoritative by adding dummy
    pages pointing to their cover pages (spamming).

16
Handling Spam Links (contd)
  • Transverse link links between pages with
    different domain names.
  • Domain name the first level of the URL of a
    page.
  • Intrinsic link links between pages with the same
    domain name.
  • Transverse links are more important than
    intrinsic links.
  • Two ways to incorporate this
  • Use only transverse links and discard intrinsic
    links.
  • Give lower weights to intrinsic links.

17
Handling Spam Links (contd)
  • How to give lower weights to intrinsic links?
  • In adjacency matrix A, entry (p, q) should be
    assigned as follows
  • If p has a transverse link to q, the entry is 1.
  • If p has an intrinsic link to q, the entry is c,
    where 0 lt c lt 1.
  • If p has no link to q, the entry is 0.

18
Considering link context
  • For a given link (p, q), let V(p, q) be the
    vicinity (e.g., ? 50 characters) of the link.
  • If V(p, q) contains terms in the user query
    (topic), then the link should be more useful for
    identifying authoritative pages.
  • To incorporate this In adjacency matrix A, make
    the weight associated with link (p, q) to be
    1n(p, q),
  • where n(p, q) is the number of terms in V(p, q)
    that appear in the query.
  • Alternately, consider the vector similarity
    between V(p,q) and the query Q

19
(No Transcript)
20
Evaluation
  • Sample experiments
  • Rank based on large in-degree (or backlinks)
  • query game
  • Rank in-degree URL
  • 1 13 http//www.gotm.org
  • 2 12 http//www.gamezero.c
    om/team-0/
  • 3 12 http//ngp.ngpc.state
    .ne.us/gp.html
  • 4 12 http//www.ben2.ucla.
    edu/permadi/

  • gamelink/gamelink.html
  • 5 11 http//igolfto.net/
  • 6 11 http//www.eduplace.c
    om/geo/indexhi.html
  • Only pages 1, 2 and 4 are authoritative game
    pages.

21
Evaluation
  • Sample experiments (continued)
  • Rank based on large authority score.
  • query game
  • Rank Authority URL
  • 1 0.613 http//www.gotm.org
  • 2 0.390 http//ad/doubleclick/n
    et/jump/

  • gamefan-network.com/
  • 3 0.342 http//www.d2realm.com/
  • 4 0.324 http//www.counter-stri
    ke.net
  • 5 0.324 http//tech-base.com/
  • 6 0.306 http//www.e3zone.com
  • All pages are authoritative game pages.

22
Authority and Hub Pages (19)
  • Sample experiments (continued)
  • Rank based on large authority score.
  • query free email
  • Rank Authority URL
  • 1 0.525 http//mail.chek.com/
  • 2 0.345 http//www.hotmail/com/
  • 3 0.309 http//www.naplesnews.n
    et/
  • 4 0.261 http//www.11mail.com/
  • 5 0.254 http//www.dwp.net/
  • 6 0.246 http//www.wptamail.com
    /
  • All pages are authoritative free email pages.

23
Cora thinks Rao is Authoritative on Planning
Citeseer has him down at 90th position ?
How come??? --Planning has two clusters
--Planning reinforcement learning
--Deterministic planning --The first is a
bigger cluster --Rao is big in the second
cluster?
24
Tyranny of Majority
Which do you think are Authoritative
pages? Which are good hubs? -intutively, we
would say that 4,8,5 will be authoritative
pages and 1,2,3,6,7 will be hub pages.
1
6
8
2
4
7
3
5
The authority and hub mass Will concentrate
completely Among the first component, as The
iterations increase. (See next slide)
BUT The power iteration will show that Only 4 and
5 have non-zero authorities .923 .382 And only
1, 2 and 3 have non-zero hubs .5 .7 .5
25
Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
p1
q1
m
n
q
p2
p
qn
pm
mgtn
26
Impact of Bridges..
1
6
When the graph is disconnected, only 4 and 5 have
non-zero authorities .923 .382 And only 1, 2
and 3 have non-zero hubs .5 .7 .5CV
8
2
4
7
3
5
Bad news from stability point of view ?Can be
fixed by putting a weak link between any
two pages.. (saying in essence that you
expect every page to be reached from
every other page)
When the components are bridged by adding one
page (9) the authorities change only 4, 5 and 8
have non-zero authorities .853 .224 .47 And o1,
2, 3, 6,7 and 9 will have non-zero hubs .39 .49
.39 .21 .21 .6
27
Finding minority Communities
  • How to retrieve pages from smaller communities?
  • A method for finding pages in nth largest
    community
  • Identify the next largest community using the
    existing algorithm.
  • Destroy this community by removing links
    associated with pages having large authorities.
  • Reset all authority and hub values back to 1 and
    calculate all authority and hub values again.
  • Repeat the above n ? 1 times and the next largest
    community will be the nth largest community.

28
Multiple Clusters on House
Query House (first community)
29
Authority and Hub Pages (26)
Query House (second community)
30
PageRank
31
The importance of publishing..
  • A/H algorithm was published in SODA as well as
    JACM
  • Kleinberg became very famous in the scientific
    community (and got a McArthur Genius award)
  • Pagerank algorithm was rejected from SIGIR and
    was never explicitly published
  • Larry Page never got a genius award or even a PhD
  • (and had to be content with being a mere
    billionaire)

32
PageRank (Importance as Stationary Visit
Probability on a Markov Chain)
  • Basic Idea
  • Think of Web as a big graph. A random surfer
    keeps randomly clicking on the links.
  • The importance of a page is the probability that
    the surfer finds herself on that page
  • --Talk of transition matrix instead of adjacency
    matrix
  • Transition matrix M derived from adjacency
    matrix A
  • --If there are F(u) forward links from a
    page u,
  • then the probability that the surfer
    clicks
  • on any of those is 1/F(u) (Columns sum
    to 1. Stochastic matrix)
  • M is the normalized version of At
  • --But even a dumb user may once in a while do
    something other than
  • follow URLs on the current page..
  • --Idea Put a small probability that
    the user goes off to a page not pointed to by the
    current page.

Principal eigenvector Gives the stationary
distribution!
33
Markov Chains Random Surfer Model
  • Markov Chains Stationary distribution
  • Necessary conditions for existence of unique
    steady state distribution Aperiodicity and
    Irreducibility
  • Irreducibility Each node can be reached from
    every other node with non-zero probability
  • Must not have sink nodes (which have no out
    links)
  • Because we can have several different steady
    state distributions based on which sink we get
    stuck in
  • If there are sink nodes, change them so that you
    can transition from them to every other node with
    low probability
  • Must not have disconnected components
  • Because we can have several different steady
    state distributions depending on which
    disconnected component we get stuck in
  • Sufficient to put a low probability link from
    every node to every other node (in addition to
    the normal weight links corresponding to actual
    hyperlinks)
  • The parameters of random surfer model
  • c the probability that surfer follows the page
  • The larger it is, the more the surfer sticks to
    what the page says
  • M the way link matrix is converted to markov
    chain
  • Can make the links have differing transition
    probability
  • E.g. query specific links have higher prob. Links
    in bold have higher prop etc
  • K the reset distribution of the surfer ?great
    thing to tweak
  • It is quite feasible to have m different reset
    distributions corresponding to m different
    populations of users (or m possible
    topic-oriented searches)
  • It is also possible to make the reset
    distribution depend on other things such as
  • trust of the page TrustRank
  • Recency of the page Recency-sensitive rank

34
Computing PageRank (10)
  • Example Suppose the Web graph is
  • M

D
C
A
B
A B C D
A B C D
  • 0 0 0 ½
  • 0 0 0 ½
  • 1 0 0
  • 0 0 1 0

A B C D
0 0 1 0 0 0 1 0 0 0 0 1 1 1
0 0
A B C D
A
35
Computing PageRank
  • Matrix representation
  • Let M be an N?N matrix and muv be the entry at
    the u-th row and v-th column.
  • muv 1/Nv if page v has a link to page
    u
  • muv 0 if there is no link from v to u
  • Let Ri be the N?1 rank vector for I-th
    iteration
  • and R0 be the initial rank vector.
  • Then Ri M ? Ri-1

36
Computing PageRank
  • If the ranks converge, i.e., there is a rank
    vector R such that
  • R M ? R,
  • R is the eigenvector of matrix M with eigenvalue
    being 1.
  • Convergence is guaranteed only if
  • M is aperiodic (the Web graph is not a big
    cycle). This is practically guaranteed for Web.
  • M is irreducible (the Web graph is strongly
    connected). This is usually not true.

Principal eigen value for A stochastic matrix is 1
37
Computing PageRank (6)
  • Rank sink A page or a group of pages is a rank
    sink if they can receive rank propagation from
    its parents but cannot propagate rank to other
    pages.
  • Rank sink causes the loss of total ranks.
  • Example

A
(C, D) is a rank sink
B
C
D
38
Computing PageRank (7)
  • A solution to the non-irreducibility and rank
    sink problem.
  • Conceptually add a link from each page v to every
    page (include self).
  • If v has no forward links originally, make all
    entries in the corresponding column in M be 1/N.
  • If v has forward links originally, replace 1/Nv
    in the corresponding column by c?1/Nv and then
    add (1-c) ?1/N to all entries, 0 lt c lt 1.

Motivation comes also from random-surfer model
39
Computing PageRank (8)
Z will have 1/N For sink pages And 0 otherwise
K will have 1/N For all entries
  • M c (M Z) (1 c) x K
  • M is irreducible.
  • M is stochastic, the sum of all entries of each
    column is 1 and there are no negative entries.
  • Therefore, if M is replaced by M as in
  • Ri M ? Ri-1
  • then the convergence is guaranteed and there
    will be no loss of the total rank (which is 1).

40
Computing PageRank (9)
  • Interpretation of M based on the random walk
    model.
  • If page v has no forward links originally, a web
    surfer at v can jump to any page in the Web with
    probability 1/N.
  • If page v has forward links originally, a surfer
    at v can either follow a link to another page
    with probability c ? 1/Nv, or jumps to any page
    with probability (1-c) ?1/N.

41
Computing PageRank (10)
  • Example Suppose the Web graph is
  • M

D
C
A
B
A B C D
  • 0 0 0 ½
  • 0 0 0 ½
  • 1 0 0
  • 0 0 1 0

A B C D
42
Computing PageRank (11)
  • Example (continued) Suppose c 0.8. All entries
    in Z are 0 and all entries in K are ¼.
  • M 0.8 (MZ) 0.2 K
  • Compute rank by iterating
  • R MxR

0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45
0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05
MATLAB says R(A).338 R(B).338 R(C).6367 R(D).
6052
43
Comparing PR A/H on the same graph
pagerank
A/H
44
Combining PR Content similarity
  • Incorporate the ranks of pages into the ranking
    function of a search engine.
  • The ranking score of a web page can be a weighted
    sum of its regular similarity with a query and
    its importance.
  • ranking_score(q, d)
  • w?sim(q, d) (1-w) ? R(d), if sim(q,
    d) gt 0
  • 0, otherwise
  • where 0 lt w lt 1.
  • Both sim(q, d) and R(d) need to be normalized to
    between 0, 1.

Who sets w?
45
We can pick and choose
  • Two alternate ways of computing page importance
  • I1. As authorities/hubs
  • I2. As stationary distribution over the
    underlying markov chain
  • Two alternate ways of combining importance with
    similarity
  • C1. Compute importance over a set derived from
    the top-100 similar pages
  • C2. Combine apples organges
  • aimportance bsimilarity

We can pick any pair of alternatives (even though
I1 was originally proposed with C1 and I2 with
C2)
46
Efficient computation Prioritized Sweeping
We can use asynchronous iterations where the
iteration uses some of the values updated in
the current iteration
47
Efficient Computation Preprocess
  • Remove dangling nodes
  • Pages w/ no children
  • Then repeat process
  • Since now more danglers
  • Stanford WebBase
  • 25 M pages
  • 81 M URLs in the link graph
  • After two prune iterations 19 M nodes

48
Representing Links Table
  • Stored on disk in binary format
  • Size for Stanford WebBase 1.01 GB
  • Assumed to exceed main memory

49
Algorithm 1
?s Sources 1/N while residual gt? ?d
Destd 0 while not Links.eof()
Links.read(source, n, dest1, destn)
for j 1 n Destdestj
DestdestjSourcesource/n ?d
Destd c Destd (1-c)/N /
dampening / residual ??Source Dest??
/ recompute every few iterations
/ Source Dest
50
Analysis of Algorithm 1
  • If memory is big enough to hold Source Dest
  • IO cost per iteration is Links
  • Fine for a crawl of 24 M pages
  • But web 800 M pages in 2/99 NEC
    study
  • Increase from 320 M pages in 1997 same
    authors
  • If memory is big enough to hold just Dest
  • Sort Links on source field
  • Read Source sequentially during rank propagation
    step
  • Write Dest to disk to serve as Source for next
    iteration
  • IO cost per iteration is Source Dest
    Links
  • If memory cant hold Dest
  • Random access pattern will make working set
    Dest
  • Thrash!!!

51
Block-Based Algorithm
  • Partition Dest into B blocks of D pages each
  • If memory P physical pages
  • D lt P-2 since need input buffers for Source
    Links
  • Partition Links into B files
  • Linksi only has some of the dest nodes for each
    source
  • Linksi only has dest nodes such that
  • DDi lt dest lt DD(i1)
  • Where DD number of 32 bit integers that fit in
    D pages

source node
?

dest node
Dest
Links (sparse)
Source
52
Partitioned Link File
Source node (32 bit int)
Outdegr (16 bit)
Destination nodes (32 bit int)
Num out (16 bit)
0
4
2
12, 26
Buckets 0-31
1
3
5
1
3
2
5
1, 9, 10
0
4
58
1
Buckets 32-63
1
3
1
56
1
2
5
36
0
4
94
1
Buckets 64-95
1
3
1
69
1
2
5
78
53
Block-based Page Rank algorithm
54
Analysis of Block Algorithm
  • IO Cost per iteration
  • B Source Dest Links(1e)
  • e is factor by which Links increased in size
  • Typically 0.1-0.3
  • Depends on number of blocks
  • Algorithm nested-loops join

55
Comparing the Algorithms
56
Effect of collusion on PageRank
C
C
A
A
B
B
Moral By referring to each other, a cluster of
pages can artificially boost their
rank (although the cluster has to be big enough
to make an appreciable
difference. Solution Put a threshold on the
number of intra-domain links that will
count Counter Buy two domains, and generate a
cluster among those..
57
(No Transcript)
58
(No Transcript)
59
Use of Link Information
  • PageRank defines the global importance of web
    pages but the importance is domain/topic
    independent.
  • We often need to find important/authoritative
    pages which are relevant to a given query.
  • What are important web browser pages?
  • Which pages are important game pages?
  • Idea Use a notion of topic-specific page rank
  • Involves using a non-uniform probability

60
Topic Specific Pagerank
Haveliwala, WWW 2002
  • For each page compute k different page ranks
  • K number of top level hierarchies in the Open
    Directory Project
  • When computing PageRank w.r.t. to a topic, say
    that with e probability we transition to one of
    the pages of the topick
  • When a query q is issued,
  • Compute similarity between q ( its context) to
    each of the topics
  • Take the weighted combination of the topic
    specific page ranks of q, weighted by the
    similarity to different topics

61
Stability of Rank Calculations
(From Ng et. al. )
The left most column Shows the original
rank Calculation -the columns on the right
are result of rank calculations when 30
of pages are randomly removed
62
(No Transcript)
63
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
64
Novel uses of Link Analysis
  • Link analysis algorithmsHITS, and Pagerankare
    not limited to hyperlinks
  • Citeseer/Cora use them for analyzing citations
    (the link is through citation)
  • See the irony herelink analysis ideas originated
    from citation analysis, and are now being applied
    for citation analysis ?
  • Some new work on keyword search on databases
    uses foreign-key links and link analysis to
    decide which of the tuples matching the keyword
    query are most important (the link is through
    foreign keys)
  • Sudarshan et. Al. ICDE 2002
  • Keyword search on databases is useful to make
    structured databases accessible to naïve users
    who dont know structured languages (such as
    SQL).

65
(No Transcript)
66
Query complexity
  • Complex queries (966 trials)
  • Average words 7.03
  • Average operators (") 4.34
  • Typical Alta Vista queries are much simpler
    Silverstein, Henzinger, Marais and Moricz
  • Average query words 2.35
  • Average operators (") 0.41
  • Forcibly adding a hub or authority node helped in
    86 of the queries

67
What about non-principal eigen vectors?
  • Principal eigen vector gives the authorities (and
    hubs)
  • What do the other ones do?
  • They may be able to show the clustering in the
    documents (see page 23 in Kleinberg paper)
  • The clusters are found by looking at the positive
    and negative ends of the secondary eigen vectors
    (ppl vector has only ve end)

68
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
69
Summary of Key Points
  • PageRank Iterative Algorithm
  • Rank Sinks
  • Efficiency of computation Memory!
  • Single precision Numbers.
  • Dont represent M explicitly.
  • Break arrays into Blocks.
  • Minimize IO Cost.
  • Number of iterations of PageRank.
  • Weighting of PageRank vs. doc similarity.

70
Beyond Google (and Pagerank)
  • Are backlinks reliable metric of importance?
  • It is a one-size-fits-all measure of
    importance
  • Not user specific
  • Not topic specific
  • There may be discrepancy between back links and
    actual popularity (as measured in hits)
  • The sense of the link is ignored (this is okay
    if you think that all publicity is good
    publicity)
  • Mark Twain on Classics
  • A classic is something everyone wishes they had
    already read and no one actually had..
    (paraphrase)
  • Google may be its own undoing(why would I need
    back links when I know I can get to it through
    Google?)
  • Customization, customization, customization
  • Yahoo sez about their magic bullet.. (NYT
    2/22/04)
  • "If you type in flowers, do you want to buy
    flowers, plant flowers or see pictures of
    flowers?"

71
Challenges in Web Search Engines
  • Spam
  • Text Spam
  • Link Spam
  • Cloaking
  • Content Quality
  • Anchor text quality
  • Quality Evaluation
  • Indirect feedback
  • Web Conventions
  • Articulate and develop validation
  • Duplicate Hosts
  • Mirror detection
  • Vaguely Structured Data
  • Page layout
  • The advantage of making rendering/content
    language be same

72
Spam is a serious problem
  • We have Spam Spam Spam Spam Spam with Eggs and
    Spam
  • in Email
  • Most mail transmitted is junk
  • web pages
  • Many different ways of fooling search engines
  • This is an open arms race
  • Annual conference on Email and Anti-Spam
  • Started 2004
  • Intl. workshop on AIR-Web (Adversarial Info
    Retrieval on Web)
  • Started in 2005 at WWW

73
Trust Spam (Knock-Knock. Who is there?)
  • A powerful way we avoid spam in our physical
    world is by preferring interactions only with
    trusted parties
  • Trust is propagated over social networks
  • When knocking on the doors of strangers, the
    first thing we do is to identify ourselves as a
    friend of a friend of friend
  • So they wont train their dogs/guns on us..
  • Knock-knock. Who is there? Aardwark. Okay (door
    opened) ?not funny
  • Aardwark who? Aardwark a million miles for one of
    your smiles. ?FUNNY
  • We can do it in cyber world too
  • Accept product recommendations only from trusted
    parties
  • E.g. Epinions
  • Accept mails only from individuals who you trust
    above a certain threshold
  • Bias page importance computation so that it
    counts only links from trusted sites..
  • Sort of like discounting links that are off
    topic

74
Trust Propagation
  • Trust is transitive so easy to propagate
  • ..but attenuates as it traverses as a social
    network
  • If I trust you, I trust your friend (but a little
    less than I do you), and I trust your friends
    friend even less
  • Trust may not be symmetric..
  • Trust is normally additive
  • If you are friend of two of my friends, may be I
    trust you more..
  • Distrust is difficult to propagate
  • If my friend distrusts you, then I probably
    distrust you
  • but if my enemy distrusts you?
  • is the enemy of my enemy automatically my
    friend?
  • Trust vs. Reputation
  • Trust is a user-specific metric
  • Your trust in an individual may be different from
    someone elses
  • Reputation can be thought of as an aggregate
    or one-size-fits-all version of Trust
  • Most systems such as EBay tend to use Reputation
    rather than Trust
  • Sort of the difference between User-specific vs.
    Global page rank

75
Case Study Epinions
  • Users can write reviews and also express
    trust/distrust on other users
  • Reviewers get royalties
  • so some tried to game the system
  • So, distrust measures introduced

Num nodes
Out degree
Guha et. Al. WWW 2004 compares some 81
different ways of propagating trust and
distrust on the Epinion trust matrix
76
Evaluating Trust Propagation Approaches
  • Given n users, and a sparsely populated nxn
    matrix of trusts between the users
  • And optionally an nxn matrix of distrusts between
    the users
  • Start by erasing some of the entries (but
    remember the values you erased)
  • For each trust propagation method
  • Use it to fill the nxn matrix
  • Compare the predicted values to the erased values

77
Fighting Page Spam
We saw discussion of these in the Henzinger et.
Al. paper
Can social networks, which gave rise to the
ideas of page importance computation, also
rescue these computations from spam?
78
TrustRank idea
Gyongyi et al, VLDB 2004
  • Tweak the default distribution used in page
    rank computation (the distribution that a bored
    user uses when she doesnt want to follow the
    links)
  • From uniform
  • To Trust based
  • Very similar in spirit to the Topic-sensitive or
    User-sensitive page rank
  • Where too you fiddle with the default
    distribution
  • Sample a set of seed pages from the web
  • Have an oracle (human) identify the good pages
    and the spam pages in the seed set
  • Expensive task, so must make seed set as small as
    possible
  • Propagate Trust (one pass)
  • Use the normalized trust to set the initial
    distribution

Slides modified from Anand Rajaramans lecture at
Stanford
79
Example
1
2
3
good
4
bad
5
6
7
80
Rules for trust propagation
  • Trust attenuation
  • The degree of trust conferred by a trusted page
    decreases with distance
  • Trust splitting
  • The larger the number of outlinks from a page,
    the less scrutiny the page author gives each
    outlink
  • Trust is split across outlinks
  • Combining splitting and damping, each out link of
    a node p gets a propagated trust of
    bt(p)/O(p)
  • 0ltblt1 O(p) is the out degree and t(p) is the
    trust of p
  • Trust additivity
  • Propagated trust from different directions is
    added up

81
Simple model
  • Suppose trust of page p is t(p)
  • Set of outlinks O(p)
  • For each q2O(p), p confers the trust
  • bt(p)/O(p) for 0ltblt1
  • Trust is additive
  • Trust of p is the sum of the trust conferred on p
    by all its inlinked pages
  • Note similarity to Topic-Specific Page Rank
  • Within a scaling factor, trust rank biased page
    rank with trusted pages as teleport set

82
Picking the seed set
  • Two conflicting considerations
  • Human has to inspect each seed page, so seed set
    must be as small as possible
  • Must ensure every good page gets adequate trust
    rank, so need make all good pages reachable from
    seed set by short paths

83
Approaches to picking seed set
  • Suppose we want to pick a seed set of k pages
  • The best idea would be to pick them from the
    top-k hub pages.
  • Note that trustworthiness is subjective
  • Aljazeera may be considered more trustworthy than
    NY Times by some (and the reverse by others)
  • PageRank
  • Pick the top k pages by page rank
  • Assume high page rank pages are close to other
    highly ranked pages
  • We care more about high page rank good pages

84
Inverse page rank
  • Pick the pages with the maximum number of
    outlinks
  • Can make it recursive
  • Pick pages that link to pages with many outlinks
  • Formalize as inverse page rank
  • Construct graph G by reversing each edge in web
    graph G
  • Page Rank in G is inverse page rank in G
  • Pick top k pages by inverse page rank

85
Anatomy of Google(circa 1999)
  • Slides from
  • http//www.cs.huji.ac.il/sdbi/2000/google/index.h
    tm

86
Some points
  • Fancy hits?
  • Why two types of barrels?
  • How is indexing parallelized?
  • How does Google show that it doesnt quite care
    about recall?
  • How does Google avoid crawling the same URL
    multiple times?
  • What are some of the memory saving things they
    do?
  • Do they use TF/IDF?
  • Do they normalize? (why not?)
  • Can they support proximity queries?
  • How are page synopses made?

87
Types of Web Queries
  • Navigational
  • User is looking for the address of a specific
    page (so the relevant set is a singleton!)
  • Success on these is responsible for much of the
    OOooo appeal of search engines..
  • Informational
  • User is trying to learn information about a
    specific topic (so the relevant set can be
    non-singleton)
  • Transactional
  • The user is searching with the final aim of
    conducting a transaction on that page..
  • E.g. comparison shopping

88
Search Engine Size over Time
Number of indexed pages, self-reported Google
50 of the web?
89
System Anatomy
  • High Level Overview

90
Google Search Engine Architecture
URL Server- Provides URLs to be fetched Crawler
is distributed Store Server - compresses
and stores pages for indexing Repository - holds
pages for indexing (full HTML of every
page) Indexer - parses documents, records words,
positions, font size, and capitalization Lexicon
- list of unique words found HitList efficient
record of word locsattribs Barrels hold (docID,
(wordID, hitList)) sorted each barrel has
range of words Anchors - keep information about
links found in web pages URL Resolver - converts
relative URLs to absolute Sorter - generates Doc
Index Doc Index - inverted index of all words in
all documents (except stop words) Links - stores
info about links to each page (used for
Pagerank) Pagerank - computes a rank for
each page retrieved Searcher - answers queries
SOURCE BRIN PAGE
91
Major Data Structures
  • Big Files
  • virtual files spanning multiple file systems
  • addressable by 64 bit integers
  • handles allocation deallocation of File
    Descriptions since the OSs is not enough
  • supports rudimentary compression

92
Major Data Structures (2)
  • Repository
  • tradeoff between speed compression ratio
  • choose zlib (3 to 1) over bzip (4 to 1)
  • requires no other data structure to access it

93
Major Data Structures (3)
  • Document Index
  • keeps information about each document
  • fixed width ISAM (index sequential access mode)
    index
  • includes various statistics
  • pointer to repository, if crawled, pointer to
    info lists
  • compact data structure
  • we can fetch a record in 1 disk seek during search

94
Major Data Structures (4)
  • URLs - docID file
  • used to convert URLs to docIDs
  • list of URL checksums with their docIDs
  • sorted by checksums
  • given a URL a binary search is performed
  • conversion is done in batch mode

95
Major Data Structures (4)
  • Lexicon
  • can fit in memory for reasonable price
  • currently 256 MB
  • contains 14 million words
  • 2 parts
  • a list of words
  • a hash table

96
Major Data Structures (4)
  • Hit Lists
  • includes position font capitalization
  • account for most of the space used in the indexes
  • 3 alternatives simple, Huffman , hand-optimized
  • hand encoding uses 2 bytes for every hit

97
Major Data Structures (4)
  • Hit Lists (2)

98
Major Data Structures (5)
  • Forward Index
  • partially ordered
  • used 64 Barrels
  • each Barrel holds a range of wordIDs
  • requires slightly more storage
  • each wordID is stored as a relative difference
    from the minimum wordID of the Barrel
  • saves considerable time in the sorting

99
Major Data Structures (6)
  • Inverted Index
  • 64 Barrels (same as the Forward Index)
  • for each wordID the Lexicon contains a pointer to
    the Barrel that wordID falls into
  • the pointer points to a doclist with their hit
    list
  • the order of the docIDs is important
  • by docID or doc word-ranking
  • Two inverted barrelsthe short barrel/full barrel

100
Major Data Structures (7)
  • Crawling the Web
  • fast distributed crawling system
  • URLserver Crawlers are implemented in phyton
  • each Crawler keeps about 300 connection open
  • at peek time the rate - 100 pages, 600K per
    second
  • uses internal cached DNS lookup
  • synchronized IO to handle events
  • number of queues
  • Robust Carefully tested

101
Major Data Structures (8)
  • Indexing the Web
  • Parsing
  • should know to handle errors
  • HTML typos
  • kb of zeros in a middle of a TAG
  • non-ASCII characters
  • HTML Tags nested hundreds deep
  • Developed their own Parser
  • involved a fair amount of work
  • did not cause a bottleneck

102
Major Data Structures (9)
  • Indexing Documents into Barrels
  • turning words into wordIDs
  • in-memory hash table - the Lexicon
  • new additions are logged to a file
  • parallelization
  • shared lexicon of 14 million pages
  • log of all the extra words

103
Major Data Structures (10)
  • Indexing the Web
  • Sorting
  • creating the inverted index
  • produces two types of barrels
  • for titles and anchor (Short barrels)
  • for full text (full barrels)
  • sorts every barrel separately
  • running sorters at parallel
  • the sorting is done in main memory

Ranking looks at Short barrels first And then
full barrels
104
Searching
  • Algorithm
  • 1. Parse the query
  • 2. Convert word into wordIDs
  • 3. Seek to the start of the doclist in the short
    barrel for every word
  • 4. Scan through the doclists until there is a
    document that matches all of the search terms
  • 5. Compute the rank of that document
  • 6. If were at the end of the short barrels start
    at the doclists of the full barrel, unless we
    have enough
  • 7. If were not at the end of any doclist goto
    step 4
  • 8. Sort the documents by rank return the top K
  • (May jump here after 40k pages)

105
The Ranking System
  • The information
  • Position, Font Size, Capitalization
  • Anchor Text
  • PageRank
  • Hits Types
  • title ,anchor , URL etc..
  • small font, large font etc..

106
The Ranking System (2)
  • Each Hit type has its own weight
  • Counts weights increase linearly with counts at
    first but quickly taper off this is the IR score
    of the doc
  • (IDF weighting??)
  • the IR is combined with PageRank to give the
    final Rank
  • For multi-word query
  • A proximity score for every set of hits with a
    proximity type weight
  • 10 grades of proximity

107
Feedback
  • A trusted user may optionally evaluate the
    results
  • The feedback is saved
  • When modifying the ranking function we can see
    the impact of this change on all previous
    searches that were ranked

108
Results
  • Produce better results than major commercial
    search engines for most searches
  • Example query bill clinton
  • return results from the Whitehouse.gov
  • email addresses of the president
  • all the results are high quality pages
  • no broken links
  • no bill without clinton no clinton without bill

109
Storage Requirements
  • Using Compression on the repository
  • about 55 GB for all the data used by the SE
  • most of the queries can be answered by just the
    short inverted index
  • with better compression, a high quality SE can
    fit onto a 7GB drive of a new PC

110
Storage Statistics
Web Page Statistics
111
System Performance
  • It took 9 days to download 26million pages
  • 48.5 pages per second
  • The Indexer Crawler ran simultaneously
  • The Indexer runs at 54 pages per second
  • The sorters run in parallel using 4 machines, the
    whole process took 24 hours
Write a Comment
User Comments (0)