Title: What is page importance
1What is page importance?
- Page importance is hard to define unilaterally
such that it satisfies everyone. There are
however some desiderata - It should be sensitive to
- The query
- Or at least the topic of the query..
- The user
- Or at least the user population
- The link structure of the web
- The amount of accesses the page gets
- It should be stable w.r.t. small random changes
in the network link structure - It shouldnt be easy to subvert with intentional
changes to link structure
How about Eloquence of the page
informativeness of the page
2Desiderata for link-based ranking
- A page that is referenced by lot of important
pages (has more back links) is more important
(Authority) - A page referenced by a single important page may
be more important than that referenced by five
unimportant pages - A page that references a lot of important pages
is also important (Hub) - Importance can be propagated
- Your importance is the weighted sum of the
importance conferred on you by the pages that
refer to you - The importance you confer on a page may be
proportional to how many other pages you refer to
(cite) - (Also what you say about them when you cite them!)
Different Notions of importance
Qn Can we assign consistent authority/hub
values to pages?
3Authorities and Hubsas mutually reinforcing
properties
- Authorities and hubs related to the same query
tend to form a bipartite subgraph of the web
graph. - Suppose each page has an authority score a(p) and
a hub score h(p)
hubs
authorities
4Authority and Hub Pages
- I Authority Computation for each page p
-
- a(p) ? h(q)
- q (q, p)?E
- O Hub Computation for each page p
-
- h(p) ? a(q)
- q (p, q)?E
q1
q2
p
q3
q1
q2
p
q3
A set of simultaneous equations Can we solve
these?
5Authority and Hub Pages (8)
- Matrix representation of operations I and O.
- Let A be the adjacency matrix of SG entry (p, q)
is 1 if p has a link to q, else the entry is 0. - Let AT be the transpose of A.
- Let hi be vector of hub scores after i
iterations. - Let ai be the vector of authority scores after i
iterations. - Operation I ai AT hi-1
- Operation O hi A ai
Normalize after every multiplication
6Authority and Hub Pages (11)
- Example Initialize all scores to 1.
- 1st Iteration
- I operation
- a(q1) 1, a(q2) a(q3) 0,
- a(p1) 3, a(p2) 2
- O operation h(q1) 5,
- h(q2) 3, h(q3) 5, h(p1) 1, h(p2) 0
- Normalization a(q1) 0.267, a(q2) a(q3)
0, - a(p1) 0.802, a(p2) 0.535, h(q1) 0.645,
- h(q2) 0.387, h(q3) 0.645, h(p1) 0.129,
h(p2) 0
q1
p1
q2
p2
q3
7Authority and Hub Pages (12)
- After 2 Iterations
- a(q1) 0.061, a(q2) a(q3) 0, a(p1)
0.791, - a(p2) 0.609, h(q1) 0.656, h(q2) 0.371,
- h(q3) 0.656, h(p1) 0.029, h(p2) 0
- After 5 Iterations
- a(q1) a(q2) a(q3) 0,
- a(p1) 0.788, a(p2) 0.615
- h(q1) 0.657, h(q2) 0.369,
- h(q3) 0.657, h(p1) h(p2) 0
q1
p1
q2
p2
q3
8What happens if you multiply a vector by a matrix?
- In general, when you multiply a vector by a
matrix, the vector gets scaled as well as
rotated - ..except when the vector happens to be in the
direction of one of the eigen vectors of the
matrix - .. in which case it only gets scaled (stretched)
- A (symmetric square) matrix has all real eigen
values, and the values give an indication of the
amount of stretching that is done for vectors in
that direction - The eigen vectors of the matrix define a new
ortho-normal space - You can model the multiplication of a general
vector by the matrix in terms of - First decompose the general vector into its
projections in the eigen vector directions - ..which means just take the dot product of the
vector with the (unit) eigen vector - Then multiply the projections by the
corresponding eigen valuesto get the new vector. - This explains why power method converges to
principal eigen vector.. - ..since if a vector has a non-zero projection in
the principal eigen vector direction, then
repeated multiplication will keep stretching the
vector in that direction, so that eventually all
other directions vanish by comparison..
Optional
9(why) Does the procedure converge?
As we multiply repeatedly with M, the component
of x in the direction of principal eigen vector
gets stretched wrt to other directions.. So we
converge finally to the direction of principal
eigenvector Necessary condition x must have a
component in the direction of principal eigen
vector (c1must be non-zero)
The rate of convergence depends on the eigen gap
10Can we power iterate to get other (secondary)
eigen vectors?
- Yesjust find a matrix M2 such that M2 has the
same eigen vectors as M, but the eigen value
corresponding to the first eigen vector e1 is
zeroed out.. - Now do power iteration on M2
- Alternately start with a random vector v, and
find a new vector v v (v.e1)e1 and do power
iteration on M with v
Why? 1. M2e1 0 2. If e2 is the
second eigen vector of M, then
it is also an eigen vector of M2
11Authority and Hub Pages
- Algorithm (summary)
- submit q to a search engine to obtain the
root set S - expand S into the base set T
- obtain the induced subgraph SG(V, E) using T
- initialize a(p) h(p) 1 for all p in V
- for each p in V until the scores converge
- apply Operation I
- apply Operation O
- normalize a(p) and h(p)
- return pages with top authority hub scores
12(No Transcript)
13Base set computation
- can be made easy by storing the link structure of
the Web in advance Link structure table (during
crawling) - --Most search engines serve this
information now. (e.g. Googles link search) - parent_url child_url
- url1 url2
- url1 url3
14Authority and Hub Pages (9)
- After each iteration of applying Operations I
and O, normalize all authority and hub scores. - Repeat until the scores for each page
converge (the convergence is guaranteed). - 5. Sort pages in descending authority scores.
- 6. Display the top authority pages.
15Handling spam links
- Should all links be equally treated?
- Two considerations
- Some links may be more meaningful/important than
other links. - Web site creators may trick the system to make
their pages more authoritative by adding dummy
pages pointing to their cover pages (spamming).
16Handling Spam Links (contd)
- Transverse link links between pages with
different domain names. - Domain name the first level of the URL of a
page. - Intrinsic link links between pages with the same
domain name. - Transverse links are more important than
intrinsic links. - Two ways to incorporate this
- Use only transverse links and discard intrinsic
links. - Give lower weights to intrinsic links.
17Handling Spam Links (contd)
- How to give lower weights to intrinsic links?
- In adjacency matrix A, entry (p, q) should be
assigned as follows - If p has a transverse link to q, the entry is 1.
- If p has an intrinsic link to q, the entry is c,
where 0 lt c lt 1. - If p has no link to q, the entry is 0.
18Considering link context
- For a given link (p, q), let V(p, q) be the
vicinity (e.g., ? 50 characters) of the link. - If V(p, q) contains terms in the user query
(topic), then the link should be more useful for
identifying authoritative pages. - To incorporate this In adjacency matrix A, make
the weight associated with link (p, q) to be
1n(p, q), - where n(p, q) is the number of terms in V(p, q)
that appear in the query. - Alternately, consider the vector similarity
between V(p,q) and the query Q
19(No Transcript)
20Evaluation
- Sample experiments
- Rank based on large in-degree (or backlinks)
- query game
- Rank in-degree URL
- 1 13 http//www.gotm.org
- 2 12 http//www.gamezero.c
om/team-0/ - 3 12 http//ngp.ngpc.state
.ne.us/gp.html - 4 12 http//www.ben2.ucla.
edu/permadi/ -
gamelink/gamelink.html - 5 11 http//igolfto.net/
- 6 11 http//www.eduplace.c
om/geo/indexhi.html - Only pages 1, 2 and 4 are authoritative game
pages.
21Evaluation
- Sample experiments (continued)
- Rank based on large authority score.
- query game
- Rank Authority URL
- 1 0.613 http//www.gotm.org
- 2 0.390 http//ad/doubleclick/n
et/jump/ -
gamefan-network.com/ - 3 0.342 http//www.d2realm.com/
- 4 0.324 http//www.counter-stri
ke.net - 5 0.324 http//tech-base.com/
- 6 0.306 http//www.e3zone.com
- All pages are authoritative game pages.
22Authority and Hub Pages (19)
- Sample experiments (continued)
- Rank based on large authority score.
- query free email
- Rank Authority URL
- 1 0.525 http//mail.chek.com/
- 2 0.345 http//www.hotmail/com/
- 3 0.309 http//www.naplesnews.n
et/ - 4 0.261 http//www.11mail.com/
- 5 0.254 http//www.dwp.net/
- 6 0.246 http//www.wptamail.com
/ - All pages are authoritative free email pages.
23Cora thinks Rao is Authoritative on Planning
Citeseer has him down at 90th position ?
How come??? --Planning has two clusters
--Planning reinforcement learning
--Deterministic planning --The first is a
bigger cluster --Rao is big in the second
cluster?
24Tyranny of Majority
Which do you think are Authoritative
pages? Which are good hubs? -intutively, we
would say that 4,8,5 will be authoritative
pages and 1,2,3,6,7 will be hub pages.
1
6
8
2
4
7
3
5
The authority and hub mass Will concentrate
completely Among the first component, as The
iterations increase. (See next slide)
BUT The power iteration will show that Only 4 and
5 have non-zero authorities .923 .382 And only
1, 2 and 3 have non-zero hubs .5 .7 .5
25Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
p1
q1
m
n
q
p2
p
qn
pm
mgtn
26Impact of Bridges..
1
6
When the graph is disconnected, only 4 and 5 have
non-zero authorities .923 .382 And only 1, 2
and 3 have non-zero hubs .5 .7 .5CV
8
2
4
7
3
5
Bad news from stability point of view ?Can be
fixed by putting a weak link between any
two pages.. (saying in essence that you
expect every page to be reached from
every other page)
When the components are bridged by adding one
page (9) the authorities change only 4, 5 and 8
have non-zero authorities .853 .224 .47 And o1,
2, 3, 6,7 and 9 will have non-zero hubs .39 .49
.39 .21 .21 .6
27Finding minority Communities
- How to retrieve pages from smaller communities?
- A method for finding pages in nth largest
community - Identify the next largest community using the
existing algorithm. - Destroy this community by removing links
associated with pages having large authorities. - Reset all authority and hub values back to 1 and
calculate all authority and hub values again. - Repeat the above n ? 1 times and the next largest
community will be the nth largest community.
28Multiple Clusters on House
Query House (first community)
29Authority and Hub Pages (26)
Query House (second community)
30PageRank
31The importance of publishing..
- A/H algorithm was published in SODA as well as
JACM - Kleinberg became very famous in the scientific
community (and got a McArthur Genius award)
- Pagerank algorithm was rejected from SIGIR and
was never explicitly published - Larry Page never got a genius award or even a PhD
- (and had to be content with being a mere
billionaire)
32PageRank (Importance as Stationary Visit
Probability on a Markov Chain)
- Basic Idea
- Think of Web as a big graph. A random surfer
keeps randomly clicking on the links. - The importance of a page is the probability that
the surfer finds herself on that page - --Talk of transition matrix instead of adjacency
matrix - Transition matrix M derived from adjacency
matrix A - --If there are F(u) forward links from a
page u, - then the probability that the surfer
clicks - on any of those is 1/F(u) (Columns sum
to 1. Stochastic matrix) - M is the normalized version of At
- --But even a dumb user may once in a while do
something other than - follow URLs on the current page..
- --Idea Put a small probability that
the user goes off to a page not pointed to by the
current page.
Principal eigenvector Gives the stationary
distribution!
33Markov Chains Random Surfer Model
- Markov Chains Stationary distribution
- Necessary conditions for existence of unique
steady state distribution Aperiodicity and
Irreducibility - Irreducibility Each node can be reached from
every other node with non-zero probability - Must not have sink nodes (which have no out
links) - Because we can have several different steady
state distributions based on which sink we get
stuck in - If there are sink nodes, change them so that you
can transition from them to every other node with
low probability - Must not have disconnected components
- Because we can have several different steady
state distributions depending on which
disconnected component we get stuck in - Sufficient to put a low probability link from
every node to every other node (in addition to
the normal weight links corresponding to actual
hyperlinks)
- The parameters of random surfer model
- c the probability that surfer follows the page
- The larger it is, the more the surfer sticks to
what the page says - M the way link matrix is converted to markov
chain - Can make the links have differing transition
probability - E.g. query specific links have higher prob. Links
in bold have higher prop etc - K the reset distribution of the surfer ?great
thing to tweak - It is quite feasible to have m different reset
distributions corresponding to m different
populations of users (or m possible
topic-oriented searches) - It is also possible to make the reset
distribution depend on other things such as - trust of the page TrustRank
- Recency of the page Recency-sensitive rank
34Computing PageRank (10)
- Example Suppose the Web graph is
- M
D
C
A
B
A B C D
A B C D
- 0 0 0 ½
- 0 0 0 ½
- 1 0 0
- 0 0 1 0
A B C D
0 0 1 0 0 0 1 0 0 0 0 1 1 1
0 0
A B C D
A
35Computing PageRank
- Matrix representation
- Let M be an N?N matrix and muv be the entry at
the u-th row and v-th column. - muv 1/Nv if page v has a link to page
u - muv 0 if there is no link from v to u
- Let Ri be the N?1 rank vector for I-th
iteration - and R0 be the initial rank vector.
- Then Ri M ? Ri-1
36Computing PageRank
- If the ranks converge, i.e., there is a rank
vector R such that - R M ? R,
- R is the eigenvector of matrix M with eigenvalue
being 1. - Convergence is guaranteed only if
- M is aperiodic (the Web graph is not a big
cycle). This is practically guaranteed for Web. - M is irreducible (the Web graph is strongly
connected). This is usually not true.
Principal eigen value for A stochastic matrix is 1
37Computing PageRank (6)
- Rank sink A page or a group of pages is a rank
sink if they can receive rank propagation from
its parents but cannot propagate rank to other
pages. - Rank sink causes the loss of total ranks.
- Example
A
(C, D) is a rank sink
B
C
D
38Computing PageRank (7)
- A solution to the non-irreducibility and rank
sink problem. - Conceptually add a link from each page v to every
page (include self). - If v has no forward links originally, make all
entries in the corresponding column in M be 1/N. - If v has forward links originally, replace 1/Nv
in the corresponding column by c?1/Nv and then
add (1-c) ?1/N to all entries, 0 lt c lt 1.
Motivation comes also from random-surfer model
39Computing PageRank (8)
Z will have 1/N For sink pages And 0 otherwise
K will have 1/N For all entries
- M c (M Z) (1 c) x K
- M is irreducible.
- M is stochastic, the sum of all entries of each
column is 1 and there are no negative entries. - Therefore, if M is replaced by M as in
- Ri M ? Ri-1
- then the convergence is guaranteed and there
will be no loss of the total rank (which is 1).
40Computing PageRank (9)
- Interpretation of M based on the random walk
model. - If page v has no forward links originally, a web
surfer at v can jump to any page in the Web with
probability 1/N. - If page v has forward links originally, a surfer
at v can either follow a link to another page
with probability c ? 1/Nv, or jumps to any page
with probability (1-c) ?1/N.
41Computing PageRank (10)
- Example Suppose the Web graph is
- M
D
C
A
B
A B C D
- 0 0 0 ½
- 0 0 0 ½
- 1 0 0
- 0 0 1 0
A B C D
42Computing PageRank (11)
- Example (continued) Suppose c 0.8. All entries
in Z are 0 and all entries in K are ¼. - M 0.8 (MZ) 0.2 K
- Compute rank by iterating
- R MxR
0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45
0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05
MATLAB says R(A).338 R(B).338 R(C).6367 R(D).
6052
43Comparing PR A/H on the same graph
pagerank
A/H
44Combining PR Content similarity
- Incorporate the ranks of pages into the ranking
function of a search engine. - The ranking score of a web page can be a weighted
sum of its regular similarity with a query and
its importance. - ranking_score(q, d)
- w?sim(q, d) (1-w) ? R(d), if sim(q,
d) gt 0 - 0, otherwise
- where 0 lt w lt 1.
- Both sim(q, d) and R(d) need to be normalized to
between 0, 1.
Who sets w?
45We can pick and choose
- Two alternate ways of computing page importance
- I1. As authorities/hubs
- I2. As stationary distribution over the
underlying markov chain
- Two alternate ways of combining importance with
similarity - C1. Compute importance over a set derived from
the top-100 similar pages - C2. Combine apples organges
- aimportance bsimilarity
We can pick any pair of alternatives (even though
I1 was originally proposed with C1 and I2 with
C2)
46Efficient computation Prioritized Sweeping
We can use asynchronous iterations where the
iteration uses some of the values updated in
the current iteration
47Efficient Computation Preprocess
- Remove dangling nodes
- Pages w/ no children
- Then repeat process
- Since now more danglers
- Stanford WebBase
- 25 M pages
- 81 M URLs in the link graph
- After two prune iterations 19 M nodes
48Representing Links Table
- Stored on disk in binary format
- Size for Stanford WebBase 1.01 GB
- Assumed to exceed main memory
49Algorithm 1
?s Sources 1/N while residual gt? ?d
Destd 0 while not Links.eof()
Links.read(source, n, dest1, destn)
for j 1 n Destdestj
DestdestjSourcesource/n ?d
Destd c Destd (1-c)/N /
dampening / residual ??Source Dest??
/ recompute every few iterations
/ Source Dest
50Analysis of Algorithm 1
- If memory is big enough to hold Source Dest
- IO cost per iteration is Links
- Fine for a crawl of 24 M pages
- But web 800 M pages in 2/99 NEC
study - Increase from 320 M pages in 1997 same
authors - If memory is big enough to hold just Dest
- Sort Links on source field
- Read Source sequentially during rank propagation
step - Write Dest to disk to serve as Source for next
iteration - IO cost per iteration is Source Dest
Links - If memory cant hold Dest
- Random access pattern will make working set
Dest - Thrash!!!
51Block-Based Algorithm
- Partition Dest into B blocks of D pages each
- If memory P physical pages
- D lt P-2 since need input buffers for Source
Links - Partition Links into B files
- Linksi only has some of the dest nodes for each
source - Linksi only has dest nodes such that
- DDi lt dest lt DD(i1)
- Where DD number of 32 bit integers that fit in
D pages
source node
?
dest node
Dest
Links (sparse)
Source
52Partitioned Link File
Source node (32 bit int)
Outdegr (16 bit)
Destination nodes (32 bit int)
Num out (16 bit)
0
4
2
12, 26
Buckets 0-31
1
3
5
1
3
2
5
1, 9, 10
0
4
58
1
Buckets 32-63
1
3
1
56
1
2
5
36
0
4
94
1
Buckets 64-95
1
3
1
69
1
2
5
78
53Block-based Page Rank algorithm
54Analysis of Block Algorithm
- IO Cost per iteration
- B Source Dest Links(1e)
- e is factor by which Links increased in size
- Typically 0.1-0.3
- Depends on number of blocks
- Algorithm nested-loops join
55Comparing the Algorithms
56Effect of collusion on PageRank
C
C
A
A
B
B
Moral By referring to each other, a cluster of
pages can artificially boost their
rank (although the cluster has to be big enough
to make an appreciable
difference. Solution Put a threshold on the
number of intra-domain links that will
count Counter Buy two domains, and generate a
cluster among those..
57(No Transcript)
58(No Transcript)
59Use of Link Information
- PageRank defines the global importance of web
pages but the importance is domain/topic
independent. - We often need to find important/authoritative
pages which are relevant to a given query. - What are important web browser pages?
- Which pages are important game pages?
- Idea Use a notion of topic-specific page rank
- Involves using a non-uniform probability
60Topic Specific Pagerank
Haveliwala, WWW 2002
- For each page compute k different page ranks
- K number of top level hierarchies in the Open
Directory Project - When computing PageRank w.r.t. to a topic, say
that with e probability we transition to one of
the pages of the topick - When a query q is issued,
- Compute similarity between q ( its context) to
each of the topics - Take the weighted combination of the topic
specific page ranks of q, weighted by the
similarity to different topics
61Stability of Rank Calculations
(From Ng et. al. )
The left most column Shows the original
rank Calculation -the columns on the right
are result of rank calculations when 30
of pages are randomly removed
62(No Transcript)
63More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
64Novel uses of Link Analysis
- Link analysis algorithmsHITS, and Pagerankare
not limited to hyperlinks - Citeseer/Cora use them for analyzing citations
(the link is through citation) - See the irony herelink analysis ideas originated
from citation analysis, and are now being applied
for citation analysis ? - Some new work on keyword search on databases
uses foreign-key links and link analysis to
decide which of the tuples matching the keyword
query are most important (the link is through
foreign keys) - Sudarshan et. Al. ICDE 2002
- Keyword search on databases is useful to make
structured databases accessible to naïve users
who dont know structured languages (such as
SQL).
65(No Transcript)
66Query complexity
- Complex queries (966 trials)
- Average words 7.03
- Average operators (") 4.34
- Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz - Average query words 2.35
- Average operators (") 0.41
- Forcibly adding a hub or authority node helped in
86 of the queries
67What about non-principal eigen vectors?
- Principal eigen vector gives the authorities (and
hubs) - What do the other ones do?
- They may be able to show the clustering in the
documents (see page 23 in Kleinberg paper) - The clusters are found by looking at the positive
and negative ends of the secondary eigen vectors
(ppl vector has only ve end)
68More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
69Summary of Key Points
- PageRank Iterative Algorithm
- Rank Sinks
- Efficiency of computation Memory!
- Single precision Numbers.
- Dont represent M explicitly.
- Break arrays into Blocks.
- Minimize IO Cost.
- Number of iterations of PageRank.
- Weighting of PageRank vs. doc similarity.
70Beyond Google (and Pagerank)
- Are backlinks reliable metric of importance?
- It is a one-size-fits-all measure of
importance - Not user specific
- Not topic specific
- There may be discrepancy between back links and
actual popularity (as measured in hits) - The sense of the link is ignored (this is okay
if you think that all publicity is good
publicity) - Mark Twain on Classics
- A classic is something everyone wishes they had
already read and no one actually had..
(paraphrase) - Google may be its own undoing(why would I need
back links when I know I can get to it through
Google?) - Customization, customization, customization
- Yahoo sez about their magic bullet.. (NYT
2/22/04) - "If you type in flowers, do you want to buy
flowers, plant flowers or see pictures of
flowers?"
71Challenges in Web Search Engines
- Spam
- Text Spam
- Link Spam
- Cloaking
- Content Quality
- Anchor text quality
- Quality Evaluation
- Indirect feedback
- Web Conventions
- Articulate and develop validation
- Duplicate Hosts
- Mirror detection
- Vaguely Structured Data
- Page layout
- The advantage of making rendering/content
language be same
72Spam is a serious problem
- We have Spam Spam Spam Spam Spam with Eggs and
Spam - in Email
- Most mail transmitted is junk
- web pages
- Many different ways of fooling search engines
- This is an open arms race
- Annual conference on Email and Anti-Spam
- Started 2004
- Intl. workshop on AIR-Web (Adversarial Info
Retrieval on Web) - Started in 2005 at WWW
73Trust Spam (Knock-Knock. Who is there?)
- A powerful way we avoid spam in our physical
world is by preferring interactions only with
trusted parties - Trust is propagated over social networks
- When knocking on the doors of strangers, the
first thing we do is to identify ourselves as a
friend of a friend of friend - So they wont train their dogs/guns on us..
- Knock-knock. Who is there? Aardwark. Okay (door
opened) ?not funny - Aardwark who? Aardwark a million miles for one of
your smiles. ?FUNNY - We can do it in cyber world too
- Accept product recommendations only from trusted
parties - E.g. Epinions
- Accept mails only from individuals who you trust
above a certain threshold - Bias page importance computation so that it
counts only links from trusted sites.. - Sort of like discounting links that are off
topic
74Trust Propagation
- Trust is transitive so easy to propagate
- ..but attenuates as it traverses as a social
network - If I trust you, I trust your friend (but a little
less than I do you), and I trust your friends
friend even less - Trust may not be symmetric..
- Trust is normally additive
- If you are friend of two of my friends, may be I
trust you more.. - Distrust is difficult to propagate
- If my friend distrusts you, then I probably
distrust you - but if my enemy distrusts you?
- is the enemy of my enemy automatically my
friend? - Trust vs. Reputation
- Trust is a user-specific metric
- Your trust in an individual may be different from
someone elses - Reputation can be thought of as an aggregate
or one-size-fits-all version of Trust - Most systems such as EBay tend to use Reputation
rather than Trust - Sort of the difference between User-specific vs.
Global page rank
75Case Study Epinions
- Users can write reviews and also express
trust/distrust on other users - Reviewers get royalties
- so some tried to game the system
- So, distrust measures introduced
Num nodes
Out degree
Guha et. Al. WWW 2004 compares some 81
different ways of propagating trust and
distrust on the Epinion trust matrix
76Evaluating Trust Propagation Approaches
- Given n users, and a sparsely populated nxn
matrix of trusts between the users - And optionally an nxn matrix of distrusts between
the users - Start by erasing some of the entries (but
remember the values you erased) - For each trust propagation method
- Use it to fill the nxn matrix
- Compare the predicted values to the erased values
77Fighting Page Spam
We saw discussion of these in the Henzinger et.
Al. paper
Can social networks, which gave rise to the
ideas of page importance computation, also
rescue these computations from spam?
78TrustRank idea
Gyongyi et al, VLDB 2004
- Tweak the default distribution used in page
rank computation (the distribution that a bored
user uses when she doesnt want to follow the
links) - From uniform
- To Trust based
- Very similar in spirit to the Topic-sensitive or
User-sensitive page rank - Where too you fiddle with the default
distribution - Sample a set of seed pages from the web
- Have an oracle (human) identify the good pages
and the spam pages in the seed set - Expensive task, so must make seed set as small as
possible - Propagate Trust (one pass)
- Use the normalized trust to set the initial
distribution
Slides modified from Anand Rajaramans lecture at
Stanford
79Example
1
2
3
good
4
bad
5
6
7
80Rules for trust propagation
- Trust attenuation
- The degree of trust conferred by a trusted page
decreases with distance - Trust splitting
- The larger the number of outlinks from a page,
the less scrutiny the page author gives each
outlink - Trust is split across outlinks
- Combining splitting and damping, each out link of
a node p gets a propagated trust of
bt(p)/O(p) - 0ltblt1 O(p) is the out degree and t(p) is the
trust of p - Trust additivity
- Propagated trust from different directions is
added up
81Simple model
- Suppose trust of page p is t(p)
- Set of outlinks O(p)
- For each q2O(p), p confers the trust
- bt(p)/O(p) for 0ltblt1
- Trust is additive
- Trust of p is the sum of the trust conferred on p
by all its inlinked pages - Note similarity to Topic-Specific Page Rank
- Within a scaling factor, trust rank biased page
rank with trusted pages as teleport set
82Picking the seed set
- Two conflicting considerations
- Human has to inspect each seed page, so seed set
must be as small as possible - Must ensure every good page gets adequate trust
rank, so need make all good pages reachable from
seed set by short paths
83Approaches to picking seed set
- Suppose we want to pick a seed set of k pages
- The best idea would be to pick them from the
top-k hub pages. - Note that trustworthiness is subjective
- Aljazeera may be considered more trustworthy than
NY Times by some (and the reverse by others) - PageRank
- Pick the top k pages by page rank
- Assume high page rank pages are close to other
highly ranked pages - We care more about high page rank good pages
84Inverse page rank
- Pick the pages with the maximum number of
outlinks - Can make it recursive
- Pick pages that link to pages with many outlinks
- Formalize as inverse page rank
- Construct graph G by reversing each edge in web
graph G - Page Rank in G is inverse page rank in G
- Pick top k pages by inverse page rank
85Anatomy of Google(circa 1999)
- Slides from
- http//www.cs.huji.ac.il/sdbi/2000/google/index.h
tm
86Some points
- Fancy hits?
- Why two types of barrels?
- How is indexing parallelized?
- How does Google show that it doesnt quite care
about recall? - How does Google avoid crawling the same URL
multiple times?
- What are some of the memory saving things they
do? - Do they use TF/IDF?
- Do they normalize? (why not?)
- Can they support proximity queries?
- How are page synopses made?
87Types of Web Queries
- Navigational
- User is looking for the address of a specific
page (so the relevant set is a singleton!) - Success on these is responsible for much of the
OOooo appeal of search engines.. - Informational
- User is trying to learn information about a
specific topic (so the relevant set can be
non-singleton) - Transactional
- The user is searching with the final aim of
conducting a transaction on that page.. - E.g. comparison shopping
88Search Engine Size over Time
Number of indexed pages, self-reported Google
50 of the web?
89System Anatomy
90Google Search Engine Architecture
URL Server- Provides URLs to be fetched Crawler
is distributed Store Server - compresses
and stores pages for indexing Repository - holds
pages for indexing (full HTML of every
page) Indexer - parses documents, records words,
positions, font size, and capitalization Lexicon
- list of unique words found HitList efficient
record of word locsattribs Barrels hold (docID,
(wordID, hitList)) sorted each barrel has
range of words Anchors - keep information about
links found in web pages URL Resolver - converts
relative URLs to absolute Sorter - generates Doc
Index Doc Index - inverted index of all words in
all documents (except stop words) Links - stores
info about links to each page (used for
Pagerank) Pagerank - computes a rank for
each page retrieved Searcher - answers queries
SOURCE BRIN PAGE
91Major Data Structures
- Big Files
- virtual files spanning multiple file systems
- addressable by 64 bit integers
- handles allocation deallocation of File
Descriptions since the OSs is not enough - supports rudimentary compression
92Major Data Structures (2)
- Repository
- tradeoff between speed compression ratio
- choose zlib (3 to 1) over bzip (4 to 1)
- requires no other data structure to access it
93Major Data Structures (3)
- Document Index
- keeps information about each document
- fixed width ISAM (index sequential access mode)
index - includes various statistics
- pointer to repository, if crawled, pointer to
info lists - compact data structure
- we can fetch a record in 1 disk seek during search
94Major Data Structures (4)
- URLs - docID file
- used to convert URLs to docIDs
- list of URL checksums with their docIDs
- sorted by checksums
- given a URL a binary search is performed
- conversion is done in batch mode
95Major Data Structures (4)
- Lexicon
- can fit in memory for reasonable price
- currently 256 MB
- contains 14 million words
- 2 parts
- a list of words
- a hash table
96Major Data Structures (4)
- Hit Lists
- includes position font capitalization
- account for most of the space used in the indexes
- 3 alternatives simple, Huffman , hand-optimized
- hand encoding uses 2 bytes for every hit
97Major Data Structures (4)
98Major Data Structures (5)
- Forward Index
- partially ordered
- used 64 Barrels
- each Barrel holds a range of wordIDs
- requires slightly more storage
- each wordID is stored as a relative difference
from the minimum wordID of the Barrel - saves considerable time in the sorting
99Major Data Structures (6)
- Inverted Index
- 64 Barrels (same as the Forward Index)
- for each wordID the Lexicon contains a pointer to
the Barrel that wordID falls into - the pointer points to a doclist with their hit
list - the order of the docIDs is important
- by docID or doc word-ranking
- Two inverted barrelsthe short barrel/full barrel
100Major Data Structures (7)
- Crawling the Web
- fast distributed crawling system
- URLserver Crawlers are implemented in phyton
- each Crawler keeps about 300 connection open
- at peek time the rate - 100 pages, 600K per
second - uses internal cached DNS lookup
- synchronized IO to handle events
- number of queues
- Robust Carefully tested
101Major Data Structures (8)
- Indexing the Web
- Parsing
- should know to handle errors
- HTML typos
- kb of zeros in a middle of a TAG
- non-ASCII characters
- HTML Tags nested hundreds deep
- Developed their own Parser
- involved a fair amount of work
- did not cause a bottleneck
102Major Data Structures (9)
- Indexing Documents into Barrels
- turning words into wordIDs
- in-memory hash table - the Lexicon
- new additions are logged to a file
- parallelization
- shared lexicon of 14 million pages
- log of all the extra words
103Major Data Structures (10)
- Indexing the Web
- Sorting
- creating the inverted index
- produces two types of barrels
- for titles and anchor (Short barrels)
- for full text (full barrels)
- sorts every barrel separately
- running sorters at parallel
- the sorting is done in main memory
Ranking looks at Short barrels first And then
full barrels
104Searching
- Algorithm
- 1. Parse the query
- 2. Convert word into wordIDs
- 3. Seek to the start of the doclist in the short
barrel for every word - 4. Scan through the doclists until there is a
document that matches all of the search terms
- 5. Compute the rank of that document
- 6. If were at the end of the short barrels start
at the doclists of the full barrel, unless we
have enough - 7. If were not at the end of any doclist goto
step 4 - 8. Sort the documents by rank return the top K
- (May jump here after 40k pages)
105The Ranking System
- The information
- Position, Font Size, Capitalization
- Anchor Text
- PageRank
- Hits Types
- title ,anchor , URL etc..
- small font, large font etc..
106The Ranking System (2)
- Each Hit type has its own weight
- Counts weights increase linearly with counts at
first but quickly taper off this is the IR score
of the doc - (IDF weighting??)
- the IR is combined with PageRank to give the
final Rank - For multi-word query
- A proximity score for every set of hits with a
proximity type weight - 10 grades of proximity
107Feedback
- A trusted user may optionally evaluate the
results - The feedback is saved
- When modifying the ranking function we can see
the impact of this change on all previous
searches that were ranked
108Results
- Produce better results than major commercial
search engines for most searches - Example query bill clinton
- return results from the Whitehouse.gov
- email addresses of the president
- all the results are high quality pages
- no broken links
- no bill without clinton no clinton without bill
109Storage Requirements
- Using Compression on the repository
- about 55 GB for all the data used by the SE
- most of the queries can be answered by just the
short inverted index - with better compression, a high quality SE can
fit onto a 7GB drive of a new PC
110Storage Statistics
Web Page Statistics
111System Performance
- It took 9 days to download 26million pages
- 48.5 pages per second
- The Indexer Crawler ran simultaneously
- The Indexer runs at 54 pages per second
- The sorters run in parallel using 4 machines, the
whole process took 24 hours