What is page importance

About This Presentation

Title:

What is page importance

Description:

2 0.345 http://www.hotmail/com/ 3 0.309 http://www.naplesnews.net ... But even a dumb user may once in a while do something other than ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 91

Provided by: sunylearni

Learn more at: https://rakaposhi.eas.asu.edu

more less

Transcript and Presenter's Notes

Title: What is page importance

1
What is page importance?

Page importance is hard to define unilaterally
such that it satisfies everyone. There are
however some desiderata
It should be sensitive to
The query
Or at least the topic of the query..
The user
Or at least the user population
The link structure of the web
The amount of accesses the page gets
It should be stable w.r.t. small random changes
in the network link structure
It shouldnt be easy to subvert with intentional
changes to link structure

How about Eloquence of the page
informativeness of the page
2
Desiderata for link-based ranking

A page that is referenced by lot of important
pages (has more back links) is more important
(Authority)
A page referenced by a single important page may
be more important than that referenced by five
unimportant pages
A page that references a lot of important pages
is also important (Hub)
Importance can be propagated
Your importance is the weighted sum of the
importance conferred on you by the pages that
refer to you
The importance you confer on a page may be
proportional to how many other pages you refer to
(cite)
(Also what you say about them when you cite them!)

Different Notions of importance
Qn Can we assign consistent authority/hub
values to pages?
3
Authorities and Hubsas mutually reinforcing
properties

Authorities and hubs related to the same query
tend to form a bipartite subgraph of the web
graph.
Suppose each page has an authority score a(p) and
a hub score h(p)

hubs
authorities
4
Authority and Hub Pages

I Authority Computation for each page p
a(p) ? h(q)
q (q, p)?E
O Hub Computation for each page p
h(p) ? a(q)
q (p, q)?E

q1
q2
p
q3
q1
q2
p
q3
A set of simultaneous equations Can we solve
these?
5
Authority and Hub Pages (8)

Matrix representation of operations I and O.
Let A be the adjacency matrix of SG entry (p, q)
is 1 if p has a link to q, else the entry is 0.
Let AT be the transpose of A.
Let hi be vector of hub scores after i
iterations.
Let ai be the vector of authority scores after i
iterations.
Operation I ai AT hi-1
Operation O hi A ai

Normalize after every multiplication
6
Authority and Hub Pages (11)

Example Initialize all scores to 1.
1st Iteration
I operation
a(q1) 1, a(q2) a(q3) 0,
a(p1) 3, a(p2) 2
O operation h(q1) 5,
h(q2) 3, h(q3) 5, h(p1) 1, h(p2) 0
Normalization a(q1) 0.267, a(q2) a(q3)
0,
a(p1) 0.802, a(p2) 0.535, h(q1) 0.645,
h(q2) 0.387, h(q3) 0.645, h(p1) 0.129,
h(p2) 0

q1
p1
q2
p2
q3
7
Authority and Hub Pages (12)

After 2 Iterations
a(q1) 0.061, a(q2) a(q3) 0, a(p1)
0.791,
a(p2) 0.609, h(q1) 0.656, h(q2) 0.371,
h(q3) 0.656, h(p1) 0.029, h(p2) 0
After 5 Iterations
a(q1) a(q2) a(q3) 0,
a(p1) 0.788, a(p2) 0.615
h(q1) 0.657, h(q2) 0.369,
h(q3) 0.657, h(p1) h(p2) 0

q1
p1
q2
p2
q3
8
What happens if you multiply a vector by a matrix?

In general, when you multiply a vector by a
matrix, the vector gets scaled as well as
rotated
..except when the vector happens to be in the
direction of one of the eigen vectors of the
matrix
.. in which case it only gets scaled (stretched)
A (symmetric square) matrix has all real eigen
values, and the values give an indication of the
amount of stretching that is done for vectors in
that direction
The eigen vectors of the matrix define a new
ortho-normal space
You can model the multiplication of a general
vector by the matrix in terms of
First decompose the general vector into its
projections in the eigen vector directions
..which means just take the dot product of the
vector with the (unit) eigen vector
Then multiply the projections by the
corresponding eigen valuesto get the new vector.
This explains why power method converges to
principal eigen vector..
..since if a vector has a non-zero projection in
the principal eigen vector direction, then
repeated multiplication will keep stretching the
vector in that direction, so that eventually all
other directions vanish by comparison..

Optional
9
(why) Does the procedure converge?
As we multiply repeatedly with M, the component
of x in the direction of principal eigen vector
gets stretched wrt to other directions.. So we
converge finally to the direction of principal
eigenvector Necessary condition x must have a
component in the direction of principal eigen
vector (c1must be non-zero)
The rate of convergence depends on the eigen gap
10
Can we power iterate to get other (secondary)
eigen vectors?

Yesjust find a matrix M2 such that M2 has the
same eigen vectors as M, but the eigen value
corresponding to the first eigen vector e1 is
zeroed out..
Now do power iteration on M2
Alternately start with a random vector v, and
find a new vector v v (v.e1)e1 and do power
iteration on M with v

Why? 1. M2e1 0 2. If e2 is the
second eigen vector of M, then
it is also an eigen vector of M2
11
Authority and Hub Pages

Algorithm (summary)
submit q to a search engine to obtain the
root set S
expand S into the base set T
obtain the induced subgraph SG(V, E) using T
initialize a(p) h(p) 1 for all p in V
for each p in V until the scores converge
apply Operation I
apply Operation O
normalize a(p) and h(p)
return pages with top authority hub scores

12
(No Transcript)
13
Base set computation

can be made easy by storing the link structure of
the Web in advance Link structure table (during
crawling)
--Most search engines serve this
information now. (e.g. Googles link search)
parent_url child_url
url1 url2
url1 url3

14
Authority and Hub Pages (9)

After each iteration of applying Operations I
and O, normalize all authority and hub scores.
Repeat until the scores for each page
converge (the convergence is guaranteed).
5. Sort pages in descending authority scores.
6. Display the top authority pages.

15
Handling spam links

Should all links be equally treated?
Two considerations
Some links may be more meaningful/important than
other links.
Web site creators may trick the system to make
their pages more authoritative by adding dummy
pages pointing to their cover pages (spamming).

16
Handling Spam Links (contd)

Transverse link links between pages with
different domain names.
Domain name the first level of the URL of a
page.
Intrinsic link links between pages with the same
domain name.
Transverse links are more important than
intrinsic links.
Two ways to incorporate this
Use only transverse links and discard intrinsic
links.
Give lower weights to intrinsic links.

17
Handling Spam Links (contd)

How to give lower weights to intrinsic links?
In adjacency matrix A, entry (p, q) should be
assigned as follows
If p has a transverse link to q, the entry is 1.
If p has an intrinsic link to q, the entry is c,
where 0 lt c lt 1.
If p has no link to q, the entry is 0.

18
Considering link context

For a given link (p, q), let V(p, q) be the
vicinity (e.g., ? 50 characters) of the link.
If V(p, q) contains terms in the user query
(topic), then the link should be more useful for
identifying authoritative pages.
To incorporate this In adjacency matrix A, make
the weight associated with link (p, q) to be
1n(p, q),
where n(p, q) is the number of terms in V(p, q)
that appear in the query.
Alternately, consider the vector similarity
between V(p,q) and the query Q

19
(No Transcript)
20
Evaluation

Sample experiments
Rank based on large in-degree (or backlinks)
query game
Rank in-degree URL
1 13 http//www.gotm.org
2 12 http//www.gamezero.c
om/team-0/
3 12 http//ngp.ngpc.state
.ne.us/gp.html
4 12 http//www.ben2.ucla.
edu/permadi/
gamelink/gamelink.html
5 11 http//igolfto.net/
6 11 http//www.eduplace.c
om/geo/indexhi.html
Only pages 1, 2 and 4 are authoritative game
pages.

21
Evaluation

Sample experiments (continued)
Rank based on large authority score.
query game
Rank Authority URL
1 0.613 http//www.gotm.org
2 0.390 http//ad/doubleclick/n
et/jump/
gamefan-network.com/
3 0.342 http//www.d2realm.com/
4 0.324 http//www.counter-stri
ke.net
5 0.324 http//tech-base.com/
6 0.306 http//www.e3zone.com
All pages are authoritative game pages.

22
Authority and Hub Pages (19)

Sample experiments (continued)
Rank based on large authority score.
query free email
Rank Authority URL
1 0.525 http//mail.chek.com/
2 0.345 http//www.hotmail/com/
3 0.309 http//www.naplesnews.n
et/
4 0.261 http//www.11mail.com/
5 0.254 http//www.dwp.net/
6 0.246 http//www.wptamail.com
/
All pages are authoritative free email pages.

23
Cora thinks Rao is Authoritative on Planning
Citeseer has him down at 90th position ?
How come??? --Planning has two clusters
--Planning reinforcement learning
--Deterministic planning --The first is a
bigger cluster --Rao is big in the second
cluster?
24
Tyranny of Majority
Which do you think are Authoritative
pages? Which are good hubs? -intutively, we
would say that 4,8,5 will be authoritative
pages and 1,2,3,6,7 will be hub pages.
1
6
8
2
4
7
3
5
The authority and hub mass Will concentrate
completely Among the first component, as The
iterations increase. (See next slide)
BUT The power iteration will show that Only 4 and
5 have non-zero authorities .923 .382 And only
1, 2 and 3 have non-zero hubs .5 .7 .5
25
Tyranny of Majority (explained)
Suppose h0 and a0 are all initialized to 1
p1
q1
m
n
q
p2
p
qn
pm
mgtn
26
Impact of Bridges..
1
6
When the graph is disconnected, only 4 and 5 have
non-zero authorities .923 .382 And only 1, 2
and 3 have non-zero hubs .5 .7 .5CV
8
2
4
7
3
5
Bad news from stability point of view ?Can be
fixed by putting a weak link between any
two pages.. (saying in essence that you
expect every page to be reached from
every other page)
When the components are bridged by adding one
page (9) the authorities change only 4, 5 and 8
have non-zero authorities .853 .224 .47 And o1,
2, 3, 6,7 and 9 will have non-zero hubs .39 .49
.39 .21 .21 .6
27
Finding minority Communities

How to retrieve pages from smaller communities?
A method for finding pages in nth largest
community
Identify the next largest community using the
existing algorithm.
Destroy this community by removing links
associated with pages having large authorities.
Reset all authority and hub values back to 1 and
calculate all authority and hub values again.
Repeat the above n ? 1 times and the next largest
community will be the nth largest community.

28
Multiple Clusters on House
Query House (first community)
29
Authority and Hub Pages (26)
Query House (second community)
30
PageRank
31
The importance of publishing..

A/H algorithm was published in SODA as well as
JACM
Kleinberg became very famous in the scientific
community (and got a McArthur Genius award)

Pagerank algorithm was rejected from SIGIR and
was never explicitly published
Larry Page never got a genius award or even a PhD
(and had to be content with being a mere
billionaire)

32
PageRank (Importance as Stationary Visit
Probability on a Markov Chain)

Basic Idea
Think of Web as a big graph. A random surfer
keeps randomly clicking on the links.
The importance of a page is the probability that
the surfer finds herself on that page
--Talk of transition matrix instead of adjacency
matrix
Transition matrix M derived from adjacency
matrix A
--If there are F(u) forward links from a
page u,
then the probability that the surfer
clicks
on any of those is 1/F(u) (Columns sum
to 1. Stochastic matrix)
M is the normalized version of At
--But even a dumb user may once in a while do
something other than
follow URLs on the current page..
--Idea Put a small probability that
the user goes off to a page not pointed to by the
current page.

Principal eigenvector Gives the stationary
distribution!
33
Markov Chains Random Surfer Model

Markov Chains Stationary distribution
Necessary conditions for existence of unique
steady state distribution Aperiodicity and
Irreducibility
Irreducibility Each node can be reached from
every other node with non-zero probability
Must not have sink nodes (which have no out
links)
Because we can have several different steady
state distributions based on which sink we get
stuck in
If there are sink nodes, change them so that you
can transition from them to every other node with
low probability
Must not have disconnected components
Because we can have several different steady
state distributions depending on which
disconnected component we get stuck in
Sufficient to put a low probability link from
every node to every other node (in addition to
the normal weight links corresponding to actual
hyperlinks)

The parameters of random surfer model
c the probability that surfer follows the page
The larger it is, the more the surfer sticks to
what the page says
M the way link matrix is converted to markov
chain
Can make the links have differing transition
probability
E.g. query specific links have higher prob. Links
in bold have higher prop etc
K the reset distribution of the surfer ?great
thing to tweak
It is quite feasible to have m different reset
distributions corresponding to m different
populations of users (or m possible
topic-oriented searches)
It is also possible to make the reset
distribution depend on other things such as
trust of the page TrustRank
Recency of the page Recency-sensitive rank

34
Computing PageRank (10)

Example Suppose the Web graph is
M

D
C
A
B
A B C D
A B C D

0 0 0 ½
0 0 0 ½
1 0 0
0 0 1 0

A B C D
0 0 1 0 0 0 1 0 0 0 0 1 1 1
0 0
A B C D
A
35
Computing PageRank

Matrix representation
Let M be an N?N matrix and muv be the entry at
the u-th row and v-th column.
muv 1/Nv if page v has a link to page
u
muv 0 if there is no link from v to u
Let Ri be the N?1 rank vector for I-th
iteration
and R0 be the initial rank vector.
Then Ri M ? Ri-1

36
Computing PageRank

If the ranks converge, i.e., there is a rank
vector R such that
R M ? R,
R is the eigenvector of matrix M with eigenvalue
being 1.
Convergence is guaranteed only if
M is aperiodic (the Web graph is not a big
cycle). This is practically guaranteed for Web.
M is irreducible (the Web graph is strongly
connected). This is usually not true.

Principal eigen value for A stochastic matrix is 1
37
Computing PageRank (6)

Rank sink A page or a group of pages is a rank
sink if they can receive rank propagation from
its parents but cannot propagate rank to other
pages.
Rank sink causes the loss of total ranks.
Example

A
(C, D) is a rank sink
B
C
D
38
Computing PageRank (7)

A solution to the non-irreducibility and rank
sink problem.
Conceptually add a link from each page v to every
page (include self).
If v has no forward links originally, make all
entries in the corresponding column in M be 1/N.
If v has forward links originally, replace 1/Nv
in the corresponding column by c?1/Nv and then
add (1-c) ?1/N to all entries, 0 lt c lt 1.

Motivation comes also from random-surfer model
39
Computing PageRank (8)
Z will have 1/N For sink pages And 0 otherwise
K will have 1/N For all entries

M c (M Z) (1 c) x K
M is irreducible.
M is stochastic, the sum of all entries of each
column is 1 and there are no negative entries.
Therefore, if M is replaced by M as in
Ri M ? Ri-1
then the convergence is guaranteed and there
will be no loss of the total rank (which is 1).

40
Computing PageRank (9)

Interpretation of M based on the random walk
model.
If page v has no forward links originally, a web
surfer at v can jump to any page in the Web with
probability 1/N.
If page v has forward links originally, a surfer
at v can either follow a link to another page
with probability c ? 1/Nv, or jumps to any page
with probability (1-c) ?1/N.

41
Computing PageRank (10)

Example Suppose the Web graph is
M

D
C
A
B
A B C D

0 0 0 ½
0 0 0 ½
1 0 0
0 0 1 0

A B C D
42
Computing PageRank (11)

Example (continued) Suppose c 0.8. All entries
in Z are 0 and all entries in K are ¼.
M 0.8 (MZ) 0.2 K
Compute rank by iterating
R MxR

0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45
0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05
MATLAB says R(A).338 R(B).338 R(C).6367 R(D).
6052
43
Comparing PR A/H on the same graph
pagerank
A/H
44
Combining PR Content similarity

Incorporate the ranks of pages into the ranking
function of a search engine.
The ranking score of a web page can be a weighted
sum of its regular similarity with a query and
its importance.
ranking_score(q, d)
w?sim(q, d) (1-w) ? R(d), if sim(q,
d) gt 0
0, otherwise
where 0 lt w lt 1.
Both sim(q, d) and R(d) need to be normalized to
between 0, 1.

Who sets w?
45
We can pick and choose

Two alternate ways of computing page importance
I1. As authorities/hubs
I2. As stationary distribution over the
underlying markov chain

Two alternate ways of combining importance with
similarity
C1. Compute importance over a set derived from
the top-100 similar pages
C2. Combine apples organges
aimportance bsimilarity

We can pick any pair of alternatives (even though
I1 was originally proposed with C1 and I2 with
C2)
46
Efficient computation Prioritized Sweeping
We can use asynchronous iterations where the
iteration uses some of the values updated in
the current iteration
47
Efficient Computation Preprocess

Remove dangling nodes
Pages w/ no children
Then repeat process
Since now more danglers
Stanford WebBase
25 M pages
81 M URLs in the link graph
After two prune iterations 19 M nodes

48
Representing Links Table

Stored on disk in binary format

Size for Stanford WebBase 1.01 GB
Assumed to exceed main memory

49
Algorithm 1
?s Sources 1/N while residual gt? ?d
Destd 0 while not Links.eof()
Links.read(source, n, dest1, destn)
for j 1 n Destdestj
DestdestjSourcesource/n ?d
Destd c Destd (1-c)/N /
dampening / residual ??Source Dest??
/ recompute every few iterations
/ Source Dest
50
Analysis of Algorithm 1

If memory is big enough to hold Source Dest
IO cost per iteration is Links
Fine for a crawl of 24 M pages
But web 800 M pages in 2/99 NEC
study
Increase from 320 M pages in 1997 same
authors
If memory is big enough to hold just Dest
Sort Links on source field
Read Source sequentially during rank propagation
step
Write Dest to disk to serve as Source for next
iteration
IO cost per iteration is Source Dest
Links
If memory cant hold Dest
Random access pattern will make working set
Dest
Thrash!!!

51
Block-Based Algorithm

Partition Dest into B blocks of D pages each
If memory P physical pages
D lt P-2 since need input buffers for Source
Links
Partition Links into B files
Linksi only has some of the dest nodes for each
source
Linksi only has dest nodes such that
DDi lt dest lt DD(i1)
Where DD number of 32 bit integers that fit in
D pages

source node
?

dest node
Dest
Links (sparse)
Source
52
Partitioned Link File
Source node (32 bit int)
Outdegr (16 bit)
Destination nodes (32 bit int)
Num out (16 bit)
0
4
2
12, 26
Buckets 0-31
1
3
5
1
3
2
5
1, 9, 10
0
4
58
1
Buckets 32-63
1
3
1
56
1
2
5
36
0
4
94
1
Buckets 64-95
1
3
1
69
1
2
5
78
53
Block-based Page Rank algorithm
54
Analysis of Block Algorithm

IO Cost per iteration
B Source Dest Links(1e)
e is factor by which Links increased in size
Typically 0.1-0.3
Depends on number of blocks
Algorithm nested-loops join

55
Comparing the Algorithms
56
Effect of collusion on PageRank
C
C
A
A
B
B
Moral By referring to each other, a cluster of
pages can artificially boost their
rank (although the cluster has to be big enough
to make an appreciable
difference. Solution Put a threshold on the
number of intra-domain links that will
count Counter Buy two domains, and generate a
cluster among those..
57
(No Transcript)
58
(No Transcript)
59
Use of Link Information

PageRank defines the global importance of web
pages but the importance is domain/topic
independent.
We often need to find important/authoritative
pages which are relevant to a given query.
What are important web browser pages?
Which pages are important game pages?
Idea Use a notion of topic-specific page rank
Involves using a non-uniform probability

60
Topic Specific Pagerank
Haveliwala, WWW 2002

For each page compute k different page ranks
K number of top level hierarchies in the Open
Directory Project
When computing PageRank w.r.t. to a topic, say
that with e probability we transition to one of
the pages of the topick
When a query q is issued,
Compute similarity between q ( its context) to
each of the topics
Take the weighted combination of the topic
specific page ranks of q, weighted by the
similarity to different topics

61
Stability of Rank Calculations
(From Ng et. al. )
The left most column Shows the original
rank Calculation -the columns on the right
are result of rank calculations when 30
of pages are randomly removed
62
(No Transcript)
63
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
64
Novel uses of Link Analysis

Link analysis algorithmsHITS, and Pagerankare
not limited to hyperlinks
Citeseer/Cora use them for analyzing citations
(the link is through citation)
See the irony herelink analysis ideas originated
from citation analysis, and are now being applied
for citation analysis ?
Some new work on keyword search on databases
uses foreign-key links and link analysis to
decide which of the tuples matching the keyword
query are most important (the link is through
foreign keys)
Sudarshan et. Al. ICDE 2002
Keyword search on databases is useful to make
structured databases accessible to naïve users
who dont know structured languages (such as
SQL).

65
(No Transcript)
66
Query complexity

Complex queries (966 trials)
Average words 7.03
Average operators (") 4.34
Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz
Average query words 2.35
Average operators (") 0.41
Forcibly adding a hub or authority node helped in
86 of the queries

67
What about non-principal eigen vectors?

Principal eigen vector gives the authorities (and
hubs)
What do the other ones do?
They may be able to show the clustering in the
documents (see page 23 in Kleinberg paper)
The clusters are found by looking at the positive
and negative ends of the secondary eigen vectors
(ppl vector has only ve end)

68
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
69
Summary of Key Points

PageRank Iterative Algorithm
Rank Sinks
Efficiency of computation Memory!
Single precision Numbers.
Dont represent M explicitly.
Break arrays into Blocks.
Minimize IO Cost.
Number of iterations of PageRank.
Weighting of PageRank vs. doc similarity.

70
Beyond Google (and Pagerank)

Are backlinks reliable metric of importance?
It is a one-size-fits-all measure of
importance
Not user specific
Not topic specific
There may be discrepancy between back links and
actual popularity (as measured in hits)
The sense of the link is ignored (this is okay
if you think that all publicity is good
publicity)
Mark Twain on Classics
A classic is something everyone wishes they had
already read and no one actually had..
(paraphrase)
Google may be its own undoing(why would I need
back links when I know I can get to it through
Google?)
Customization, customization, customization
Yahoo sez about their magic bullet.. (NYT
2/22/04)
"If you type in flowers, do you want to buy
flowers, plant flowers or see pictures of
flowers?"

71
Challenges in Web Search Engines

Spam
Text Spam
Link Spam
Cloaking
Content Quality
Anchor text quality
Quality Evaluation
Indirect feedback

Web Conventions
Articulate and develop validation
Duplicate Hosts
Mirror detection
Vaguely Structured Data
Page layout
The advantage of making rendering/content
language be same

72
Spam is a serious problem

We have Spam Spam Spam Spam Spam with Eggs and
Spam
in Email
Most mail transmitted is junk
web pages
Many different ways of fooling search engines
This is an open arms race
Annual conference on Email and Anti-Spam
Started 2004
Intl. workshop on AIR-Web (Adversarial Info
Retrieval on Web)
Started in 2005 at WWW

73
Trust Spam (Knock-Knock. Who is there?)

A powerful way we avoid spam in our physical
world is by preferring interactions only with
trusted parties
Trust is propagated over social networks
When knocking on the doors of strangers, the
first thing we do is to identify ourselves as a
friend of a friend of friend
So they wont train their dogs/guns on us..
Knock-knock. Who is there? Aardwark. Okay (door
opened) ?not funny
Aardwark who? Aardwark a million miles for one of
your smiles. ?FUNNY
We can do it in cyber world too
Accept product recommendations only from trusted
parties
E.g. Epinions
Accept mails only from individuals who you trust
above a certain threshold
Bias page importance computation so that it
counts only links from trusted sites..
Sort of like discounting links that are off
topic

74
Trust Propagation

Trust is transitive so easy to propagate
..but attenuates as it traverses as a social
network
If I trust you, I trust your friend (but a little
less than I do you), and I trust your friends
friend even less
Trust may not be symmetric..
Trust is normally additive
If you are friend of two of my friends, may be I
trust you more..
Distrust is difficult to propagate
If my friend distrusts you, then I probably
distrust you
but if my enemy distrusts you?
is the enemy of my enemy automatically my
friend?
Trust vs. Reputation
Trust is a user-specific metric
Your trust in an individual may be different from
someone elses
Reputation can be thought of as an aggregate
or one-size-fits-all version of Trust
Most systems such as EBay tend to use Reputation
rather than Trust
Sort of the difference between User-specific vs.
Global page rank

75
Case Study Epinions

Users can write reviews and also express
trust/distrust on other users
Reviewers get royalties
so some tried to game the system
So, distrust measures introduced

Num nodes
Out degree
Guha et. Al. WWW 2004 compares some 81
different ways of propagating trust and
distrust on the Epinion trust matrix
76
Evaluating Trust Propagation Approaches

Given n users, and a sparsely populated nxn
matrix of trusts between the users
And optionally an nxn matrix of distrusts between
the users
Start by erasing some of the entries (but
remember the values you erased)
For each trust propagation method
Use it to fill the nxn matrix
Compare the predicted values to the erased values

77
Fighting Page Spam
We saw discussion of these in the Henzinger et.
Al. paper
Can social networks, which gave rise to the
ideas of page importance computation, also
rescue these computations from spam?
78
TrustRank idea
Gyongyi et al, VLDB 2004

Tweak the default distribution used in page
rank computation (the distribution that a bored
user uses when she doesnt want to follow the
links)
From uniform
To Trust based
Very similar in spirit to the Topic-sensitive or
User-sensitive page rank
Where too you fiddle with the default
distribution
Sample a set of seed pages from the web
Have an oracle (human) identify the good pages
and the spam pages in the seed set
Expensive task, so must make seed set as small as
possible
Propagate Trust (one pass)
Use the normalized trust to set the initial
distribution

Slides modified from Anand Rajaramans lecture at
Stanford
79
Example
1
2
3
good
4
bad
5
6
7
80
Rules for trust propagation

Trust attenuation
The degree of trust conferred by a trusted page
decreases with distance
Trust splitting
The larger the number of outlinks from a page,
the less scrutiny the page author gives each
outlink
Trust is split across outlinks
Combining splitting and damping, each out link of
a node p gets a propagated trust of
bt(p)/O(p)
0ltblt1 O(p) is the out degree and t(p) is the
trust of p
Trust additivity
Propagated trust from different directions is
added up

81
Simple model

Suppose trust of page p is t(p)
Set of outlinks O(p)
For each q2O(p), p confers the trust
bt(p)/O(p) for 0ltblt1
Trust is additive
Trust of p is the sum of the trust conferred on p
by all its inlinked pages
Note similarity to Topic-Specific Page Rank
Within a scaling factor, trust rank biased page
rank with trusted pages as teleport set

82
Picking the seed set

Two conflicting considerations
Human has to inspect each seed page, so seed set
must be as small as possible
Must ensure every good page gets adequate trust
rank, so need make all good pages reachable from
seed set by short paths

83
Approaches to picking seed set

Suppose we want to pick a seed set of k pages
The best idea would be to pick them from the
top-k hub pages.
Note that trustworthiness is subjective
Aljazeera may be considered more trustworthy than
NY Times by some (and the reverse by others)
PageRank
Pick the top k pages by page rank
Assume high page rank pages are close to other
highly ranked pages
We care more about high page rank good pages

84
Inverse page rank

Pick the pages with the maximum number of
outlinks
Can make it recursive
Pick pages that link to pages with many outlinks
Formalize as inverse page rank
Construct graph G by reversing each edge in web
graph G
Page Rank in G is inverse page rank in G
Pick top k pages by inverse page rank

85
Anatomy of Google(circa 1999)

Slides from
http//www.cs.huji.ac.il/sdbi/2000/google/index.h
tm

86
Some points

Fancy hits?
Why two types of barrels?
How is indexing parallelized?
How does Google show that it doesnt quite care
about recall?
How does Google avoid crawling the same URL
multiple times?

What are some of the memory saving things they
do?
Do they use TF/IDF?
Do they normalize? (why not?)
Can they support proximity queries?
How are page synopses made?

87
Types of Web Queries

Navigational
User is looking for the address of a specific
page (so the relevant set is a singleton!)
Success on these is responsible for much of the
OOooo appeal of search engines..
Informational
User is trying to learn information about a
specific topic (so the relevant set can be
non-singleton)
Transactional
The user is searching with the final aim of
conducting a transaction on that page..
E.g. comparison shopping

88
Search Engine Size over Time
Number of indexed pages, self-reported Google
50 of the web?
89
System Anatomy

High Level Overview

90
Google Search Engine Architecture
URL Server- Provides URLs to be fetched Crawler
is distributed Store Server - compresses
and stores pages for indexing Repository - holds
pages for indexing (full HTML of every
page) Indexer - parses documents, records words,
positions, font size, and capitalization Lexicon
- list of unique words found HitList efficient
record of word locsattribs Barrels hold (docID,
(wordID, hitList)) sorted each barrel has
range of words Anchors - keep information about
links found in web pages URL Resolver - converts
relative URLs to absolute Sorter - generates Doc
Index Doc Index - inverted index of all words in
all documents (except stop words) Links - stores
info about links to each page (used for
Pagerank) Pagerank - computes a rank for
each page retrieved Searcher - answers queries
SOURCE BRIN PAGE
91
Major Data Structures

Big Files
virtual files spanning multiple file systems
addressable by 64 bit integers
handles allocation deallocation of File
Descriptions since the OSs is not enough
supports rudimentary compression

92
Major Data Structures (2)

Repository
tradeoff between speed compression ratio
choose zlib (3 to 1) over bzip (4 to 1)
requires no other data structure to access it

93
Major Data Structures (3)

Document Index
keeps information about each document
fixed width ISAM (index sequential access mode)
index
includes various statistics
pointer to repository, if crawled, pointer to
info lists
compact data structure
we can fetch a record in 1 disk seek during search

94
Major Data Structures (4)

URLs - docID file
used to convert URLs to docIDs
list of URL checksums with their docIDs
sorted by checksums
given a URL a binary search is performed
conversion is done in batch mode

95
Major Data Structures (4)

Lexicon
can fit in memory for reasonable price
currently 256 MB
contains 14 million words
2 parts
a list of words
a hash table

96
Major Data Structures (4)

Hit Lists
includes position font capitalization
account for most of the space used in the indexes
3 alternatives simple, Huffman , hand-optimized
hand encoding uses 2 bytes for every hit

97
Major Data Structures (4)

Hit Lists (2)

98
Major Data Structures (5)

Forward Index
partially ordered
used 64 Barrels
each Barrel holds a range of wordIDs
requires slightly more storage
each wordID is stored as a relative difference
from the minimum wordID of the Barrel
saves considerable time in the sorting

99
Major Data Structures (6)

Inverted Index
64 Barrels (same as the Forward Index)
for each wordID the Lexicon contains a pointer to
the Barrel that wordID falls into
the pointer points to a doclist with their hit
list
the order of the docIDs is important
by docID or doc word-ranking
Two inverted barrelsthe short barrel/full barrel

100
Major Data Structures (7)

Crawling the Web
fast distributed crawling system
URLserver Crawlers are implemented in phyton
each Crawler keeps about 300 connection open
at peek time the rate - 100 pages, 600K per
second
uses internal cached DNS lookup
synchronized IO to handle events
number of queues
Robust Carefully tested

101
Major Data Structures (8)

Indexing the Web
Parsing
should know to handle errors
HTML typos
kb of zeros in a middle of a TAG
non-ASCII characters
HTML Tags nested hundreds deep
Developed their own Parser
involved a fair amount of work
did not cause a bottleneck

102
Major Data Structures (9)

Indexing Documents into Barrels
turning words into wordIDs
in-memory hash table - the Lexicon
new additions are logged to a file
parallelization
shared lexicon of 14 million pages
log of all the extra words

103
Major Data Structures (10)

Indexing the Web
Sorting
creating the inverted index
produces two types of barrels
for titles and anchor (Short barrels)
for full text (full barrels)
sorts every barrel separately
running sorters at parallel
the sorting is done in main memory

Ranking looks at Short barrels first And then
full barrels
104
Searching

Algorithm
1. Parse the query
2. Convert word into wordIDs
3. Seek to the start of the doclist in the short
barrel for every word
4. Scan through the doclists until there is a
document that matches all of the search terms

5. Compute the rank of that document
6. If were at the end of the short barrels start
at the doclists of the full barrel, unless we
have enough
7. If were not at the end of any doclist goto
step 4
8. Sort the documents by rank return the top K
(May jump here after 40k pages)

105
The Ranking System

The information
Position, Font Size, Capitalization
Anchor Text
PageRank
Hits Types
title ,anchor , URL etc..
small font, large font etc..

106
The Ranking System (2)

Each Hit type has its own weight
Counts weights increase linearly with counts at
first but quickly taper off this is the IR score
of the doc
(IDF weighting??)
the IR is combined with PageRank to give the
final Rank
For multi-word query
A proximity score for every set of hits with a
proximity type weight
10 grades of proximity

107
Feedback

A trusted user may optionally evaluate the
results
The feedback is saved
When modifying the ranking function we can see
the impact of this change on all previous
searches that were ranked

108
Results