Date: Fri, 15 Feb 2002 12:53:45 -0700 Subject: IOC awards presidency also to Gore (RNN)-- In a surprising, but widely anticipated move, the International Olympic Committee president just came on TV and announced that IOC decided to award a presidency

1 / 62

About This Presentation

Title:

Date: Fri, 15 Feb 2002 12:53:45 -0700 Subject: IOC awards presidency also to Gore (RNN)-- In a surprising, but widely anticipated move, the International Olympic Committee president just came on TV and announced that IOC decided to award a presidency

Description:

(RNN)-- In a surprising, but widely anticipated move, the International ... You will never never never need it. It's free; I couldn't leave it ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 63

Provided by: sunylearni

Learn more at: http://rakaposhi.eas.asu.edu

more less

Transcript and Presenter's Notes

Title: Date: Fri, 15 Feb 2002 12:53:45 -0700 Subject: IOC awards presidency also to Gore (RNN)-- In a surprising, but widely anticipated move, the International Olympic Committee president just came on TV and announced that IOC decided to award a presidency

1
2/18

Date Fri, 15 Feb 2002 125345 -0700Subject
IOC awards presidency also to Gore(RNN)-- In a
surprising, but widely anticipated move, the
International Olympic Committee president just
came on TV and announced that IOC decided to
award a presidency to Albert Gore Jr. too. Gore
Jr. won the popular vote initially, but to the
surprise of TV viewers world wide, Bush was
awarded thepresidency by the electoral college
judges.Mr. Bush, who "beat" gore, still gets
to keep his presidency. "We decided to put the
two men on an equal footing and we are not going
to start doing the calculations of all the
different votes that (were) given. Besides, who
knows what those seniors in Palm Beach were
thinking?" said the IOC president. The specific
details of shared presidency are still being
worked out--but it is expected that Gore will be
the president during the day, when Mr. Bush
typically is busy in the Gym working out.In a
separate communique the IOC suspended Florida
for an indefinite period from the
union.Speaking from his home (far) outside
Nashville, a visibly elated Gore profusely
thanked Canadian people for starting this trend.
He also remarked that this will be the first
presidents' day when the sitting president can
be on both coasts simultaneously. When last
seen, he was busy using the "Gettysburg"
template in the latest MS Powerpoint to prepare
an eloquent speech for his inauguration-cum-firs
t-state-of-the-union.--RNNRelated Sites
Gettysburg Powerpoint template
http//www.norvig.com/Gettysburg/

2
AgendaPage Rank issues (computation Collusion
etc)Crawling

Announcements
Next class INTERACTIVE
(read Google paper and come prepared with smart
questions/comments/answers)
Homework 2 socket closed..
Question
Are you reading the papers????????

3
Adding PageRank to a SearchEngine

Weighted sum of importancesimilarity with query
Score(q, d)
w?sim(q, p) (1-w) ? R(p), if sim(q, p) gt
0
0, otherwise
Where
0 lt w lt 1
sim(q, p), R(p) must be normalized to 0, 1.

4
Stability of Rank Calculations
(From Ng et. al. )
The left most column Shows the original
rank Calculation -the columns on the right
are result of rank calculations when 30
of pages are randomly removed
5
(No Transcript)
6
Effect of collusion on PageRank
C
C
A
A
B
B
Moral By referring to each other, a cluster of
pages can artificially boost their
rank (although the cluster has to be big enough
to make an appreciable
difference. Solution Put a threshold on the
number of intra-domain links that will
count Counter Buy two domains, and generate a
cluster among those..
7
What about non-principal eigen vectors?

Principal eigen vector gives the authorities (and
hubs)
What do the other ones do?
They may be able to show the clustering in the
documents (see page 23 in Kleinberg paper)
The clusters are found by looking at the positive
and negative ends of the secondary eigen vectors
(ppl vector has only ve end)

8
Novel uses of Link Analysis

Link analysis algorithmsHITS, and Pagerankare
not limited to hyperlinks
Citeseer/Cora use them for analyzing citations
(the link is through citation)
See the irony herelink analysis ideas originated
from citation analysis, and are now being applied
for citation analysis ?
Some new work on keyword search on databases
uses foreign-key links and link analysis to
decide which of the tuples matching the keyword
query are most important (the link is through
foreign keys)
Sudarshan et. Al. ICDE 2002
Keyword search on databases is useful to make
structured databases accessible to naïve users
who dont know structured languages (such as
SQL).

9
(No Transcript)
10
Query complexity

Complex queries (966 trials)
Average words 7.03
Average operators (") 4.34
Typical Alta Vista queries are much simpler
Silverstein, Henzinger, Marais and Moricz
Average query words 2.35
Average operators (") 0.41
Forcibly adding a hub or authority node helped in
86 of the queries

11
Practicality

Challenges
M no longer sparse (dont represent explicitly!)
Data too big for memory (be sneaky about disk
usage)
Stanford version of Google
24 million documents in crawl
147GB documents
259 million links
Computing pagerank few hours on single 1997
workstation
But How?
Next discussion from Haveliwala paper

12
Efficient Computation Preprocess

Remove dangling nodes
Pages w/ no children
Then repeat process
Since now more danglers
Stanford WebBase
25 M pages
81 M URLs in the link graph
After two prune iterations 19 M nodes

13
Representing Links Table

Stored on disk in binary format

Size for Stanford WebBase 1.01 GB
Assumed to exceed main memory

14
Algorithm 1
?s Sources 1/N while residual gt? ?d
Destd 0 while not Links.eof()
Links.read(source, n, dest1, destn)
for j 1 n Destdestj
DestdestjSourcesource/n ?d
Destd c Destd (1-c)/N /
dampening / residual ??Source Dest??
/ recompute every few iterations
/ Source Dest
15
Analysis of Algorithm 1

If memory is big enough to hold Source Dest
IO cost per iteration is Links
Fine for a crawl of 24 M pages
But web 800 M pages in 2/99 NEC
study
Increase from 320 M pages in 1997 same
authors
If memory is big enough to hold just Dest
Sort Links on source field
Read Source sequentially during rank propagation
step
Write Dest to disk to serve as Source for next
iteration
IO cost per iteration is Source Dest
Links
If memory cant hold Dest
Random access pattern will make working set
Dest
Thrash!!!

16
Block-Based Algorithm

Partition Dest into B blocks of D pages each
If memory P physical pages
D lt P-2 since need input buffers for Source
Links
Partition Links into B files
Linksi only has some of the dest nodes for each
source
Linksi only has dest nodes such that
DDi lt dest lt DD(i1)
Where DD number of 32 bit integers that fit in
D pages

source node
?

dest node
Dest
Links (sparse)
Source
17
Partitioned Link File
Source node (32 bit int)
Outdegr (16 bit)
Destination nodes (32 bit int)
Num out (16 bit)
0
4
2
12, 26
Buckets 0-31
1
3
5
1
3
2
5
1, 9, 10
0
4
58
1
Buckets 32-63
1
3
1
56
1
2
5
36
0
4
94
1
Buckets 64-95
1
3
1
69
1
2
5
78
18
Block-based Page Rank algorithm
19
Analysis of Block Algorithm

IO Cost per iteration
B Source Dest Links(1e)
e is factor by which Links increased in size
Typically 0.1-0.3
Depends on number of blocks
Algorithm nested-loops join

20
Comparing the Algorithms
21
PageRank Convergence
22
PageRank Convergence
23
More stable because random surfer model allows
low prob edges to every place.CV
Can be made stable with subspace-based A/H values
see Ng. et al. 2001
24
Summary of Key Points

PageRank Iterative Algorithm
Rank Sinks
Efficiency of computation Memory!
Single precision Numbers.
Dont represent M explicitly.
Break arrays into Blocks.
Minimize IO Cost.
Number of iterations of PageRank.
Weighting of PageRank vs. doc similarity.

25
Crawlers Main issues

General-purpose crawling
Context specific crawiling
Building topic-specific search engines

26
(No Transcript)
27
2/24
Shopping at job fairs Push my resume But jobs
aren't what I seek I will be your walking student
advertisement Can't live on my research
stipend Everybody wants a Google shirt HP,
Amazon Pixar, Cray, and Ford I just can't
decide Help me score the most free pens and free
umbrellas or a coffee mug from Bell
Labs Everybody wants a Google..
Until I find a steady funder I'll make do with
cheap-a plunder Everybody wants a
Google.. Wait! You will never never never need
it It's free I couldn't leave it Everybody wants
a Google shirt Shameless corp'rate carrion
crows Turn your backs and show your
logos Everybody wants a Google shirt
("Everybody Wants a Google Shirt" is based on
"Everybody Wants to Rule the World" by Tears
for Fears. Alternate lyrics by Andy Collins,
Kate Deibel, Neil Spring, Steve Wolfman, and Ken
Yasuhara.)
28
Discussion

What parts of Google did you find to be in line
with what you learned until now?
What parts of Google were different?

29
Some points

Fancy hits?
Why two types of barrels?
How is indexing parallelized?
How does Google show that it doesnt quite care
about recall?
How does Google avoid crawling the same URL
multiple times?

What are some of the memory saving things they
do?
Do they use TF/IDF?
Do they normalize? (why not?)
Can they support proximity queries?
How are page synopses made?

30
Beyond Google (and Pagerank)

Are backlinks reliable metric of importance?
It is a one-size-fits-all measure of
importance
Not user specific
Not topic specific
There may be discrepancy between back links and
actual popularity (as measured in hits)
The sense of the link is ignored (this is okay
if you think that all publicity is good
publicity)
Mark Twain on Classics
A classic is something everyone wishes they had
already read and no one actually had..
(paraphrase)
Google may be its own undoing(why would I need
back links when I know I can get to it through
Google?)
Customization, customization, customization
Yahoo sez about their magic bullet.. (NYT
2/22/04)
"If you type in flowers, do you want to buy
flowers, plant flowers or see pictures of
flowers?"

31
The rest of the slides on Google as well as
crawling were notspecifically discussed one at a
time, but have been discussed in essence(read
you are still responsible for them)
32
(No Transcript)
33
(No Transcript)
34
SPIDER CASE STUDY
35
Web Crawling (Search) Strategy

Starting location(s)
Traversal order
Depth first
Breadth first
Or ???
Cycles?
Coverage?
Load?

d

b
e
h
j
c
f
g
i
36
Robot (2)

Some specific issues
What initial URLs to use?
Choice depends on type of search engines to be
built.
For general-purpose search engines, use URLs that
are likely to reach a large portion of the Web
such as the Yahoo home page.
For local search engines covering one or several
organizations, use URLs of the home pages of
these organizations. In addition, use appropriate
domain constraint.

37
Robot (7)

Several research issues about robots
Fetching more important pages first with limited
resources.
Can use measures of page importance
Fetching web pages in a specified subject area
such as movies and sports for creating
domain-specific search engines.
Focused crawling
Efficient re-fetch of web pages to keep web page
index up-to-date.
Keeping track of change rate of a page

38
Storing Summaries

Cant store complete page text
Whole WWW doesnt fit on any server
Stop Words
Stemming
What (compact) summary should be stored?
Per URL
Title, snippet
Per Word
URL, word number

But, look at Googles Cache copy
39
(No Transcript)
40
(No Transcript)
41
Robot (4)

How to extract URLs from a web page?
Need to identify all possible tags and attributes
that hold URLs.
Anchor tag lta hrefURL gt lt/agt
Option tag ltoption valueURLgt lt/optiongt
Map ltarea hrefURL gt
Frame ltframe srcURL gt
Link to an image ltimg srcURL gt
Relative path vs. absolute path ltbase href gt

42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
Focused Crawling

Classifier Is crawled page P relevant to the
topic?
Algorithm that maps page to relevant/irrelevant
Semi-automatic
Based on page vicinity..
Distilleris crawled page P likely to lead to
relevant pages?
Algorithm that maps page to likely/unlikely
Could be just A/H computation, and taking HUBS
Distiller determines the priority of following
links off of P

47
(No Transcript)
48
(No Transcript)
49
Anatomy of Google(circa 1999)

Slides from
http//www.cs.huji.ac.il/sdbi/2000/google/index.h
tm

50
Search Engine Size over Time
Number of indexed pages, self-reported Google
50 of the web?
51
System Anatomy

High Level Overview

52
Google Search Engine Architecture
URL Server- Provides URLs to be fetched Crawler
is distributed Store Server - compresses
and stores pages for indexing Repository - holds
pages for indexing (full HTML of every
page) Indexer - parses documents, records words,
positions, font size, and capitalization Lexicon
- list of unique words found HitList efficient
record of word locsattribs Barrels hold (docID,
(wordID, hitList)) sorted each barrel has
range of words Anchors - keep information about
links found in web pages URL Resolver - converts
relative URLs to absolute Sorter - generates Doc
Index Doc Index - inverted index of all words in
all documents (except stop words) Links - stores
info about links to each page (used for
Pagerank) Pagerank - computes a rank for
each page retrieved Searcher - answers queries
SOURCE BRIN PAGE
53
(No Transcript)
54
Major Data Structures

Big Files
virtual files spanning multiple file systems
addressable by 64 bit integers
handles allocation deallocation of File
Descriptions since the OSs is not enough
supports rudimentary compression

55
Major Data Structures (2)

Repository
tradeoff between speed compression ratio
choose zlib (3 to 1) over bzip (4 to 1)
requires no other data structure to access it

56
Major Data Structures (3)

Document Index
keeps information about each document
fixed width ISAM (index sequential access mode)
index
includes various statistics
pointer to repository, if crawled, pointer to
info lists
compact data structure
we can fetch a record in 1 disk seek during search

57
Major Data Structures (4)

URLs - docID file
used to convert URLs to docIDs
list of URL checksums with their docIDs
sorted by checksums
given a URL a binary search is performed
conversion is done in batch mode

58
Major Data Structures (4)

Lexicon
can fit in memory for reasonable price
currently 256 MB
contains 14 million words
2 parts
a list of words
a hash table

59
Major Data Structures (4)

Hit Lists
includes position font capitalization
account for most of the space used in the indexes
3 alternatives simple, Huffman , hand-optimized
hand encoding uses 2 bytes for every hit

60
Major Data Structures (4)

Hit Lists (2)

61
Major Data Structures (5)

Forward Index
partially ordered
used 64 Barrels
each Barrel holds a range of wordIDs
requires slightly more storage
each wordID is stored as a relative difference
from the minimum wordID of the Barrel
saves considerable time in the sorting

62
Major Data Structures (6)

Inverted Index
64 Barrels (same as the Forward Index)
for each wordID the Lexicon contains a pointer to
the Barrel that wordID falls into
the pointer points to a doclist with their hit
list
the order of the docIDs is important
by docID or doc word-ranking
Two inverted barrelsthe short barrel/full barrel

63
Major Data Structures (7)

Crawling the Web
fast distributed crawling system
URLserver Crawlers are implemented in phyton
each Crawler keeps about 300 connection open
at peek time the rate - 100 pages, 600K per
second
uses internal cached DNS lookup
synchronized IO to handle events
number of queues
Robust Carefully tested

64
Major Data Structures (8)

Indexing the Web
Parsing
should know to handle errors
HTML typos
kb of zeros in a middle of a TAG
non-ASCII characters
HTML Tags nested hundreds deep
Developed their own Parser
involved a fair amount of work
did not cause a bottleneck

65
Major Data Structures (9)

Indexing Documents into Barrels
turning words into wordIDs
in-memory hash table - the Lexicon
new additions are logged to a file
parallelization
shared lexicon of 14 million pages
log of all the extra words

66
Major Data Structures (10)

Indexing the Web
Sorting
creating the inverted index
produces two types of barrels
for titles and anchor (Short barrels)
for full text (full barrels)
sorts every barrel separately
running sorters at parallel
the sorting is done in main memory

Ranking looks at Short barrels first And then
full barrels
67
Searching

Algorithm
1. Parse the query
2. Convert word into wordIDs
3. Seek to the start of the doclist in the short
barrel for every word
4. Scan through the doclists until there is a
document that matches all of the search terms

5. Compute the rank of that document
6. If were at the end of the short barrels start
at the doclists of the full barrel, unless we
have enough
7. If were not at the end of any doclist goto
step 4
8. Sort the documents by rank return the top K
(May jump here after 40k pages)

68
The Ranking System

The information
Position, Font Size, Capitalization
Anchor Text
PageRank
Hits Types
title ,anchor , URL etc..
small font, large font etc..

69
The Ranking System (2)

Each Hit type has its own weight
Counts weights increase linearly with counts at
first but quickly taper off this is the IR score
of the doc
(IDF weighting??)
the IR is combined with PageRank to give the
final Rank
For multi-word query
A proximity score for every set of hits with a
proximity type weight
10 grades of proximity

70
Feedback

A trusted user may optionally evaluate the
results
The feedback is saved
When modifying the ranking function we can see
the impact of this change on all previous
searches that were ranked

71
Results