The%20Anatomy%20Of%20A%20Large%20Scale%20Search%20Engine presentation

About This Presentation

Transcript and Presenter's Notes

Title: The%20Anatomy%20Of%20A%20Large%20Scale%20Search%20Engine

1
The Anatomy Of A Large Scale Search Engine

Based on a paper by
Sergey Brin Lawrence Page

Computer Science Department, Stanford
University - submitted to WWW7 (1997) lecture by
Tal Blum for the SDBI seminar
2
Index

Introduction
Design Goals
System Features
Related Work
System Anatomy
Results Performance
Conclusions

Future Work
References

3
What is Google?

Large-scale search engine
makes extensive use in hypertext
designed to crawl index the web efficiently
gives better results
prototype at http//google.stanford.edu or
http//www.google.com
googol 10100

4
Why talk about google?

To engineer a SE is a challenging task
millions of pages, terms, queries
little academic research
SE today is not what it was 5 years ago
the first detailed public description of SE
better results using hypertext
uncontrolled hypertext collections

5
The web - IR challenge

2 main ways for surfing
high quality human maintained lists (Yahoo)
too slow to improve
cannot cover esoteric topics
expensive to build and maintain
search engines (google, altavista)
search by keywords
too many low quality matches
people try to mislead automated search engines

6
Web Growth
7
Web Search Engine Scaling-Up1994-2000

First SE WWWW (1994) had an index of 110,000 web
pages, 1500 queries
November 1997 index of 2-100 million web pages,
20 million(Altavista)
expected that by 2000 SE will have an index of
billion web pages, hundreds of millions of queries

8
Web Search Engine Scaling-Up1999

Challenges in Creating a Search Engine which
scales even to today web
Fast crawling technology
gather documents, keep them up to date
Efficient storage space
indices, optionally the documents
Handle queries quickly
rate of thousands per second

9
Google Scaling with the web

Improved Hardware Performance
exceptions disk seek time, OS
Google is designed to scale well to extremely
large data sets
Googles data structure are optimized for fast
efficient access
Google is a centralized SE

10
Design Goals

Improved Search Quality
Junk Results
Number of documents has increased by many factors
User ability to look at documents has not
As the collection size grows we need tools with
very high precision even at the expanse of recall
Use of hypertextual information
In google link structure anchor text

11
Design Goals (2)

Academic Search Engine Research
SE has migrated from academic domain to the
commercial
SE technology became mostly a black art
advertising oriented.
Get people usage Information
considered commercially valuable
Support novel research activities on large-scale
web data

12
System Features

PageRank Bringing order to the web
most web SE has largely ignored the link graph
518 million hyperlinks
correspond well with people idea of importance
Pr(A) (1-d) (Pr(T1)/C(T1)Pr(Tn)/C(Tn))
difference from traditional methods
not counting links from pages equally
normalizing by the number of links in a page
different from Hits of Kleiberg

13
System Features (2)

Anchor Text
Associate link text with the page it points to
advantages
anchor provide more accurate description
can exist for documents that cant be indexed
images, programs, databases, mp3, non-text docs,
e-mails
can return web pages that hadnt been crawled
was first used in WWW Worm 1994

14
System Features (3)

Other Features
Location Information
Use of proximity in search
Visualization Information
Font relative Size
Full raw HTML is available
users can view a cashed version of the page
users can view the page as it was when indexed
can be used for research

15
Related Work

SE have short history (wwww 1994)
commercial services closely guard the details of
their databases
work on specialized features of SE
especially on post-processing results of SE
work on Information Retrieval Systems
especially on well controlled environments

16
IR Differences Between the Web and Well
Controlled Collections

TREC 96s Very Large Corpus is only 20GB
compared to 147GB of Google crawl
The Web is a vast collection of heterogeneous
documents
language, vocabulary, format
things that work well for TREC often do not
produce good results on the web
there is no control over what people put on the
web

17
System Anatomy

High Level Overview

18
Major Data Structures

Big Files
virtual files spanning multiple file systems
addressable by 64 bit integers
handles allocation deallocation of File
Descriptions since the OSs is not enough
supports rudimentary compression

19
Major Data Structures (2)

Repository
tradeoff between speed compression ratio
choose zlib (3 to 1) over bzip (4 to 1)
requires no other data structure to access it

20
Major Data Structures (3)

Document Index
keeps information about each document
fixed width ISAM (index sequential access mode)
index
includes various statistics
pointer to repository, if crawled, pointer to
info lists
compact data structure
we can fetch a record in 1 disk seek during search

21
Major Data Structures (4)

URLs - docID file
used to convert URLs to docIDs
list of URL checksums with their docIDs
sorted by checksums
given a URL a binary search is performed
conversion is done in batch mode

22
Major Data Structures (4)

Lexicon
can fit in memory for reasonable price
currently 256 MB
contains 14 million words
2 parts
a list of words
a hash table

23
Major Data Structures (4)

Hit Lists
includes position font capitalization
account for most of the space used in the indexes
3 alternatives simple, Huffman , hand-optimized
hand encoding uses 2 bytes for every hit

24
Major Data Structures (4)

Hit Lists (2)

25
Major Data Structures (5)

Forward Index
partially ordered
used 64 Barrels
each Barrel holds a range of wordIDs
requires slightly more storage
each wordID is stored as a relative difference
from the minimum wordID of the Barrel
save considerable time in the sorting

26
Major Data Structures (6)

Inverted Index
64 Barrels (same as the Forward Index)
for each wordID the Lexicon contains a pointer to
the Barrel that wordID falls into
the pointer points to a doclist with their hit
list
the order of the docIDs is important
by docID or doc word-ranking
in Google they choose a compromise

27
Major Data Structures (7)

Crawling the Web
fast distributed crawling system
URLserver Crawlers are implemented in phyton
each Crawler keeps about 300 connection open
at peek time the rate - 100 pages, 600K per
second
uses internal cached DNS lookup
synchronized IO to handle events
number of queues
Robust Carefully tested

28
Major Data Structures (8)

Indexing the Web
Parsing
should know to handle errors
HTML typos
kb of zeros in a middle of a TAG
non-ASCII characters
HTML Tags nested hundreds deep
Developed their own Parser
involved a fair amount of work
did not cause a bottleneck

29
Major Data Structures (9)

Indexing Documents into Barrels
turning words into wordIDs
in-memory hash table - the Lexicon
new additions are logged to a file
parallelization
shared lexicon of 14 million pages
log of all the extra words

30
Major Data Structures (10)

Indexing the Web
Sorting
creating the inverted index
produces two types of barrels
for titles and anchor
for full text
sorts every barrel separately
running sorters at parallel
the sorting is done in main memory

31
Searching

Algorithm
1. Parse the query
2. Convert word into wordIDs
3. Seek to the start of the doclist in the short
barrel for every word
4. Scan through the doclists until there is a
document that matches all of the search terms

5. Compute the rank of that document
6. If were at the end of the short barrels start
at the doclists of the full barrel, unless we
have enough
7. If were not at the end of any doclist goto
step 4
8. Sort the documents by rank return the top K

32
The Ranking System

The information
Position, Font Size, Capitalization
Anchor Text
PageRank
Hits Types
title ,anchor , URL etc..
small font, large font etc..

33
The Ranking System (2)

Each Hit type has its own weight
Counts weights increase linearly with counts at
first but quickly taper off this is the IR score
of the doc
the IR is combined with PageRank to give the
final Rank
For multi-word query
A proximity score for every set of hits with a
proximity type weight

34
Feedback

A trusted user may optionally evaluate the
results
The feedback is saved
When modifying the ranking function we can see
the impact of this change on all previous
searches that were ranked

35
Results

Produce better results than major commercial
search engines for most searches
Example query bill clinton
return results from the Whitehouse.gov
email addresses of the president
all the results are high quality pages
no broken links
no bill without clinton no clinton without bill

36
Storage Requirements

Using Compression on the repository
about 55 GB for all the data used by the SE
most of the queries can be answered by just the
short inverted index
with better compression, a high quality SE can
fit onto a 7GB drive of a new PC

37
Storage Statistics
Web Page Statistics
38
System Performance

It took 9 days to download 26million pages
48.5 pages per second
The Indexer Crawler ran simultaneously
The Indexer runs at 54 pages per second
The sorters run in parallel using 4 machines, the
whole process took 24 hours

39
Conclusions

Scalable Search Engine
High Quality Search Results
Search techniques
PageRank
Anchor Text
Proximity Information
A Complete Architecture

40
Future Work

Improve search efficiency
Scale to 100 million
Boolean Operators
Text Surrounding Links
Personalization PageRank
Result Summarization

41
New Features

Google Scout
Documents Caching
Uncle Sams
Link option

42
The End

Write a Comment

User Comments (0)

About PowerShow.com

The%20Anatomy%20Of%20A%20Large%20Scale%20Search%20Engine PowerPoint PPT Presentation