The Anatomy Of A Large Scale Search Engine
  • Based on a paper by
  • Sergey Brin Lawrence Page

Computer Science Department, Stanford
University - submitted to WWW7 (1997) lecture by
Tal Blum for the SDBI seminar
  • Introduction
  • Design Goals
  • System Features
  • Related Work
  • System Anatomy
  • Results Performance
  • Conclusions
  • Future Work
  • References

What is Google?
  • Large-scale search engine
  • makes extensive use in hypertext
  • designed to crawl index the web efficiently
  • gives better results
  • prototype at http// or
  • googol 10100

Why talk about google?
  • To engineer a SE is a challenging task
  • millions of pages, terms, queries
  • little academic research
  • SE today is not what it was 5 years ago
  • the first detailed public description of SE
  • better results using hypertext
  • uncontrolled hypertext collections

The web - IR challenge
  • 2 main ways for surfing
  • high quality human maintained lists (Yahoo)
  • too slow to improve
  • cannot cover esoteric topics
  • expensive to build and maintain
  • search engines (google, altavista)
  • search by keywords
  • too many low quality matches
  • people try to mislead automated search engines

Web Growth
Web Search Engine Scaling-Up1994-2000
  • First SE WWWW (1994) had an index of 110,000 web
    pages, 1500 queries
  • November 1997 index of 2-100 million web pages,
    20 million(Altavista)
  • expected that by 2000 SE will have an index of
    billion web pages, hundreds of millions of queries

Web Search Engine Scaling-Up1999
  • Challenges in Creating a Search Engine which
    scales even to today web
  • Fast crawling technology
  • gather documents, keep them up to date
  • Efficient storage space
  • indices, optionally the documents
  • Handle queries quickly
  • rate of thousands per second

Google Scaling with the web
  • Improved Hardware Performance
  • exceptions disk seek time, OS
  • Google is designed to scale well to extremely
    large data sets
  • Googles data structure are optimized for fast
    efficient access
  • Google is a centralized SE

Design Goals
  • Improved Search Quality
  • Junk Results
  • Number of documents has increased by many factors
  • User ability to look at documents has not
  • As the collection size grows we need tools with
    very high precision even at the expanse of recall
  • Use of hypertextual information
  • In google link structure anchor text

Design Goals (2)
  • Academic Search Engine Research
  • SE has migrated from academic domain to the
  • SE technology became mostly a black art
    advertising oriented.
  • Get people usage Information
  • considered commercially valuable
  • Support novel research activities on large-scale
    web data

System Features
  • PageRank Bringing order to the web
  • most web SE has largely ignored the link graph
  • 518 million hyperlinks
  • correspond well with people idea of importance
  • Pr(A) (1-d) (Pr(T1)/C(T1)Pr(Tn)/C(Tn))
  • difference from traditional methods
  • not counting links from pages equally
  • normalizing by the number of links in a page
  • different from Hits of Kleiberg

System Features (2)
  • Anchor Text
  • Associate link text with the page it points to
  • advantages
  • anchor provide more accurate description
  • can exist for documents that cant be indexed
  • images, programs, databases, mp3, non-text docs,
  • can return web pages that hadnt been crawled
  • was first used in WWW Worm 1994

System Features (3)
  • Other Features
  • Location Information
  • Use of proximity in search
  • Visualization Information
  • Font relative Size
  • Full raw HTML is available
  • users can view a cashed version of the page
  • users can view the page as it was when indexed
  • can be used for research

Related Work
  • SE have short history (wwww 1994)
  • commercial services closely guard the details of
    their databases
  • work on specialized features of SE
  • especially on post-processing results of SE
  • work on Information Retrieval Systems
  • especially on well controlled environments

IR Differences Between the Web and Well
Controlled Collections
  • TREC 96s Very Large Corpus is only 20GB
    compared to 147GB of Google crawl
  • The Web is a vast collection of heterogeneous
  • language, vocabulary, format
  • things that work well for TREC often do not
    produce good results on the web
  • there is no control over what people put on the

System Anatomy
  • High Level Overview

Major Data Structures
  • Big Files
  • virtual files spanning multiple file systems
  • addressable by 64 bit integers
  • handles allocation deallocation of File
    Descriptions since the OSs is not enough
  • supports rudimentary compression

Major Data Structures (2)
  • Repository
  • tradeoff between speed compression ratio
  • choose zlib (3 to 1) over bzip (4 to 1)
  • requires no other data structure to access it

Major Data Structures (3)
  • Document Index
  • keeps information about each document
  • fixed width ISAM (index sequential access mode)
  • includes various statistics
  • pointer to repository, if crawled, pointer to
    info lists
  • compact data structure
  • we can fetch a record in 1 disk seek during search

Major Data Structures (4)
  • URLs - docID file
  • used to convert URLs to docIDs
  • list of URL checksums with their docIDs
  • sorted by checksums
  • given a URL a binary search is performed
  • conversion is done in batch mode

Major Data Structures (4)
  • Lexicon
  • can fit in memory for reasonable price
  • currently 256 MB
  • contains 14 million words
  • 2 parts
  • a list of words
  • a hash table

Major Data Structures (4)
  • Hit Lists
  • includes position font capitalization
  • account for most of the space used in the indexes
  • 3 alternatives simple, Huffman , hand-optimized
  • hand encoding uses 2 bytes for every hit

Major Data Structures (4)
  • Hit Lists (2)

Major Data Structures (5)
  • Forward Index
  • partially ordered
  • used 64 Barrels
  • each Barrel holds a range of wordIDs
  • requires slightly more storage
  • each wordID is stored as a relative difference
    from the minimum wordID of the Barrel
  • save considerable time in the sorting

Major Data Structures (6)
  • Inverted Index
  • 64 Barrels (same as the Forward Index)
  • for each wordID the Lexicon contains a pointer to
    the Barrel that wordID falls into
  • the pointer points to a doclist with their hit
  • the order of the docIDs is important
  • by docID or doc word-ranking
  • in Google they choose a compromise

Major Data Structures (7)
  • Crawling the Web
  • fast distributed crawling system
  • URLserver Crawlers are implemented in phyton
  • each Crawler keeps about 300 connection open
  • at peek time the rate - 100 pages, 600K per
  • uses internal cached DNS lookup
  • synchronized IO to handle events
  • number of queues
  • Robust Carefully tested

Major Data Structures (8)
  • Indexing the Web
  • Parsing
  • should know to handle errors
  • HTML typos
  • kb of zeros in a middle of a TAG
  • non-ASCII characters
  • HTML Tags nested hundreds deep
  • Developed their own Parser
  • involved a fair amount of work
  • did not cause a bottleneck

Major Data Structures (9)
  • Indexing Documents into Barrels
  • turning words into wordIDs
  • in-memory hash table - the Lexicon
  • new additions are logged to a file
  • parallelization
  • shared lexicon of 14 million pages
  • log of all the extra words

Major Data Structures (10)
  • Indexing the Web
  • Sorting
  • creating the inverted index
  • produces two types of barrels
  • for titles and anchor
  • for full text
  • sorts every barrel separately
  • running sorters at parallel
  • the sorting is done in main memory

  • Algorithm
  • 1. Parse the query
  • 2. Convert word into wordIDs
  • 3. Seek to the start of the doclist in the short
    barrel for every word
  • 4. Scan through the doclists until there is a
    document that matches all of the search terms
  • 5. Compute the rank of that document
  • 6. If were at the end of the short barrels start
    at the doclists of the full barrel, unless we
    have enough
  • 7. If were not at the end of any doclist goto
    step 4
  • 8. Sort the documents by rank return the top K

The Ranking System
  • The information
  • Position, Font Size, Capitalization
  • Anchor Text
  • PageRank
  • Hits Types
  • title ,anchor , URL etc..
  • small font, large font etc..

The Ranking System (2)
  • Each Hit type has its own weight
  • Counts weights increase linearly with counts at
    first but quickly taper off this is the IR score
    of the doc
  • the IR is combined with PageRank to give the
    final Rank
  • For multi-word query
  • A proximity score for every set of hits with a
    proximity type weight

  • A trusted user may optionally evaluate the
  • The feedback is saved
  • When modifying the ranking function we can see
    the impact of this change on all previous
    searches that were ranked

  • Produce better results than major commercial
    search engines for most searches
  • Example query bill clinton
  • return results from the
  • email addresses of the president
  • all the results are high quality pages
  • no broken links
  • no bill without clinton no clinton without bill

Storage Requirements
  • Using Compression on the repository
  • about 55 GB for all the data used by the SE
  • most of the queries can be answered by just the
    short inverted index
  • with better compression, a high quality SE can
    fit onto a 7GB drive of a new PC

Storage Statistics
Web Page Statistics
System Performance
  • It took 9 days to download 26million pages
  • 48.5 pages per second
  • The Indexer Crawler ran simultaneously
  • The Indexer runs at 54 pages per second
  • The sorters run in parallel using 4 machines, the
    whole process took 24 hours

  • Scalable Search Engine
  • High Quality Search Results
  • Search techniques
  • PageRank
  • Anchor Text
  • Proximity Information
  • A Complete Architecture

Future Work
  • Improve search efficiency
  • Scale to 100 million
  • Boolean Operators
  • Text Surrounding Links
  • Personalization PageRank
  • Result Summarization

New Features
  • Google Scout
  • Documents Caching
  • Uncle Sams
  • Link option

The End
