Anatomy of a LargeScale Hypertextual Web Search Engine - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Anatomy of a LargeScale Hypertextual Web Search Engine

Description:

... Real-Time Embedded System Technology), Soongsil Univ, Korea ... Query : 'Bill Clinton' - Bill Clinton Sucks - high quality information available on this topic ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 19
Provided by: moon2
Category:

less

Transcript and Presenter's Notes

Title: Anatomy of a LargeScale Hypertextual Web Search Engine


1
Anatomy of a Large-Scale Hypertextual Web Search
Engine
  • Computer Science Department
  • Sergey Brin and Lawrence Page
  • Speaker Jin O, Kim
  • March 14, 2006

2
Table of Contents
  • Abstract
  • Introduction
  • - Web Search Engines Scaling Up 1994
    2000
  • - Google Scaling with the Web
  • - Design Goals
  • System Features
  • - PageRank
  • - Anchor Text
  • - Other Features
  • Related Work
  • - Information Retreieval
  • - Differences Between the Web and Well
    Controlled Collections
  • System Anatomy
  • - Google Architecture Overview
  • - Major Data Structures

3
Abstract
  • Presented as a protoype of a large-scale search
    engine
  • Google is designed to crawl and index the Web
    efficiently and produce much more satisfying
    searh results than exsting systems.
  • To engineer a search engine is a challenging task
  • 10 100 million indexing
  • Answer 10 millions of queries every day
  • How to build a practical large-scale system which
    can exploit the additional information present in
    hypertext

4
Introduction (1/4)
  • The amount of information on the web is growing
    rapidly, as well as the number of new users
    inexperienced in the art of web research.
  • Automated search engines that rely on keyword
    matching usually return too many low quality
    matches
  • advertisers
  • Google 10100, googol

5
Introduction (2/4)
  • Web Search Engines -- Scaling Up 1994 - 2000
  • 1994 WWW, 110,000 pages index, 15000 queries
  • 1997 (now) top search engines, 2 100 million
    pages index
  • altavista, 20 million
    queries
  • 2000 1 billion pages index, 100 million queries

6
Introduction (3/4)
  • Google Scaling with the Web
  • gather the web documents and keep them up to date
  • - fast crawling technology
  • - storage space must be used efficiently
  • - indexing system must process hundreds of
    gigabyes of data
  • efficiently
  • - queries must be handled quickly
  • These tasks are becoming increasingly difficult
    as the Web grows.
  • Hardware performance and cost have improved
    dramatically to partially offset the difficulty.
  • - notable exception disk seek time,
    operationg system robustness
  • Designing Google growth of the Web and
    technological changes
  • - data structures optimized ofr fast and
    efficient access

7
Introduction (4/4)
  • Google Scaling with the Web
  • Design Goals
  • Improved Search Quality
  • - very high precision (number of relevant
    documents returned, say in the top tens of
    results)
  • - expense of recall (the total number of
    relevant documents the system is able to return)
  • Academic Search Engine Research
  • - 1993 .com 1.5, 1997 .com 60
  • - One of our main goals in designing Google
    was to set up an environment where other
    researchers can come in quickly, process large
    chunks of the web, and produce interesting
    results that would have been very difficult to
    produce otherwise.

8
System Features (1/3)
  • PageRank Bringing Order to the Web
  • Citation importance that corresponds well with
    peoples subjective idea of importance
  • Simple text matching search
  • - performs admirably when PageRank
    prioritizes the results
  • Not counting links from all pages equally, and by
    the number of links on a page.
  • - We assume page A has pages T1 Tn which
    point to it.
  • - d (damping factor) 0 1, 0.85
  • - C defined as the number of links going
    out of page A

9
System Features (2/3)
  • PageRank
  • High PageRank many pages that point to it, some
    pages that point to it and have a high PageRank

10
System Features (3/3)
  • Anchor Text
  • Associate it with the page the link points to
  • 1. Often provide more accurate descriptions of
    web pages than the page themselves
  • 2. May exist for documents which cannot be
    indexed by a text-based search engine, such as
    images, programes, and databases
  • Progpagation mostly because anchor text can help
    provide better quality results
  • Other Features
  • 1. location information for all hits and so it
    makes extensive use of proximity in search
  • 2. track of some visual presentation details such
    as font size of words
  • - larger or bolder font weighted
    higher than other words
  • 3. full raw HTML of pages is available in a
    repository

11
Related Work (1/2)
  • Information Retreieval
  • TREC -gt well controlled, homogenous collections
  • Google 147GB from our crawl of 24 million web
    pages
  • Vector Space Model not enough
  • Query Bill Clinton -gt Bill Clinton Sucks
  • - high quality information available on this
    topic
  • Differences Between the Web and Well Controlled
    Collections
  • Web
  • - extreme variation internal to the
    documents
  • - no control over what people can put on
    the web
  • metadata efforts have largely failed
  • - companies which specialize in
    manipulating search engines for profit

12
System Anatomy (1/3)
  • Google Architecture Overview

13
System Anatomy (2/3)
  • Google Architecture Overview
  • URL Server sends lists of URLs to crawlers
  • Crawler downloads web pages
  • Store Server compresses stores web pages
    into the repository
  • Indexer
  • - reads the repository uncompresses the
    documents
  • - parses the documents
  • - creates forward index
  • - parses out the links
  • URL Resolver
  • - converts relative URLs to absolute URLs
    and then to docIDs
  • - generates a database of links
  • - puts the anchor text into the barrels
  • Sorter generates the inverted index
  • Searcher answers queries

14
System Anatomy (3/3)
  • Major Data Structures
  • Bigfiles
  • Respository
  • Document Index
  • Lexicon
  • Hit Lists
  • Forward Index
  • Inverted Index

15
Major Data Structures
  • Bigfiles
  • The operating systems do not provide enough for
    our needs
  • Virtual files spanning mutiple file systems
    addressable by 64 bit intergers
  • Respository
  • Contains the full HTML of every web page
    compressed using zlib
  • Documents are stored one after the other and are
    prefixed by docID, length, and URL
  • Rebuild all the other data structures from only
    the repository and a file which lists crawler
    errors

16
Major Data Structures
  • Document Index
  • Fixed width ISAM index
  • Store document status, pointer to repository,
    document checksum
  • If document has been crawled, ptr to variable
    length docinfo file stored
  • Lexicon
  • Fits in memory with a reasonable price
  • - 256MB 14 million words
  • List of the words, Hash table of pointers

17
Major Data Structures
  • Hit Lists
  • plain hit, anchor hit, fancy hit
  • Encoding uses 2 bytes for each hit
  • Length of hit list stored before hit
  • Forward Index
  • Stored in 64 barrels
  • If a document contains words in a barrel,
  • then the docID is recorded into the barrel,
  • with the list of wordIDs and hit lists.
  • Each wordID stored as a relative difference from
    the minimum wordID in a barrel. (24 ibts for the
    wordID, 8 for hit list length)

18
Major Data Structures
  • Inverted Index
  • The same barrels as forward index, except that
    they have been processed by the sorter.
  • For every wordID, doclist of docIDs generated,
    with corresponding hit lists
  • Two sets of inverted barrels, one set for hit
    lists which include title or anchor hits and
    another set for all hit lists.
Write a Comment
User Comments (0)
About PowerShow.com