CS246 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CS246

Description:

1 bit error per month per 1GB. 200 machines with 1GB each. 200 ... 4 x 8 x 1013 bits transfer. 320 bit errors scatters around the stream. 320 byte errors ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 23
Provided by: junghoo
Category:
Tags: bit | cs246

less

Transcript and Presenter's Notes

Title: CS246


1
CS246
  • Scale of Search Engine

2
High-Level Architecture
  • Major modules for a search engine?
  • Crawler
  • Page download Refresh
  • Indexer
  • Index construction
  • PageRank computation
  • Query Processor
  • Page ranking
  • Query logging

3
General Architecture
4
Scale of Google
  • Number of pages indexed
  • Claimed to be 8B
  • Index refresh interval
  • Once per month 1200 pages/sec
  • Number of queries per day
  • 200M in April 2003 2000 queries/sec
  • Google runs on commodity Intel-Linux boxes

5
Other Statistics
  • Average page size 15KB
  • Average query size 40B
  • Average result size 5KB
  • Average number of links per page 10

6
Size of Dataset (1)
  • Total raw HTML data size
  • 8G x 15KB 120 TB!
  • Inverted index roughly the same size as raw
    corpus 120 TB for index itself
  • With appropriate compression, 31 compression
    ratio
  • 80 TB data residing in disk

7
Size of Dataset (2)
  • Number of disks necessary for one copy
  • (80 TB) / (100 GB per disk) 800 disk

8
Simplified Indexing Model
S
D
  • Model
  • Copy 40TB data from disks in S to disks in D
    through network
  • S crawling machines
  • D indexing machines
  • 40TB crawled data, 40TB index
  • Ignore actual processing
  • 4 x 100GB disks per machine
  • 100 machines in S and D each
  • 1GB RAM per machine


9
Data Flow
  • Disk ? RAM ? Network ? RAM ? Disk
  • No hardware is error free
  • Disk (undetected) error rate 1 per 1013
  • Network error rate 1 bit per 1012
  • Memory soft error rate 1 bit error per month
    (1GB)
  • Typically go unnoticed for small data

Network
RAM
Disk
Disk
RAM
10
Data Flow
  • Assuming 100Mbit/s link between machines
  • 400GB per machine, 10MB/s transfer rate? Half
    day just for data transfer

Network
RAM
Disk
Disk
RAM
11
Errors from Disk
  • Undetected disk error rate 1 per 1013
  • 4X1013 bytes data read in total4X1013 bytes data
    write in total? 8 byte errors from disk
    read/write

Network
RAM
Disk
Disk
RAM
12
Errors from Memory
  • 1 bit error per month per 1GB
  • 200 machines with 1GB each ? 200 machines/30
    days 6 ? 6 bit error per day ? 6 byte error
    from memory corruption

Network
RAM
Disk
Disk
RAM
13
Errors from Network
  • 1 error per 1012
  • 4 x 8 x 1013 bits transfer? 320 bit errors
    scatters around the stream? 320 byte errors

Network
RAM
Disk
Disk
RAM
14
Data Size and Errors (1)
  • During index construction/copy, something always
    goes wrong
  • 320 byte errors from network, 8 byte errors from
    disk, 6 byte errors from memory
  • Very difficult to trace and debug
  • Particularly disk and memory error
  • No OS/application assumes such errors yet
  • Pure hardware errors, but very difficult to
    differentiate hardware error and software bug
  • Software bugs may also cause similar errors

15
Data Size and Errors (2)
  • Very difficult to trace and debug
  • Data corruption in the middle of, say, sorting
    completely screws up the sorting
  • Need a data-verification step after every
    operation
  • Algorithm, data structure must be resilient to
    data corruption
  • Check points, etc.
  • ECC RAM is a must
  • Can detect most of 1 bit errors

16
Data Size and Reliability
  • Disk mean time to failure 3 years? (3 x 365
    days) / 800 disks 1 day One disk failure every
    1 day!
  • Remember, this is just for one copy
  • Data organization should be very resilient to
    disk failure

17
Data Size and Crawling
  • Efficient crawl is very important
  • 1 page/sec ? 1200 machines just for crawling
  • Parallelization through thread/event queue
    necessary
  • Complex crawling algorithm -- No, No!
  • Well-optimized crawler
  • 100 pages/sec (10 ms/page)
  • 12 machines for crawling
  • Bandwidth consumption
  • 1200 x 15KB x 8bit 150Mbps
  • One dedicated OC3 line (155Mbps) for crawling
    400,000 per year

18
Data Size and Indexing
  • Efficient Indexing is very important
  • 1200 pages / sec
  • Indexing steps
  • Load page, extract words Network/disk intensive
  • Sort word, postings CPU intensive
  • Write sorted postings Disk intensive
  • Pipeline indexing steps

P1
P2
P3
19
Data Size and Query Processing
  • Index size 40TB ? 400 disks
  • Typically fewer than 5 disks per machine
  • Potentially 100-machine cluster to answer a query
  • If one machine goes down, the cluster goes down
  • Multi-tier index structure can be helpful
  • Tier 1 Popular (high PageRank) page index
  • Tier 2 Less popular page index
  • Most queries can be answered by tier-1 cluster
    (with fewer machines)

20
Implication of Query Load
  • 2000 queries / sec
  • Rule of thumb 1 query / sec per CPU
  • Depends on number of disks, memory size, etc.
  • 2000 machines just to answer queries
  • 5KB / answer page
  • 2000 x 5KB x 8bit 80 Mbps
  • Half dedicated OC3 line (155Mbps) 300,000

21
Query Load and Replication
  • Index replication necessary to handle the query
    load
  • Assuming 1TB tier-1 index, 100Mbit/sec transfer
    rate
  • 8bits x 1TB / 100MB 80,000 sec
  • One day to refresh to a new index
  • Of course, we need to verify the transferred data
    before using it

22
Hardware at Google
  • 10,000 Intel-Linux cluster
  • Assuming 99.9 uptime (8 hour downtime per year)
  • 10 machines are always down
  • Nightmare for system administrators
  • Assuming 3-year hardware replacement
  • Set up, replace and dump 10 machines every day
  • Heterogeneity is unavoidable
Write a Comment
User Comments (0)
About PowerShow.com