CS246 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

CS246

Description:

1 bit error per month per 1GB. 200 machines with 1GB each. 200 ... 4 x 8 x 1013 bits transfer. 320 bit errors scatters around the stream. 320 byte errors ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 23

Provided by: junghoo

Category:

Tags: bit | cs246

Transcript and Presenter's Notes

Title: CS246

1
CS246

Scale of Search Engine

2
High-Level Architecture

Major modules for a search engine?
Crawler
Page download Refresh
Indexer
Index construction
PageRank computation
Query Processor
Page ranking
Query logging

3
General Architecture
4
Scale of Google

Number of pages indexed
Claimed to be 8B
Index refresh interval
Once per month 1200 pages/sec
Number of queries per day
200M in April 2003 2000 queries/sec
Google runs on commodity Intel-Linux boxes

5
Other Statistics

Average page size 15KB
Average query size 40B
Average result size 5KB
Average number of links per page 10

6
Size of Dataset (1)

Total raw HTML data size
8G x 15KB 120 TB!
Inverted index roughly the same size as raw
corpus 120 TB for index itself
With appropriate compression, 31 compression
ratio
80 TB data residing in disk

7
Size of Dataset (2)

Number of disks necessary for one copy
(80 TB) / (100 GB per disk) 800 disk

8
Simplified Indexing Model
S
D

Model
Copy 40TB data from disks in S to disks in D
through network
S crawling machines
D indexing machines
40TB crawled data, 40TB index
Ignore actual processing
4 x 100GB disks per machine
100 machines in S and D each
1GB RAM per machine

9
Data Flow

Disk ? RAM ? Network ? RAM ? Disk
No hardware is error free
Disk (undetected) error rate 1 per 1013
Network error rate 1 bit per 1012
Memory soft error rate 1 bit error per month
(1GB)
Typically go unnoticed for small data

Network
RAM
Disk
Disk
RAM
10
Data Flow

Assuming 100Mbit/s link between machines
400GB per machine, 10MB/s transfer rate? Half
day just for data transfer

Network
RAM
Disk
Disk
RAM
11
Errors from Disk

Undetected disk error rate 1 per 1013
4X1013 bytes data read in total4X1013 bytes data
write in total? 8 byte errors from disk
read/write

Network
RAM
Disk
Disk
RAM
12
Errors from Memory

1 bit error per month per 1GB
200 machines with 1GB each ? 200 machines/30
days 6 ? 6 bit error per day ? 6 byte error
from memory corruption

Network
RAM
Disk
Disk
RAM
13
Errors from Network

1 error per 1012
4 x 8 x 1013 bits transfer? 320 bit errors
scatters around the stream? 320 byte errors

Network
RAM
Disk
Disk
RAM
14
Data Size and Errors (1)

During index construction/copy, something always
goes wrong
320 byte errors from network, 8 byte errors from
disk, 6 byte errors from memory
Very difficult to trace and debug
Particularly disk and memory error
No OS/application assumes such errors yet
Pure hardware errors, but very difficult to
differentiate hardware error and software bug
Software bugs may also cause similar errors

15
Data Size and Errors (2)

Very difficult to trace and debug
Data corruption in the middle of, say, sorting
completely screws up the sorting
Need a data-verification step after every
operation
Algorithm, data structure must be resilient to
data corruption
Check points, etc.
ECC RAM is a must
Can detect most of 1 bit errors

16
Data Size and Reliability

Disk mean time to failure 3 years? (3 x 365
days) / 800 disks 1 day One disk failure every
1 day!
Remember, this is just for one copy
Data organization should be very resilient to
disk failure

17
Data Size and Crawling

Efficient crawl is very important
1 page/sec ? 1200 machines just for crawling
Parallelization through thread/event queue
necessary
Complex crawling algorithm -- No, No!
Well-optimized crawler
100 pages/sec (10 ms/page)
12 machines for crawling
Bandwidth consumption
1200 x 15KB x 8bit 150Mbps
One dedicated OC3 line (155Mbps) for crawling
400,000 per year

18
Data Size and Indexing

Efficient Indexing is very important
1200 pages / sec
Indexing steps
Load page, extract words Network/disk intensive
Sort word, postings CPU intensive
Write sorted postings Disk intensive
Pipeline indexing steps

P1
P2
P3
19
Data Size and Query Processing

Index size 40TB ? 400 disks
Typically fewer than 5 disks per machine
Potentially 100-machine cluster to answer a query
If one machine goes down, the cluster goes down
Multi-tier index structure can be helpful
Tier 1 Popular (high PageRank) page index
Tier 2 Less popular page index
Most queries can be answered by tier-1 cluster
(with fewer machines)

20
Implication of Query Load

2000 queries / sec
Rule of thumb 1 query / sec per CPU
Depends on number of disks, memory size, etc.
2000 machines just to answer queries
5KB / answer page
2000 x 5KB x 8bit 80 Mbps
Half dedicated OC3 line (155Mbps) 300,000

21
Query Load and Replication

Index replication necessary to handle the query
load
Assuming 1TB tier-1 index, 100Mbit/sec transfer
rate
8bits x 1TB / 100MB 80,000 sec
One day to refresh to a new index
Of course, we need to verify the transferred data
before using it

22
Hardware at Google

10,000 Intel-Linux cluster
Assuming 99.9 uptime (8 hour downtime per year)
10 machines are always down
Nightmare for system administrators
Assuming 3-year hardware replacement
Set up, replace and dump 10 machines every day
Heterogeneity is unavoidable

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

CS246 PowerPoint PPT Presentation

CS246 - CS246 Basic Information Retrieval Today s Topic Basic Information Retrieval (IR) Bag of words assumption Boolean Model Inverted index Vector-space model Document ... | PowerPoint PPT presentation | free to view

CS246 PowerPoint PPT Presentation

CS246 - What is the Web like? Any questions on some of the characteristics and/or ... If there are only two links, A B and B A, then A and B becomes one SCC. A. B ... | PowerPoint PPT presentation | free to view

CS246 PowerPoint PPT Presentation

CS246 - How can we identify these Web communities? Junghoo 'John' Cho (UCLA ... Linux, Star wars, Anti-abortion, Nicole Kidman, ... Pages tend to point to each other ... | PowerPoint PPT presentation | free to view

CS246 PowerPoint PPT Presentation

CS246 - User expresses a query using a mediator schema. Mediator translates the query to source ... Western calendar Chinese lunar calendar. Query Translation Example ... | PowerPoint PPT presentation | free to view

CS246: Web Information Systems PowerPoint PPT Presentation

CS246: Web Information Systems - Around 2-3 papers every week. Typically one full day of paper reading. One ... Cars.com. Amazon.com. Apartments.com. 401carfinder.com. CS246 by John Cho. 12 ... | PowerPoint PPT presentation | free to view

CS246 PowerPoint PPT Presentation

CS246 - PageRank of p is the sum of PageRanks of its parents. One equation for every page ... Identified pages are good 'Hub' and 'Authority' on 'bicycling' ... | PowerPoint PPT presentation | free to view

Finland in Winter PowerPoint PPT Presentation

Finland in Winter - A series of beautiful photos taken in the country of Finland during the winter. | PowerPoint PPT presentation | free to view

Business Etiquette PowerPoint PPT Presentation

Business Etiquette - Etiquette is a very important factor in determining the success or failure of a business or a person, here are a few Tips on Business Etiquettes. | PowerPoint PPT presentation | free to view

Path to a Healthy Heart PowerPoint PPT Presentation

Path to a Healthy Heart - Practice Healthy Habits for your Heart | PowerPoint PPT presentation | free to view

Animaux PowerPoint PPT Presentation

Animaux - Incredible animal shots/ Fantastic | PowerPoint PPT presentation | free to view

Safety at Work PowerPoint PPT Presentation

Safety at Work - A humorous display of some very unsafe practices. Do not try these at home, or at work. | PowerPoint PPT presentation | free to view

Humanism and the Culture War PowerPoint PPT Presentation

Humanism and the Culture War - The Emerging Transhumanist Culture at Transvision 2004 Jende Huang, field organizer, American Humanist Association | PowerPoint PPT presentation | free to view

When chefs get bored PowerPoint PPT Presentation

When chefs get bored - The amazing pictures that show what chefs get up to when they're bored. | PowerPoint PPT presentation | free to view

Ladder to Success PowerPoint PPT Presentation

Ladder to Success - Success is everybody's dream. But what is the key to success?. Find how you can be successful and achieve your goals. | PowerPoint PPT presentation | free to view

Arrive Alive PowerPoint PPT Presentation

Arrive Alive - CAr Accidents | PowerPoint PPT presentation | free to view

Gr8! Pictures PowerPoint PPT Presentation

Gr8! Pictures - Enjoy these amazing pictures of nature. | PowerPoint PPT presentation | free to view

Hurricane Ike Pictures PowerPoint PPT Presentation

Hurricane Ike Pictures - Pics of Hurricane IKE | PowerPoint PPT presentation | free to view

Beauty of Nature PowerPoint PPT Presentation

Beauty of Nature - Nature makes the beauty of our world, lets explore the beauty of nature ... | PowerPoint PPT presentation | free to view

Unbelievable Dubai PowerPoint PPT Presentation

Unbelievable Dubai - The speed this city grows is just unbelievable. Emerging as one of the favorite tourist destinations in the world. | PowerPoint PPT presentation | free to view

Harbin Ice & Snow Festival PowerPoint PPT Presentation

Harbin Ice & Snow Festival - The Harbin Snow and Ice Festival is a spetacular event, of amazing ice sculptures and buildings constructed of ice and snow. | PowerPoint PPT presentation | free to view

Beautiful Scenery PowerPoint PPT Presentation

Beautiful Scenery - Feel mystic nature beauty!! Sooth stress and tension by these beautiful scenery photos and music. Photos are from Album2000 at http://album.sakuraweb.com | PowerPoint PPT presentation | free to view

Amazing Striped Iceberg PowerPoint PPT Presentation

Amazing Striped Iceberg - Here are few striking pictures of icebergs with multi-colored stripes. These pics are just as incredible | PowerPoint PPT presentation | free to view

Top Images of 2008 PowerPoint PPT Presentation

Top Images of 2008 - Some fantastic pictures from the Internet in 2008 | PowerPoint PPT presentation | free to view

Interesting Facts PowerPoint PPT Presentation

Interesting Facts - Some Amazing Facts ... bet you are not aware of these! | PowerPoint PPT presentation | free to view

Beautiful Silhouette Pictures PowerPoint PPT Presentation

Beautiful Silhouette Pictures - Amazing silhouette pictures ... | PowerPoint PPT presentation | free to view

Words of Wisdom PowerPoint PPT Presentation

Words of Wisdom - An uplifting presentation containing some wisdom and beautiful pictures. Watch it to get some inspiration, or as a meditation. It is well worth the few minutes it takes. | PowerPoint PPT presentation | free to view

Cute & Innocent Animals PowerPoint PPT Presentation

Cute & Innocent Animals - A collection of some cute pictures of animals, this could bring smile to your face. | PowerPoint PPT presentation | free to view