Distributed Web Crawling a survey by Dustin Boswell - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Distributed Web Crawling a survey by Dustin Boswell

Description:

For each newUrl not in UrlsDone: UrlsTodo.insert( newUrl ) ... Previous Web Crawlers. 4 machines. 891 million. 600 pages/second. 4 machines. 24 million pages ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 13
Provided by: dustyb
Category:

less

Transcript and Presenter's Notes

Title: Distributed Web Crawling a survey by Dustin Boswell


1
Distributed Web Crawling(a survey by Dustin
Boswell)
2
Basic Crawling Algorithm
UrlsTodo yahoo.com/index.html Repeat url
UrlsTodo.getNext() html Download( url
) UrlsDone.insert( url ) newUrls parseForLinks(
html ) For each newUrl not in UrlsDone UrlsTodo.
insert( newUrl )
3
Statistics to Keep in Mind
Documents on the web Avg. HTML size Avg. URL
length Links per page External Links per page
3 Billion (by Googles count) 15KB 50
characters 10 2
Download the entire web in a year 95 urls /
second !
4
Statistics to Keep in Mind
Documents on the web Avg. HTML size Avg. URL
length Links per page External Links per page
3 Billion (by Googles count) 15KB 50
characters 10 2
Download the entire web in a year 95 urls /
second !
3 Billion 15KB 45 TeraBytes of HTML 3
Billion 50 chars 150 GigaBytes of URLs
!! ? multiple machines required
5
Distributing the Workload
Internet
Machine 0
Machine 1
Machine N-1
LAN
  • Each machine is assigned a fixed subset of the
    url-space

6
Distributing the Workload
Internet
Machine 0
Machine 1
Machine N-1
LAN
  • Each machine is assigned a fixed subset of the
    url-space
  • machine hash( urls domain name ) N

7
Distributing the Workload
Internet
Machine 0
Machine 1
Machine N-1
LAN
cnn.com/sports cnn.com/weather cbs.com/csi_miami
bbc.com/us bbc.com/uk bravo.com/queer_eye
  • Each machine is assigned a fixed subset of the
    url-space
  • machine hash( urls domain name ) N

8
Distributing the Workload
Internet
Machine 0
Machine 1
Machine N-1
LAN
cnn.com/sports cnn.com/weather cbs.com/csi_miami
bbc.com/us bbc.com/uk bravo.com/queer_eye
  • Each machine is assigned a fixed subset of the
    url-space
  • machine hash( urls domain name ) N
  • Communication a couple urls per page (very
    small)
  • DNS cache per machine
  • Maintain politeness dont want to DOS attack
    someone!

9
Software Hazards
  • Slow/Unresponsive DNS Servers
  • Slow/Unresponsive HTTP Servers

parallel / asynch interface desired
10
Software Hazards
  • Slow/Unresponsive DNS Servers
  • Slow/Unresponsive HTTP Servers
  • Large or Infinite-sized pages
  • Infinite Links (domain.com/time100, 101,
    102, )
  • Broken HTML

parallel / asynch interface desired
11
Previous Web Crawlers
12
Questions?
Write a Comment
User Comments (0)
About PowerShow.com