Web crawlers - PowerPoint PPT Presentation

About This Presentation
Title:

Web crawlers

Description:

Most crawlers search only for. HTML (leaves and nodes in the tree) ... Commercial crawlers face problems! Want to explore more than they can; ... – PowerPoint PPT presentation

Number of Views:653
Avg rating:3.0/5.0
Slides: 15
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: Web crawlers


1
Web crawlers
  • cs430 lecture
  • 02/22/01
  • Kamen Yotov

2
What is a web crawler?
  • Definition (crawlerspider)Self sufficient
    programs that index any site you point them at.
    Useful for indexing
  • websites, distributed among multiple servers
  • websites related to your own!

3
Types of web crawlers
  • Server-side (Business oriented)
  • Technology behind Google, Altavista
  • Scalable, reliable, available
  • Resource hungry
  • Client-side (Customer oriented)
  • Examples are Teleport Pro, WebSnake,
  • Much smaller requirements
  • Need guidance to proceed

4
Simple web crawler algorithm
  • Same simple algorithm for both types!
  • Let S be set of pages we want to index
  • In first place let S be a singleton set p
  • Take an element p of S
  • Parse the page p and retrieve the set of pages L
    it has links to
  • Substitute SSL-p
  • Repeat as many times as necessary.

5
Simple or not so much
  • Representation of S ?
  • Queue, Stack, Deque
  • Taking elements and completing SSL
  • FIFO, LIFO, Combination
  • How deep do we go?
  • Not only finding, but indexing!
  • Links not so easy to extract

6
FIFO Queue BFS
7
LIFO Queue DFS
8
What to search for?
  • Most crawlers search only for
  • HTML (leaves and nodes in the tree)
  • ASCII clear text (only as leaves in the tree)
  • Some search for
  • PDF
  • PostScript,
  • Important indexing after search!

9
Links not so easy to extract
  • Relative/Absolute
  • CGI
  • Parameters
  • Dynamic generation of pages
  • Server-side scripting
  • Server-side image maps
  • Links buried in scripting code
  • Undecidable in first place

10
Performance issues
  • Commercial crawlers face problems!
  • Want to explore more than they can
  • Have limited computational resources
  • Need much storage space and bandwidth
  • Communication bandwidth issues
  • Connection to the backbone is not fast enough to
    crawl at the desired speed
  • Need to respect other sites, so they dont render
    them not operational.

11
An example (Google)
  • 85 people
  • 50 technical, 14 PhD in Computer Science
  • Central system
  • Handles 5.5 million searches per day
  • Increase rate is 20 per month
  • Contains 2500 Linux machines
  • Has 80 terabytes of spinning disks
  • 30 new machines are installed daily
  • Cache holds 200 million pages
  • The aim is to crawl the web once per month!
  • Larry Page, Google

12
Typical crawling setting
  • Multi-machine, clustered environment
  • Multi-thread, parallel searching

13
Netiquette
  • robots.txt
  • robots.txt for http//www.example.com/
  • User-agent
  • Disallow /cyberworld/map/
  • Disallow /tmp/ these will soon disappear
  • Disallow /foo.html
  • Cybermapper knows where to go.
  • User-agent cybermapper
  • Disallow
  • Site bandwidth overload
  • Restricted material

14
An area open for RD!
  • No much information how real crawlers work
  • People who know how to do it, just do it (in
    contrast to explaining it)
  • May be yours will be the next best crawler!
Write a Comment
User Comments (0)
About PowerShow.com