Web Crawling and Automatic Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Web Crawling and Automatic Discovery

Description:

March 26, 2003. CS502 Web Information Systems. 1. Web Crawling and Automatic Discovery ... March 26, 2003. CS502 Web Information Systems. 17. The Web is a BIG ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 29
Provided by: berg73
Category:

less

Transcript and Presenter's Notes

Title: Web Crawling and Automatic Discovery


1
Web Crawling and Automatic Discovery
  • Donna Bergmark
  • Cornell Information Systems
  • bergmark_at_cs.cornell.edu

2
Web Resource Discovery
  • Finding info on the Web
  • Surfing (random strategy goal is
    serendipity)
  • Searching (inverted indices specific info)
  • Crawling (follow links all the info)
  • Uses for crawling
  • Find stuff
  • Gather stuff
  • Check stuff

3
Definition
  • Spider robot crawler
  • Crawlers are computer programs that roam the
    Web with the goal of automating specific tasks
    related to the Web.

4
Crawlers and internet history
  • 1991 HTTP
  • 1992 26 servers
  • 1993 60 servers self-register archie
  • 1994 (early) first crawlers
  • 1996 search engines abound
  • 1998 focused crawling
  • 1999 web graph studies
  • 2002 use for digital libraries

5
So, why not write a robot?
  • Youd think a crawler would be easy to write
  • Pick up the next URL
  • Connect to the server
  • GET the URL
  • When the page arrives, get its links
    (optionally do other stuff)
  • REPEAT

6
The Central Crawler Function
Server 3 queue
Connect a Socket to Server send HTTP request
Server 2 queue
URL -gt IP address via DNS
Wait for the response An HTML page
Server 1 queue
7
Handling the HTTP Response
Extract text
FETCH
No
Extract links

8
LINK Extraction
  • Finding the links is easy (sequential scan)
  • Need to clean them up and canonicalize them
  • Need to filter them
  • Need to check for robot exclusion
  • Need to check for duplicates

9
Update the Frontier
URL1 URL2 URL3
FETCH
PROCESS
FRONTIER
10
Crawler Issues
  • System Considerations
  • The URL itself
  • Politeness
  • Visit Order
  • Robot Traps
  • The hidden web

11
Standard for Robot Exclusion
  • Martin Koster (1994)
  • http//any-server80/robots.txt
  • Maintained by the webmaster
  • Forbid access to pages, directories
  • Commonly excluded /cgi-bin/
  • Adherence is voluntary for the crawler

12
Visit Order
  • The frontier
  • Breadth-first FIFO queue
  • Depth-first LIFO queue
  • Best-first Priority queue
  • Random
  • Refresh rate

13
Robot Traps
  • Cycles in the Web graph
  • Infinite links on a page
  • Traps set out by the Webmaster

14
The Hidden Web
  • Dynamic pages increasing
  • Subscription pages
  • Username and password pages
  • Research in progress on how crawlers can get
    into the hidden web

15
MERCATOR
16
Mercator Features
  • One file configures a crawl
  • Written in Java
  • Can add your own code
  • Extend one or more of Ms base classes
  • Add totally new classes called by your own
  • Industrial-strength crawler
  • uses its own DNS and java.net package

17
The Web is a BIG Graph
  • Diameter of the Web
  • Cannot crawl even the static part, completely
  • New technology the focused crawl

18
Crawling and Crawlers
  • Web overlays the internet
  • A crawl overlays the web

seed
19
Focused Crawling
20
Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
21
Focused Crawling
  • Recall the cartoon for a focused crawl
  • A simple way to do it is with 2 knobs

22
Focusing the Crawl
  • Threshold page is on-topic if correlation to the
    closest centroid is above this value
  • Cutoff follow links from pages whose distance
    from closest on-topic ancestor is less than this
    value

23
Illustration
Corr gt threshold
1
Cutoff 1
2
3
4
555
5
X
6
7
X
24
Closest
Furthest
25
Correlation vs. Crawl Length
26
Fall 2002 Student Project
Centroids, Dictionary
Term vectors
Collection URLs
Query
Centroid
Collection
Description
Mercator
Chebyshev P.s
HTML
27
Conclusion
  • We covered crawling history, technology,
    deployment
  • Focused crawling with tunneling
  • We have a good experimental setup for exploring
    automatic collection synthesis

28
http//mercator.comm.nsdlib.org
Write a Comment
User Comments (0)
About PowerShow.com