Web Crawling and Automatic Discovery - PowerPoint PPT Presentation

About This Presentation

Title:

Web Crawling and Automatic Discovery

Description:

March 26, 2003. CS502 Web Information Systems. 1. Web Crawling and Automatic Discovery ... March 26, 2003. CS502 Web Information Systems. 17. The Web is a BIG ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 29

Provided by: berg73

Learn more at: http://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: Web Crawling and Automatic Discovery

1
Web Crawling and Automatic Discovery

Donna Bergmark
Cornell Information Systems
bergmark_at_cs.cornell.edu

2
Web Resource Discovery

Finding info on the Web
Surfing (random strategy goal is
serendipity)
Searching (inverted indices specific info)
Crawling (follow links all the info)
Uses for crawling
Find stuff
Gather stuff
Check stuff

3
Definition

Spider robot crawler
Crawlers are computer programs that roam the
Web with the goal of automating specific tasks
related to the Web.

4
Crawlers and internet history

1991 HTTP
1992 26 servers
1993 60 servers self-register archie
1994 (early) first crawlers
1996 search engines abound
1998 focused crawling
1999 web graph studies
2002 use for digital libraries

5
So, why not write a robot?

Youd think a crawler would be easy to write
Pick up the next URL
Connect to the server
GET the URL
When the page arrives, get its links
(optionally do other stuff)
REPEAT

6
The Central Crawler Function
Server 3 queue
Connect a Socket to Server send HTTP request
Server 2 queue
URL -gt IP address via DNS
Wait for the response An HTML page
Server 1 queue
7
Handling the HTTP Response
Extract text
FETCH
No
Extract links

8
LINK Extraction

Finding the links is easy (sequential scan)
Need to clean them up and canonicalize them
Need to filter them
Need to check for robot exclusion
Need to check for duplicates

9
Update the Frontier
URL1 URL2 URL3
FETCH
PROCESS
FRONTIER
10
Crawler Issues

System Considerations
The URL itself
Politeness
Visit Order
Robot Traps
The hidden web

11
Standard for Robot Exclusion

Martin Koster (1994)
http//any-server80/robots.txt
Maintained by the webmaster
Forbid access to pages, directories
Commonly excluded /cgi-bin/
Adherence is voluntary for the crawler

12
Visit Order

The frontier
Breadth-first FIFO queue
Depth-first LIFO queue
Best-first Priority queue
Random
Refresh rate

13
Robot Traps

Cycles in the Web graph
Infinite links on a page
Traps set out by the Webmaster

14
The Hidden Web

Dynamic pages increasing
Subscription pages
Username and password pages
Research in progress on how crawlers can get
into the hidden web

15
MERCATOR
16
Mercator Features

One file configures a crawl
Written in Java
Can add your own code
Extend one or more of Ms base classes
Add totally new classes called by your own
Industrial-strength crawler
uses its own DNS and java.net package

17
The Web is a BIG Graph

Diameter of the Web
Cannot crawl even the static part, completely
New technology the focused crawl

18
Crawling and Crawlers

Web overlays the internet
A crawl overlays the web

seed
19
Focused Crawling
20
Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
21
Focused Crawling

Recall the cartoon for a focused crawl
A simple way to do it is with 2 knobs

22
Focusing the Crawl

Threshold page is on-topic if correlation to the
closest centroid is above this value
Cutoff follow links from pages whose distance
from closest on-topic ancestor is less than this
value

23
Illustration
Corr gt threshold
1
Cutoff 1
2
3
4
555
5
X
6
7
X
24
Closest
Furthest
25
Correlation vs. Crawl Length
26
Fall 2002 Student Project
Centroids, Dictionary
Term vectors
Collection URLs
Query
Centroid
Collection
Description
Mercator
Chebyshev P.s
HTML
27
Conclusion