Title: Web Crawling and Automatic Discovery
1Web Crawling and Automatic Discovery
- Donna Bergmark
- Cornell Information Systems
- bergmark_at_cs.cornell.edu
2Web Resource Discovery
- Finding info on the Web
- Surfing (random strategy goal is
serendipity) - Searching (inverted indices specific info)
- Crawling (follow links all the info)
- Uses for crawling
- Find stuff
- Gather stuff
- Check stuff
3Definition
- Spider robot crawler
- Crawlers are computer programs that roam the
Web with the goal of automating specific tasks
related to the Web.
4Crawlers and internet history
- 1991 HTTP
- 1992 26 servers
- 1993 60 servers self-register archie
- 1994 (early) first crawlers
- 1996 search engines abound
- 1998 focused crawling
- 1999 web graph studies
- 2002 use for digital libraries
5So, why not write a robot?
- Youd think a crawler would be easy to write
- Pick up the next URL
- Connect to the server
- GET the URL
- When the page arrives, get its links
(optionally do other stuff) - REPEAT
6The Central Crawler Function
Server 3 queue
Connect a Socket to Server send HTTP request
Server 2 queue
URL -gt IP address via DNS
Wait for the response An HTML page
Server 1 queue
7Handling the HTTP Response
Extract text
FETCH
No
Extract links
8LINK Extraction
- Finding the links is easy (sequential scan)
- Need to clean them up and canonicalize them
- Need to filter them
- Need to check for robot exclusion
- Need to check for duplicates
9Update the Frontier
URL1 URL2 URL3
FETCH
PROCESS
FRONTIER
10Crawler Issues
- System Considerations
- The URL itself
- Politeness
- Visit Order
- Robot Traps
- The hidden web
11Standard for Robot Exclusion
- Martin Koster (1994)
- http//any-server80/robots.txt
- Maintained by the webmaster
- Forbid access to pages, directories
- Commonly excluded /cgi-bin/
- Adherence is voluntary for the crawler
12Visit Order
- The frontier
- Breadth-first FIFO queue
- Depth-first LIFO queue
- Best-first Priority queue
- Random
- Refresh rate
13Robot Traps
- Cycles in the Web graph
- Infinite links on a page
- Traps set out by the Webmaster
14The Hidden Web
- Dynamic pages increasing
- Subscription pages
- Username and password pages
- Research in progress on how crawlers can get
into the hidden web
15MERCATOR
16Mercator Features
- One file configures a crawl
- Written in Java
- Can add your own code
- Extend one or more of Ms base classes
- Add totally new classes called by your own
- Industrial-strength crawler
- uses its own DNS and java.net package
17The Web is a BIG Graph
- Diameter of the Web
- Cannot crawl even the static part, completely
- New technology the focused crawl
18Crawling and Crawlers
- Web overlays the internet
- A crawl overlays the web
seed
19Focused Crawling
20Focused Crawling
1
4
3
2
7
6
5
R
Focused crawl
Breadth-first crawl
1
21Focused Crawling
- Recall the cartoon for a focused crawl
- A simple way to do it is with 2 knobs
22Focusing the Crawl
- Threshold page is on-topic if correlation to the
closest centroid is above this value - Cutoff follow links from pages whose distance
from closest on-topic ancestor is less than this
value
23Illustration
Corr gt threshold
1
Cutoff 1
2
3
4
555
5
X
6
7
X
24Closest
Furthest
25Correlation vs. Crawl Length
26Fall 2002 Student Project
Centroids, Dictionary
Term vectors
Collection URLs
Query
Centroid
Collection
Description
Mercator
Chebyshev P.s
HTML
27Conclusion
- We covered crawling history, technology,
deployment - Focused crawling with tunneling
- We have a good experimental setup for exploring
automatic collection synthesis
28http//mercator.comm.nsdlib.org