Title: Web Spiders
1Web Spiders
- Dan Reeves
- Bill Walsh
- HDIW EECS 547
- 16 February 2000
2What is a Web Spider?
- A program that uses HTTP to automatically
- download documents from a web server
- analyze documents retrieved from a web server
- send data back to a web server
3Spider Usage
- Search engines
- Lycos analyzes 10,000,000 Web pages a day
- Comparison shopping
- ShopBot
- Data analysis
- bidding behavior at online auctions
- Automated Web interactions
- daily comics delivery
- stock trading agent
- Other (Mirroring, HTML/link validation, )
4How Humans Typically Access The Web
- Web browser
- human friendly interface
- hides details of HTTP
- Web browser is just a program written in some
language - Whatever it does, you (a programmer) can do too!
5What Components We Need to Use the Web
- socket connection
- HTTP
- page knowledge
6Setting Up a Socket Connection
- Programmatically (C, Perl, Java, Lisp, etc.)
- Unix command prompt
- telnet address port_number
- address is web site address
- default port_number for most web sites 80
- telnet http//www.netscape.com 80
7HTTP
- A well-defined specification for message formats
- Orthogonal to
- TCP/IP
- HTML
- XML
- W3C World Wide Web Consortium
- www.w3.org
8Page Knowledge
- Markup language HTML, XML, free text
- Data formatting
- regular expressions
- domain-specific conventions
- freeform text
- How to get the knowledge
- coded by humans
- learning
9telnet www.netscape.com 80
telnet www.netscape.com 80 Trying
207.200.75.204... Connected to www-ld2.netscape.co
m. Escape character is ''. GET /index.html
HTTP/1.0 User-Agent An Evil Spider Accept
image/gif, / Accept-Language en,
de Purpose-of-Request Denial of Service Attack
HTTP/1.1 200 OK Server Netscape-Enterprise/3.6
Date Thu, 10 Feb 2000 212239 GMT Set-Cookie
UIDC141.213.12.1860950217760031129domain.nets
cape.compath/ expires31-Dec-2010 235959
GMT Content-type text/html Connection
close - Hide from old browsers if (parseFloat(navigator.
appVersion) ') l
ocation.href "http//home.netscape.com/computing
/download/upgrade_index.html" // Stop Hiding
From Old Browsers -- Netcenter about 40k snipped
window.pupP
up() Connection closed by
foreign host.
10Setting Up a Proxy in Netscape
11Example of GET
GET / HTTP/1.0M Proxy-Connection
Keep-AliveM User-Agent Mozilla/4.7 en (X11
U SunOS 5.7 sun4u)M Host www.netscape.comM Acc
ept image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, /M Accept-Encoding
gzipM Accept-Language de, enM Accept-Charset
iso-8859-1,,utf-8M Cookie UIDC141.213.12.1570
937196127933529 HITO_VISITSA151853E21368C1E
00B21M
12Example of GET-Based Form
GET /lookup/Lookup.tibco?searchsunwst_symbolon
HTTP/1.0 Referer http//www.netscape.com/ Proxy-C
onnection Keep-Alive User-Agent Mozilla/4.7
en (X11 U SunOS 5.7 sun4u) Host
lookup.netscape.com Accept image/gif,
image/x-xbitmap, image/jpeg, image/pjpeg,
image/png, / Accept-Encoding
gzip Accept-Language de, en Accept-Charset
iso-8859-1,,utf-8 Cookie UIDC141.213.12.157093
7196127933529 HITO_VISITSA151853E21368C1E00
B21 NSPOPmyn12
13Example of POST-Based Form
POST /dreeves/bin/quote-submit.cgi
HTTP/1.0 Referer http//www.eecs.umich.edu/dreev
es/add-quote.html Proxy-Connection
Keep-Alive User-Agent Mozilla/4.7 en (X11 U
SunOS 5.7 sun4u) Host www.eecs.umich.edu Accept
image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, / Accept-Encoding
gzip Accept-Language de, en Accept-Charset
iso-8859-1,,utf-8 Content-type
application/x-www-form-urlencoded Content-length
234 recipientdanielsubjectQUOTEDATABASESUBMI
SSIONnameeecs547student emaildreeves40umich
.edu body22Bewareofbugsintheabovecode
3BIhaveonlyproveditcorrect2Cnottriedit.
220D0A--DonaldKnuth0D0A
14Basic Perl Web Library (web.pl)
- getURLAsString
- Given a URL, returns contents as string.
- submitForm
- Given a URL and a perl hash of HTML form fields
and contents, submits the form and returns
response. - html2text
- Uses lynx to parse html into a reasonable text
approximation.
15Example Get Todays Dilbert and Package it for
Email
16Other Issues and Gotchas
- SSL
- Perl SSLeay library
- Cookies
- Perl libraries exist
- robots.txt file
- Sometimes for spiders benefit
- Politeness
- Dont get your domain blocked!
17For More Information...
- All examples and links at
- http//www.eecs.umich.edu/
- dreeves/hdiw/main.html