Web Spiders - PowerPoint PPT Presentation

About This Presentation
Title:

Web Spiders

Description:

A program that uses HTTP to automatically. download documents from a web server ... User-Agent: Mozilla/4.7 [en] (X11; U; SunOS 5.7 sun4u) Host: lookup.netscape.com ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 18
Provided by: bill306
Category:
Tags: mozilla | spiders | web

less

Transcript and Presenter's Notes

Title: Web Spiders


1
Web Spiders
  • Dan Reeves
  • Bill Walsh
  • HDIW EECS 547
  • 16 February 2000

2
What is a Web Spider?
  • A program that uses HTTP to automatically
  • download documents from a web server
  • analyze documents retrieved from a web server
  • send data back to a web server

3
Spider Usage
  • Search engines
  • Lycos analyzes 10,000,000 Web pages a day
  • Comparison shopping
  • ShopBot
  • Data analysis
  • bidding behavior at online auctions
  • Automated Web interactions
  • daily comics delivery
  • stock trading agent
  • Other (Mirroring, HTML/link validation, )

4
How Humans Typically Access The Web
  • Web browser
  • human friendly interface
  • hides details of HTTP
  • Web browser is just a program written in some
    language
  • Whatever it does, you (a programmer) can do too!

5
What Components We Need to Use the Web
  • socket connection
  • HTTP
  • page knowledge

6
Setting Up a Socket Connection
  • Programmatically (C, Perl, Java, Lisp, etc.)
  • Unix command prompt
  • telnet address port_number
  • address is web site address
  • default port_number for most web sites 80
  • telnet http//www.netscape.com 80

7
HTTP
  • A well-defined specification for message formats
  • Orthogonal to
  • TCP/IP
  • HTML
  • XML
  • W3C World Wide Web Consortium
  • www.w3.org

8
Page Knowledge
  • Markup language HTML, XML, free text
  • Data formatting
  • regular expressions
  • domain-specific conventions
  • freeform text
  • How to get the knowledge
  • coded by humans
  • learning

9
telnet www.netscape.com 80
telnet www.netscape.com 80 Trying
207.200.75.204... Connected to www-ld2.netscape.co
m. Escape character is ''. GET /index.html
HTTP/1.0 User-Agent An Evil Spider Accept
image/gif, / Accept-Language en,
de Purpose-of-Request Denial of Service Attack
HTTP/1.1 200 OK Server Netscape-Enterprise/3.6
Date Thu, 10 Feb 2000 212239 GMT Set-Cookie
UIDC141.213.12.1860950217760031129domain.nets
cape.compath/ expires31-Dec-2010 235959
GMT Content-type text/html Connection
close - Hide from old browsers if (parseFloat(navigator.
appVersion) ') l
ocation.href "http//home.netscape.com/computing
/download/upgrade_index.html" // Stop Hiding
From Old Browsers -- Netcenter about 40k snipped
window.pupP
up() Connection closed by
foreign host.
10
Setting Up a Proxy in Netscape
11
Example of GET
GET / HTTP/1.0M Proxy-Connection
Keep-AliveM User-Agent Mozilla/4.7 en (X11
U SunOS 5.7 sun4u)M Host www.netscape.comM Acc
ept image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, /M Accept-Encoding
gzipM Accept-Language de, enM Accept-Charset
iso-8859-1,,utf-8M Cookie UIDC141.213.12.1570
937196127933529 HITO_VISITSA151853E21368C1E
00B21M
12
Example of GET-Based Form
GET /lookup/Lookup.tibco?searchsunwst_symbolon
HTTP/1.0 Referer http//www.netscape.com/ Proxy-C
onnection Keep-Alive User-Agent Mozilla/4.7
en (X11 U SunOS 5.7 sun4u) Host
lookup.netscape.com Accept image/gif,
image/x-xbitmap, image/jpeg, image/pjpeg,
image/png, / Accept-Encoding
gzip Accept-Language de, en Accept-Charset
iso-8859-1,,utf-8 Cookie UIDC141.213.12.157093
7196127933529 HITO_VISITSA151853E21368C1E00
B21 NSPOPmyn12
13
Example of POST-Based Form
POST /dreeves/bin/quote-submit.cgi
HTTP/1.0 Referer http//www.eecs.umich.edu/dreev
es/add-quote.html Proxy-Connection
Keep-Alive User-Agent Mozilla/4.7 en (X11 U
SunOS 5.7 sun4u) Host www.eecs.umich.edu Accept
image/gif, image/x-xbitmap, image/jpeg,
image/pjpeg, image/png, / Accept-Encoding
gzip Accept-Language de, en Accept-Charset
iso-8859-1,,utf-8 Content-type
application/x-www-form-urlencoded Content-length
234 recipientdanielsubjectQUOTEDATABASESUBMI
SSIONnameeecs547student emaildreeves40umich
.edu body22Bewareofbugsintheabovecode
3BIhaveonlyproveditcorrect2Cnottriedit.
220D0A--DonaldKnuth0D0A
14
Basic Perl Web Library (web.pl)
  • getURLAsString
  • Given a URL, returns contents as string.
  • submitForm
  • Given a URL and a perl hash of HTML form fields
    and contents, submits the form and returns
    response.
  • html2text
  • Uses lynx to parse html into a reasonable text
    approximation.

15
Example Get Todays Dilbert and Package it for
Email
16
Other Issues and Gotchas
  • SSL
  • Perl SSLeay library
  • Cookies
  • Perl libraries exist
  • robots.txt file
  • Sometimes for spiders benefit
  • Politeness
  • Dont get your domain blocked!

17
For More Information...
  • All examples and links at
  • http//www.eecs.umich.edu/
  • dreeves/hdiw/main.html
Write a Comment
User Comments (0)
About PowerShow.com