Information Retrieval - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Information Retrieval

Description:

Information Retrieval – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 13
Provided by: dragomi3
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval


1
Information Retrieval
March 11, 2005
  • Handout 9

2
Course Information
  • Instructor Dragomir R. Radev (radev_at_si.umich.edu)
  • Office 3080, West Hall Connector
  • Phone (734) 615-5225
  • Office hours M 11-12 Th 12-1 or via email
  • Course page http//tangra.si.umich.edu/radev/650
    /
  • Class meets on Fridays, 210-455 PM in 409 West
    Hall

3
Measuring the Web
4
Bharat and Broder 1998
  • Based on crawls of HotBot, Altavista, Excite, and
    InfoSeek
  • 10,000 queries in mid and late 1997
  • Estimate is 200M pages
  • Only 1.4 are indexed by all of them

5
Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
6
Crawling the web
7
Basic principles
  • The HTTP/HTML protocols
  • Following hyperlinks
  • Some problems
  • Link extraction
  • Link normalization
  • Robot exclusion
  • Loops
  • Spider traps
  • Server overload

8
Example
  • U-Ms root robots.txt file
  • http//www.umich.edu/robots.txt
  • User-agent
  • Disallow /websvcs/projects/
  • Disallow /7Ewebsvcs/projects/
  • Disallow /homepage/
  • Disallow /7Ehomepage/
  • Disallow /smartgl/
  • Disallow /7Esmartgl/
  • Disallow /gateway/
  • Disallow /7Egateway/

9
Example crawler
  • E.g., poacher
  • http//search.cpan.org/neilb/Robot-0.011/examples
    /poacher
  • /data0/projects/perltree-index

10
ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)

follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test

process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error

process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
11
(No Transcript)
12
Focused crawling
  • Topical locality
  • Pages that are linked are similar in content (and
    vice-versa Davison 00, Menczer 02, 04, Radev et
    al. 04)
  • The radius-1 hypothesis
  • given that page i is relevant to a query and that
    page i points to page j, then page j is also
    likely to be relevant (at least, more so than a
    random web page)
  • Focused crawling
  • Keeping a priority queue of the most relevant
    pages
Write a Comment
User Comments (0)
About PowerShow.com