Documenting Internet2 an IT perspective - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Documenting Internet2 an IT perspective

Description:

an IT perspective. Eric Celeste. University of Minnesota (Twin Cities) Libraries ... aiming for broad deployment, Archive-It. cross-platform, many users ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 28
Provided by: ericc160
Category:

less

Transcript and Presenter's Notes

Title: Documenting Internet2 an IT perspective


1
Documenting Internet2an IT perspective
...or... A joyful romp with Heritrix,
JavaScript, Spotlight!
  • Eric Celeste
  • University of Minnesota (Twin Cities) Libraries
  • for the Coalition for Networked Information
  • 6 December 2005

2
background...
  • DI2 brought together
  • University of Minnesota (CBI)
  • University of Michigan (SI)
  • Internet2
  • web crawling only a small part
  • the save everything approach

3
briefly
  • on crawling with spiders
  • on Heritrix and JavaScript
  • on Spotlight and local files
  • on sinkholes and strategies

4
spiders on the web
5
pages
6
links
7
hosts domains
8
robots.txt
9
scope
10
seeds
11
excluded pages
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
done!
19
our crawler
  • Heritrix, from the IA
  • aiming for broad deployment, Archive-It
  • cross-platform, many users
  • simple setup, sophisticated options
  • generates ARC files

20
from ARC to archive
  • keep originals intact
  • a few large files to manage
  • can serve a mirror from the master
  • can extract files for research
  • solution requires Perl, PHP, JavaScript, MySQL

21
processing...
  • for mirroring online
  • optimizing and indexing with Perl
  • loading into MySQL database
  • presenting via PHP
  • for using on local disk
  • extracting files from ARC

22
joys of javascript...
  • modifies the page after loading
  • HTML almost unmolested
  • changes explicit in code

23
are we there yet?
  • make the archive obvious
  • yet intrude as little as possible

24
global research locally
  • a web site in your pocket
  • applying local tools
  • maintaining browse-ability
  • Apples Spotlight one of many

25
sinkholes / strategies
  • partnership with institution
  • config, IP, retention
  • crawling far from perfect
  • no creation dates, exclusions
  • sticky traps, scripted pages (AJAX)
  • scripts still immature
  • better demarcation
  • more self-contained (not at /)

26
still...
  • capture save what we can
  • keep it as original as possible
  • stay flexible for the future
  • have fun in the present!

27
more information
  • http//wiki.lib.umn.edu/DI2/
  • Eric Celeste ltefc_at_umn.edugt
Write a Comment
User Comments (0)
About PowerShow.com