Title: JustInTime Recovery of Missing Web Pages
1Just-In-Time Recovery of Missing Web Pages
- Hypertext 2006
- Odense, Denmark
- August 25, 2006
- Terry L. Harrison Michael L. Nelson
- Old Dominion University
- Norfolk VA, USA
2Preservation Fortress Model
Five Easy Steps for Preservation
- Get a lot of
- Buy a lot of disks, machines, tapes, etc.
- Hire an army of staff
- Load a small amount of data
- Look upon my archive ye Mighty, and despair!
image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
3Alternate Models of Preservation
- Lazy Preservation
- Let Google, IA et al. preserve your website
- Just-In-Time Preservation
- Find a good enough replacement web page
- Shared Infrastructure Preservation
- Push your content to sites that might preserve it
- Web Server Enhanced Preservation
- Use Apache modules to create archival-ready
resources
image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
4Outline
- The 404 problem
- Component technologies
- web infrastructure
- lexical signatures
- OAI-PMH
- Opal
- architectural description
- analysis
5404 Problem
- Kahle (97) - Average page lifetime 44 days
- Koehler (99, 04) - 67 URLs lost in 4 years
- Lawrence et al. (01) - 23-53 URLs in CiteSeer
papers invalid over 5 year span (3 of invalid
URLs unfindable) - Spinellis (03) - 27 URLs in CACM/Computer papers
gone in 5 years - Chan et al. (03) - 11 year half-life for URLs in
D-Lib Magazine articles - Nelson Allen (02) - 3 objects in digital
library gone in 1 year
ECDL 1999 good enough page available
PSP 2003 exact copy at new URL
Greynet 99 unavailable at any URL?
6Web Infrastructure Refreshing Migrating
7Lexical Signatures
- Robust Hyperlinks Cost Just Five Words Each
- Phelps Wilensky (2000)
- http//www.cs.odu.edu/tharriso/?lex-sigterr
yharrisonthesisjcdlawarded - Analysis of Lexical Signatures for Improving
Information Presence on the World Wide Web - Park et al. (2004)
8OAI-PMH
Data Providers / Repositories
Service Providers / Harvesters
A repository is a network accessible server that
can process the 6 OAI-PMH requests A
repository is managed by a data provider to
expose metadata to harvesters.
A harvester is a client application that issues
OAI-PMH requests. A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
9OAI-PMH Aggregators
- aggregators allow for
- scalability for OAI-PMH
- load balancing
- community building
- discovery
data providers (repositories)
service providers (harvesters)
aggregator
10Overview of OAI-PMH Verbs
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
11Observations
- One reason why the original Phelps Wilensky
vision was never realized is that required a
priori LS calculation - idea use the Web Infrastructure to calculate LSs
as they are needed - Mass adoption of a system will occur only if it
is really, really easy to do so - idea digital preservation systems should require
only a small number of heroes
12Description Use Cases
- Allow many web servers to use a few Opal servers
that use the caches of the Web Infrastructure to
generate Lexical Signatures of recently 404 URLs
to find either - the same page at a new URL
- example bookmarked colleague is now 404
- cached info is not useful
- similar pages probably not useful
- a good enough replacement page
- example bookmarked recipe is now 404
- cached info is useful
- similar pages probably useful
13Opal Configuration Configure Two Things
edit httpd.conf
add / edit custom 404 page
14Opal High-Level Architecture
1. Get URL X
Interactive User
www.bar.org
2. Custom 404 page
3. Pagetag redirects User to Opal server
5. Opal gives user navigation options
4. Opal searches WI caches creates LS
opal.foo.edu
15Locating Caches
http//www.google.com/search?hlenieISO-8859-1q
http//www.cs.odu.edu/tharriso http//search.ya
hoo.com/search?frFP-pull-web-teiUTF8phttp//w
ww.cs.odu.edu/tharriso
16Internet Archive
17WI Caches Last 7-51 days
- IA caches forever, but
- may not ever crawl you
- 12 month latency
- no internal backups
Frank McCown, Joan A. Smith, Michael L. Nelson,
Johan Bollen, Reconstructing Websites for the
Lazy Webmaster, arXiv cs.IR/0512069,
2005. http//arxiv.org/abs/cs.IR/0512069
18Term Frequency ? Inverse Document Frequency
- Calculating Term Frequency is easy
- frequency of term in this document
- Calculating Document Frequency is hard
- frequency of term in all documents
- assumes knowledge of entire corpus!
- Good terms appear
- frequently in a single document
- infrequently across all documents
19Scraping Google to Approximate DF
- Frequency of term across all documents
- How many documents?
20GUI - Bootstrapping
21GUI - Learned
22GUI (cont)
- simURL"http//www.cs.odu.edu/tharriso/"
baseURL"http//invivo_test.com" -
- urlhttp//www.cs.odu.edu/tharriso
- matchhttp//www.cs.odu.edu/tharriso/')"
- Terry Harrison Profile
Page
Burning Man Images
Other Images - (not really well sorted, sorry!) Email
Terry ...
- (May 2003), AR Zipf Fellowship Awarded to
Terry Harrison - Press Release - ...
www.cs.odu.edu/
tharriso/ - 12k -
23Opal Server Databases
- URL database
- 404 URL ? (LS, similarURL1, similarURL2, ,
similarURLN) - similarURL ? (URL, datestamp, votes, Opal server)
- Term database
- term ? (Opal server, source, datestamp, DF,
corpus size, IDF)
Define each URL and Term as OAI-PMH Records and
we can harvest what an Opal server has learned
- can accommodate late arrivers (no cold
start for them) - pool the learning of
multiple servers - incentives to cooperate
24Opal Synchronization
Group 1
- Other architectures possible
- Harvesting frequency determined by individual
nodes
Group 2
Opal A
Opal D.1
Opal D aggregates D.1-D.3 to Group 1 Opal D
aggregates A-C to Group 2
Opal D.2
Opal D.3
Terms
URLs
25Discovery via OAI-PMH
26Connection Costs
- Costcache (WI N) R
- WI of web infrastructure caches
- N connections for each WI
- R connection to get a datestamp
- Costpaths Rc T Rl
- Rc connections to get a cached copy
- T connections required for each term
- Rl connections to use LS
Costcache 31 1 4
Costpaths 1 T 1
27Analysis - Cumulative Terms Learned
1 Million terms 30000 Documents Result averages
after 100 iterations
28Analysis - Terms Learned Per Document
1 Million terms 30000 Documents Result averages
after 100 iterations
29Load Estimation
30Future Work
- Testing on departmental server
- hard to test in-the-small
- Code optimizations
- many short cuts taken for demo system
- G Y APIs not used screen scraping only
- Lexical Signatures
- describe changes over time
- IDF calculation metrics
- is scraping Google valid? is it nice?
- Learning new code
- use OAI-PMH to update the system
- OpenURL resolver
- 404 URL referent
31Conclusions
- Lexical signatures can be generated just-in-time
from WI caches as pages disappear - Many web servers can be easily configured to use
a single Opal server - Multiple Opal servers can harvest each other to
learn Terms and URLs more quickly