JustInTime Recovery of Missing Web Pages - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

JustInTime Recovery of Missing Web Pages

Description:

Chan et al. (03) - 11 year half-life for URLs in D-Lib Magazine articles ... 404 URL (LS, similarURL1, similarURL2, ..., similarURLN) ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 31
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: JustInTime Recovery of Missing Web Pages


1
Just-In-Time Recovery of Missing Web Pages
  • Hypertext 2006
  • Odense, Denmark
  • August 25, 2006
  • Terry L. Harrison Michael L. Nelson
  • Old Dominion University
  • Norfolk VA, USA

2
Preservation Fortress Model
Five Easy Steps for Preservation
  • Get a lot of
  • Buy a lot of disks, machines, tapes, etc.
  • Hire an army of staff
  • Load a small amount of data
  • Look upon my archive ye Mighty, and despair!

image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
3
Alternate Models of Preservation
  • Lazy Preservation
  • Let Google, IA et al. preserve your website
  • Just-In-Time Preservation
  • Find a good enough replacement web page
  • Shared Infrastructure Preservation
  • Push your content to sites that might preserve it
  • Web Server Enhanced Preservation
  • Use Apache modules to create archival-ready
    resources

image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
4
Outline
  • The 404 problem
  • Component technologies
  • web infrastructure
  • lexical signatures
  • OAI-PMH
  • Opal
  • architectural description
  • analysis

5
404 Problem
  • Kahle (97) - Average page lifetime 44 days
  • Koehler (99, 04) - 67 URLs lost in 4 years
  • Lawrence et al. (01) - 23-53 URLs in CiteSeer
    papers invalid over 5 year span (3 of invalid
    URLs unfindable)
  • Spinellis (03) - 27 URLs in CACM/Computer papers
    gone in 5 years
  • Chan et al. (03) - 11 year half-life for URLs in
    D-Lib Magazine articles
  • Nelson Allen (02) - 3 objects in digital
    library gone in 1 year

ECDL 1999 good enough page available
PSP 2003 exact copy at new URL
Greynet 99 unavailable at any URL?
6
Web Infrastructure Refreshing Migrating
7
Lexical Signatures
  • Robust Hyperlinks Cost Just Five Words Each
  • Phelps Wilensky (2000)
  • http//www.cs.odu.edu/tharriso/?lex-sigterr
    yharrisonthesisjcdlawarded
  • Analysis of Lexical Signatures for Improving
    Information Presence on the World Wide Web
  • Park et al. (2004)

8
OAI-PMH
Data Providers / Repositories
Service Providers / Harvesters
A repository is a network accessible server that
can process the 6 OAI-PMH requests A
repository is managed by a data provider to
expose metadata to harvesters. 
A harvester is a client application that issues
OAI-PMH requests.  A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
9
OAI-PMH Aggregators
  • aggregators allow for
  • scalability for OAI-PMH
  • load balancing
  • community building
  • discovery

data providers (repositories)
service providers (harvesters)
aggregator
10
Overview of OAI-PMH Verbs
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
11
Observations
  • One reason why the original Phelps Wilensky
    vision was never realized is that required a
    priori LS calculation
  • idea use the Web Infrastructure to calculate LSs
    as they are needed
  • Mass adoption of a system will occur only if it
    is really, really easy to do so
  • idea digital preservation systems should require
    only a small number of heroes

12
Description Use Cases
  • Allow many web servers to use a few Opal servers
    that use the caches of the Web Infrastructure to
    generate Lexical Signatures of recently 404 URLs
    to find either
  • the same page at a new URL
  • example bookmarked colleague is now 404
  • cached info is not useful
  • similar pages probably not useful
  • a good enough replacement page
  • example bookmarked recipe is now 404
  • cached info is useful
  • similar pages probably useful

13
Opal Configuration Configure Two Things
edit httpd.conf
add / edit custom 404 page
14
Opal High-Level Architecture
1. Get URL X
Interactive User
www.bar.org
2. Custom 404 page
3. Pagetag redirects User to Opal server
5. Opal gives user navigation options
4. Opal searches WI caches creates LS
opal.foo.edu
15
Locating Caches
http//www.google.com/search?hlenieISO-8859-1q
http//www.cs.odu.edu/tharriso http//search.ya
hoo.com/search?frFP-pull-web-teiUTF8phttp//w
ww.cs.odu.edu/tharriso
16
Internet Archive
17
WI Caches Last 7-51 days
  • IA caches forever, but
  • may not ever crawl you
  • 12 month latency
  • no internal backups

Frank McCown, Joan A. Smith, Michael L. Nelson,
Johan Bollen, Reconstructing Websites for the
Lazy Webmaster, arXiv cs.IR/0512069,
2005. http//arxiv.org/abs/cs.IR/0512069
18
Term Frequency ? Inverse Document Frequency
  • Calculating Term Frequency is easy
  • frequency of term in this document
  • Calculating Document Frequency is hard
  • frequency of term in all documents
  • assumes knowledge of entire corpus!
  • Good terms appear
  • frequently in a single document
  • infrequently across all documents

19
Scraping Google to Approximate DF
  • Frequency of term across all documents
  • How many documents?

20
GUI - Bootstrapping
21
GUI - Learned
22
GUI (cont)
  • simURL"http//www.cs.odu.edu/tharriso/"
    baseURL"http//invivo_test.com"
  • urlhttp//www.cs.odu.edu/tharriso
  • matchhttp//www.cs.odu.edu/tharriso/')"
  • Terry Harrison Profile
    Page
    Burning Man Images
    Other Images
  • (not really well sorted, sorry!) Email
    Terry ...
  • (May 2003), AR Zipf Fellowship Awarded to
    Terry Harrison - Press Release
  • ...
    www.cs.odu.edu/
    tharriso/ - 12k -  

23
Opal Server Databases
  • URL database
  • 404 URL ? (LS, similarURL1, similarURL2, ,
    similarURLN)
  • similarURL ? (URL, datestamp, votes, Opal server)
  • Term database
  • term ? (Opal server, source, datestamp, DF,
    corpus size, IDF)

Define each URL and Term as OAI-PMH Records and
we can harvest what an Opal server has learned
- can accommodate late arrivers (no cold
start for them) - pool the learning of
multiple servers - incentives to cooperate
24
Opal Synchronization
Group 1
  • Other architectures possible
  • Harvesting frequency determined by individual
    nodes

Group 2
Opal A
Opal D.1
Opal D aggregates D.1-D.3 to Group 1 Opal D
aggregates A-C to Group 2
Opal D.2
Opal D.3
Terms
URLs
25
Discovery via OAI-PMH
26
Connection Costs
  • Costcache (WI N) R
  • WI of web infrastructure caches
  • N connections for each WI
  • R connection to get a datestamp
  • Costpaths Rc T Rl
  • Rc connections to get a cached copy
  • T connections required for each term
  • Rl connections to use LS

Costcache 31 1 4
Costpaths 1 T 1
27
Analysis - Cumulative Terms Learned
1 Million terms 30000 Documents Result averages
after 100 iterations
28
Analysis - Terms Learned Per Document
1 Million terms 30000 Documents Result averages
after 100 iterations
29
Load Estimation
30
Future Work
  • Testing on departmental server
  • hard to test in-the-small
  • Code optimizations
  • many short cuts taken for demo system
  • G Y APIs not used screen scraping only
  • Lexical Signatures
  • describe changes over time
  • IDF calculation metrics
  • is scraping Google valid? is it nice?
  • Learning new code
  • use OAI-PMH to update the system
  • OpenURL resolver
  • 404 URL referent

31
Conclusions
  • Lexical signatures can be generated just-in-time
    from WI caches as pages disappear
  • Many web servers can be easily configured to use
    a single Opal server
  • Multiple Opal servers can harvest each other to
    learn Terms and URLs more quickly
Write a Comment
User Comments (0)
About PowerShow.com