Title: Thinking Differently About Web Page Preservation
1Thinking Differently About Web Page Preservation
- Michael L. Nelson, Frank McCown, Joan A. Smith
- Old Dominion University
- Norfolk VA
- mln,fmccown,jsmit_at_cs.odu.edu
- Library of Congress
- Brown Bag Seminar
- June 29, 2006
- Research supported in part by NSF, Library of
Congress and Andrew Mellon Foundation
2Background
- We cant save everything!
- if not everything, then how much?
- what does save mean?
3Women and Children First
HMS Birkenhead, Cape Danger, 1852
638 passengers
193 survivors
all 7 women 13 children
image from http//www.btinternet.com/palmiped/Bi
rkenhead.htm
4We should probably save a copy of this
5Or maybe we dont have to the Wikipedia
link is in the top 10, so were ok, right?
6Surely were saving copies of this
72 copies in the UK 2 Dublin Core
records Thats probably good enough
8What about the things that we know we dont
need to keep?
You DO support recycling, right?
9A higher moral calling for pack rats?
10Just Keep the Important Stuff!
11Lessons Learned from the AIHT
(Boring stuff D-Lib Magazine, December 2005)
images from http//facweb.cs.depaul.edu/sgrais/co
llage.htm
12Preservation Fortress Model
Five Easy Steps for Preservation
- Get a lot of
- Buy a lot of disks, machines, tapes, etc.
- Hire an army of staff
- Load a small amount of data
- Look upon my archive ye Mighty, and despair!
image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
13Alternate Models of Preservation
- Lazy Preservation
- Let Google, IA et al. preserve your website
- Just-In-Time Preservation
- Wait for it to disappear first, then a good
enough version - Shared Infrastructure Preservation
- Push your content to sites that might preserve it
- Web Server Enhanced Preservation
- Use Apache modules to create archival-ready
resources
image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
14Lazy PreservationHow much preservation do I get
if I do nothing?
15Outline Lazy Preservation
- Web Infrastructure as a Resource
- Reconstructing Web Sites
- Research Focus
16(No Transcript)
17Web Infrastructure
18(No Transcript)
19Cost of Preservation
20Outline Lazy Preservation
- Web Infrastructure as a Resource
- Reconstructing Web Sites
- Research Focus
21Research Questions
- How much digital preservation of websites is
afforded by lazy preservation? - Can we reconstruct entire websites from the WI?
- What factors contribute to the success of website
reconstruction? - Can we predict how much of a lost website can be
recovered? - How can the WI be utilized to provide
preservation of server-side components?
22Prior Work
- Is website reconstruction from WI feasible?
- Web repository G,M,Y,IA
- Web-repository crawler Warrick
- Reconstructed 24 websites
- How long do search engines keep cached content
after it is removed?
23Timeline of SE Resource Acquisition and Release
Vulnerable resource not yet cached (tca is not
defined) Replicated resource available on web
server and SE cache (tca lt current time lt
tr) Endangered resource removed from web server
but still cached (tca lt current time lt
tcr) Unrecoverable resource missing from web
server and cache (tca lt tcr lt current time)
Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites, D-Lib Magazine, 12(2),
February 2006. Frank McCown, Joan A. Smith,
Michael L. Nelson, and Johan Bollen.
Reconstructing Websites for the Lazy
Webmaster, Technical report, arXiv cs.IR/0512069,
2005.
24 25(No Transcript)
26Cached Image
27Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
28Web Repository Characteristics
C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
29SE Caching Experiment
- Create html, pdf, and images
- Place files on 4 web servers
- Remove files on regular schedule
- Examine web server logs to determine when each
page is crawled and by whom - Query each search engine daily using unique
identifier to see if they have cached the page or
image
Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites. D-Lib Magazine, February 2006, 12(2)
30Caching of HTML Resources - mln
31Reconstructing a Website
Original URL
Warrick
Starting URL
Web Repo
Results page
Cached URL
Retrieved resource
File system
Cached resource
- Pull resources from all web repositories
- Strip off extra header and footer html
- Store most recently cached version or canonical
version - Parse html for links to other resources
32How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Missing link to D points to old resource G
F cant be found
33Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
34Websites to Reconstruct
- Reconstruct 24 sites in 3 categories
- 1. small (1-150 resources) 2. medium (150-499
resources)3. large (500 resources) - Use Wget to download current website
- Use Warrick to reconstruct
- Calculate reconstruction vector
35Results
Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
36Aggregation of Websites
37Web Repository Contributions
38Warrick Milestones
- www2006.org first lost website reconstructed
(Nov 2005) - DCkickball.org first website someone else
reconstructed without our help (late Jan 2006) - www.iclnet.org first website we reconstructed
for someone else (mid Mar 2006) - Internet Archive officially blesses Warrick
(mid Mar 2006)1
1http//frankmccown.blogspot.com/2006/03/warrick-i
s-gaining-traction.html
39Outline Lazy Preservation
- Web Infrastructure as a Resource
- Reconstructing Web Sites
- Research Focus
40Proposed Work
- How lazy can we afford to be?
- Find factors influencing success of website
reconstruction from the WI - Perform search engine cache characterization
- Inject server-side components into WI for
complete website reconstruction - Improving the Warrick crawler
- Evaluate different crawling policies
- Frank McCown and Michael L. Nelson, Evaluation of
Crawling Policies for a Web-repository Crawler,
ACM Hypertext 2006. - Development of web-repository API for inclusion
in Warrick
41Factors Influencing Website Recoverability from
the WI
- Previous study did not find statistically
significant relationship between recoverability
and website size or PageRank - Methodology
- Sample large number of websites - dmoz.org
- Perform several reconstructions over time using
same policy - Download sites several times over time to capture
change rates
42Evaluation
- Use statistical analysis to test for the
following factors - Size
- Makeup
- Path depth
- PageRank
- Change rate
- Create a predictive model how much of my lost
website do I expect to get back?
43Marshall TR Server running EPrints
44We can recover the missing page and PDF, but
what about the services?
45Recovery of Web Server Components
- Recovering the client-side representation is not
enough to reconstruct a dynamically-produced
website - How can we inject the server-side functionality
into the WI? - Web repositories like HTML
- Canonical versions stored by all web repos
- Text-based
- Comments can be inserted without changing
appearance of page - Injection Use erasure codes to break a server
file into chunks and insert the chunks into HTML
comments of different pages
46Recover Server File from WI
47Evaluation
- Find the most efficient values for n and r
(chunks created/recovered) - Security
- Develop simple mechanism for selecting files that
can be injected into the WI - Address encryption issues
- Reconstruct an EPrints website with a few hundred
resources
48SE Cache Characterization
- Web characterization is an active field
- Search engine caches have never been
characterized - Methodology
- Randomly sample URLs from four popular search
engines Google, MSN, Yahoo, Ask - Download cached version and live version from the
Web - Examine HTTP headers and page content
- Test for overlap with Internet Archive
- Attempt to access various resource types (PDF,
Word, PS, etc.) in each SE cache
49Summary Lazy Preservation
- When this work is completed, we will have
- demonstrated and evaluated the lazy preservation
technique - provided a reference implementation
- characterized SE caching behavior
- provided a layer of abstraction on top of SE
behavior (API) - explored how much we store in the WI (server-side
vs. client-side representations)
50Web Server Enhanced Preservation How much
preservation do I get if I do just a little bit?
51Outline Web Server Enhanced Preservation
- OAI-PMH
- mod_oai complex objects resource harvesting
- Research Focus
52WWW and DL Separate Worlds
Crawlapalooza
WWW
WWW
DL
DL
Today
Harvester Home Companion
1994
The problem is not that the WWW doesnt work it
clearly does. The problem is that our
(preservation) expectations have been lowered.
53Data Providers / Repositories
Service Providers / Harvesters
A repository is a network accessible server that
can process the 6 OAI-PMH requests A
repository is managed by a data provider to
expose metadata to harvesters.
A harvester is a client application that issues
OAI-PMH requests. A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
54Aggregators
- aggregators allow for
- scalability for OAI-PMH
- load balancing
- community building
- discovery
data providers (repositories)
service providers (harvesters)
aggregator
55OAI-PMH data model
entry point to all records pertaining to the
resource
metadata pertaining to the resource
56OAI-PMH Used by Google AcademicLive (MSN)
- Why support OAI-PMH?
- These guys are in business (i.e., for
profit) - How does OAI-PMH help their bottom line?
- By improving the search and analysis process
57Resource Harvesting with OAI-PMH
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
58Outline Web Server Enhanced Preservation
- OAI-PMH
- mod_oai complex objects resource harvesting
- Research Focus
59Two Problems
60mod_oai solution
- Integrate OAI-PMH functionality into the web
server itself - mod_oai an Apache 2.0 module to automatically
answer OAI-PMH requests for an http server - written in C
- respects values in .htaccess, httpd.conf
- compile mod_oai on http//www.foo.edu/
- baseURL is now http//www.foo.edu/modoai
- Result web harvesting with OAI-PMH semantics
(e.g., from, until, sets)
http//www.foo.edu/modoai?verbListIdentifiersmet
dataPrefixoai_dcfrom2004-09-15setmimevideom
peg
The human-readable web site
Prepped for machine-friendly harvesting
Give me a list of all resources, include Dublin
Core metadata, dating from 9/15/2004 through
today, and that are MIME type video-MPEG.
61A Crawlers View of the Web Site
Not crawled (protected)
web root
Not crawled (Generated on-the-fly by CGI, e.g.)
Not crawled robots.txt or robots META tag
Not crawled (unadvertised unlinked)
Crawled pages
Not crawled (remote link only)
Not crawled (too deep)
Remote web site
62Apaches View of the Web Site
Require authentication
web root
Generated on-the-fly (CGI, e.g.)
Tagged No robots
Unknown/not visible
63The Problem Defining The Whole Site
- For a given server, there are a set of URLs, U,
and a set of files F - Apache maps U ? F
- mod_oai maps F ? U
- Neither function is 1-1 nor onto
- We can easily check if a single u maps to F, but
given F we cannot (easily) generate U - Short-term issues
- dynamic files
- exporting unprocessed server-side files would be
a security hole - IndexIgnore
- httpd will hide valid URLs
- File permissions
- httpd will advertise files it cannot read
- Long-term issues
- Alias, Location
- files can be covered up by the httpd
- UserDir
- interactions between the httpd and the filesystem
64A Webmasters Omniscient View
Dynamic
web root
Authenticated
Tagged No robots
Orphaned
Deep
Unknown/not visible
65HTTP Get versus OAI-PMH GetRecord
HTTP GetRecord
Machine-readable
HTTP GET
JHOVE METADATA
Human-readable
MD-5 LS
Complex Object
mod_oai
Apache Web Server
GET /modoai/?verbGetRecordidentifier headlines
.htmlmetadaprefixoai_didl
GET /headlines.html HTTP1.1
WEB SITE
66OAI-PMH data model in mod_oai
http//techreports.larc.nasa.gov/ltrs/PDF/2004/aia
a/NASA-aiaa-2004-0015.pdf
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
67Complex Objects That Tell A Story
http//foo.edu/bar.pdf encoded as an MPEG-21 DIDL
Russian Nesting Doll
ltdidlgt ltmetadata source"jhove"gt...lt/metadatagt
ltmetadata source"file"gt...lt/metadatagt ltmetadata
source"essence"gt...lt/metadatagt ltmetadata
source"grep"gt...lt/metadatagt ... ltresource
mimeType"application/pdf" identifierhttp//foo.
edu/bar.pdf encoding"base64gt SADLFJSALDJF...SLDKF
JASLDJ lt/resourcegt lt/didlgt
- Resource and metadata packaged together as a
complex digital object represented via XML
wrapper - Uniform solution for simple compound objects
- Unambiguous expression of locator of datastream
- Disambiguation between locators identifiers
- OAI-PMH datestamp changes whenever the resource
(datastreams secondary information) changes - OAI-PMH semantics apply about containers, set
membership
- First came Lenin
- Then came Stalin
68Resource Discovery ListIdentifiers
- HARVESTER
- issues a ListIdentifiers,
- finds URLs of updated resources
- does HTTP GETs updates only
- can get URLs of resources with specified MIME
types
69Preservation ListRecords
- HARVESTER
- issues a ListRecords,
- Gets updates as MPEG-21 DIDL documents (HTTP
headers, resource By Value or By Reference) - can get resources with specified MIME types
70What does this mean?
- For an entire web site, we can
- serialize everything as an XML stream
- extract it using off-the-shelf OAI-PMH harvesters
- efficiently discover updates additions
- For each URL, we can
- create preservation ready version with
configurable descriptivetechnicalstructural
metadata - e.g., Jhove output, datestamps, signatures,
provenance, automatically generated summary, etc.
Jhove other pertinent info
Harvest the resource
or lexical signatures, Summaries, etc
extract metadata
include an index translations
Wrap it all together In an XML Stream
Ready for the future
71Outline Web Server Enhanced Preservation
- OAI-PMH
- mod_oai complex objects resource harvesting
- Research Focus
72Research Contributions
- Thesis Question How well can Apache support web
page preservation? - Goal To make web resources preservation ready
- Support refreshing (how many URLs at this
site?) the counting problem - Support migration (what is this object?) the
representation problem - How Using OAI-PMH resource harvesting
- Aggregate forensic metadata
- Automate extraction
- Encapsulate into an object
- XML stream of information
- Maximize preservation opportunity
- Bring DL technology into the realm of WWW
73Experimentation Evaluation
- Research solutions to the counting problem
- Different tools yield different results
- Google sitemap ltgt Apache file list ltgt robot
crawled pages - Combine approaches for one automated, full URL
listing - Apache logs are detailed history of site activity
- Compare user page requests with crawlers
requests - Compare crawled pages with actual site tree
- Continue research on the representation problem
- Integrate utilities into mod_oai (Jhove, etc.)
- Automate metadata extraction encapsulation
- Serialize and reconstitute
- complete back-up of site reconstitution through
XML stream
74Summary Web Server Enhanced Preservation
- Better web harvesting can be achieved through
- OAI-PMH structured access to updates
- Complex object formats modeled representation of
digital objects - Address 2 key problems
- Preservation (ListRecords) The Representation
Problem - Web crawling (ListIdentifiers) The Counting
Problem - mod_oai reference implementation
- Better performance than wget crawlers
- not a replacement for DSpace, Fedora,
eprints.org, etc. - More info
- http//www.modoai.org/
- http//whiskey.cs.odu.edu/
- Automatic harvesting of web resources rich in
metadata packaged for the future
Today manual
Tomorrow automatic!
75Summary
76Summary
- Digital preservation is not hard, its just big.
- Save the women and children first, of course, but
there is room for many more - Using the by-product of SE and WI, we can get a
good amount of preservation for free - prediction Google et al. will eventually see
preservation as a business opportunity - Increasing the role of the web server will solve
most of the digital preservation problems - complex objects OAI-PMH digital preservation
solution
77As you know, you preserve the files you have.
Theyre not the files you might want or wish to
have at a later time if you think about it,
you can have all the metadata in the world on a
file and a file can be blown up
image from http//www.washingtonpost.com/wp-dyn/a
rticles/A132-2004Dec14.html
78Overview of OAI-PMH Verbs
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
79Enhancing Apaches utility as a preservation tool
- Create a partnership between server and SE
- Apache can serve up details about site,
accessible portions of site tree, changes
including additions and deletions - SE can reduce crawl time and subsequent
index/update times - Google Hi Apache! Whats new?
- Apache Hi Google! Ive got 3 new pages,
xyz/blah1.html, - yyy/bug2.html, and ru2.html. Oh, and I also
deleted xyz/boo.html. - Use OAI-PMH to facilitate conversation between
the SE and the server - Data model offers many advantages
- Both content-rich and metadata-rich
- Supports complex objects
- Protocols 6 verbs mesh well with SE, Server
roles - ListMetadataFormats, ListSets, GetRecord,
ListRecords, ListIdentifiers, ListRecords - Enable policy-driven relationship between site
SE - push content-rich harvesting to web community
80OAI-PMH concepts typical repository
81OAI-PMH concepts mod_oai
82OAI-PMH data model
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
83Warrick API
- API should provide a clear and flexible interface
for web repositories - Goals
- Shield Warrick from changes to WI
- Facilitate inclusion of new web repositories
- Minimize implementation and maintenance costs
84Evaluation
- Internet Archive has endorsed use of Warrick
- Make Warrick available on SourceForge
- Measure the community adoption modification
85performance of mod_oai and wget on www.cs.odu.edu
for more detail mod_oai An Apache Module for
Metadata Harvesting http//arxiv.org/abs/cs.DL/0
503069
86IndexIgnore File Permissions
87Alias Covering Up Files
httpd.conf Alias /A /usr/local/web/htdocs/B Ali
as /B /usr/local/web/htdocs/A
the files A and B will be different from the
URLs http//server/A http//server/B
88UserDir Just in Time mounting of directories
whiskey.cs.odu.edu/ftp/WWW/conf ls /home liu_x/
mln/ whiskey.cs.odu.edu/ftp/WWW/conf ls -d
/home/tharriso /home/tharriso/ whiskey.cs.odu.edu
/ftp/WWW/conf ls /home liu_x/ mln/
tharriso/ whiskey.cs.odu.edu/ftp/WWW/conf
89Complex Object Formats Characteristics
- Representation of a digital object by means of a
wrapper XML document. - Represented resource can be
- simple digital object (consisting of a single
datastream) - compound digital object (consisting of multiple
datastreams) - Unambiguous approach to convey identifiers of the
digital object and its constituent datastreams. - Include datastream
- By-Value embedding of base64-encoded datastream
- By-Reference embedding network location of the
datastream - not mutually exclusive equivalent
- Include a variety of secondary information
- By-Value
- By-Reference
- Descriptive metadata, rights information,
technical metadata,
90Resource Harvesting Use cases
- Discovery use content itself in the creation of
services - search engines that make full-text searchable
- citation indexing systems that extract references
from the full-text content - browsing interfaces that include thumbnail
versions of high-quality images from cultural
heritage collections - Preservation
- periodically transfer digital content from a data
repository to one or more trusted digital
repositories - trusted digital repositories need a mechanism to
automatically synchronize with the originating
data repository
Ideas first presented in Van de Sompel, Nelson,
Lagoze Warner, http//www.dlib.org/dlib/december
04/vandesompel/12vandesompel.html