Thinking Differently About Web Page Preservation

About This Presentation

Title:

Thinking Differently About Web Page Preservation

Description:

Research supported in part by NSF, Library of Congress and ... image from: http://www.btinternet.com/~palmiped/Birkenhead.htm. HMS Birkenhead, Cape Danger, 1852 ... – PowerPoint PPT presentation

Number of Views:91

Avg rating:3.0/5.0

Slides: 78

Provided by: Michael50

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Thinking Differently About Web Page Preservation

1
Thinking Differently About Web Page Preservation

Michael L. Nelson, Frank McCown, Joan A. Smith
Old Dominion University
Norfolk VA
mln,fmccown,jsmit_at_cs.odu.edu
Library of Congress
Brown Bag Seminar
June 29, 2006
Research supported in part by NSF, Library of
Congress and Andrew Mellon Foundation

2
Background

We cant save everything!
if not everything, then how much?
what does save mean?

3
Women and Children First
HMS Birkenhead, Cape Danger, 1852
638 passengers
193 survivors
all 7 women 13 children
image from http//www.btinternet.com/palmiped/Bi
rkenhead.htm
4
We should probably save a copy of this
5
Or maybe we dont have to the Wikipedia
link is in the top 10, so were ok, right?
6
Surely were saving copies of this
7
2 copies in the UK 2 Dublin Core
records Thats probably good enough
8
What about the things that we know we dont
need to keep?
You DO support recycling, right?
9
A higher moral calling for pack rats?
10
Just Keep the Important Stuff!
11
Lessons Learned from the AIHT
(Boring stuff D-Lib Magazine, December 2005)
images from http//facweb.cs.depaul.edu/sgrais/co
llage.htm
12
Preservation Fortress Model
Five Easy Steps for Preservation

Get a lot of
Buy a lot of disks, machines, tapes, etc.
Hire an army of staff
Load a small amount of data
Look upon my archive ye Mighty, and despair!

image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
13
Alternate Models of Preservation

Lazy Preservation
Let Google, IA et al. preserve your website
Just-In-Time Preservation
Wait for it to disappear first, then a good
enough version
Shared Infrastructure Preservation
Push your content to sites that might preserve it
Web Server Enhanced Preservation
Use Apache modules to create archival-ready
resources

image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
14
Lazy PreservationHow much preservation do I get
if I do nothing?

Frank McCown

15
Outline Lazy Preservation

Web Infrastructure as a Resource
Reconstructing Web Sites
Research Focus

16
(No Transcript)
17
Web Infrastructure
18
(No Transcript)
19
Cost of Preservation
20
Outline Lazy Preservation

Web Infrastructure as a Resource
Reconstructing Web Sites
Research Focus

21
Research Questions

How much digital preservation of websites is
afforded by lazy preservation?
Can we reconstruct entire websites from the WI?
What factors contribute to the success of website
reconstruction?
Can we predict how much of a lost website can be
recovered?
How can the WI be utilized to provide
preservation of server-side components?

22
Prior Work

Is website reconstruction from WI feasible?
Web repository G,M,Y,IA
Web-repository crawler Warrick
Reconstructed 24 websites
How long do search engines keep cached content
after it is removed?

23
Timeline of SE Resource Acquisition and Release
Vulnerable resource not yet cached (tca is not
defined) Replicated resource available on web
server and SE cache (tca lt current time lt
tr) Endangered resource removed from web server
but still cached (tca lt current time lt
tcr) Unrecoverable resource missing from web
server and cache (tca lt tcr lt current time)
Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites, D-Lib Magazine, 12(2),
February 2006. Frank McCown, Joan A. Smith,
Michael L. Nelson, and Johan Bollen.
Reconstructing Websites for the Lazy
Webmaster, Technical report, arXiv cs.IR/0512069,
2005.
24

25
(No Transcript)
26
Cached Image
27
Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
28
Web Repository Characteristics

C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
29
SE Caching Experiment

Create html, pdf, and images
Place files on 4 web servers
Remove files on regular schedule
Examine web server logs to determine when each
page is crawled and by whom
Query each search engine daily using unique
identifier to see if they have cached the page or
image

Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites. D-Lib Magazine, February 2006, 12(2)
30
Caching of HTML Resources - mln
31
Reconstructing a Website
Original URL
Warrick
Starting URL
Web Repo
Results page
Cached URL
Retrieved resource
File system
Cached resource

Pull resources from all web repositories
Strip off extra header and footer html
Store most recently cached version or canonical
version
Parse html for links to other resources

32
How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Missing link to D points to old resource G
F cant be found
33
Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
34
Websites to Reconstruct

Reconstruct 24 sites in 3 categories
1. small (1-150 resources) 2. medium (150-499
resources)3. large (500 resources)
Use Wget to download current website
Use Warrick to reconstruct
Calculate reconstruction vector

35
Results
Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
36
Aggregation of Websites
37
Web Repository Contributions
38
Warrick Milestones

www2006.org first lost website reconstructed
(Nov 2005)
DCkickball.org first website someone else
reconstructed without our help (late Jan 2006)
www.iclnet.org first website we reconstructed
for someone else (mid Mar 2006)
Internet Archive officially blesses Warrick
(mid Mar 2006)1

1http//frankmccown.blogspot.com/2006/03/warrick-i
s-gaining-traction.html
39
Outline Lazy Preservation

Web Infrastructure as a Resource
Reconstructing Web Sites
Research Focus

40
Proposed Work

How lazy can we afford to be?
Find factors influencing success of website
reconstruction from the WI
Perform search engine cache characterization
Inject server-side components into WI for
complete website reconstruction
Improving the Warrick crawler
Evaluate different crawling policies
Frank McCown and Michael L. Nelson, Evaluation of
Crawling Policies for a Web-repository Crawler,
ACM Hypertext 2006.
Development of web-repository API for inclusion
in Warrick

41
Factors Influencing Website Recoverability from
the WI

Previous study did not find statistically
significant relationship between recoverability
and website size or PageRank
Methodology
Sample large number of websites - dmoz.org
Perform several reconstructions over time using
same policy
Download sites several times over time to capture
change rates

42
Evaluation

Use statistical analysis to test for the
following factors
Size
Makeup
Path depth
PageRank
Change rate
Create a predictive model how much of my lost
website do I expect to get back?

43
Marshall TR Server running EPrints
44
We can recover the missing page and PDF, but
what about the services?
45
Recovery of Web Server Components

Recovering the client-side representation is not
enough to reconstruct a dynamically-produced
website
How can we inject the server-side functionality
into the WI?
Web repositories like HTML
Canonical versions stored by all web repos
Text-based
Comments can be inserted without changing
appearance of page
Injection Use erasure codes to break a server
file into chunks and insert the chunks into HTML
comments of different pages

46
Recover Server File from WI
47
Evaluation

Find the most efficient values for n and r
(chunks created/recovered)
Security
Develop simple mechanism for selecting files that
can be injected into the WI
Address encryption issues
Reconstruct an EPrints website with a few hundred
resources

48
SE Cache Characterization

Web characterization is an active field
Search engine caches have never been
characterized
Methodology
Randomly sample URLs from four popular search
engines Google, MSN, Yahoo, Ask
Download cached version and live version from the
Web
Examine HTTP headers and page content
Test for overlap with Internet Archive
Attempt to access various resource types (PDF,
Word, PS, etc.) in each SE cache

49
Summary Lazy Preservation

When this work is completed, we will have
demonstrated and evaluated the lazy preservation
technique
provided a reference implementation
characterized SE caching behavior
provided a layer of abstraction on top of SE
behavior (API)
explored how much we store in the WI (server-side
vs. client-side representations)

50
Web Server Enhanced Preservation How much
preservation do I get if I do just a little bit?

Joan A. Smith

51
Outline Web Server Enhanced Preservation

OAI-PMH
mod_oai complex objects resource harvesting
Research Focus

52
WWW and DL Separate Worlds
Crawlapalooza
WWW
WWW
DL
DL
Today
Harvester Home Companion
1994
The problem is not that the WWW doesnt work it
clearly does. The problem is that our
(preservation) expectations have been lowered.
53
Data Providers / Repositories
Service Providers / Harvesters
A repository is a network accessible server that
can process the 6 OAI-PMH requests A
repository is managed by a data provider to
expose metadata to harvesters.
A harvester is a client application that issues
OAI-PMH requests. A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
54
Aggregators

aggregators allow for
scalability for OAI-PMH
load balancing
community building
discovery

data providers (repositories)
service providers (harvesters)
aggregator
55
OAI-PMH data model
entry point to all records pertaining to the
resource
metadata pertaining to the resource
56
OAI-PMH Used by Google AcademicLive (MSN)

Why support OAI-PMH?
These guys are in business (i.e., for
profit)
How does OAI-PMH help their bottom line?
By improving the search and analysis process

57
Resource Harvesting with OAI-PMH
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
58
Outline Web Server Enhanced Preservation

OAI-PMH
mod_oai complex objects resource harvesting
Research Focus

59
Two Problems
60
mod_oai solution

Integrate OAI-PMH functionality into the web
server itself
mod_oai an Apache 2.0 module to automatically
answer OAI-PMH requests for an http server
written in C
respects values in .htaccess, httpd.conf
compile mod_oai on http//www.foo.edu/
baseURL is now http//www.foo.edu/modoai
Result web harvesting with OAI-PMH semantics
(e.g., from, until, sets)

http//www.foo.edu/modoai?verbListIdentifiersmet
dataPrefixoai_dcfrom2004-09-15setmimevideom
peg
The human-readable web site
Prepped for machine-friendly harvesting
Give me a list of all resources, include Dublin
Core metadata, dating from 9/15/2004 through
today, and that are MIME type video-MPEG.
61
A Crawlers View of the Web Site
Not crawled (protected)
web root
Not crawled (Generated on-the-fly by CGI, e.g.)
Not crawled robots.txt or robots META tag
Not crawled (unadvertised unlinked)
Crawled pages
Not crawled (remote link only)
Not crawled (too deep)
Remote web site
62
Apaches View of the Web Site
Require authentication
web root
Generated on-the-fly (CGI, e.g.)
Tagged No robots
Unknown/not visible
63
The Problem Defining The Whole Site

For a given server, there are a set of URLs, U,
and a set of files F
Apache maps U ? F
mod_oai maps F ? U
Neither function is 1-1 nor onto
We can easily check if a single u maps to F, but
given F we cannot (easily) generate U
Short-term issues
dynamic files
exporting unprocessed server-side files would be
a security hole
IndexIgnore
httpd will hide valid URLs
File permissions
httpd will advertise files it cannot read
Long-term issues
Alias, Location
files can be covered up by the httpd
UserDir
interactions between the httpd and the filesystem

64
A Webmasters Omniscient View
Dynamic
web root
Authenticated
Tagged No robots
Orphaned
Deep
Unknown/not visible
65
HTTP Get versus OAI-PMH GetRecord
HTTP GetRecord
Machine-readable
HTTP GET
JHOVE METADATA
Human-readable
MD-5 LS
Complex Object
mod_oai
Apache Web Server
GET /modoai/?verbGetRecordidentifier headlines
.htmlmetadaprefixoai_didl
GET /headlines.html HTTP1.1
WEB SITE
66
OAI-PMH data model in mod_oai
http//techreports.larc.nasa.gov/ltrs/PDF/2004/aia
a/NASA-aiaa-2004-0015.pdf
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
67
Complex Objects That Tell A Story
http//foo.edu/bar.pdf encoded as an MPEG-21 DIDL
Russian Nesting Doll
ltdidlgt ltmetadata source"jhove"gt...lt/metadatagt
ltmetadata source"file"gt...lt/metadatagt ltmetadata
source"essence"gt...lt/metadatagt ltmetadata
source"grep"gt...lt/metadatagt ... ltresource
mimeType"application/pdf" identifierhttp//foo.
edu/bar.pdf encoding"base64gt SADLFJSALDJF...SLDKF
JASLDJ lt/resourcegt lt/didlgt

Resource and metadata packaged together as a
complex digital object represented via XML
wrapper
Uniform solution for simple compound objects
Unambiguous expression of locator of datastream
Disambiguation between locators identifiers
OAI-PMH datestamp changes whenever the resource
(datastreams secondary information) changes
OAI-PMH semantics apply about containers, set
membership

First came Lenin
Then came Stalin

68
Resource Discovery ListIdentifiers

HARVESTER
issues a ListIdentifiers,
finds URLs of updated resources
does HTTP GETs updates only
can get URLs of resources with specified MIME
types

69
Preservation ListRecords

HARVESTER
issues a ListRecords,
Gets updates as MPEG-21 DIDL documents (HTTP
headers, resource By Value or By Reference)
can get resources with specified MIME types

70
What does this mean?

For an entire web site, we can
serialize everything as an XML stream
extract it using off-the-shelf OAI-PMH harvesters
efficiently discover updates additions
For each URL, we can
create preservation ready version with
configurable descriptivetechnicalstructural
metadata
e.g., Jhove output, datestamps, signatures,
provenance, automatically generated summary, etc.

Jhove other pertinent info
Harvest the resource
or lexical signatures, Summaries, etc
extract metadata
include an index translations
Wrap it all together In an XML Stream
Ready for the future
71
Outline Web Server Enhanced Preservation

OAI-PMH
mod_oai complex objects resource harvesting
Research Focus

72
Research Contributions

Thesis Question How well can Apache support web
page preservation?
Goal To make web resources preservation ready
Support refreshing (how many URLs at this
site?) the counting problem
Support migration (what is this object?) the
representation problem
How Using OAI-PMH resource harvesting
Aggregate forensic metadata
Automate extraction
Encapsulate into an object
XML stream of information
Maximize preservation opportunity
Bring DL technology into the realm of WWW

73
Experimentation Evaluation

Research solutions to the counting problem
Different tools yield different results
Google sitemap ltgt Apache file list ltgt robot
crawled pages
Combine approaches for one automated, full URL
listing
Apache logs are detailed history of site activity
Compare user page requests with crawlers
requests
Compare crawled pages with actual site tree
Continue research on the representation problem
Integrate utilities into mod_oai (Jhove, etc.)
Automate metadata extraction encapsulation
Serialize and reconstitute
complete back-up of site reconstitution through
XML stream

74
Summary Web Server Enhanced Preservation

Better web harvesting can be achieved through
OAI-PMH structured access to updates
Complex object formats modeled representation of
digital objects
Address 2 key problems
Preservation (ListRecords) The Representation
Problem
Web crawling (ListIdentifiers) The Counting
Problem
mod_oai reference implementation
Better performance than wget crawlers
not a replacement for DSpace, Fedora,
eprints.org, etc.
More info
http//www.modoai.org/
http//whiskey.cs.odu.edu/
Automatic harvesting of web resources rich in
metadata packaged for the future

Today manual
Tomorrow automatic!
75
Summary

Michael L. Nelson

76
Summary

Digital preservation is not hard, its just big.
Save the women and children first, of course, but
there is room for many more
Using the by-product of SE and WI, we can get a
good amount of preservation for free
prediction Google et al. will eventually see
preservation as a business opportunity
Increasing the role of the web server will solve
most of the digital preservation problems
complex objects OAI-PMH digital preservation
solution

77
As you know, you preserve the files you have.
Theyre not the files you might want or wish to
have at a later time if you think about it,
you can have all the metadata in the world on a
file and a file can be blown up
image from http//www.washingtonpost.com/wp-dyn/a
rticles/A132-2004Dec14.html
78
Overview of OAI-PMH Verbs
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
79
Enhancing Apaches utility as a preservation tool

Create a partnership between server and SE
Apache can serve up details about site,
accessible portions of site tree, changes
including additions and deletions
SE can reduce crawl time and subsequent
index/update times
Google Hi Apache! Whats new?
Apache Hi Google! Ive got 3 new pages,
xyz/blah1.html,
yyy/bug2.html, and ru2.html. Oh, and I also
deleted xyz/boo.html.
Use OAI-PMH to facilitate conversation between
the SE and the server
Data model offers many advantages
Both content-rich and metadata-rich
Supports complex objects
Protocols 6 verbs mesh well with SE, Server
roles
ListMetadataFormats, ListSets, GetRecord,
ListRecords, ListIdentifiers, ListRecords
Enable policy-driven relationship between site
SE
push content-rich harvesting to web community

80
OAI-PMH concepts typical repository
81
OAI-PMH concepts mod_oai
82
OAI-PMH data model
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
83
Warrick API

API should provide a clear and flexible interface
for web repositories
Goals
Shield Warrick from changes to WI
Facilitate inclusion of new web repositories
Minimize implementation and maintenance costs

84
Evaluation

Internet Archive has endorsed use of Warrick
Make Warrick available on SourceForge
Measure the community adoption modification

85
performance of mod_oai and wget on www.cs.odu.edu

for more detail mod_oai An Apache Module for
Metadata Harvesting http//arxiv.org/abs/cs.DL/0
503069
86
IndexIgnore File Permissions
87
Alias Covering Up Files
httpd.conf Alias /A /usr/local/web/htdocs/B Ali
as /B /usr/local/web/htdocs/A
the files A and B will be different from the
URLs http//server/A http//server/B
88
UserDir Just in Time mounting of directories
whiskey.cs.odu.edu/ftp/WWW/conf ls /home liu_x/
mln/ whiskey.cs.odu.edu/ftp/WWW/conf ls -d
/home/tharriso /home/tharriso/ whiskey.cs.odu.edu
/ftp/WWW/conf ls /home liu_x/ mln/
tharriso/ whiskey.cs.odu.edu/ftp/WWW/conf
89
Complex Object Formats Characteristics

Representation of a digital object by means of a
wrapper XML document.
Represented resource can be
simple digital object (consisting of a single
datastream)
compound digital object (consisting of multiple
datastreams)
Unambiguous approach to convey identifiers of the
digital object and its constituent datastreams.
Include datastream
By-Value embedding of base64-encoded datastream
By-Reference embedding network location of the
datastream
not mutually exclusive equivalent
Include a variety of secondary information
By-Value
By-Reference
Descriptive metadata, rights information,
technical metadata,

90
Resource Harvesting Use cases

Discovery use content itself in the creation of
services
search engines that make full-text searchable
citation indexing systems that extract references
from the full-text content
browsing interfaces that include thumbnail
versions of high-quality images from cultural
heritage collections
Preservation
periodically transfer digital content from a data
repository to one or more trusted digital
repositories
trusted digital repositories need a mechanism to
automatically synchronize with the originating
data repository