Thinking Differently About Web Page Preservation - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

Thinking Differently About Web Page Preservation

Description:

Research supported in part by NSF, Library of Congress and ... image from: http://www.btinternet.com/~palmiped/Birkenhead.htm. HMS Birkenhead, Cape Danger, 1852 ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 78
Provided by: Michael50
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Thinking Differently About Web Page Preservation


1
Thinking Differently About Web Page Preservation
  • Michael L. Nelson, Frank McCown, Joan A. Smith
  • Old Dominion University
  • Norfolk VA
  • mln,fmccown,jsmit_at_cs.odu.edu
  • Library of Congress
  • Brown Bag Seminar
  • June 29, 2006
  • Research supported in part by NSF, Library of
    Congress and Andrew Mellon Foundation

2
Background
  • We cant save everything!
  • if not everything, then how much?
  • what does save mean?

3
Women and Children First
HMS Birkenhead, Cape Danger, 1852
638 passengers
193 survivors
all 7 women 13 children
image from http//www.btinternet.com/palmiped/Bi
rkenhead.htm
4
We should probably save a copy of this
5
Or maybe we dont have to the Wikipedia
link is in the top 10, so were ok, right?
6
Surely were saving copies of this
7
2 copies in the UK 2 Dublin Core
records Thats probably good enough
8
What about the things that we know we dont
need to keep?
You DO support recycling, right?
9
A higher moral calling for pack rats?
10
Just Keep the Important Stuff!
11
Lessons Learned from the AIHT
(Boring stuff D-Lib Magazine, December 2005)
images from http//facweb.cs.depaul.edu/sgrais/co
llage.htm
12
Preservation Fortress Model
Five Easy Steps for Preservation
  • Get a lot of
  • Buy a lot of disks, machines, tapes, etc.
  • Hire an army of staff
  • Load a small amount of data
  • Look upon my archive ye Mighty, and despair!

image from http//www.itunisie.com/tourisme/excur
sion/tabarka/images/fort.jpg
13
Alternate Models of Preservation
  • Lazy Preservation
  • Let Google, IA et al. preserve your website
  • Just-In-Time Preservation
  • Wait for it to disappear first, then a good
    enough version
  • Shared Infrastructure Preservation
  • Push your content to sites that might preserve it
  • Web Server Enhanced Preservation
  • Use Apache modules to create archival-ready
    resources

image from http//www.proex.ufes.br/arsm/knots_in
terlaced.htm
14
Lazy PreservationHow much preservation do I get
if I do nothing?
  • Frank McCown

15
Outline Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus

16
(No Transcript)
17
Web Infrastructure
18
(No Transcript)
19
Cost of Preservation
20
Outline Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus

21
Research Questions
  • How much digital preservation of websites is
    afforded by lazy preservation?
  • Can we reconstruct entire websites from the WI?
  • What factors contribute to the success of website
    reconstruction?
  • Can we predict how much of a lost website can be
    recovered?
  • How can the WI be utilized to provide
    preservation of server-side components?

22
Prior Work
  • Is website reconstruction from WI feasible?
  • Web repository G,M,Y,IA
  • Web-repository crawler Warrick
  • Reconstructed 24 websites
  • How long do search engines keep cached content
    after it is removed?

23
Timeline of SE Resource Acquisition and Release
Vulnerable resource not yet cached (tca is not
defined) Replicated resource available on web
server and SE cache (tca lt current time lt
tr) Endangered resource removed from web server
but still cached (tca lt current time lt
tcr) Unrecoverable resource missing from web
server and cache (tca lt tcr lt current time)
Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites, D-Lib Magazine, 12(2),
February 2006. Frank McCown, Joan A. Smith,
Michael L. Nelson, and Johan Bollen.
Reconstructing Websites for the Lazy
Webmaster, Technical report, arXiv cs.IR/0512069,
2005.
24


25
(No Transcript)
26
Cached Image
27
Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
28
Web Repository Characteristics


C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
29
SE Caching Experiment
  • Create html, pdf, and images
  • Place files on 4 web servers
  • Remove files on regular schedule
  • Examine web server logs to determine when each
    page is crawled and by whom
  • Query each search engine daily using unique
    identifier to see if they have cached the page or
    image

Joan A. Smith, Frank McCown, and Michael L.
Nelson. Observed Web Robot Behavior on Decaying
Web Subsites. D-Lib Magazine, February 2006, 12(2)
30
Caching of HTML Resources - mln
31
Reconstructing a Website
Original URL
Warrick
Starting URL
Web Repo
Results page
Cached URL
Retrieved resource
File system
Cached resource
  • Pull resources from all web repositories
  • Strip off extra header and footer html
  • Store most recently cached version or canonical
    version
  • Parse html for links to other resources

32
How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Missing link to D points to old resource G
F cant be found
33
Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
34
Websites to Reconstruct
  • Reconstruct 24 sites in 3 categories
  • 1. small (1-150 resources) 2. medium (150-499
    resources)3. large (500 resources)
  • Use Wget to download current website
  • Use Warrick to reconstruct
  • Calculate reconstruction vector

35
Results
Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
36
Aggregation of Websites
37
Web Repository Contributions
38
Warrick Milestones
  • www2006.org first lost website reconstructed
    (Nov 2005)
  • DCkickball.org first website someone else
    reconstructed without our help (late Jan 2006)
  • www.iclnet.org first website we reconstructed
    for someone else (mid Mar 2006)
  • Internet Archive officially blesses Warrick
    (mid Mar 2006)1

1http//frankmccown.blogspot.com/2006/03/warrick-i
s-gaining-traction.html
39
Outline Lazy Preservation
  • Web Infrastructure as a Resource
  • Reconstructing Web Sites
  • Research Focus

40
Proposed Work
  • How lazy can we afford to be?
  • Find factors influencing success of website
    reconstruction from the WI
  • Perform search engine cache characterization
  • Inject server-side components into WI for
    complete website reconstruction
  • Improving the Warrick crawler
  • Evaluate different crawling policies
  • Frank McCown and Michael L. Nelson, Evaluation of
    Crawling Policies for a Web-repository Crawler,
    ACM Hypertext 2006.
  • Development of web-repository API for inclusion
    in Warrick

41
Factors Influencing Website Recoverability from
the WI
  • Previous study did not find statistically
    significant relationship between recoverability
    and website size or PageRank
  • Methodology
  • Sample large number of websites - dmoz.org
  • Perform several reconstructions over time using
    same policy
  • Download sites several times over time to capture
    change rates

42
Evaluation
  • Use statistical analysis to test for the
    following factors
  • Size
  • Makeup
  • Path depth
  • PageRank
  • Change rate
  • Create a predictive model how much of my lost
    website do I expect to get back?

43
Marshall TR Server running EPrints
44
We can recover the missing page and PDF, but
what about the services?
45
Recovery of Web Server Components
  • Recovering the client-side representation is not
    enough to reconstruct a dynamically-produced
    website
  • How can we inject the server-side functionality
    into the WI?
  • Web repositories like HTML
  • Canonical versions stored by all web repos
  • Text-based
  • Comments can be inserted without changing
    appearance of page
  • Injection Use erasure codes to break a server
    file into chunks and insert the chunks into HTML
    comments of different pages

46
Recover Server File from WI
47
Evaluation
  • Find the most efficient values for n and r
    (chunks created/recovered)
  • Security
  • Develop simple mechanism for selecting files that
    can be injected into the WI
  • Address encryption issues
  • Reconstruct an EPrints website with a few hundred
    resources

48
SE Cache Characterization
  • Web characterization is an active field
  • Search engine caches have never been
    characterized
  • Methodology
  • Randomly sample URLs from four popular search
    engines Google, MSN, Yahoo, Ask
  • Download cached version and live version from the
    Web
  • Examine HTTP headers and page content
  • Test for overlap with Internet Archive
  • Attempt to access various resource types (PDF,
    Word, PS, etc.) in each SE cache

49
Summary Lazy Preservation
  • When this work is completed, we will have
  • demonstrated and evaluated the lazy preservation
    technique
  • provided a reference implementation
  • characterized SE caching behavior
  • provided a layer of abstraction on top of SE
    behavior (API)
  • explored how much we store in the WI (server-side
    vs. client-side representations)

50
Web Server Enhanced Preservation How much
preservation do I get if I do just a little bit?
  • Joan A. Smith

51
Outline Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai complex objects resource harvesting
  • Research Focus

52
WWW and DL Separate Worlds
Crawlapalooza
WWW
WWW
DL
DL
Today
Harvester Home Companion
1994
The problem is not that the WWW doesnt work it
clearly does. The problem is that our
(preservation) expectations have been lowered.
53
Data Providers / Repositories
Service Providers / Harvesters
A repository is a network accessible server that
can process the 6 OAI-PMH requests A
repository is managed by a data provider to
expose metadata to harvesters. 
A harvester is a client application that issues
OAI-PMH requests.  A harvester is operated by a
service provider as a means of collecting
metadata from repositories.
54
Aggregators
  • aggregators allow for
  • scalability for OAI-PMH
  • load balancing
  • community building
  • discovery

data providers (repositories)
service providers (harvesters)
aggregator
55
OAI-PMH data model
entry point to all records pertaining to the
resource
metadata pertaining to the resource
56
OAI-PMH Used by Google AcademicLive (MSN)
  • Why support OAI-PMH?
  • These guys are in business (i.e., for
    profit)
  • How does OAI-PMH help their bottom line?
  • By improving the search and analysis process

57
Resource Harvesting with OAI-PMH
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
58
Outline Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai complex objects resource harvesting
  • Research Focus

59
Two Problems
60
mod_oai solution
  • Integrate OAI-PMH functionality into the web
    server itself
  • mod_oai an Apache 2.0 module to automatically
    answer OAI-PMH requests for an http server
  • written in C
  • respects values in .htaccess, httpd.conf
  • compile mod_oai on http//www.foo.edu/
  • baseURL is now http//www.foo.edu/modoai
  • Result web harvesting with OAI-PMH semantics
    (e.g., from, until, sets)

http//www.foo.edu/modoai?verbListIdentifiersmet
dataPrefixoai_dcfrom2004-09-15setmimevideom
peg
The human-readable web site
Prepped for machine-friendly harvesting
Give me a list of all resources, include Dublin
Core metadata, dating from 9/15/2004 through
today, and that are MIME type video-MPEG.
61
A Crawlers View of the Web Site
Not crawled (protected)
web root
Not crawled (Generated on-the-fly by CGI, e.g.)
Not crawled robots.txt or robots META tag
Not crawled (unadvertised unlinked)
Crawled pages
Not crawled (remote link only)
Not crawled (too deep)
Remote web site
62
Apaches View of the Web Site
Require authentication
web root
Generated on-the-fly (CGI, e.g.)
Tagged No robots
Unknown/not visible
63
The Problem Defining The Whole Site
  • For a given server, there are a set of URLs, U,
    and a set of files F
  • Apache maps U ? F
  • mod_oai maps F ? U
  • Neither function is 1-1 nor onto
  • We can easily check if a single u maps to F, but
    given F we cannot (easily) generate U
  • Short-term issues
  • dynamic files
  • exporting unprocessed server-side files would be
    a security hole
  • IndexIgnore
  • httpd will hide valid URLs
  • File permissions
  • httpd will advertise files it cannot read
  • Long-term issues
  • Alias, Location
  • files can be covered up by the httpd
  • UserDir
  • interactions between the httpd and the filesystem

64
A Webmasters Omniscient View
Dynamic
web root
Authenticated
Tagged No robots
Orphaned
Deep
Unknown/not visible
65
HTTP Get versus OAI-PMH GetRecord
HTTP GetRecord
Machine-readable
HTTP GET
JHOVE METADATA
Human-readable
MD-5 LS
Complex Object
mod_oai
Apache Web Server
GET /modoai/?verbGetRecordidentifier headlines
.htmlmetadaprefixoai_didl
GET /headlines.html HTTP1.1
WEB SITE
66
OAI-PMH data model in mod_oai
http//techreports.larc.nasa.gov/ltrs/PDF/2004/aia
a/NASA-aiaa-2004-0015.pdf
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
67
Complex Objects That Tell A Story
http//foo.edu/bar.pdf encoded as an MPEG-21 DIDL
Russian Nesting Doll
ltdidlgt ltmetadata source"jhove"gt...lt/metadatagt
ltmetadata source"file"gt...lt/metadatagt ltmetadata
source"essence"gt...lt/metadatagt ltmetadata
source"grep"gt...lt/metadatagt ... ltresource
mimeType"application/pdf" identifierhttp//foo.
edu/bar.pdf encoding"base64gt SADLFJSALDJF...SLDKF
JASLDJ lt/resourcegt lt/didlgt
  • Resource and metadata packaged together as a
    complex digital object represented via XML
    wrapper
  • Uniform solution for simple compound objects
  • Unambiguous expression of locator of datastream
  • Disambiguation between locators identifiers
  • OAI-PMH datestamp changes whenever the resource
    (datastreams secondary information) changes
  • OAI-PMH semantics apply about containers, set
    membership
  • First came Lenin
  • Then came Stalin

68
Resource Discovery ListIdentifiers
  • HARVESTER
  • issues a ListIdentifiers,
  • finds URLs of updated resources
  • does HTTP GETs updates only
  • can get URLs of resources with specified MIME
    types

69
Preservation ListRecords
  • HARVESTER
  • issues a ListRecords,
  • Gets updates as MPEG-21 DIDL documents (HTTP
    headers, resource By Value or By Reference)
  • can get resources with specified MIME types

70
What does this mean?
  • For an entire web site, we can
  • serialize everything as an XML stream
  • extract it using off-the-shelf OAI-PMH harvesters
  • efficiently discover updates additions
  • For each URL, we can
  • create preservation ready version with
    configurable descriptivetechnicalstructural
    metadata
  • e.g., Jhove output, datestamps, signatures,
    provenance, automatically generated summary, etc.

Jhove other pertinent info
Harvest the resource
or lexical signatures, Summaries, etc
extract metadata
include an index translations
Wrap it all together In an XML Stream
Ready for the future
71
Outline Web Server Enhanced Preservation
  • OAI-PMH
  • mod_oai complex objects resource harvesting
  • Research Focus

72
Research Contributions
  • Thesis Question How well can Apache support web
    page preservation?
  • Goal To make web resources preservation ready
  • Support refreshing (how many URLs at this
    site?) the counting problem
  • Support migration (what is this object?) the
    representation problem
  • How Using OAI-PMH resource harvesting
  • Aggregate forensic metadata
  • Automate extraction
  • Encapsulate into an object
  • XML stream of information
  • Maximize preservation opportunity
  • Bring DL technology into the realm of WWW

73
Experimentation Evaluation
  • Research solutions to the counting problem
  • Different tools yield different results
  • Google sitemap ltgt Apache file list ltgt robot
    crawled pages
  • Combine approaches for one automated, full URL
    listing
  • Apache logs are detailed history of site activity
  • Compare user page requests with crawlers
    requests
  • Compare crawled pages with actual site tree
  • Continue research on the representation problem
  • Integrate utilities into mod_oai (Jhove, etc.)
  • Automate metadata extraction encapsulation
  • Serialize and reconstitute
  • complete back-up of site reconstitution through
    XML stream

74
Summary Web Server Enhanced Preservation
  • Better web harvesting can be achieved through
  • OAI-PMH structured access to updates
  • Complex object formats modeled representation of
    digital objects
  • Address 2 key problems
  • Preservation (ListRecords) The Representation
    Problem
  • Web crawling (ListIdentifiers) The Counting
    Problem
  • mod_oai reference implementation
  • Better performance than wget crawlers
  • not a replacement for DSpace, Fedora,
    eprints.org, etc.
  • More info
  • http//www.modoai.org/
  • http//whiskey.cs.odu.edu/
  • Automatic harvesting of web resources rich in
    metadata packaged for the future

Today manual
Tomorrow automatic!
75
Summary
  • Michael L. Nelson

76
Summary
  • Digital preservation is not hard, its just big.
  • Save the women and children first, of course, but
    there is room for many more
  • Using the by-product of SE and WI, we can get a
    good amount of preservation for free
  • prediction Google et al. will eventually see
    preservation as a business opportunity
  • Increasing the role of the web server will solve
    most of the digital preservation problems
  • complex objects OAI-PMH digital preservation
    solution

77
As you know, you preserve the files you have.
Theyre not the files you might want or wish to
have at a later time if you think about it,
you can have all the metadata in the world on a
file and a file can be blown up
image from http//www.washingtonpost.com/wp-dyn/a
rticles/A132-2004Dec14.html
78
Overview of OAI-PMH Verbs
metadata about the repository
harvesting verbs
most verbs take arguments dates, sets, ids,
metadata formats and resumption token (for flow
control)
79
Enhancing Apaches utility as a preservation tool
  • Create a partnership between server and SE
  • Apache can serve up details about site,
    accessible portions of site tree, changes
    including additions and deletions
  • SE can reduce crawl time and subsequent
    index/update times
  • Google Hi Apache! Whats new?
  • Apache Hi Google! Ive got 3 new pages,
    xyz/blah1.html,
  • yyy/bug2.html, and ru2.html. Oh, and I also
    deleted xyz/boo.html.
  • Use OAI-PMH to facilitate conversation between
    the SE and the server
  • Data model offers many advantages
  • Both content-rich and metadata-rich
  • Supports complex objects
  • Protocols 6 verbs mesh well with SE, Server
    roles
  • ListMetadataFormats, ListSets, GetRecord,
    ListRecords, ListIdentifiers, ListRecords
  • Enable policy-driven relationship between site
    SE
  • push content-rich harvesting to web community

80
OAI-PMH concepts typical repository
81
OAI-PMH concepts mod_oai
82
OAI-PMH data model
OAI-PMH identifier entry point to all records
pertaining to the resource
metadata pertaining to the resource
simple
highly expressive
more expressive
highly expressive
83
Warrick API
  • API should provide a clear and flexible interface
    for web repositories
  • Goals
  • Shield Warrick from changes to WI
  • Facilitate inclusion of new web repositories
  • Minimize implementation and maintenance costs

84
Evaluation
  • Internet Archive has endorsed use of Warrick
  • Make Warrick available on SourceForge
  • Measure the community adoption modification

85
performance of mod_oai and wget on www.cs.odu.edu

for more detail mod_oai An Apache Module for
Metadata Harvesting http//arxiv.org/abs/cs.DL/0
503069
86
IndexIgnore File Permissions
87
Alias Covering Up Files
httpd.conf Alias /A /usr/local/web/htdocs/B Ali
as /B /usr/local/web/htdocs/A
the files A and B will be different from the
URLs http//server/A http//server/B
88
UserDir Just in Time mounting of directories
whiskey.cs.odu.edu/ftp/WWW/conf ls /home liu_x/
mln/ whiskey.cs.odu.edu/ftp/WWW/conf ls -d
/home/tharriso /home/tharriso/ whiskey.cs.odu.edu
/ftp/WWW/conf ls /home liu_x/ mln/
tharriso/ whiskey.cs.odu.edu/ftp/WWW/conf
89
Complex Object Formats Characteristics
  • Representation of a digital object by means of a
    wrapper XML document.
  • Represented resource can be
  • simple digital object (consisting of a single
    datastream)
  • compound digital object (consisting of multiple
    datastreams)
  • Unambiguous approach to convey identifiers of the
    digital object and its constituent datastreams.
  • Include datastream
  • By-Value embedding of base64-encoded datastream
  • By-Reference embedding network location of the
    datastream
  • not mutually exclusive equivalent
  • Include a variety of secondary information
  • By-Value
  • By-Reference
  • Descriptive metadata, rights information,
    technical metadata,

90
Resource Harvesting Use cases
  • Discovery use content itself in the creation of
    services
  • search engines that make full-text searchable
  • citation indexing systems that extract references
    from the full-text content
  • browsing interfaces that include thumbnail
    versions of high-quality images from cultural
    heritage collections
  • Preservation
  • periodically transfer digital content from a data
    repository to one or more trusted digital
    repositories
  • trusted digital repositories need a mechanism to
    automatically synchronize with the originating
    data repository

Ideas first presented in Van de Sompel, Nelson,
Lagoze Warner, http//www.dlib.org/dlib/december
04/vandesompel/12vandesompel.html
Write a Comment
User Comments (0)
About PowerShow.com