Title: Integrating Preservation Functions into the Web Server
1Integrating Preservation Functionsinto the Web
Server
- Using 3rd-Party Utilities to Generate
- Just-In-Time Metadata for Web Resources
Joan A. Smith Old Dominion University
2WWW and Digital Libraries Vastly Different Worlds
- World Wide Web
- A disorganized free-for-all
- Near-zero metadata
- Unpredictable additions, deletions, modifications
- No preservation policy
vs
3Web Site Preservation 2 Problems
Guess the bean count, win the jar
The counting problem3 How many pages are on that
site? To save it you have to find it
The representation problem4 Whats that page all
about? Future use requires understanding
4Digital Preservation Requirements
- Refreshing If you dont have it, you cant
preserve it - Resources disappear over time (Cong. Foleys web
site) - Resources change over time (cs.odu.edu/index.html)
- Resources can decay/degrade over time (damaged
files, lost links) - Migration If you dont upgrade it, you cant use
it - Format obsolescence (WordPerfect vs. PDF)
- Format modification (XBM vs. JPEG)
- System obsolescence (TRS-80 vs PowerPC)
- Emulation If you cant access it, you cant use
it - Original bits and bytes only work in the original
environment (PDP-11) - Obsolete systems can be emulated in a newer
environment (Frogger) - Physical characteristics have to be interpreted
in new environments
5A Crawlers View of the Web Site
web root
?
?
?
?
?
?
?
?
X
?
?
X
?
X
?
?
?
?
?
?
The crawler has run into the counting problem,
and doesnt know it.
6Pages Out of Crawler Reach
- Some pages linked from web root
- Some dynamic content
- Some orphaned pages
- Some pages protected with access controls
- Some pages too deep for a particular crawler
7Search Engines The Counting Problem
- HTTP cannot ask for only new or modified
resources - Conditional GET by datestamp or etag has limited
benefit - Cannot get a list of pages that have been
deleted changed added - Each resource must be requested, one at a time,
by name - There is no SELECT in HTTP
- Crawlers cannot request a list of all URLs for
the site - Crawlers can only GET one resource at a time, by
name - HTTP cannot give a crawler a list of resources it
has - Undiscovered resources will not be refreshed
- Sitemaps
- XML document lays out site structure (cf.
http//www.sitemaps.org/protocol.php ) - lt?xml version"1.0" encoding"UTF-8"?gt
- lturlset xmlns"http//www.sitemaps.org/schemas/sit
emap/0.9"gt - lturlgt
- ltlocgthttp//www.example.com/lt/locgt
- ltlastmodgt2005-01-01lt/lastmodgt
- ltchangefreqgtmonthlylt/changefreqgt
- ltprioritygt0.8lt/prioritygt
86 Verbs of the OAI-PMH
most verbs can take qualifying arguments dates,
sets, ids, metadata formats, and resumption
token (for flow control)
- Compatible with HTTP
- Supports OAIS model
- Can support complex object model
9Addressing the Counting Problem Using OAI-PMH
- Advantages for Crawler
- Single request itemizes all resources in web
tree ListIdentifiers - Can refine by MIME, Datestamp
- But Still Limited
- No Dynamic URLs
- Web root tree only
- Same metadata as HTTP
Basic request http//www.foo.edu/modoai/?verbLis
tIdentifiersmetadataPrefixoai_dc Enhanced
request from2006-09-15setmimevideompeg
10Web Sites Metadata Challenged
HTML metadata
telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ
JPEG metadata
11Archives Metadata-Rich
12Whats In A Web Page?
13A Web Page Behind the Scenes
14HTTP Behind the Scenes
- Resource example
- http//foo.edu/jackJill.jpg
- Note the limited metadata from the HTTP GET
request - Binary content is not human-readable
- Additional metadata could help the digital
archeologist of the future - Color map
- NISO information
- Base64 encoding of resource
- MD5 or other hash function
- Subject matter
- And metadata that could help preserve the Jack
and Jill document - Language
- Script type and version
- Document summary/abstract
- Keyword extraction
telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ ê
_at_XÑ9'M½ÂX4ýÃÆçÉÎ Ð?õÔÓ!RÓ_at_ûTÓr
pz ëÖ.éhéQ)Ùè5übgøxzè ² "2s35Rq³ÁÂ
CcruÃÒÿÄ ê_at_XÑ9'M½ÂX4ýÃÆçÉÎ
Ð?õÔÓ!RÓ_at_ûTÓrpz ëÖ.éhéQ)Ùè5übgø
xzè Connection closed by foreign host.
15The Conscientious Webmaster
Preservation is important
But Im soooo busy
How to help???
He who waits to do a great deal of good will
never do anything. -- Samuel Johnson
16Metadata Generation Utility Examples
17Post-Harvest Processing
Often a combination of manual and automated input
18Harvest with Metadata
Harvest
Pre-processed resource
Metadata Magic Get the resource together with
its metadata
19Configuring the Web-Server for Metadata Magic
- http//foo.edu/example.html
- No impact to everyday users
- Regular GET gt regular response
- OAI-PMH Get Record gt crate response
- Standard Apache Location directive
- mod_oai module configured with plug-ins
- Scripts, utilities, etc. can vary by MIME type
http//foo.edu/modoai/?verbgetRecordidentifier
http//foo.edu/example.htmlmetadataPrefixcrate
20Metadata Magic with mod_oai
http//foo.edu/modoai/?verbgetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEclt/datagt lt/crateContentgt
ltcrateMetadatagt ltdescriptiongtltlabelgtfile
magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagt Date 2007-06-18
143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Reporti
ngModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments
APP0lt/datagt lt/descriptiongt lt/crateMetadatagt
lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
21Best-Effort Metadata
- Unverified
- Utility results are not cross-checked
- Output of analyses directly into XML response
- Undifferentiated
- No categorization of output
- Resource and metadata cohabit response
- Extemporaneous
- Generated at time of dissemination
- Integrates preservation functions with the web
server - A simple, easy-to-implement option for improving
web resource metadata
22YAMM?! (Yet Another Metadata Model?)
23The MPEG-21 DIDL Model
24Preservation Metadata
High
Probability of Preservation
Low
Less
More
Resource Metadata Available
25 Webs gtgt Archivists
Typical ingest scenario
Archivist
Web Sites
26Harnessing the Web Server
User standard GET request and response
- Archivist mod_oai GetRecord request and response
Self-describing resource
27What is a Self-Describing Resource?
Standard HTTP Headers -- Last-Modified Mon, 29
Aug 2005 120140 GMT ETag "5800535-3e72-4312f9
24" Content-Length 15986 Content-Type
image/jpeg
PLUS Output from built-in utilities
EXIF TOOL File Name 103_0315.JPG Camera Model
Name Canon EOS DIGITAL REBEL Date/Time
Original 20030930 133751 Shooting
Mode Sports Shutter Speed 1/2000 Aperture 7.1 Mete
ring Mode Evaluative Exposure Compensation 0 ISO 4
00 Lens 75.0 - 300.0mm Focal Length 300.0mm Image
Size 3072x2048 Quality Normal Flash Off White
Balance Auto Focus Mode AI Servo
AF Contrast 1 Sharpness 1 Saturation 1 Color
Tone Normal File Size 1606 kB File Number 103-0315
JHOVE TOOL Date 2007-06-18 143550 EDT
RepresentationInformation /home/crate/apache/htdo
cs/jackJill.jpg ReportingModule JPEG-hul, Rel.
1.2 (2005-08-22) LastModified 2007-01-16
230907 EST Size 27750 Format JPEG Version
1.00 Status Well-Formed and valid
SignatureMatches JPEG-hul MIMEtype image/jpeg
Profile JFIF JPEGMetadata CompressionType
Huffman coding, Baseline DCT Images Number 1
Image NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146 BitsPerSample
8, 8, 8 SamplesPerPixel 3 Scans 1
QuantizationTables QuantizationTable
Precision 8-bit DestinationIdentifier 0
Comments LEAD Technologies Inc. V1.01
ApplicationSegments APP0
File/Magic JPEG image data JFIF standard
1.00 resolution (DPI) "LEAD Technologies Inc.
V1.01 33 x 26
MD5 Hash 58a54e8638db432f4515eedf89f44505
CRATE Wrapped together with the resource in
simple XML
28Metadata Generation Utility Examples
29Web Server Configuration conf file
- Section 1 Global Environment
-
- ServerType standalone
- ServerRoot "/etc/httpd"
- PidFile /var/run/httpd.pid
- ResourceConfig /dev/null
- AccessConfig /dev/null
- Timeout 300
- KeepAlive On
- MaxKeepAliveRequests 0
- KeepAliveTimeout 15
- MinSpareServers 16
- MaxSpareServers 64
- StartServers 16
- MaxClients 512
- MaxRequestsPerChild 100000
- Section 2 'Main' server configuration
-
- Operational Rules
- Modules (mod_perl, etc.)
- Security
- Virtual Hosts
ltFiles .plgt Options None
AllowOverride None Order deny,allow
Deny from all lt/Filesgt ltIfModule
mod_dir.cgt DirectoryIndex index.htm
index.html index.php index.php3 default.html
index.cgi lt/IfModulegt ltIfModule
mod_include.cgt Include conf/mmap.conf
lt/IfModulegt UseCanonicalName On
ltIfModule mod_mime.cgt TypesConfig
/etc/httpd/conf/mime.types lt/IfModulegt
DefaultType text/plain HostnameLookups
Off
30Apache mod_oai Location Directive
- ltLocation /modoaigt
- SetHandler modoai-handler
- modoai_oai_active ON
- ltmodoai_plugingt
- label md5sum
- exec /usr/bin/md5sum s
- version /usr/bin/md5sum --version
- mime /
- lt/modoai_plugingt
- ltmodoai_plugingt
- label file
- exec /usr/bin/file -kz s
- version /usr/bin/file -v
- mime /
- lt/modoai_plugingt
- ltmodoai_plugingt
- label jhove
- exec /opt/jhove/jhove -m pdf-hul s
- version /opt/jhove/jhove -v
- Scripts
- Pipes
- Executables
- MIME-based selective processing
31Building a CRATE
- URI, UUID
- Standard HTTP Headers
- Plug-In Metadata
- Base64-Encoded Resource
32CRATE example from mod_oai
http//foo.edu/modoai/?verbGetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEc lt/datagt lt/crateContent
gt ltcrateMetadatagt ltdescriptiongtltlabelgtfil
e magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagtlt!CDATA Date
2007-06-18 143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Report
ingModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments APP0
gtlt/datagt lt/descriptiongt lt/crateMetadatagt lt
/recordgt lt/GetRecordgt lt/OAI-PMHgt
33Automatic, Best-Effort Metadata
- Automatic
- Generated at time of dissemination
- Integrates preservation functions with the web
server - Unverified
- Utility results are not cross-checked
- Output of analyses go directly into XML response
- Undifferentiated
- No categorization of output
- Resource and metadata form complex-object
response - A simple, easy-to-implement option for improving
- available preservation metadata for web resources
34Issues - Or Not?
- Web Server Performance
- Academic vs dot-com expectations
- Solution options
- Utility Efficiency
- Java-based vs C-based
- Market pressures
- Security
- Metadata vs risk
- Access controls
35Next Up
- mod_oai Open Source release
- Formalize/release CRATE schema definition (XSD)
- Metrics Collection Evaluation
- Academic sites
- Dot-Com sites
- Examine utility compatibility and issues
- Address security concerns
36Examples
- MPEG-21 DIDL
- http//beatitude.cs.odu.edu9090/washingtonpost1.t
xt - http//beatitude.cs.odu.edu9090/modoai/?verbGetR
ecordmetadataPrefixoai_didlidentifierhttp//be
atitude.cs.odu.edu9090/washingtonpost1.txt - http//beatitude.cs.odu.edu9090/index.html
- http//beatitude.cs.odu.edu9090/modoai/?verbGetR
ecordmetadataPrefixoai_didlidentifierhttp//be
atitude.cs.odu.edu9090/index.html - CRATE
- http//beatitude.cs.odu.edu8080/modoaitest/diag.j
pg - http//beatitude.cs.odu.edu8080/modoai/?verbGetR
ecordmetadataPrefixcrateidentifierhttp//beati
tude.cs.odu.edu8080/modoaitest/diag.jpg - http//beatitude.cs.odu.edu8080/index.html
- http//beatitude.cs.odu.edu8080/modoai/?verbGetR
ecordmetadataPrefixcrateidentifierhttp//beati
tude.cs.odu.edu8080/index.html
37Further Information
- The mod_oai project home page
- http//www.modoai.org/
- JCDL 2007
- Generating Best Effort Preservation Metadata
- For Web Resources At Time Of Dissemination
- IWAW 2007
- CRATE A Simple Model for Self-Describing Web
Resources - Authors webs
- http//www.cs.odu.edu/mln/pubs/
- http//www.joanasmith.com/pubs.html