Integrating Preservation Functions into the Web Server - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Integrating Preservation Functions into the Web Server

Description:

Digital Preservation Requirements. Refreshing: If you don't have it, you ... Camera Model Name Canon EOS DIGITAL REBEL. Date/Time Original 2003:09:30 13:37:51 ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 38
Provided by: Joana53
Category:

less

Transcript and Presenter's Notes

Title: Integrating Preservation Functions into the Web Server


1
Integrating Preservation Functionsinto the Web
Server
  • Using 3rd-Party Utilities to Generate
  • Just-In-Time Metadata for Web Resources

Joan A. Smith Old Dominion University
2
WWW and Digital Libraries Vastly Different Worlds
  • World Wide Web
  • A disorganized free-for-all
  • Near-zero metadata
  • Unpredictable additions, deletions, modifications
  • No preservation policy

vs
3
Web Site Preservation 2 Problems
Guess the bean count, win the jar
The counting problem3 How many pages are on that
site? To save it you have to find it
The representation problem4 Whats that page all
about? Future use requires understanding
4
Digital Preservation Requirements
  • Refreshing If you dont have it, you cant
    preserve it
  • Resources disappear over time (Cong. Foleys web
    site)
  • Resources change over time (cs.odu.edu/index.html)
  • Resources can decay/degrade over time (damaged
    files, lost links)
  • Migration If you dont upgrade it, you cant use
    it
  • Format obsolescence (WordPerfect vs. PDF)
  • Format modification (XBM vs. JPEG)
  • System obsolescence (TRS-80 vs PowerPC)
  • Emulation If you cant access it, you cant use
    it
  • Original bits and bytes only work in the original
    environment (PDP-11)
  • Obsolete systems can be emulated in a newer
    environment (Frogger)
  • Physical characteristics have to be interpreted
    in new environments

5
A Crawlers View of the Web Site
web root
?
?
?
?
?
?
?
?
X
?
?
X
?
X
?
?
?
?
?
?
The crawler has run into the counting problem,
and doesnt know it.
6
Pages Out of Crawler Reach
  • Some pages linked from web root
  • Some dynamic content
  • Some orphaned pages
  • Some pages protected with access controls
  • Some pages too deep for a particular crawler

7
Search Engines The Counting Problem
  • HTTP cannot ask for only new or modified
    resources
  • Conditional GET by datestamp or etag has limited
    benefit
  • Cannot get a list of pages that have been
    deleted changed added
  • Each resource must be requested, one at a time,
    by name
  • There is no SELECT in HTTP
  • Crawlers cannot request a list of all URLs for
    the site
  • Crawlers can only GET one resource at a time, by
    name
  • HTTP cannot give a crawler a list of resources it
    has
  • Undiscovered resources will not be refreshed
  • Sitemaps
  • XML document lays out site structure (cf.
    http//www.sitemaps.org/protocol.php )
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lturlset xmlns"http//www.sitemaps.org/schemas/sit
    emap/0.9"gt
  • lturlgt
  • ltlocgthttp//www.example.com/lt/locgt
  • ltlastmodgt2005-01-01lt/lastmodgt
  • ltchangefreqgtmonthlylt/changefreqgt
  • ltprioritygt0.8lt/prioritygt

8
6 Verbs of the OAI-PMH
most verbs can take qualifying arguments dates,
sets, ids, metadata formats, and resumption
token (for flow control)
  • Compatible with HTTP
  • Supports OAIS model
  • Can support complex object model

9
Addressing the Counting Problem Using OAI-PMH
  • Advantages for Crawler
  • Single request itemizes all resources in web
    tree ListIdentifiers
  • Can refine by MIME, Datestamp
  • But Still Limited
  • No Dynamic URLs
  • Web root tree only
  • Same metadata as HTTP

Basic request http//www.foo.edu/modoai/?verbLis
tIdentifiersmetadataPrefixoai_dc Enhanced
request from2006-09-15setmimevideompeg
10
Web Sites Metadata Challenged
HTML metadata
telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ
JPEG metadata
11
Archives Metadata-Rich
12
Whats In A Web Page?
13
A Web Page Behind the Scenes
14
HTTP Behind the Scenes
  • Resource example
  • http//foo.edu/jackJill.jpg
  • Note the limited metadata from the HTTP GET
    request
  • Binary content is not human-readable
  • Additional metadata could help the digital
    archeologist of the future
  • Color map
  • NISO information
  • Base64 encoding of resource
  • MD5 or other hash function
  • Subject matter
  • And metadata that could help preserve the Jack
    and Jill document
  • Language
  • Script type and version
  • Document summary/abstract
  • Keyword extraction

telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ ê
_at_XÑ9'M½ÂX4ýÃÆçÉÎ Ð?õÔÓ!RÓ_at_ûTÓr
pz ëÖ.éhéQ)Ùè5übgøxzè ² "2s35Rq³ÁÂ
CcruÃÒÿÄ ê_at_XÑ9'M½ÂX4ýÃÆçÉÎ
Ð?õÔÓ!RÓ_at_ûTÓrpz ëÖ.éhéQ)Ùè5übgø
xzè Connection closed by foreign host.
15
The Conscientious Webmaster
Preservation is important
But Im soooo busy
How to help???
He who waits to do a great deal of good will
never do anything. -- Samuel Johnson
16
Metadata Generation Utility Examples
17
Post-Harvest Processing
Often a combination of manual and automated input
18
Harvest with Metadata
Harvest
Pre-processed resource
Metadata Magic Get the resource together with
its metadata
19
Configuring the Web-Server for Metadata Magic
  • http//foo.edu/example.html
  • No impact to everyday users
  • Regular GET gt regular response
  • OAI-PMH Get Record gt crate response
  • Standard Apache Location directive
  • mod_oai module configured with plug-ins
  • Scripts, utilities, etc. can vary by MIME type

http//foo.edu/modoai/?verbgetRecordidentifier
http//foo.edu/example.htmlmetadataPrefixcrate
20
Metadata Magic with mod_oai
http//foo.edu/modoai/?verbgetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEclt/datagt lt/crateContentgt
ltcrateMetadatagt ltdescriptiongtltlabelgtfile
magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagt Date 2007-06-18
143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Reporti
ngModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments
APP0lt/datagt lt/descriptiongt lt/crateMetadatagt
lt/recordgt lt/GetRecordgt lt/OAI-PMHgt
21
Best-Effort Metadata
  • Unverified
  • Utility results are not cross-checked
  • Output of analyses directly into XML response
  • Undifferentiated
  • No categorization of output
  • Resource and metadata cohabit response
  • Extemporaneous
  • Generated at time of dissemination
  • Integrates preservation functions with the web
    server
  • A simple, easy-to-implement option for improving
    web resource metadata

22
YAMM?! (Yet Another Metadata Model?)
23
The MPEG-21 DIDL Model
24
Preservation Metadata
High
Probability of Preservation
Low
Less
More
Resource Metadata Available
25
Webs gtgt Archivists
Typical ingest scenario
Archivist
Web Sites
26
Harnessing the Web Server
User standard GET request and response
  • Archivist mod_oai GetRecord request and response

Self-describing resource
27
What is a Self-Describing Resource?
Standard HTTP Headers -- Last-Modified Mon, 29
Aug 2005 120140 GMT ETag "5800535-3e72-4312f9
24" Content-Length 15986 Content-Type
image/jpeg
PLUS Output from built-in utilities
EXIF TOOL File Name 103_0315.JPG Camera Model
Name Canon EOS DIGITAL REBEL Date/Time
Original 20030930 133751 Shooting
Mode Sports Shutter Speed 1/2000 Aperture 7.1 Mete
ring Mode Evaluative Exposure Compensation 0 ISO 4
00 Lens 75.0 - 300.0mm Focal Length 300.0mm Image
Size 3072x2048 Quality Normal Flash Off White
Balance Auto Focus Mode AI Servo
AF Contrast 1 Sharpness 1 Saturation 1 Color
Tone Normal File Size 1606 kB File Number 103-0315
JHOVE TOOL Date 2007-06-18 143550 EDT
RepresentationInformation /home/crate/apache/htdo
cs/jackJill.jpg ReportingModule JPEG-hul, Rel.
1.2 (2005-08-22) LastModified 2007-01-16
230907 EST Size 27750 Format JPEG Version
1.00 Status Well-Formed and valid
SignatureMatches JPEG-hul MIMEtype image/jpeg
Profile JFIF JPEGMetadata CompressionType
Huffman coding, Baseline DCT Images Number 1
Image NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146 BitsPerSample
8, 8, 8 SamplesPerPixel 3 Scans 1
QuantizationTables QuantizationTable
Precision 8-bit DestinationIdentifier 0
Comments LEAD Technologies Inc. V1.01
ApplicationSegments APP0
File/Magic JPEG image data JFIF standard
1.00 resolution (DPI) "LEAD Technologies Inc.
V1.01 33 x 26
MD5 Hash 58a54e8638db432f4515eedf89f44505
CRATE Wrapped together with the resource in
simple XML
28
Metadata Generation Utility Examples
29
Web Server Configuration conf file
  • Section 1 Global Environment
  • ServerType standalone
  • ServerRoot "/etc/httpd"
  • PidFile /var/run/httpd.pid
  • ResourceConfig /dev/null
  • AccessConfig /dev/null
  • Timeout 300
  • KeepAlive On
  • MaxKeepAliveRequests 0
  • KeepAliveTimeout 15
  • MinSpareServers 16
  • MaxSpareServers 64
  • StartServers 16
  • MaxClients 512
  • MaxRequestsPerChild 100000
  • Section 2 'Main' server configuration
  • Operational Rules
  • Modules (mod_perl, etc.)
  • Security
  • Virtual Hosts

ltFiles .plgt Options None
AllowOverride None Order deny,allow
Deny from all lt/Filesgt ltIfModule
mod_dir.cgt DirectoryIndex index.htm
index.html index.php index.php3 default.html
index.cgi lt/IfModulegt ltIfModule
mod_include.cgt Include conf/mmap.conf
lt/IfModulegt UseCanonicalName On
ltIfModule mod_mime.cgt TypesConfig
/etc/httpd/conf/mime.types lt/IfModulegt
DefaultType text/plain HostnameLookups
Off
30
Apache mod_oai Location Directive
  • ltLocation /modoaigt
  • SetHandler modoai-handler
  • modoai_oai_active ON
  • ltmodoai_plugingt
  • label md5sum
  • exec /usr/bin/md5sum s
  • version /usr/bin/md5sum --version
  • mime /
  • lt/modoai_plugingt
  • ltmodoai_plugingt
  • label file
  • exec /usr/bin/file -kz s
  • version /usr/bin/file -v
  • mime /
  • lt/modoai_plugingt
  • ltmodoai_plugingt
  • label jhove
  • exec /opt/jhove/jhove -m pdf-hul s
  • version /opt/jhove/jhove -v
  • Scripts
  • Pipes
  • Executables
  • MIME-based selective processing

31
Building a CRATE
  • URI, UUID
  • Standard HTTP Headers
  • Plug-In Metadata
  • Base64-Encoded Resource

32
CRATE example from mod_oai
http//foo.edu/modoai/?verbGetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEc lt/datagt lt/crateContent
gt ltcrateMetadatagt ltdescriptiongtltlabelgtfil
e magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagtlt!CDATA Date
2007-06-18 143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Report
ingModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments APP0
gtlt/datagt lt/descriptiongt lt/crateMetadatagt lt
/recordgt lt/GetRecordgt lt/OAI-PMHgt
33
Automatic, Best-Effort Metadata
  • Automatic
  • Generated at time of dissemination
  • Integrates preservation functions with the web
    server
  • Unverified
  • Utility results are not cross-checked
  • Output of analyses go directly into XML response
  • Undifferentiated
  • No categorization of output
  • Resource and metadata form complex-object
    response
  • A simple, easy-to-implement option for improving
  • available preservation metadata for web resources

34
Issues - Or Not?
  • Web Server Performance
  • Academic vs dot-com expectations
  • Solution options
  • Utility Efficiency
  • Java-based vs C-based
  • Market pressures
  • Security
  • Metadata vs risk
  • Access controls

35
Next Up
  • mod_oai Open Source release
  • Formalize/release CRATE schema definition (XSD)
  • Metrics Collection Evaluation
  • Academic sites
  • Dot-Com sites
  • Examine utility compatibility and issues
  • Address security concerns

36
Examples
  • MPEG-21 DIDL
  • http//beatitude.cs.odu.edu9090/washingtonpost1.t
    xt
  • http//beatitude.cs.odu.edu9090/modoai/?verbGetR
    ecordmetadataPrefixoai_didlidentifierhttp//be
    atitude.cs.odu.edu9090/washingtonpost1.txt
  • http//beatitude.cs.odu.edu9090/index.html
  • http//beatitude.cs.odu.edu9090/modoai/?verbGetR
    ecordmetadataPrefixoai_didlidentifierhttp//be
    atitude.cs.odu.edu9090/index.html
  • CRATE
  • http//beatitude.cs.odu.edu8080/modoaitest/diag.j
    pg
  • http//beatitude.cs.odu.edu8080/modoai/?verbGetR
    ecordmetadataPrefixcrateidentifierhttp//beati
    tude.cs.odu.edu8080/modoaitest/diag.jpg
  • http//beatitude.cs.odu.edu8080/index.html
  • http//beatitude.cs.odu.edu8080/modoai/?verbGetR
    ecordmetadataPrefixcrateidentifierhttp//beati
    tude.cs.odu.edu8080/index.html

37
Further Information
  • The mod_oai project home page
  • http//www.modoai.org/
  • JCDL 2007
  • Generating Best Effort Preservation Metadata
  • For Web Resources At Time Of Dissemination
  • IWAW 2007
  • CRATE A Simple Model for Self-Describing Web
    Resources
  • Authors webs
  • http//www.cs.odu.edu/mln/pubs/
  • http//www.joanasmith.com/pubs.html
Write a Comment
User Comments (0)
About PowerShow.com