Title: CRATE: A Simple Model for SelfDescribing Web Resources
1CRATE A Simple Model for Self-Describing Web
Resources
- International Web Archiving Workshop 2007
Joan A. Smith Michael L. Nelson Old Dominion
University Department of Computer
Science Norfolk, VA 23529 jsmit, mln_at_cs.odu.edu
2WWW and Digital Libraries Vastly Different Worlds
- World Wide Web
- A disorganized free-for-all
- Near-zero metadata
- Unpredictable additions, deletions, modifications
- No preservation policy
vs
3Web Sites Metadata Challenged
HTML metadata
telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ
JPEG metadata
4Archives Metadata-Rich
5YAMM?! (Yet Another Metadata Model?)
6The MPEG-21 DIDL Model
7Preservation Metadata
High
Probability of Preservation
Low
Less
More
Resource Metadata Available
8 Webs gtgt Archivists
Typical ingest scenario
Archivist
Web Sites
9Harnessing the Web Server
User standard GET request and response
- Archivist mod_oai GetRecord request and response
Self-describing resource
10What is a Self-Describing Resource?
Standard HTTP Headers -- Last-Modified Mon, 29
Aug 2005 120140 GMT ETag "5800535-3e72-4312f9
24" Content-Length 15986 Content-Type
image/jpeg
PLUS Output from built-in utilities
EXIF TOOL File Name 103_0315.JPG Camera Model
Name Canon EOS DIGITAL REBEL Date/Time
Original 20030930 133751 Shooting
Mode Sports Shutter Speed 1/2000 Aperture 7.1 Mete
ring Mode Evaluative Exposure Compensation 0 ISO 4
00 Lens 75.0 - 300.0mm Focal Length 300.0mm Image
Size 3072x2048 Quality Normal Flash Off White
Balance Auto Focus Mode AI Servo
AF Contrast 1 Sharpness 1 Saturation 1 Color
Tone Normal File Size 1606 kB File Number 103-0315
JHOVE TOOL Date 2007-06-18 143550 EDT
RepresentationInformation /home/crate/apache/htdo
cs/jackJill.jpg ReportingModule JPEG-hul, Rel.
1.2 (2005-08-22) LastModified 2007-01-16
230907 EST Size 27750 Format JPEG Version
1.00 Status Well-Formed and valid
SignatureMatches JPEG-hul MIMEtype image/jpeg
Profile JFIF JPEGMetadata CompressionType
Huffman coding, Baseline DCT Images Number 1
Image NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146 BitsPerSample
8, 8, 8 SamplesPerPixel 3 Scans 1
QuantizationTables QuantizationTable
Precision 8-bit DestinationIdentifier 0
Comments LEAD Technologies Inc. V1.01
ApplicationSegments APP0
File/Magic JPEG image data JFIF standard
1.00 resolution (DPI) "LEAD Technologies Inc.
V1.01 33 x 26
MD5 Hash 58a54e8638db432f4515eedf89f44505
CRATE Wrapped together with the resource in
simple XML
11Metadata Generation Utility Examples
12Web Server Configuration conf file
- Section 1 Global Environment
-
- ServerType standalone
- ServerRoot "/etc/httpd"
- PidFile /var/run/httpd.pid
- ResourceConfig /dev/null
- AccessConfig /dev/null
- Timeout 300
- KeepAlive On
- MaxKeepAliveRequests 0
- KeepAliveTimeout 15
- MinSpareServers 16
- MaxSpareServers 64
- StartServers 16
- MaxClients 512
- MaxRequestsPerChild 100000
- Section 2 'Main' server configuration
-
- Operational Rules
- Modules (mod_perl, etc.)
- Security
- Virtual Hosts
ltFiles .plgt Options None
AllowOverride None Order deny,allow
Deny from all lt/Filesgt ltIfModule
mod_dir.cgt DirectoryIndex index.htm
index.html index.php index.php3 default.html
index.cgi lt/IfModulegt ltIfModule
mod_include.cgt Include conf/mmap.conf
lt/IfModulegt UseCanonicalName On
ltIfModule mod_mime.cgt TypesConfig
/etc/httpd/conf/mime.types lt/IfModulegt
DefaultType text/plain HostnameLookups
Off
13Apache mod_oai Location Directive
- ltLocation /modoaigt
- SetHandler modoai-handler
- modoai_oai_active ON
- ltmodoai_plugingt
- label md5sum
- exec /usr/bin/md5sum s
- version /usr/bin/md5sum --version
- mime /
- lt/modoai_plugingt
- ltmodoai_plugingt
- label file
- exec /usr/bin/file -kz s
- version /usr/bin/file -v
- mime /
- lt/modoai_plugingt
- ltmodoai_plugingt
- label jhove
- exec /opt/jhove/jhove -m pdf-hul s
- version /opt/jhove/jhove -v
- Scripts
- Pipes
- Executables
- MIME-based selective processing
14Building a CRATE
- URI, UUID
- Standard HTTP Headers
- Plug-In Metadata
- Base64-Encoded Resource
15CRATE example from mod_oai
http//foo.edu/modoai/?verbGetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEc lt/datagt lt/crateContent
gt ltcrateMetadatagt ltdescriptiongtltlabelgtfil
e magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagtlt!CDATA Date
2007-06-18 143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Report
ingModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments APP0
gtlt/datagt lt/descriptiongt lt/crateMetadatagt lt
/recordgt lt/GetRecordgt lt/OAI-PMHgt
16Automatic, Best-Effort Metadata
- Automatic
- Generated at time of dissemination
- Integrates preservation functions with the web
server - Unverified
- Utility results are not cross-checked
- Output of analyses go directly into XML response
- Undifferentiated
- No categorization of output
- Resource and metadata form complex-object
response - A simple, easy-to-implement option for improving
- available preservation metadata for web resources
17Issues - Or Not?
- Web Server Performance
- Academic vs dot-com expectations
- Solution options
- Utility Efficiency
- Java-based vs C-based
- Market pressures
- Security
- Metadata vs risk
- Access controls
18Next Up
- mod_oai Open Source release
- Formalize/release CRATE schema definition (XSD)
- Metrics Collection Evaluation
- Academic sites
- Dot-Com sites
- Examine utility compatibility and issues
- Address security concerns
19Demo
- TODAY
- http//beatitude.cs.odu.edu8080/modoaitest/diag.j
pg - http//beatitude.cs.odu.edu8080/modoai/?verbGetR
ecordmetadataPrefixcrateidentifierhttp//local
host/modoaitest/diag.jpg - AT MODOAI.ORG
- http//www.modoai.org/demos.html
20Further Information
- The mod_oai project home page
- http//www.modoai.org/
- JCDL 2007
- Generating Best Effort Preservation Metadata
- For Web Resources At Time Of Dissemination
- Authors webs
- http//www.cs.odu.edu/mln/pubs/
- http//www.joanasmith.com/pubs.html