CRATE: A Simple Model for SelfDescribing Web Resources - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

CRATE: A Simple Model for SelfDescribing Web Resources

Description:

WWW and Digital Libraries: Vastly Different Worlds. World Wide Web. A ... Camera Model Name Canon EOS DIGITAL REBEL. Date/Time Original 2003:09:30 13:37:51 ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 21
Provided by: Joana53
Category:

less

Transcript and Presenter's Notes

Title: CRATE: A Simple Model for SelfDescribing Web Resources


1
CRATE A Simple Model for Self-Describing Web
Resources
  • International Web Archiving Workshop 2007

Joan A. Smith Michael L. Nelson Old Dominion
University Department of Computer
Science Norfolk, VA 23529 jsmit, mln_at_cs.odu.edu
2
WWW and Digital Libraries Vastly Different Worlds
  • World Wide Web
  • A disorganized free-for-all
  • Near-zero metadata
  • Unpredictable additions, deletions, modifications
  • No preservation policy

vs
3
Web Sites Metadata Challenged
HTML metadata
telnet foo.edu 80 Trying 82.165.199.160...
Connected to foo.edu. Escape character is ''.
GET /jackJill.jpg HTTP/1.1 Host foo.edu
HTTP/1.1 200 OK Date Mon, 11 Jun 2007
164925 GMT Server Apache/1.3.33 (Unix)
Last-Modified Mon, 29 Aug 2005 120140 GMT
ETag "5800535-3e72-4312f924" Accept-Ranges
bytes Content-Length 15986 Content-Type
image/jpeg ÿØÿà "2s35Rq³ÁÂCcruÃÒÿÄ
JPEG metadata
4
Archives Metadata-Rich
5
YAMM?! (Yet Another Metadata Model?)
6
The MPEG-21 DIDL Model
7
Preservation Metadata
High
Probability of Preservation
Low
Less
More
Resource Metadata Available
8
Webs gtgt Archivists
Typical ingest scenario
Archivist
Web Sites
9
Harnessing the Web Server
User standard GET request and response
  • Archivist mod_oai GetRecord request and response

Self-describing resource
10
What is a Self-Describing Resource?
Standard HTTP Headers -- Last-Modified Mon, 29
Aug 2005 120140 GMT ETag "5800535-3e72-4312f9
24" Content-Length 15986 Content-Type
image/jpeg
PLUS Output from built-in utilities
EXIF TOOL File Name 103_0315.JPG Camera Model
Name Canon EOS DIGITAL REBEL Date/Time
Original 20030930 133751 Shooting
Mode Sports Shutter Speed 1/2000 Aperture 7.1 Mete
ring Mode Evaluative Exposure Compensation 0 ISO 4
00 Lens 75.0 - 300.0mm Focal Length 300.0mm Image
Size 3072x2048 Quality Normal Flash Off White
Balance Auto Focus Mode AI Servo
AF Contrast 1 Sharpness 1 Saturation 1 Color
Tone Normal File Size 1606 kB File Number 103-0315
JHOVE TOOL Date 2007-06-18 143550 EDT
RepresentationInformation /home/crate/apache/htdo
cs/jackJill.jpg ReportingModule JPEG-hul, Rel.
1.2 (2005-08-22) LastModified 2007-01-16
230907 EST Size 27750 Format JPEG Version
1.00 Status Well-Formed and valid
SignatureMatches JPEG-hul MIMEtype image/jpeg
Profile JFIF JPEGMetadata CompressionType
Huffman coding, Baseline DCT Images Number 1
Image NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146 BitsPerSample
8, 8, 8 SamplesPerPixel 3 Scans 1
QuantizationTables QuantizationTable
Precision 8-bit DestinationIdentifier 0
Comments LEAD Technologies Inc. V1.01
ApplicationSegments APP0
File/Magic JPEG image data JFIF standard
1.00 resolution (DPI) "LEAD Technologies Inc.
V1.01 33 x 26
MD5 Hash 58a54e8638db432f4515eedf89f44505
CRATE Wrapped together with the resource in
simple XML
11
Metadata Generation Utility Examples
12
Web Server Configuration conf file
  • Section 1 Global Environment
  • ServerType standalone
  • ServerRoot "/etc/httpd"
  • PidFile /var/run/httpd.pid
  • ResourceConfig /dev/null
  • AccessConfig /dev/null
  • Timeout 300
  • KeepAlive On
  • MaxKeepAliveRequests 0
  • KeepAliveTimeout 15
  • MinSpareServers 16
  • MaxSpareServers 64
  • StartServers 16
  • MaxClients 512
  • MaxRequestsPerChild 100000
  • Section 2 'Main' server configuration
  • Operational Rules
  • Modules (mod_perl, etc.)
  • Security
  • Virtual Hosts

ltFiles .plgt Options None
AllowOverride None Order deny,allow
Deny from all lt/Filesgt ltIfModule
mod_dir.cgt DirectoryIndex index.htm
index.html index.php index.php3 default.html
index.cgi lt/IfModulegt ltIfModule
mod_include.cgt Include conf/mmap.conf
lt/IfModulegt UseCanonicalName On
ltIfModule mod_mime.cgt TypesConfig
/etc/httpd/conf/mime.types lt/IfModulegt
DefaultType text/plain HostnameLookups
Off
13
Apache mod_oai Location Directive
  • ltLocation /modoaigt
  • SetHandler modoai-handler
  • modoai_oai_active ON
  • ltmodoai_plugingt
  • label md5sum
  • exec /usr/bin/md5sum s
  • version /usr/bin/md5sum --version
  • mime /
  • lt/modoai_plugingt
  • ltmodoai_plugingt
  • label file
  • exec /usr/bin/file -kz s
  • version /usr/bin/file -v
  • mime /
  • lt/modoai_plugingt
  • ltmodoai_plugingt
  • label jhove
  • exec /opt/jhove/jhove -m pdf-hul s
  • version /opt/jhove/jhove -v
  • Scripts
  • Pipes
  • Executables
  • MIME-based selective processing

14
Building a CRATE
  • URI, UUID
  • Standard HTTP Headers
  • Plug-In Metadata
  • Base64-Encoded Resource

15
CRATE example from mod_oai
http//foo.edu/modoai/?verbGetRecordidentifier
http//foo.edu/jackJill.jpgmetadataPrefixcrate
lt?xml version"1.0" encoding"UTF-8"?gt ltOAI-PMH
xmlns"http//www.openarchives.org/OAI/2.0/"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//www.openarchives.or
g/OAI/2.0/ http//www.openarchives.org/OAI/2.0/OAI
-PMH.xsd"gt ltresponseDategt2007-06-18T182146Zlt/re
sponseDategt ltrequest verb"GetRecord"
identifierhttp//foo.edu/jackJill.jpg metadataPr
efixcrate"gthttp//foo.edu/crate/lt/requestgt
ltGetRecordgt ltrecordgt ltheadergt
ltidentifiergthttp//foo.edu/jackJill.jpglt/identifie
rgt ltdatestampgt2007-01-17T040907Zlt/datestampgt
ltsetSpecgtmimeimagejpeglt/setSpecgt lt/headergt
ltcrateContentgt ltmimeTypegtimage/jpeg
encodingbase64lt/mimeTypegt ltdatagtJVBERi0xLjQK
MyAwIG9iaiA8PAovTGVuZ3RoIDM5MjAgICCAKL0ZpIhlzHdxH
Z56diZdOiXjHNfEq9jOuDTzEc lt/datagt lt/crateContent
gt ltcrateMetadatagt ltdescriptiongtltlabelgtfil
e magiclt/labelgt ltexecgt/usr/bin/file
jackJill.jpglt/execgt ltversiongtfile-4.16lt/versio
ngt ltdatagtJPEG image data, JFIF standard 1.00,
resolution (DPI), "LEAD Technologies Inc. V1.01",
33 x 26lt/datagt lt/descriptiongt ltdescriptiongtlt
labelgtjhovelt/labelgt ltexecgt/opt/jhove/jhove m
jpeg-hullt/execgt ltversiongtJhove (Rel. 1.1,
2006-06-05)lt/versiongt ltdatagtlt!CDATA Date
2007-06-18 143550 EDT RepresentationInformation
/home/crate/apache/htdocs/jackJill.jpg Report
ingModule JPEG-hul, Rel. 1.2 (2005-08-22)
LastModified 2007-01-16 230907 EST Size
27750 Format JPEG Version 1.00 Status
Well-Formed and valid SignatureMatches
JPEG-hul MIMEtype image/jpeg Profile JFIF
JPEGMetadata CompressionType Huffman coding,
Baseline DCT Images Number 1 Image
NisoImageMetadata MIMEType image/jpeg
ByteOrder big-endian CompressionScheme JPEG
ColorSpace YCbCr SamplingFrequencyUnit inch
XSamplingFrequency 33 YSamplingFrequency 26
ImageWidth 172 ImageLength 146
BitsPerSample 8, 8, 8 SamplesPerPixel
3 Scans 1 QuantizationTables
QuantizationTable Precision 8-bit
DestinationIdentifier 0 Comments LEAD
Technologies Inc. V1.01 ApplicationSegments APP0
gtlt/datagt lt/descriptiongt lt/crateMetadatagt lt
/recordgt lt/GetRecordgt lt/OAI-PMHgt
16
Automatic, Best-Effort Metadata
  • Automatic
  • Generated at time of dissemination
  • Integrates preservation functions with the web
    server
  • Unverified
  • Utility results are not cross-checked
  • Output of analyses go directly into XML response
  • Undifferentiated
  • No categorization of output
  • Resource and metadata form complex-object
    response
  • A simple, easy-to-implement option for improving
  • available preservation metadata for web resources

17
Issues - Or Not?
  • Web Server Performance
  • Academic vs dot-com expectations
  • Solution options
  • Utility Efficiency
  • Java-based vs C-based
  • Market pressures
  • Security
  • Metadata vs risk
  • Access controls

18
Next Up
  • mod_oai Open Source release
  • Formalize/release CRATE schema definition (XSD)
  • Metrics Collection Evaluation
  • Academic sites
  • Dot-Com sites
  • Examine utility compatibility and issues
  • Address security concerns

19
Demo
  • TODAY
  • http//beatitude.cs.odu.edu8080/modoaitest/diag.j
    pg
  • http//beatitude.cs.odu.edu8080/modoai/?verbGetR
    ecordmetadataPrefixcrateidentifierhttp//local
    host/modoaitest/diag.jpg
  • AT MODOAI.ORG
  • http//www.modoai.org/demos.html

20
Further Information
  • The mod_oai project home page
  • http//www.modoai.org/
  • JCDL 2007
  • Generating Best Effort Preservation Metadata
  • For Web Resources At Time Of Dissemination
  • Authors webs
  • http//www.cs.odu.edu/mln/pubs/
  • http//www.joanasmith.com/pubs.html
Write a Comment
User Comments (0)
About PowerShow.com