Title: MPEG21 Digital Item Declaration ISOIEC 210002: an overview
1File-based storage of Digital Objects and
constituent datastreams XMLtapes and Internet
Archive ARC files
Xiaoming Liu (1), Luda Balakireva (1), Patrick
Hochstenbach (2) and Herbert Van de Sompel (1)
(1) Digital Library Research Prototyping
Team Research Library, Los Alamos National
Laboratory (2) University Library Ghent
University liu_x_at_lanl.gov , ludab_at_lanl.gov ,
patrick.hochstenbach_at_ugent.be , herbertv_at_lanl.gov
2Disclaimer
- The term Digital Object (DO) will be used as in
Kahn/Wilensky - Compound object
- Multiple datastreams of different mime types
- Secondary information pertaining to object and
datastreams - Identifiers for object (and datastreams)
- This is OAIS Content Information
3XML-based representation of DOs
- Growing interest in XML-based representation of
DOs in Digital Library architectures - Platform-independence,
- Industry-support
- Longevity, potential migration paths
- Processing tools, validation capabilities
- XML-based Compound Object formats
- ISO/IEC 21000-2 MPEG-21 DID DIDL
- METS
- IMS/CP
- CCDS XFDU
- Typical functionality
- By-Value (base64) and/or By-Reference provision
of constituent datastreams - By-Value and/or By-Reference provision of
secondary information - Provision of identifiers
4Storing XML-based representations of DOs
- Existing approaches
- storage of the XML-representations as individual
files in a file system - Poor access performance
- Poor backup performance
- storage of the XML-representations in (SQL, XML,
object) databases - Long term? Data are dependent on the underlying
system - storage of the XML-representations by
concatenating many such documents into a single
file such as tar or zip - Not XML aware, hence, no use of off-the-shelf XML
tools - Increasing storage space (base64-encoding of the
constituent datastreams)
5aDORe XMLtape/ARCfile solution
- Part of LANL aDORe repository effort
- Standards-based, modular repository architecture
- Distributed architecture
- Protocol-based interactions between modules
- Usable to create interoperable federations of
heterogeneous repositories - Actual implementation of the architecture at LANL
- Components of aDORe software will be released
- Inspired by Internet Archive ARC file approach
- File-based mechanism to store datastreams
resulting from Web-crawling - Concatenation of multiple datastreams into a
single file - Metadata as seperators between datastreams
- But not OK to store XML-based representations of
DOs - Metadata capabilities very limited crawling
related - Lose power of XML processing tools
6aDORe XMLtape/ARCfile solution
- Two interconnected file-based storage mechanisms
- XMLtapes File storage of XML-based
representations of Digital Objects - ARCfiles File storage of constituent datastreams
of Digital Objects - The ARC files are interconnected with one or more
XMLtapes during the ingestion process - A protocol-based access mechanism is introduced
- XMLtape is exposed as an autonomous OAI-PMH
repository - ARCfile is exposed as an OpenURL Resolver
- Write once - Read many
- Files remain stable
- Protocol-based access mechanism remains stable
- Indexing mechanisms can change as technologies
evolve - Storage approach is independent from the compound
object format used to represent DOs as XML - aDORe uses MPEG-21 DIDL
7ISO/IEC 21000-2 MPEG-21 DID DIDL
has XML serialization
has declaration
Digital Item Declaration
DIDL document
Digital Item
8Representing DOs using MPEG-21 DID
sample DIDL document
9aDORe XMLtape
- An XML file that concatenates the XML-based
representations of multiple DOs - Structure is defined by an XML Schema
- http//purl.lanl.gov/aDORe/schemas/2005-08/XMLtape
.xsd - tape-level administrative section
- Open-ended content
- Plug-in for processing-related information,
indication of related ARCfiles - http//purl.lanl.gov/aDORe/schemas/2005-08/XMLtape
Basics.xsd - concatenation of records, each of which consists
of - record-level administrative section
- identifier and datestamp of the contained record
- other record-level administrative information
- a record (can be from any XML Namespace). DIDL in
case of aDORe - http//purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xs
d - An XMLtape is a valid and well-formed XML file
- Independent from chosen XML-based Compound Object
Format
10aDORe XMLtape
- lt?xml version"1.0" encoding"UTF-8"?gt
- lttatape xmlnsta"http//library.lanl.gov/2005-08
/aDORe/XMLtape/" - lttatapeAdmingt
- ...
- lt/tatapeAdmingt
- lttatapeRecordgt
- lttatapeRecordAdmingt
- lttaidentifiergtoaiaps.orgPhysRevA.71.04
0101lt/taidentifiergt - lttadategt2005-03-29T043122Zlt/tadategt
- lttarecordAdmingt
- ...
- lt/tarecordAdmingt
- lt/tatapeRecordAdmingt
- lttarecordgt
- ltdidlDIDLgt...lt/didlDIDLgt
- lt/tarecordgt
- lt/tatapeRecordgt
- lt/tatapegt
aDORe tatape
sample XMLtape
11aDORe XMLtape index
XMLtape
index
identifier datestamp of ingestion
identifier datestamp of ingestion
identifier datestamp of ingestion
- Indexing
- Can be achieved with a variety of technologies
- Current implementation Berkeley DB Java Edition
lttatapeRecordAdmingt
12aDORe XMLtape as OAI-PMH repository
XMLtape
index
OAI-PMH request
DIDL document
OAI-PMH identifier identifier from
lttatapeRecordAdmingt OAI-PMH datestamp
datetime from lttatapeRecordAdmingt OAI-PMH
response content of lttarecordgt
13Internet Archive ARCfile
- Concatenation of binary files
- Designed and used by the Internet Archive
(Wayback machine) - gt 400 TB web data
- Under revision by the International Internet
Preservation Consortium (IIPC) WARC file format - Input from LANL to facilitate non-Web-crawling
use case - The ARC file format is structured as follows
- file header that provides administrative
information about the ARC file itself - a sequence of document records, consisting of
- a header line containing some, mainly
crawl-related, metadata. - URI of the crawled document
- timestamp of acquisition of the data
- size of the data block
- a response to a protocol request such as an HTTP
GET
14Internet Archive ARC file
- filedesc//IA-001102.arc 0 19960923142103
text/plain 761 0 Alexa InternetURL IP-address
Archive-date Content-type Archive-length - http//www.dryswamp.edu80/index.html
127.10.100.2 19961104142103 text/html
202HTTP/1.0 200 Document followsDate Mon, 04
Nov 1996 142106 GMTServer NCSA/1.4.1Content-t
ype text/html Last-modified Sat,10 Aug 1996
223311 GMTContent-length 30ltHTMLgtHello
World!!! - lt/HTMLgt
sample ARC file
15Internet Archive ARC file in aDORe
- filedesc//singletape.arc 0.0.0.0 20050922142103
text/plain 76 1 0 - Internet Archive
- URL IP-address Archive-date Content-type
Archive-length - infolanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b
989a 0.0.0.0 20050907221344 application/pdf
415025 - PDF-1.3
- âãÏÓ
- 290
- 0 obj
- ltlt
- /Linearized 1
- /O 295
- /H 3642 1057
- /L 415025
sample aDORe ARC file
sample ARCfile
16Internet Archive ARC file
ARC
index
URL
datastream
URL
datastream
URL
datastream
datastream
datastream
datastream
datastream
- Indexing
- Can be achieved with a variety of technologies
- Current implementation in aDORe Heritrix toolkit
datastream
URL IP-address Archive-date Content-type
Archive-length
17ARC file as OpenURL Resolver
ARC file
index
datastream
OpenURL
OpenURL request
datastream
datastream
datastream
datastream
datastream
datastream
datastream
datastream
Referent Identifier datastream identifier
URL from ARC record header Resolver Identifier
identifier of ARC file
18Associating an XMLtape with ARC Files (1)
- A Digital Object is represented using an
XML-based Complex Object format (e.g. MPEG-21
DID) - The resulting package (e.g. DIDL document) is
stored in an XMLtape - Constituent datastreams of the Digital Object are
provided By-Reference - Using the ref attribute of the Resource element
in MPEG-21 DID - The value of the network location of the
constituent datastream is compliant with the NISO
OpenURL Framework - baseURL(ARCfile OpenURL Resolver)?
- url_ver Z39.88-2004
- rft_id Datastream Identifier
- res_id ARCfile identifier
19Associating an XMLtape with ARC Files (1)
- lt?xml version"1.0" encoding"UTF-8"?gt
- ltdidlDIDLgt
-
- ltdidlComponent id"uuid-ddec9dbb-90e5-4b8a-93f3-d
d1c8b781547"gt - ltdidlDescriptorgt
- ltdidlStatement mimeType"application/xml
charsetutf-8"gt - ltdiiIdentifier gt
- infolanl-repo/ds/ba0797d3-9414-42d0-90e8-
f5397e74892b - lt/diiIdentifiergt
- lt/didlStatementgt
- lt/didlDescriptorgt
- ltdidlResource mimeType"application/pdf
- ref"http//purl.lanl.gov/aDORe/d
emo/adore-arcfile-resolver/resolver? - url_verZ39.88-2004
- res_idinfolanl-repo/arc/2
001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2 - rft_idinfolanl-repo/ds/ba
0797d3-9414-42d0-90e8-f5397e74892b/gt - lt/didlComponentgt
-
- lt/didlDIDLgt
Extract from DIDL
20Associating an XMLtape with ARC Files (2)
- An XMLtape is associated with its corresponding
ARCfiles through a plug-in for the XMLtape-level
administrative section.
21Associating an XMLtape with ARC Files (2)
- lt?xml version"1.0" encoding"UTF-8"?gt
- lttatape xmlnsta"http//library.lanl.gov/2005-08
/aDORe/XMLtape/"gt - lttatapeAdmingt
- lttbXMLtapeBasics xmlnstb"http//library.lan
l.gov/2005-08/aDORe/XMLtapeBasics/gt - lttbXMLtapeIdgtinfolanl-repo/xmltape/singles
citapelt/tbXMLtapeIdgt - lttbARCfileIdgtinfolanl-repo/arc/singlescita
pelt/tbARCfileIdgt - lttbprocessSoftwaregtgov.lanl.xmltape.SingleT
apeWriterlt/tbprocessSoftwaregt - lttbprocessTimegt2005-09-07T221339Zlt/tbpro
cessTimegt - lt/tbXMLtapeBasicsgt
- lt/tatapeAdmingt
- lttatapeRecordgt
- lttatapeRecordAdmingt
-
- lt/tatapegt
XMLtape header
22AGENT
23aDORe XMLtape/ARCfile environment
24Implementation
- XMLtapes
- Berkeley DB Java Edition
- OCLC OAICat
- ARCfiles
- Heritrix
- OCLC OpenURL software
- XMLtape Registry
- MySQL db
- OCLC OAICat
- ARCfile Registry
- MySQL db
- OCLC OAICat
25Performance indicators
- System
- Model Dell 2650 2U rack-mount server
- CPU dual 2.8 GHz Intel Xeon processors
- RAM 5GB RAM
- Disks 10k RPM SCSI disks
- XMLtape
- 1786 MB, 201872 DIDL records
- download 100 consecutive DIDL records (787 KB) gt
0.18 second - download static file of same size gt 0.09 second
- ARCfile
- 272 MB, 4910 files
- download a sample PDF file (312 KB) gt 0.24
second - download static file of same size gt 0.036 second
26Software
- Software - ARC files
- Heritrix the internet archive's open-source,
extensible, web-scale, archival-quality web
crawler project. http//crawler.archive.org/ - NetArchive.dk a project that plans for the
preservation of Denmark's cultural heritage on
the internet for future generations.
http//www.netarchive.dk/ - Many other tools http//archive-access.sourceforg
e.Net - XMLtapes
- Perl tool, XMLTape (LANL Ghent University),
http//search.cpan.org/hochsten/XML-Tape/ - Combined aDORe XMLtape/ARCfile environment
- Java tool (LANL), soon to be released on
SourceForge
27Conclusion
- The file-based approach is inherently simple, and
reduces dependency on database system. - The autonomy of the indexes allows retaining the
files over time, while the indexes can be created
using other techniques as technologies evolve. - The protocol-based nature of the access increases
the flexibility in light of evolving technologies
as it introduces another layer of abstraction. - The XMLtape approach is inspired by the ARC file
format, but provides several additional
attractive features - Off-the-shelf XML tools can be used to
parse/validate an XMLtape - All DO metadata can be stored in XML-based
compound object format - Presentation available via http//public.lanl.gov/
herbertv/ - Install TSCC codec for avi movies