MPEG21 Digital Item Declaration ISOIEC 210002: an overview - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

MPEG21 Digital Item Declaration ISOIEC 210002: an overview

Description:

Current implementation: Berkeley DB Java Edition ta:tapeRecordAdmin ... Berkeley DB Java Edition. OCLC OAICat. ARCfiles: Heritrix. OCLC OpenURL software ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 28
Provided by: jbek9
Category:

less

Transcript and Presenter's Notes

Title: MPEG21 Digital Item Declaration ISOIEC 210002: an overview


1
File-based storage of Digital Objects and
constituent datastreams XMLtapes and Internet
Archive ARC files
Xiaoming Liu (1), Luda Balakireva (1), Patrick
Hochstenbach (2) and Herbert Van de Sompel (1)
(1) Digital Library Research Prototyping
Team Research Library, Los Alamos National
Laboratory (2) University Library Ghent
University liu_x_at_lanl.gov , ludab_at_lanl.gov ,
patrick.hochstenbach_at_ugent.be , herbertv_at_lanl.gov

2
Disclaimer
  • The term Digital Object (DO) will be used as in
    Kahn/Wilensky
  • Compound object
  • Multiple datastreams of different mime types
  • Secondary information pertaining to object and
    datastreams
  • Identifiers for object (and datastreams)
  • This is OAIS Content Information

3
XML-based representation of DOs
  • Growing interest in XML-based representation of
    DOs in Digital Library architectures
  • Platform-independence,
  • Industry-support
  • Longevity, potential migration paths
  • Processing tools, validation capabilities
  • XML-based Compound Object formats
  • ISO/IEC 21000-2 MPEG-21 DID DIDL
  • METS
  • IMS/CP
  • CCDS XFDU
  • Typical functionality
  • By-Value (base64) and/or By-Reference provision
    of constituent datastreams
  • By-Value and/or By-Reference provision of
    secondary information
  • Provision of identifiers

4
Storing XML-based representations of DOs
  • Existing approaches
  • storage of the XML-representations as individual
    files in a file system
  • Poor access performance
  • Poor backup performance
  • storage of the XML-representations in (SQL, XML,
    object) databases
  • Long term? Data are dependent on the underlying
    system
  • storage of the XML-representations by
    concatenating many such documents into a single
    file such as tar or zip
  • Not XML aware, hence, no use of off-the-shelf XML
    tools
  • Increasing storage space (base64-encoding of the
    constituent datastreams)

5
aDORe XMLtape/ARCfile solution
  • Part of LANL aDORe repository effort
  • Standards-based, modular repository architecture
  • Distributed architecture
  • Protocol-based interactions between modules
  • Usable to create interoperable federations of
    heterogeneous repositories
  • Actual implementation of the architecture at LANL
  • Components of aDORe software will be released
  • Inspired by Internet Archive ARC file approach
  • File-based mechanism to store datastreams
    resulting from Web-crawling
  • Concatenation of multiple datastreams into a
    single file
  • Metadata as seperators between datastreams
  • But not OK to store XML-based representations of
    DOs
  • Metadata capabilities very limited crawling
    related
  • Lose power of XML processing tools

6
aDORe XMLtape/ARCfile solution
  • Two interconnected file-based storage mechanisms
  • XMLtapes File storage of XML-based
    representations of Digital Objects
  • ARCfiles File storage of constituent datastreams
    of Digital Objects
  • The ARC files are interconnected with one or more
    XMLtapes during the ingestion process
  • A protocol-based access mechanism is introduced
  • XMLtape is exposed as an autonomous OAI-PMH
    repository
  • ARCfile is exposed as an OpenURL Resolver
  • Write once - Read many
  • Files remain stable
  • Protocol-based access mechanism remains stable
  • Indexing mechanisms can change as technologies
    evolve
  • Storage approach is independent from the compound
    object format used to represent DOs as XML
  • aDORe uses MPEG-21 DIDL

7
ISO/IEC 21000-2 MPEG-21 DID DIDL
has XML serialization
has declaration
Digital Item Declaration
DIDL document
Digital Item
8
Representing DOs using MPEG-21 DID
sample DIDL document
9
aDORe XMLtape
  • An XML file that concatenates the XML-based
    representations of multiple DOs
  • Structure is defined by an XML Schema
  • http//purl.lanl.gov/aDORe/schemas/2005-08/XMLtape
    .xsd
  • tape-level administrative section
  • Open-ended content
  • Plug-in for processing-related information,
    indication of related ARCfiles
  • http//purl.lanl.gov/aDORe/schemas/2005-08/XMLtape
    Basics.xsd
  • concatenation of records, each of which consists
    of
  • record-level administrative section
  • identifier and datestamp of the contained record
  • other record-level administrative information
  • a record (can be from any XML Namespace). DIDL in
    case of aDORe
  • http//purl.lanl.gov/aDORe/schemas/2005-08/DIDL.xs
    d
  • An XMLtape is a valid and well-formed XML file
  • Independent from chosen XML-based Compound Object
    Format

10
aDORe XMLtape
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lttatape xmlnsta"http//library.lanl.gov/2005-08
    /aDORe/XMLtape/"
  • lttatapeAdmingt
  • ...
  • lt/tatapeAdmingt
  • lttatapeRecordgt
  • lttatapeRecordAdmingt
  • lttaidentifiergtoaiaps.orgPhysRevA.71.04
    0101lt/taidentifiergt
  • lttadategt2005-03-29T043122Zlt/tadategt
  • lttarecordAdmingt
  • ...
  • lt/tarecordAdmingt
  • lt/tatapeRecordAdmingt
  • lttarecordgt
  • ltdidlDIDLgt...lt/didlDIDLgt
  • lt/tarecordgt
  • lt/tatapeRecordgt
  • lt/tatapegt

aDORe tatape
sample XMLtape
11
aDORe XMLtape index
XMLtape
index
identifier datestamp of ingestion
identifier datestamp of ingestion
identifier datestamp of ingestion
  • Indexing
  • Can be achieved with a variety of technologies
  • Current implementation Berkeley DB Java Edition

lttatapeRecordAdmingt
12
aDORe XMLtape as OAI-PMH repository
XMLtape
index
OAI-PMH request
DIDL document
OAI-PMH identifier identifier from
lttatapeRecordAdmingt OAI-PMH datestamp
datetime from lttatapeRecordAdmingt OAI-PMH
response content of lttarecordgt
13
Internet Archive ARCfile
  • Concatenation of binary files
  • Designed and used by the Internet Archive
    (Wayback machine)
  • gt 400 TB web data
  • Under revision by the International Internet
    Preservation Consortium (IIPC) WARC file format
  • Input from LANL to facilitate non-Web-crawling
    use case
  • The ARC file format is structured as follows
  • file header that provides administrative
    information about the ARC file itself
  • a sequence of document records, consisting of
  • a header line containing some, mainly
    crawl-related, metadata.
  • URI of the crawled document
  • timestamp of acquisition of the data
  • size of the data block
  • a response to a protocol request such as an HTTP
    GET

14
Internet Archive ARC file
  • filedesc//IA-001102.arc 0 19960923142103
    text/plain 761 0 Alexa InternetURL IP-address
    Archive-date Content-type Archive-length
  • http//www.dryswamp.edu80/index.html
    127.10.100.2 19961104142103 text/html
    202HTTP/1.0 200 Document followsDate Mon, 04
    Nov 1996 142106 GMTServer NCSA/1.4.1Content-t
    ype text/html Last-modified Sat,10 Aug 1996
    223311 GMTContent-length 30ltHTMLgtHello
    World!!!
  • lt/HTMLgt

sample ARC file
15
Internet Archive ARC file in aDORe
  • filedesc//singletape.arc 0.0.0.0 20050922142103
    text/plain 76 1 0
  • Internet Archive
  • URL IP-address Archive-date Content-type
    Archive-length
  • infolanl-repo/ds/39c2fa93-fa22-4c19-90af-b5f58b9b
    989a 0.0.0.0 20050907221344 application/pdf
    415025
  • PDF-1.3
  • âãÏÓ
  • 290
  • 0 obj
  • ltlt
  • /Linearized 1
  • /O 295
  • /H 3642 1057
  • /L 415025

sample aDORe ARC file
sample ARCfile
16
Internet Archive ARC file
ARC
index
URL
datastream
URL
datastream
URL
datastream
datastream
datastream
datastream
datastream
  • Indexing
  • Can be achieved with a variety of technologies
  • Current implementation in aDORe Heritrix toolkit

datastream
URL IP-address Archive-date Content-type
Archive-length
17
ARC file as OpenURL Resolver
ARC file
index
datastream
OpenURL
OpenURL request
datastream
datastream
datastream
datastream
datastream
datastream
datastream
datastream
Referent Identifier datastream identifier
URL from ARC record header Resolver Identifier
identifier of ARC file
18
Associating an XMLtape with ARC Files (1)
  • A Digital Object is represented using an
    XML-based Complex Object format (e.g. MPEG-21
    DID)
  • The resulting package (e.g. DIDL document) is
    stored in an XMLtape
  • Constituent datastreams of the Digital Object are
    provided By-Reference
  • Using the ref attribute of the Resource element
    in MPEG-21 DID
  • The value of the network location of the
    constituent datastream is compliant with the NISO
    OpenURL Framework
  • baseURL(ARCfile OpenURL Resolver)?
  • url_ver Z39.88-2004
  • rft_id Datastream Identifier
  • res_id ARCfile identifier

19
Associating an XMLtape with ARC Files (1)
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • ltdidlDIDLgt
  • ltdidlComponent id"uuid-ddec9dbb-90e5-4b8a-93f3-d
    d1c8b781547"gt
  • ltdidlDescriptorgt
  • ltdidlStatement mimeType"application/xml
    charsetutf-8"gt
  • ltdiiIdentifier gt
  • infolanl-repo/ds/ba0797d3-9414-42d0-90e8-
    f5397e74892b
  • lt/diiIdentifiergt
  • lt/didlStatementgt
  • lt/didlDescriptorgt
  • ltdidlResource mimeType"application/pdf
  • ref"http//purl.lanl.gov/aDORe/d
    emo/adore-arcfile-resolver/resolver?
  • url_verZ39.88-2004
  • res_idinfolanl-repo/arc/2
    001_4acb6e28-1ef9-11da-9e1e-d8ccd1d6c8f2
  • rft_idinfolanl-repo/ds/ba
    0797d3-9414-42d0-90e8-f5397e74892b/gt
  • lt/didlComponentgt
  • lt/didlDIDLgt

Extract from DIDL
20
Associating an XMLtape with ARC Files (2)
  • An XMLtape is associated with its corresponding
    ARCfiles through a plug-in for the XMLtape-level
    administrative section.

21
Associating an XMLtape with ARC Files (2)
  • lt?xml version"1.0" encoding"UTF-8"?gt
  • lttatape xmlnsta"http//library.lanl.gov/2005-08
    /aDORe/XMLtape/"gt
  • lttatapeAdmingt
  • lttbXMLtapeBasics xmlnstb"http//library.lan
    l.gov/2005-08/aDORe/XMLtapeBasics/gt
  • lttbXMLtapeIdgtinfolanl-repo/xmltape/singles
    citapelt/tbXMLtapeIdgt
  • lttbARCfileIdgtinfolanl-repo/arc/singlescita
    pelt/tbARCfileIdgt
  • lttbprocessSoftwaregtgov.lanl.xmltape.SingleT
    apeWriterlt/tbprocessSoftwaregt
  • lttbprocessTimegt2005-09-07T221339Zlt/tbpro
    cessTimegt
  • lt/tbXMLtapeBasicsgt
  • lt/tatapeAdmingt
  • lttatapeRecordgt
  • lttatapeRecordAdmingt
  • lt/tatapegt

XMLtape header
22
AGENT
23
aDORe XMLtape/ARCfile environment
24
Implementation
  • XMLtapes
  • Berkeley DB Java Edition
  • OCLC OAICat
  • ARCfiles
  • Heritrix
  • OCLC OpenURL software
  • XMLtape Registry
  • MySQL db
  • OCLC OAICat
  • ARCfile Registry
  • MySQL db
  • OCLC OAICat

25
Performance indicators
  • System
  • Model Dell 2650 2U rack-mount server
  • CPU dual 2.8 GHz Intel Xeon processors
  • RAM 5GB RAM
  • Disks 10k RPM SCSI disks
  • XMLtape
  • 1786 MB, 201872 DIDL records
  • download 100 consecutive DIDL records (787 KB) gt
    0.18 second
  • download static file of same size gt 0.09 second
  • ARCfile
  • 272 MB,  4910 files
  • download a sample PDF file (312 KB) gt 0.24
    second
  • download static file of same size gt 0.036 second

26
Software
  • Software - ARC files
  • Heritrix the internet archive's open-source,
    extensible, web-scale, archival-quality web
    crawler project. http//crawler.archive.org/
  • NetArchive.dk a project that plans for the
    preservation of Denmark's cultural heritage on
    the internet for future generations.
    http//www.netarchive.dk/
  • Many other tools http//archive-access.sourceforg
    e.Net
  • XMLtapes
  • Perl tool, XMLTape (LANL Ghent University),
    http//search.cpan.org/hochsten/XML-Tape/
  • Combined aDORe XMLtape/ARCfile environment
  • Java tool (LANL), soon to be released on
    SourceForge

27
Conclusion
  • The file-based approach is inherently simple, and
    reduces dependency on database system.
  • The autonomy of the indexes allows retaining the
    files over time, while the indexes can be created
    using other techniques as technologies evolve.
  • The protocol-based nature of the access increases
    the flexibility in light of evolving technologies
    as it introduces another layer of abstraction.
  • The XMLtape approach is inspired by the ARC file
    format, but provides several additional
    attractive features
  • Off-the-shelf XML tools can be used to
    parse/validate an XMLtape
  • All DO metadata can be stored in XML-based
    compound object format
  • Presentation available via http//public.lanl.gov/
    herbertv/
  • Install TSCC codec for avi movies
Write a Comment
User Comments (0)
About PowerShow.com