Title: aDORe v1 : Architectural Highlights
1aDORe v1 Architectural Highlights Herbert Van
de Sompel Digital Library Research Prototyping
Team Research Library Los Alamos National
Laboratory Acknowledgments Luda Balakireva,
Jeroen Bekaert, Ryan Chute, Patrick Hochstenbach,
Xiaoming Liu, Damien Lujan The aDORe effort was
supported by an NDIIP grant from the Library of
Congress
2Context
- Fact
- LANL Research Library stores a significant
scholarly collection locally (AI databases,
journal articles, ) and creates applications
based on that collection. - Initial aDORe motivation
- Undo tight integration between data and
application - Uniform approach for ingesting, storing, and
disseminating LANL RL data collections - Bigger picture
- Allow for multiple, parallel applications on top
of stored content - Create an environment that provides guarantees
regarding long-term accessibility of stored
content
3aDORe characteristics
- Standards-based
- MPEG-21 Digital Item Declaration, the MPEG-21
Digital Item Identification, URI, info URI,
OAI-PMH, NISO OpenURL, SRU, Information
Environment Service Registry, Internet Archive
ARC file format, OAIS concepts, XML, XML Schema,
XQuery. - Component-based, highly modular
- Multiple content repositories, Identifier
Locator, Service Registry, Format Registry,
Semantic Registry, Harvesting front-end,
Dissemination front-end - Protocol-based
- Components expose (REST-based) Web services
- All read services based on 4 standards
OAI-PMH, NISO OpenURL, SRU, Xquery. - Interaction between modules is protocol-driven.
4aDORe characteristics
5aDORe effort
- aDORe is 2 things
- A standards-based, repository federation
architecture - Actual implementation of the architecture at LANL
for local storage of digital assets - Prototype version was in production for 2 years!
- Production version finalized June 2007.
6aDORe overview
- Representing Digital Objects
- MPEG-21 DID DIDL to represent Digital Objects
using XML packages - Identification of Digital Objects, datastreams,
and XML Packages - Storing Digital Objects
- Autonomous distributed repositories with OAI-PMH
and OpenURL-based service interfaces - Locating Digital Objects, datastreams, and XML
Packages - Identifier Locator
- Registries
- Service Registry Locating service interfaces for
autonomous distributed repositories - Format Registry Sharing media type identifiers
across autonomous distributed repositories - Semantic Registry Sharing intellectual content
type identifiers across autonomous distributed
repositories - Providing federated access to the autonomous
distributed repositories - OAI-PMH Federator Harvesting XML packages
- OpenURL Resolver Requesting services pertaining
to Digital Objects, datastreams, and XML Packages
7Representing Digital Objects
8sample Digital Object
- Create an XML-based surrogate for each Digital
Object - Glues all components together in a single XML
Package - Contains all required metadata (descriptive,
technical, identifiers, ) in the XML Package - Initial access format for all materials is the
same (XML) irrespective of their native media
type - Assign identifiers to the XML Package, the
Digital Object, the datstreams. Maintain
original identifiers.
9representing Digital Objects using MPEG-21 DID
DIDL
- An XML Package is available for every Digital
Object - The Package is an XML document compliant with the
MPEG-21 Digital Item Declaration Language DIDL
document - The DIDL document typically contains
- By-Value descriptive metadata datastream
ingest/repository related metadata - By-Reference all constituent datastreams of the
Digital Object - Creation of DIDL documents can be
- static, at ingestion time, cf. for aDORe Archive
- dynamic, via add-on capability to existing
content management system, cf. Ghent University
eRez add-ons - A new DIDL document is created when a new version
of a previously ingested Digital Object is
ingested (update is considered re-ingestion).
10sample Digital Object
11representing Digital Objects using MPEG-21 DID
12Identification digital objects, datastreams,
DIDL documents
13aDORe DIDLTools
- aDORe DIDLTools software is available from
http//african.lanl.gov/aDORe/projects/DIDLTools/
14The aDORe architecture
15the aDORe architecture 3 layers
- Layer 1 the aDORe repositories
- Networked systems that host digital object
content and that make that content accessible by
exposing core service interfaces. - In LANL Implementation XMLtapes and ARCfiles
(aDORe Archive) - Other Content Management Systems can be turned
into an aDORe repository by implementing the core
service interfaces. - Layer 2 the aDORe federation components
- Networked systems that facilitate presenting the
aDORe repositories as a single logical
repository these federation components expose
core service interfaces to allow access to their
content. - Federation components are Identifier Locator,
Service Registry, Format Registry, Semantic
Registry - Layer 3 the aDORe front-ends
- Networked systems that make digital object
content hosted in the multitude of physical aDORe
repositories accessible by exposing core services
interfaces that present those aDORe repositories
as a single logical repository - aDORe front-ends are OAI-PMH Federator, OpenURL
Resolver
16(No Transcript)
17The aDORe architecture
Layer 1 aDORe repositories Hosting Digital
ObjectsMaking hosted Digital Object content
accessible
18(No Transcript)
19aDORe repositories
- Networked systems that host digital object
content and that have core service interfaces to
facilitate access that content. - Currently 2 types in LANL implementation
- XMLtapes concatenating XML Packages
- ARCfiles concatenating datastreams
- Combination of OAI-PMH and OpenURL-based core
service interfaces - Generic XMLtape XQuery Resolver
- Other Content Management Systems can be turned
into an aDORe repository by implementing the core
service interfaces. - Cf. Aleph
- Cf. Ghent University eRez
20aDORe Archive XMLtapes
XMLtape
oaipmh2 openurl-aDORe1 openurl-aDORe2 openurl-aDOR
e3
21aDORe Archive XMLtape XQuery Resolver
XMLtape
openurl-aDORe7
22aDORe Archive ARCfiles
ARCfile
openurl-aDORe3 openurl-aDORe4
23The aDORe architecture
Layer 2 aDORe federation components
Facilitating the presentation of aDORe
repositoriesas a single logical repository
24(No Transcript)
25Identifier Locator
- Stores all identifiers of aDORe repositories
(DIDLDocumentIdentifier, digital object
identifier, datastream identifier) - Loaded by retrieving identifiers from aDORe
repositories using their give me your
identifiers OpenURL service interface - Stores identifier, repository identifier
- 1 OpenURL-based service interface to the
Identifier Locator
26Identifier Locator
openurl-aDORe2
27Service Registry
- Stores information on all components of the aDORe
environment, including - Identifier of the component,
- Supported core services,
- Location of the core service interfaces,
- Other metadata about the component content
- Components include
- aDORe repositories
- XMLtapes
- ARCfiles
- Federation components
- Identifier Locator
- Registries
- aDORe Front-ends
- OAI-PMH Federator
- OpenURL Resolver
28Registries Service Registry
- Lay-out follows the UK Information Environment
Service Registry (IESR) specification - OAI-PMH, OpenURL and SRU service interfaces
29Service Registry
30Service Registry
oaipmh2 SRU openurl-aDORe6
31openurl-aDORe1 openurl-aDORe2 openurl-aDORe6
32Registries Format Registry
- Stores information on aDORe media types for
datastreams. - Content
- MIME media types
- XML document types
- Digital object profiles
- OAI-PMH service interface
oaipmh2
33Registries Semantic Registry
- Stores information on aDORe semantic content
types for datastreams. - OAI-PMH service interface
oaipmh2
34The aDORe architecture
Layer 3 aDORe front-ends Presenting aDORe
repositories as a single logical repository
35(No Transcript)
36Expose aDORe repositories as a single repository
- Pretend that everything that was introduced so
far is just 1 repository, not hundreds,
thousands, - Name that repository aDORe1
- Provide core service interfaces to aDORe1,
similar to those available for the autonomous
aDORe repositories oaipmh, openURL, - Achieve this through the introduction of 2
components - OAI-PMH Federator
- OpenURL Resolver
- These aDORe1-level core service interfaces are
really the only ones that should be known to
downstream applications
37OAI-PMH Federator
- Single point of access to harvest DIDLs from
aDORe1 - Interacts with Service Registry, Identifier
Locator, and aDORe repositories to generate
OAI-PMH responses - Supports DIDL, and can disseminate other compound
object formats (i.e. METS, Atom, ) - OAI-PMH Federator provides OAI-PMH interface to
aDORe1
oaipmh2
38OpenURL Resolver
- Supports the core OpenURL services for aDORe1
that are also available for the autonomous aDORe
repositories - Interacts with Service Registry, Identifier
Locator, and aDORe repoitories to generate
responses - OpenURL Resolver provides 3 core service
interfaces to aDORe1 - obtain the most recent DIDL for a specified
identifier (DIDLDocumentIdentifier, digital
object identifier, datastream identifier) - retrieve a list of all the locations (DIDLs in
aDORe1) containing a specified identifier - retrieve a datastream corresponding with a
specified identifier (datastream identifier) - OpenURL Resolver could support
- Return all identifiers of aDORe1
- XQuery
openurl-aDORe1 openurl-aDORe2 openurl-aDORe4
39rft_id identifier svc_id infolanl-repo/svc/ge
tDIDL
This is really 2 look-ups
40OpenURL Resolver (a bit more)
- Single point of access to request services
pertaining to single items from the aDORe
repositories. - Powered by a rule engine that dynamically decides
which services are available for a specified item
based on properties of the item (format,
semantics, collection, creation date, ). - Interacts with Service Registry, Identifier
Locator, aDORe repositories, rule engine, and
transformation services to generate responses - Retrieve a list of all services pertaining to an
item with specified identifier (DIDLDocumentIdenti
fier, content identifier, datastream identifier)
openurl-aDORe5
41Select an OpenURL service request from the list
42LANL aDORe implementation
43LANL aDORe software
- Largely based on off-the-shelf software
components - Berkeley DB Java Edition
- Heritrix tookit
- MySQL db
- OCLC OAICat
- OCLC OpenURL software
- Ockam IESR service registry
- aDORe Archive software (Layer 1 XMLtape
ARCfiles) is available from http//african.lanl.g
ov/aDORe/projects/adoreArchive/ - Plans to one way or another make the entire
LANL aDORe solution (revised Layer 1, Layer 2,
Layer 3) available.
44LANL aDORe _at_ 2 Sep 2007
- aDORe Archive
- XMLtapes 1,308
- ARCfiles 2,223
- DIDL Documents 45,444,113
- ARCfile resources 115,028,715
- 4.4 TByte
- Identifier Locator
- Identifiers 310,253,260
45LANL aDORe hardware
46LANL aDORe Performance
- Ingestion
- Preprocessing, Indexing, Registration 12 DIDLs
/ Second - System Specifications
- Sunfire x4600 M2 Server
- CPU AMD 8218 dual-core 2.6GHz (X 8)
- RAM 16 x 2GB DDR2-667
- Retrieval
- Sub-10ms Retrieval Times for Individual Modules
- System Specifications
- IBM Blade Center
- Chassis Model 86773XU
- Blades Model 885092U (X 14)
- AMD 2.8GHz (single core)
- RAM 8 GB PC3200 ECC DDR SDRAM
47aDORe Ingestion Overview
48Conclusion
- aDORe Archive
- The file-based approach (XMLtape/ARCfile) is
inherently simple, and reduces dependency on
database systems. - The XMLtape approach is inspired by the ARC file
format, but provides several additional
attractive features - Off-the-shelf XML tools can be used to
parse/validate an XMLtape - All Digital Object metadata can be stored in XML
Package - The autonomy of the indexes allows retaining the
files over time, while the indexes can be created
using other techniques as technologies evolve. - Can throw all indexes out and just start from
scratch. - Data integrity
- XMLpackage contains SHA1 digest for each
datastream of the Digital Object represented by
the XML Package - SHA1 digest for each XMLtape and ARCfile stored
in XMLtape Registry, and ARCfile Registry,
respectively
49Conclusion
- aDORe
- The protocol-based nature of the access increases
the flexibility in light of evolving technologies
through the introduction of a layer of
abstraction. - Can throw whichever technology out and
re-implement the same protocol interface using
another technology. - The protocol-based nature of the solution allows
a fully distributed implementation. - The component-based nature yields scalability.
- The standard-based design allows the use of
off-the-shelf tools. - A standard-based approach typically allows for a
less painless migration (to a new standard). - All kinds of Content Management Systems can be
aDORe-ized.