Title: Collection Based Persistent Archives
1Collection Based Persistent Archives
Reagan Moore Chaitan Baru Amarnath Gupta
George Kremenek Bertram Ludaescher Richard
Marciano Arcot Rajasekar Wayne Schroeder
Michael Wan Ilya Zaslavsky Bing Zhu
(http//www.npaci.edu/DICE/)
2Topics
- Components of a persistent archive
- Information management example
- Data management example
- Knowledge management example
3Fundamental Concept for a Persistent Archive
- Persistence requires migration over time onto new
technology - While the migration occurs, a persistent archive
must be able to interoperate with both the old
technology and the new technology. - A persistent archive is an interoperability
system.
4What Types of Interoperability are Needed?
- Data management (data sets)
- Ability to work with multiple types of storage
systems, across separate administration domains - Information management (schema)
- Ability to define a collection independent of
database choice - Ability to migrate collection onto new databases
- Knowledge management (ontology)
- Ability to map old concepts to current view of
the world - Ability to present and manipulate information
associated with data sets
5Implicit Concepts
- Infrastructure independence
- Data set access
- Authentication
- Collection management
- Presentation
- Non-proprietary formatting
- Information models
- XML - Information markup language
- GML - Graphics markup language
- Functional separation of archival systems
- Accessioning workbench, archive, access workbench
6Implicit Goals
- Maintain digital objects and the information
retrieval catalog description in the archive - Provide ability to instantiate collection as
needed on new technology - Instantiate archived collection only when needed
- Implies collection can sit in the archive
forever, and can still be accessed at an
arbitrary point in the future
7Electronic Records Archive (ERA)
ACCESSION
ARCHIVES
REFERENCE
TRANSFER
Accessioning Work Bench (snapin)
Reference Workbench (snapin)
Retrieve Records
Media Handlers
Catalog
METADATA REPOSITORY RECORDS REPOSITORY
Internet Intranet
Text Image Photo Video Audio Geographical
Information System Compound Records WEB Database
Arrangement
A R C
Query Reference Tools
TAPE
TAPE
CD
U N W R A P P E R
CD
W R A P P E R
DISK
DISK
record
Presentation
Metadata wrapper
Order Fulfillment
8Common Information Model
- eXtensible Markup Language (XML)
- Use tags to define semantic context for
components of the data set - Document Type Definition (DTD)
- Provides semi-structured representation for
organizing tags that can be applied to groups of
digital objects - Development of standards for tags
- California Digital Library - Encoded Archival
Description - Digital sky, Protein Data Bank, Neuroscience
brain images
9Digital Object Representation
- Require non-proprietary markup language for
formats that can be controlled by the archive - HTML - text
- SVG - Scalable Vector Graphics markup language
- As standards evolve, choose next format markup
language to be a superset of the previous
language - Convert to new standard on the fly as digital
objects are accessed, or during a media migration
10Hierarchy of Information Contexts
- Digital object context
- Meta-data to define the structure of the object
- When publishing a digital object, must also
publish the context of the object - Use collections to organize objects
- Meta-data to define the structure of the
collection - When publishing a collection, must also publish
the information needed to organize the
collection. - Use knowledge context to control presentation
- Rules to map information to presentation style
- Rules that govern the generation of the digital
objects
11Information Management
- XML representation of metadata attributes
- Standardization of DTDs - MOA II DTD for text
- Standardization of markup language
- XML based representation of collection structure
- Attributes defining the physical layout of a
schema into relational tables (foreign keys,
attribute data types, ) - XML databases XML organized data collections
- Commercial systems Excelon, TAMINO, Oracle8i,
12Art Museum Image Consortium
- Information management
- Support for heterogeneous digital objects
- Automated conversion of meta-data to XML DTD
- Validation of meta-data
13AMICO Meta-data Conversion to XML
14E-mail Collection
- Demonstrate ability to ingest, archive, recreate,
query, and present a digital object from a 1
million record E-mail collection (RFC1036) - 2.5 GB of data
- 6 required fields
- 13 optional fields
- User defined fields (over 1000)
- Determine information model needed for persistent
archive
15XML DTD for E-mail
16Data Management Hierarchy
- Persistent Archives
- Storage of information model, data model, along
with data - Data Grid
- Access to data in a different administration
domain - Digital Library - services
- Interlib - ADEPT, UC Berkeley Digital Library
- Data Collection
- Extensible Meta-data catalog - EMCAT
- Data handling
- SDSC Storage Resource Broker - SRB
- Archival Storage
- High performance storage system - HPSS
17Storage Transparencies
- Location transparency
- Distribution of data collection across multiple
physical resources - Name transparency
- Attributed based access to data
- Protocol transparency
- Common API for access to remote data resources
- Time transparency
- Minimization of data access latency
18Digital Library Data Management
- Persistent identifiers
- Ability to move a data set without the name
changing - Data set replicas
- Management of multiple copies of a data set
- Archival backup of data sets
- Integration of disk data caches with archival
storage - Persistent archives
- Management of a collection through multiple
cycles of technology evolution
19SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
Third-party copy
User
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
20Collection Based Access
- Abstract data set naming and administration away
from physical storage resource - Data sets defined by attributes
- Logical collection used to group data sets across
storage systems - Enables support for replication of data
- Collection owned data
- Authentication controlled by data handling system
- Persistence controlled by data handling system
21SRB Containers - Managing Archive Latency
SRB client
- Create container in a logical storage resource
containing at least one cacheable resource - Create objects in containers
- Cache daemon will move filled containers to
archive - synch and purge APIs
SRB Server
UNIX
HPSS
HPSS
container
Distributed Storage Resources
cached containers
22Knowledge Management
- Knowledge-based mediation
- Conceptual-level integration
- Predictive learning models
- Rule-based ontology maps
- Map source XML to Concept Map (ontologies, views)
- Rule-based presentation and analysis
- Rules governing accessioning of data sets
- Rule governing integrity constraints
- Style sheets for presentation
23AMICO Presentation Interface
24(No Transcript)
25Formatted Message Using XML DTD
26 Knowledge Representation
27(No Transcript)
28(No Transcript)
29(No Transcript)
30Applications
- Support for distributed data collections
- Federation of data collections to form digital
library - Integration of digital libraries with archives
- Finding aids for federation of digital libraries
through mediation of information - Data grids for data access
- Persistent archives
31Communities Providing Technology
- Archival storage - HPSS, ADSM, SANs
- Data handling - Storage Resource Broker
- Databases - XML, Object relational
- Digital libraries - services, information
discovery - Data grids - collection federation, finding aids
- Computational grids - remote execution
- Library - catalogs, DTDs, finding aids
- Archivist - archival procedures
32(No Transcript)
33Further Information
http//www.npaci.edu/DICE