Collection Based Persistent Archives - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Collection Based Persistent Archives

Description:

Persistence requires migration over time onto new technology ... Standardization of DTDs - MOA II DTD for text. Standardization of markup language ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 34
Provided by: npa5
Category:

less

Transcript and Presenter's Notes

Title: Collection Based Persistent Archives


1
Collection Based Persistent Archives
Reagan Moore Chaitan Baru Amarnath Gupta
George Kremenek Bertram Ludaescher Richard
Marciano Arcot Rajasekar Wayne Schroeder
Michael Wan Ilya Zaslavsky Bing Zhu
(http//www.npaci.edu/DICE/)
2
Topics
  • Components of a persistent archive
  • Information management example
  • Data management example
  • Knowledge management example

3
Fundamental Concept for a Persistent Archive
  • Persistence requires migration over time onto new
    technology
  • While the migration occurs, a persistent archive
    must be able to interoperate with both the old
    technology and the new technology.
  • A persistent archive is an interoperability
    system.

4
What Types of Interoperability are Needed?
  • Data management (data sets)
  • Ability to work with multiple types of storage
    systems, across separate administration domains
  • Information management (schema)
  • Ability to define a collection independent of
    database choice
  • Ability to migrate collection onto new databases
  • Knowledge management (ontology)
  • Ability to map old concepts to current view of
    the world
  • Ability to present and manipulate information
    associated with data sets

5
Implicit Concepts
  • Infrastructure independence
  • Data set access
  • Authentication
  • Collection management
  • Presentation
  • Non-proprietary formatting
  • Information models
  • XML - Information markup language
  • GML - Graphics markup language
  • Functional separation of archival systems
  • Accessioning workbench, archive, access workbench

6
Implicit Goals
  • Maintain digital objects and the information
    retrieval catalog description in the archive
  • Provide ability to instantiate collection as
    needed on new technology
  • Instantiate archived collection only when needed
  • Implies collection can sit in the archive
    forever, and can still be accessed at an
    arbitrary point in the future

7
Electronic Records Archive (ERA)
ACCESSION
ARCHIVES
REFERENCE
TRANSFER
Accessioning Work Bench (snapin)
Reference Workbench (snapin)
Retrieve Records
Media Handlers
Catalog
METADATA REPOSITORY RECORDS REPOSITORY

Internet Intranet
Text Image Photo Video Audio Geographical
Information System Compound Records WEB Database
Arrangement
A R C
Query Reference Tools
TAPE
TAPE
CD
U N W R A P P E R
CD
W R A P P E R
DISK
DISK
record
Presentation
Metadata wrapper
Order Fulfillment
8
Common Information Model
  • eXtensible Markup Language (XML)
  • Use tags to define semantic context for
    components of the data set
  • Document Type Definition (DTD)
  • Provides semi-structured representation for
    organizing tags that can be applied to groups of
    digital objects
  • Development of standards for tags
  • California Digital Library - Encoded Archival
    Description
  • Digital sky, Protein Data Bank, Neuroscience
    brain images

9
Digital Object Representation
  • Require non-proprietary markup language for
    formats that can be controlled by the archive
  • HTML - text
  • SVG - Scalable Vector Graphics markup language
  • As standards evolve, choose next format markup
    language to be a superset of the previous
    language
  • Convert to new standard on the fly as digital
    objects are accessed, or during a media migration

10
Hierarchy of Information Contexts
  • Digital object context
  • Meta-data to define the structure of the object
  • When publishing a digital object, must also
    publish the context of the object
  • Use collections to organize objects
  • Meta-data to define the structure of the
    collection
  • When publishing a collection, must also publish
    the information needed to organize the
    collection.
  • Use knowledge context to control presentation
  • Rules to map information to presentation style
  • Rules that govern the generation of the digital
    objects

11
Information Management
  • XML representation of metadata attributes
  • Standardization of DTDs - MOA II DTD for text
  • Standardization of markup language
  • XML based representation of collection structure
  • Attributes defining the physical layout of a
    schema into relational tables (foreign keys,
    attribute data types, )
  • XML databases XML organized data collections
  • Commercial systems Excelon, TAMINO, Oracle8i,

12
Art Museum Image Consortium
  • Information management
  • Support for heterogeneous digital objects
  • Automated conversion of meta-data to XML DTD
  • Validation of meta-data

13
AMICO Meta-data Conversion to XML
14
E-mail Collection
  • Demonstrate ability to ingest, archive, recreate,
    query, and present a digital object from a 1
    million record E-mail collection (RFC1036)
  • 2.5 GB of data
  • 6 required fields
  • 13 optional fields
  • User defined fields (over 1000)
  • Determine information model needed for persistent
    archive

15
XML DTD for E-mail
16
Data Management Hierarchy
  • Persistent Archives
  • Storage of information model, data model, along
    with data
  • Data Grid
  • Access to data in a different administration
    domain
  • Digital Library - services
  • Interlib - ADEPT, UC Berkeley Digital Library
  • Data Collection
  • Extensible Meta-data catalog - EMCAT
  • Data handling
  • SDSC Storage Resource Broker - SRB
  • Archival Storage
  • High performance storage system - HPSS

17
Storage Transparencies
  • Location transparency
  • Distribution of data collection across multiple
    physical resources
  • Name transparency
  • Attributed based access to data
  • Protocol transparency
  • Common API for access to remote data resources
  • Time transparency
  • Minimization of data access latency

18
Digital Library Data Management
  • Persistent identifiers
  • Ability to move a data set without the name
    changing
  • Data set replicas
  • Management of multiple copies of a data set
  • Archival backup of data sets
  • Integration of disk data caches with archival
    storage
  • Persistent archives
  • Management of a collection through multiple
    cycles of technology evolution

19
SDSC Storage Resource Broker Meta-data Catalog
Application
Resource
Third-party copy
User
Remote Proxies
MCAT
Dublin Core
DataCutter
Application Meta-data
20
Collection Based Access
  • Abstract data set naming and administration away
    from physical storage resource
  • Data sets defined by attributes
  • Logical collection used to group data sets across
    storage systems
  • Enables support for replication of data
  • Collection owned data
  • Authentication controlled by data handling system
  • Persistence controlled by data handling system

21
SRB Containers - Managing Archive Latency
SRB client
  • Create container in a logical storage resource
    containing at least one cacheable resource
  • Create objects in containers
  • Cache daemon will move filled containers to
    archive
  • synch and purge APIs

SRB Server
UNIX
HPSS
HPSS
container
Distributed Storage Resources
cached containers
22
Knowledge Management
  • Knowledge-based mediation
  • Conceptual-level integration
  • Predictive learning models
  • Rule-based ontology maps
  • Map source XML to Concept Map (ontologies, views)
  • Rule-based presentation and analysis
  • Rules governing accessioning of data sets
  • Rule governing integrity constraints
  • Style sheets for presentation

23
AMICO Presentation Interface
24
(No Transcript)
25
Formatted Message Using XML DTD
26
Knowledge Representation
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Applications
  • Support for distributed data collections
  • Federation of data collections to form digital
    library
  • Integration of digital libraries with archives
  • Finding aids for federation of digital libraries
    through mediation of information
  • Data grids for data access
  • Persistent archives

31
Communities Providing Technology
  • Archival storage - HPSS, ADSM, SANs
  • Data handling - Storage Resource Broker
  • Databases - XML, Object relational
  • Digital libraries - services, information
    discovery
  • Data grids - collection federation, finding aids
  • Computational grids - remote execution
  • Library - catalogs, DTDs, finding aids
  • Archivist - archival procedures

32
(No Transcript)
33
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com