Digital Libraries, Data Grids, and Persistent Archives - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Digital Libraries, Data Grids, and Persistent Archives

Description:

Create storage abstraction layer. Storage Resource Broker (SRB) provides ... Abstraction layer for knowledge repositories ... Specifying levels of Abstraction ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 44
Provided by: reag4
Category:

less

Transcript and Presenter's Notes

Title: Digital Libraries, Data Grids, and Persistent Archives


1
Digital Libraries, Data Grids, and Persistent
Archives Reagan W. Moore San Diego
Supercomputer Center moore_at_sdsc.edu http//www.npa
ci.edu/DICE/
2
Data and Knowledge Systems Group
  • Staff
  • Reagan Moore
  • Ilkai Altintas
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • XuFei Qian
  • Roman Olshanowsky
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Graduate Students
  • A. Bagchi
  • S. Bansal
  • A. Behere
  • R. Bharath
  • S. Bharath
  • M. Kulrul
  • L. Sui
  • Undergraduate Interns
  • N. Cotofana
  • M. Shumaker
  • J. Trang
  • L. Yin
  • /- NN

3
Topics
  • Application of
  • Data management systems
  • Information management systems
  • Knowledge management systems
  • to
  • Distributed data collections
  • Digital libraries
  • Data Grids
  • Persistent Archives
  • by
  • Defining levels of abstraction

4
Information Management Projects
  • Digital Libraries
  • CDL - AMICO
  • DARPA/USPTO - patent digital library
  • NLM Visible Embryo digital library - GMU
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • NSF NPACI Digital Sky - Caltech 2MASS sky survey
  • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grid Environments
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NASA Information Power Grid - NASA Ames
  • NIH Biomedical Informatics Research Network
  • NSF Grid Physics Network - U Florida
  • NSF National Virtual Observatory - Johns Hopkins
    University / Caltech
  • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Archivist workbench

5
Managing Distributed Storage
  • Separate the organization of digital objects from
    their physical storage
  • Logical Name Space to manage attributes about the
    digital objects
  • Data handling system to manage interactions with
    remote storage systems
  • Create storage abstraction layer
  • Storage Resource Broker (SRB) provides data
    management system

6
Information Management- Logical Name Space
  • Set of attributes to describe digital entities
    that are registered into the logical name space
  • SRB metadata - Unix file system semantics
  • Provenance metadata - Dublin Core
  • Resource metadata - User access control lists
  • Discipline metadata - User defined attributes
  • Each digital entity may have unique attributes

7
Information Management
  • Abstraction layer for interacting with
    information repositories
  • Manage the schema and physical table structures
    of a database
  • Extensible schema
  • User defined attributes
  • Extensible Metadata CATalog (EMCAT) manages
    collections
  • mySRB.html interface supports dynamic collection
    creation

8
Knowledge Management - Discovery across
Collections
  • Characterization of relationships between
    attributes
  • Semantic / logical - cross-walks
  • Procedural / temporal - records management
  • Structural / spatial - GIS
  • Abstraction layer for knowledge repositories
  • Mapping from collection attributes to discipline
    concepts
  • Model-based Mediation supports mapping from
    knowledge relationships to rule-based inference
    engines

9
Presentation of Digital Objects
Application
Operating System
Storage System
Display System
Digital Object
10
Technology Management
Application
Wrap Application
Operating System
Storage System
Display System
Digital Object
11
Technology Management
Application
Add Operating System Call
Operating System
Storage System
Display System
Digital Object
12
Technology Management
Application
Add Operating System Call
Operating System
Add Operating System Call
Storage System
Display System
Digital Object
13
Technology Management
Application
Add Operating System Call
Operating System
Wrap Storage System
Wrap Display System
Storage System
Display System
Digital Object
14
Technology Management
Application
Operating System
Storage System
Display System
Migrate Encoding Format
Digital Object
15
Specifying levels of Abstraction
  • Technology management becomes simpler if the
    persistent archive infrastructure operates on
    abstractions, rather than an explicit physical
    implementation of a resource
  • Can we abstract
  • Digital object
  • Storage

16
Technology Management
Application
Operating System
Storage System Abstraction
Display System Abstraction
Storage System
Display System
Digital Object Abstraction
Digital Object
17
Types of Digital Entity Abstractions
  • Logical representation
  • What does the digital entity represent?
  • What is the associated meaning?
  • Physical representation
  • What is the physical structure of the digital
    entity?

18
Levels of Abstraction for Bits
Logical I-nodes
Physical Track / Sector
Abstraction for Digital Entity
Digital Entity
Bit Stream
Abstraction for Repository
Logical File Name
Physical File System (NFS/AFS/NTFS)
Repository
Disk
19
Levels of Abstraction for Data
Logical Data Model (units, semantics)
Physical Encoding Format (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
20
Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
21
Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical ER/UML/XMI/ RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
22
Information Management Projects
  • Digital Libraries
  • CDL - AMICO
  • DARPA/USPTO - patent digital library
  • NLM Visible Embryo digital library - GMU
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • NSF NPACI Digital Sky - Caltech 2MASS sky survey
  • NSF NSDL - UCAR / Columbia / Cornell / UCSB
  • Data Grids
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NASA Information Power Grid - NASA Ames
  • NIH Biomedical Informatics Research Network
  • NSF Grid Physics Network - U Florida
  • NSF National Virtual Observatory - Johns Hopkins
    University / Caltech
  • NSF Southern California Earthquake Center - ISI
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Archivist workbench

23
Evolution of Data Management
Collection - managed data Use database to
organize attributes about data objects Separate
information management from data storage Support
APIs for information discovery, data access
Database A
Storage
Storage Resource Broker
Integration accomplished through a data handling
system which characterizes the storage systems
24
SDSC Storage Resource Broker Meta-data Catalog
25
Evolution of Data Management
Distributed Data Collection Same name
space Same schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Storage Resource Broker
Integration requires the ability to characterize
both the schemas and the table structures of
each information repository
26
Distributed Data Collection
  • Logical organization of distributed digital
    objects into a collection
  • Access through federated servers
  • Collection-owned data, implies the server at each
    storage repository runs under a collection
    user-ID
  • Collection attributes define a global namespace
  • Self-consistent attribute update on all data
    accesses
  • Support for multiple access APIs
  • Extensible support for access to any type of
    storage system (archive, file system, database)
  • Extensible collection attributes

27
Interoperability across Data and Information
Repositories
  • Define a representation for storage that is
    independent of the implementation of the storage
    system
  • Unix file system semantics - Open/Close/Read/Write
    /Seek
  • Define a representation of a collection that is
    independent of the choice of database
  • schema, table structures

28
Visible Embryo Project
Disk Cache
AFIP Collab WS
Image Generation
OHSU
Eolas
GST
ATD Net
NIC
Disk Cache
UIC Startap
ASX200
BEN
MSWS
NT WS
MSWS
NT WS
Oakland
HSCC
WRL
100 Gbit
Vegas
OC-3
JHU
Disk Cache
DS3
Los Angeles
VBNS OC-12
GMU
Abilene OC-3
Disk Cache
DC POP
OC-3
Abilene OC-3
SDSC
Archive
29
Data Grids
Data Grid - linking multiple data
collections Separate name spaces Separate
schema Separate administration
domains Heterogeneous database instances
Database A
Database B
Data grid
The data grid is itself a collection that
provides mechanisms to hide latency and manage
semantics
30
National Virtual Observatory Data Grid
1. Portals and Workbenches
2.Knowledge Resource Management
Bulk Data Analysis
Metadata View
Data View
Catalog Analysis
3.
Standard APIs and Protocols
Concept space
4.Grid Security Caching Replication Backup Schedul
ing
Information Discovery
Metadata delivery
Data Discovery
Data Delivery
5.
Standard Metadata format, Data model, Wire format
Catalog Mediator
6.
Data mediator
Catalog/Image Specific Access
Compute Resources
Catalogs
Data Archives
Derived Collections
7.
31
Federated Digital Libraries
Virtual Data Grid - linking multiple data
collections Ability to execute processes to
recreate derived data
Database A Services
Database B Services
Virtual Data Grid
The virtual data grid integrates data grid and
digital library technology to manage processes
32
User Interfaces
NSDL
Usage Enhancement
Delivery Presentation Aggregation - Channels
Information about collections
Core NSDL Bus
Meta-data delivery Data delivery Query Global
Ids Security Network
Metadata data access-based services
Virtual Collections Mediators
Collection Building
33
Persistent Archive
Persistent archive Describe archived data as
collections Describe processes used to create
collections Manage evolution of technology
Database A (today)
Database A (tomorrow)
Virtual Data Grid
The persistent archive is itself a virtual data
grid that provides mechanisms to manage
migration to new technology
34
Persistent Archives
  • Storage system abstraction
  • Logical name space and data manipulations
  • Information repository abstraction
  • Logical schema and physical table structure
  • Knowledge repository abstraction
  • Topic maps and inference rules
  • Digital object abstraction
  • Data model and encoding format

35
Persistent Collection
  • Define context for archiving data -annotate
    information content
  • Create archivable form - standard encoding format
  • Archive information content along with data
  • Test closure of the collection - all digital
    objects that can be discovered in the collection
    are members of the collection
  • Test completeness of the collection - inherent
    relationships within the collection can be cast
    in terms of attributes generated from the
    annotated information.
  • Differentiate between inherent knowledge and
    anomalies / artifacts

36
Self-Instantiating Archive
  • Archive the processes that are used to control
    the ingestion process
  • Conversion to archivable form
  • Annotation of information content
  • When accessing the collection, retrieve the
    processes and the original digital objects
  • Apply the processing steps to re-create the
    information content
  • Query the result to discover desired digital
    objects
  • A self-instantiating archive is a virtual data
    grid

37
ERA Concept model
38
Data Management Systems
  • Distributed data collections
  • Single name space
  • Distributed data storage systems
  • Data Grid - integration of multiple data
    collections
  • Each collection has a separate name space
  • Infrastructure that interconnects the collections
    can use its own name space, containers,
    replication
  • Virtual Data Grids - federation of digital
    libraries
  • In addition, support interoperability between
    services for manipulation, presentation,
    discovery of digital objects
  • Persistent archive
  • In addition, manage evolution of technology
    components

39
Differentiating between Data, Information, and
Knowledge
  • Data
  • Digital object
  • Objects are streams of bits
  • Information
  • Any tagged data, which is treated as an
    attribute.
  • Attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • Relationships between attributes
  • Relationships can be procedural/temporal,
    structural/spatial, logical/semantic, functional

40
Knowledge Management
  • Must manage semantic relationships between the
    multiple name spaces
  • Data Grid
  • Must manage procedural relationships between
    digital library services
  • Federated digital library
  • Must manage structural relationships between
    different archivable forms - encoding formats
  • Persistent archive

41
Types of Knowledge Relationships
  • Logical / semantic
  • Digital Library cross-walks
  • Temporal / procedural
  • Workflow systems
  • Spatial / structural
  • GIS systems
  • Functional / algorithmic
  • Scientific feature analysis

42
Knowledge Based Data Grids
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Model-based Access)
XML DTD
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
(Data Handling System - SRB)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
43
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com