Presented at the NPACI All Hands Meeting - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Presented at the NPACI All Hands Meeting

Description:

Any tagged data, which is treated as an attribute. ... Defines the concepts under which the data objects will be tagged and organized ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 24
Provided by: pchari
Category:

less

Transcript and Presenter's Notes

Title: Presented at the NPACI All Hands Meeting


1
Knowledge-Based Persistent Archives Reagan W.
Moore San Diego Supercomputer Center 9500 Gilman
Drive, La Jolla, CA 92093-0505 Phone 858
534-5073 FAX 858 534-5152 E-mail
moore_at_sdsc.edu

Presented at the NPACI All Hands Meeting
2
Data Intensive Computing Environment
  • Staff
  • Reagan Moore
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Students - GSRA
  • Martin Kuhl
  • Liying Sui
  • Yang Yu
  • Valter Crescenzi
  • Students - Undergrad Interns
  • Peter Shin
  • Roman Olshanowsky
  • Shabbar Tambawala
  • Pratik Mukhopadhyay
  • /- NN

3
Topics
  • Persistent archive functionality
  • Characterization of
  • Data / Information / Knowledge
  • Integration of Digital Library, Grid
    environments, and Persistent Archives

4
Persistent Archive
  • Manage digital objects for the life of the
    republic
  • Maintain ability to discover and access digital
    objects while supporting hardware and software
    systems evolve

5
Fundamental Concept for a Persistent Archive
  • Persistence requires migration over time onto new
    technology
  • While the migration occurs, a persistent archive
    must be able to interoperate with both the old
    technology and the new technology.
  • A persistent archive is an interoperability
    system.

6
Implicit Concepts for Persistent Archive
  • Infrastructure independence
  • Data set access
  • Authentication
  • Collection management
  • Presentation
  • Non-proprietary formatting
  • Information models
  • XML - Information markup language
  • GML - Graphics markup language
  • Support for ingestion, management, access
  • Accessioning workbench, archive, access workbench

7
Standard Information Markup Language
  • XML representation of metadata attributes
  • Standardization of DTDs - MOA II DTD for text
  • Standardization of markup language
  • XML based representation of collection structure
  • Attributes defining the physical layout of a
    schema into relational tables (foreign keys,
    attribute data types, )
  • XML databases XML organized data collections
  • Commercial systems Excelon, TAMINO, Oracle8i,
  • XML based Topic Maps
  • Represent relationships between collection domain
    concepts, collection attibutes

8
E-mail Collection
  • Test of the scalability of the technology
  • Archived a one-million record E-mail collection
    (1999)
  • Ingestion
  • Tagged E-mail using XML syntax (6 required,
    13optional, 1000 user-defined tags)
  • Created description of the collection
  • Aggregated E-mail into containers, stored in an
    archive
  • Retrieved collection description, created
    database, and optimized for query
  • Total time was 27 hours (used 10 Mbit/sec
    Ethernet)

9
(No Transcript)
10
What Types of Interoperability are Needed?
  • Data management (digital objects)
  • Ability to work with multiple types of storage
    systems, across separate administration domains
  • Information management (attributes)
  • Ability to define a collection independent of
    database choice
  • Ability to migrate collection onto new databases
  • Knowledge management (relationships)
  • Ability to manage relationships
  • Ability to map domain concepts to collection
    attributes

11
Simplest Definitions
  • Data
  • Digital object
  • Objects are streams of bits
  • Information
  • Any tagged data, which is treated as an
    attribute.
  • Attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • Relationships between attributes
  • Relationships can be procedural/temporal,
    structural/spatial, logical/semantic, functional

12
Types of Knowledge Relationships
  • Logical / semantic
  • Digital Library cross-walks
  • Temporal / procedural
  • Workflow systems
  • Spatial / structural
  • GIS systems
  • Functional / algorithmic
  • Scientific feature analysis

13
(No Transcript)
14
Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
15
Collection Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
16
Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
17
Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
18
Examples of Implied KnowledgeSenate Legislative
Activities
  • Structural knowledge
  • Pertinent information embedded in document
    headers
  • Procedural knowledge
  • Naming convention
  • Senator represented by last name
  • Senator represented by last name and state
  • Senator represented by last name, first name, and
    state
  • Collection knowledge
  • Referenced senators include senators no longer in
    the senate

19
Knowledge Generation
  • Accessioning Template
  • Defines the concepts under which the data objects
    will be tagged and organized
  • Attribute selection
  • Define the attributes that represent the
    information content associated with the domain
    concepts
  • Tag attributes using minimal constraint language,
    such as XML or XMLSchema
  • Evaluate closure of mined attributes compared to
    expected attributes
  • Refine concept map

20
Information Generation
  • Create occurrence index
  • (Occurrence, attribute, value)
  • This is needed to be able to recreate original
    form of digital object
  • Analyze completeness of information
  • Inverse index of attribute values
  • Identifies unexpected values - consistency
  • Analyze closure of collection
  • Are additional attributes needed to represent
    inverse index value ranges?

21
Data Organization
  • Archive preferred views of collection
  • Original data
  • XML tagged representation
  • Minimal representation of consolidated
    information
  • Noise-freeversion based upon occurrence tags
  • Object-relational database version
  • Archive occurrence tagged view
  • Archive ingestion procedures that transform
    collection from the original digital objects to
    the preferred views

22
Information Management Projects
  • Digital Libraries
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • Digital Embryo digital library - GMU
  • NPACI Digital Sky - Caltech 2MASS sky survey
  • CDL - AMICO
  • NSF NSDL - UCAR / DLESE
  • Grid Environments
  • NASA Information Power Grid - NASA Ames
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NSF Grid Physics Network - U Fl
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Scalable archives

23
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com