Title: Presented at the NPACI All Hands Meeting
1Knowledge-Based Persistent Archives Reagan W.
Moore San Diego Supercomputer Center 9500 Gilman
Drive, La Jolla, CA 92093-0505 Phone 858
534-5073 FAX 858 534-5152 E-mail
moore_at_sdsc.edu
Presented at the NPACI All Hands Meeting
2Data Intensive Computing Environment
- Staff
- Reagan Moore
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Students - GSRA
- Martin Kuhl
- Liying Sui
- Yang Yu
- Valter Crescenzi
- Students - Undergrad Interns
- Peter Shin
- Roman Olshanowsky
- Shabbar Tambawala
- Pratik Mukhopadhyay
- /- NN
3Topics
- Persistent archive functionality
- Characterization of
- Data / Information / Knowledge
- Integration of Digital Library, Grid
environments, and Persistent Archives
4Persistent Archive
- Manage digital objects for the life of the
republic - Maintain ability to discover and access digital
objects while supporting hardware and software
systems evolve
5Fundamental Concept for a Persistent Archive
- Persistence requires migration over time onto new
technology - While the migration occurs, a persistent archive
must be able to interoperate with both the old
technology and the new technology. - A persistent archive is an interoperability
system.
6Implicit Concepts for Persistent Archive
- Infrastructure independence
- Data set access
- Authentication
- Collection management
- Presentation
- Non-proprietary formatting
- Information models
- XML - Information markup language
- GML - Graphics markup language
- Support for ingestion, management, access
- Accessioning workbench, archive, access workbench
7Standard Information Markup Language
- XML representation of metadata attributes
- Standardization of DTDs - MOA II DTD for text
- Standardization of markup language
- XML based representation of collection structure
- Attributes defining the physical layout of a
schema into relational tables (foreign keys,
attribute data types, ) - XML databases XML organized data collections
- Commercial systems Excelon, TAMINO, Oracle8i,
- XML based Topic Maps
- Represent relationships between collection domain
concepts, collection attibutes
8E-mail Collection
- Test of the scalability of the technology
- Archived a one-million record E-mail collection
(1999) - Ingestion
- Tagged E-mail using XML syntax (6 required,
13optional, 1000 user-defined tags) - Created description of the collection
- Aggregated E-mail into containers, stored in an
archive - Retrieved collection description, created
database, and optimized for query - Total time was 27 hours (used 10 Mbit/sec
Ethernet)
9(No Transcript)
10What Types of Interoperability are Needed?
- Data management (digital objects)
- Ability to work with multiple types of storage
systems, across separate administration domains - Information management (attributes)
- Ability to define a collection independent of
database choice - Ability to migrate collection onto new databases
- Knowledge management (relationships)
- Ability to manage relationships
- Ability to map domain concepts to collection
attributes
11Simplest Definitions
- Data
- Digital object
- Objects are streams of bits
- Information
- Any tagged data, which is treated as an
attribute. - Attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- Relationships between attributes
- Relationships can be procedural/temporal,
structural/spatial, logical/semantic, functional
12Types of Knowledge Relationships
- Logical / semantic
- Digital Library cross-walks
- Temporal / procedural
- Workflow systems
- Spatial / structural
- GIS systems
- Functional / algorithmic
- Scientific feature analysis
13(No Transcript)
14Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
15Collection Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
16Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
17Ingestion Processes for Collection Creation
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection
18Examples of Implied KnowledgeSenate Legislative
Activities
- Structural knowledge
- Pertinent information embedded in document
headers - Procedural knowledge
- Naming convention
- Senator represented by last name
- Senator represented by last name and state
- Senator represented by last name, first name, and
state - Collection knowledge
- Referenced senators include senators no longer in
the senate
19Knowledge Generation
- Accessioning Template
- Defines the concepts under which the data objects
will be tagged and organized - Attribute selection
- Define the attributes that represent the
information content associated with the domain
concepts - Tag attributes using minimal constraint language,
such as XML or XMLSchema - Evaluate closure of mined attributes compared to
expected attributes - Refine concept map
20Information Generation
- Create occurrence index
- (Occurrence, attribute, value)
- This is needed to be able to recreate original
form of digital object - Analyze completeness of information
- Inverse index of attribute values
- Identifies unexpected values - consistency
- Analyze closure of collection
- Are additional attributes needed to represent
inverse index value ranges?
21Data Organization
- Archive preferred views of collection
- Original data
- XML tagged representation
- Minimal representation of consolidated
information - Noise-freeversion based upon occurrence tags
- Object-relational database version
- Archive occurrence tagged view
- Archive ingestion procedures that transform
collection from the original digital objects to
the preferred views
22Information Management Projects
- Digital Libraries
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - Digital Embryo digital library - GMU
- NPACI Digital Sky - Caltech 2MASS sky survey
- CDL - AMICO
- NSF NSDL - UCAR / DLESE
- Grid Environments
- NASA Information Power Grid - NASA Ames
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NSF Grid Physics Network - U Fl
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Scalable archives
23Further Information
http//www.npaci.edu/DICE