Title: KnowledgeBased Persistent Archives
1Knowledge-Based Persistent Archives
- Richard Marciano
- marciano_at_sdsc.edu
- San Diego Supercomputer Center
- University of California, San Diego
2Data Intensive Computing Environment
- Staff
- Reagan Moore
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Amarnath Gupta
- George Kremenek
- Bertram Ludäscher
- Richard Marciano
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Bing Zhu
- Students - GSRA
- Martin Kuhl
- Liying Sui
- Yang Yu
- Valter Crescenzi (Italy)
- Students - Undergrad Interns
- Peter Shin
- Roman Olshanowsky
- Shabbar Tambawala
- Pratik Mukhopadhyay
- Michelle Schumaker
3Digital Archives
- Problem
- How to achieve long-term preservation of
information (for the archivist records) and
sustained access? - Challenges and Opportunities
- fight archives obsolescence (in the presence of
with) - rapidly changing storage, data formats, software
environment, hardware, - Approaches
- Time out (do nothing assume hardware,
software, data formats, etc. all work 400 years
from now ...) - Emulation (emulate hardware and software
infrastructure) - Migration (migrate to new infrastructure)
- Standardize data formats
4Archival Example Senate Collection
- is maybe NOT what you get (a not so well
documented format)
5Senate Collection Example
- Rich Text Format (a documented Microsoft
format)
\pard\parM \pard\b S. 345\b0\parM
\pard\qr DATE INTRODUCED 02/03/1999\parM
\pard SPONSOR Allard\parM \i\qc OFFICIAL
TITLE\i0\parM \pard A bill to amend the Animal
Welfare Act to remove the limitation that permits
\ interstate movement of live birds, for the
purpose of fighting, to States in which \ animal
fighting is lawful.\parM \i\qc LATEST
STATUS\i0\par\pardM \pard\plain
\fi-1900\li1900\nowidctlpar\adjustrightFeb 3,
1999\tab Read twice and\ referred to the
Committee on Agriculture.\parM \pardM
S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
6Senate Collection Example
- the XML can be lifted from the presentation
level
S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
SENATE AGRICULTURE
02/03/1999te_introduced
Feb 3,
1999
Read twice and referred to the
Committee on Agriculture
A bill to amend the Animal
Welfare Act to remove the limitation that permits
interstate movement of live birds, for the
purpose of fighting, to States in which animal
fighting is lawful.
Allard, Wayne CO
7XML as an Archival Format
- Information level schema as an XML DTD
bills (bill) committees?, congressional_record?, cosponsors?,
date_introduced?,
digest?, latest_status_list?, official_title?,
sponsor?, statement_of_purpose?,
submitted_by?, submitted_for?) bill_name CDATA REQUIRED (committee) (cosponsor) T latest_status_list (latest_status) latest_status (ls_date, ls_txt) abstract (PCDATA) (PCDATA) (PCDATA) T co_name (PCDATA) CDATA IMPLIED (PCDATA) (PCDATA) (PCDATA)
8From XML-Based to Knowledge-Based Archives
- Collection-based archival with XML save data "as
is" plus... - ... separate content from presentation
- ... tag your data (take a lift in the info
hierarchy) - ... use a self-describing, semistructured data
format (XML) - Knowledge-based archival now add ...
- ... conceptual level information
- ... integrity constraints
- ... explanations/derivation rules
- archiving only results yf(x) vs. archiving the
rules/function "f" (e.g. f the
Florida procedure...) - employ knowledge representation languages
9 Knowledge-Based Archival Senate Example
- Data provider says
- Please archive all records of legislative
activities of the 106th senate! - Integrity constraints, eg
- (1) senators_with_file UNION (sponsor,
cosponsors, submitted_by) - (2) senators sponsors co-sponsors
- Violation
- the rhs is a SUPERSET of the lhs !
- Exceptions
- (Chafee, John), (Gramm, Phil), (Miller, Zell)
- (Possible) Explanations
- senators who joined (Zell), passed away (Chafee),
were forgotten (Gramm)!? - Checking ICs
- IF sponsor(X), not senator(X) THEN
ADD(exception_log, missing_senator_info(X)) - IF condition THEN action
- Action LOG, WARN,
ABORT, ...
10Persistent Archive
- Manage digital objects for the life of the
republic - Maintain ability to discover and access digital
objects while supporting hardware and software
systems evolve
11Fundamental Concept for a Persistent Archive
- Persistence requires migration over time onto new
technology - While the migration occurs, a persistent archive
must be able to interoperate with both the old
technology and the new technology. - A persistent archive is an interoperability
system.
12Implicit Concepts for Persistent Archive
- Infrastructure independence
- Data set access
- Authentication
- Collection management
- Presentation
- Non-proprietary formatting
- Information models
- XML - Information markup language
- GML - Graphics markup language
- Support for ingestion, management, access
- Accessioning workbench, archive, access workbench
13Standard Information Markup Language
- XML representation of metadata attributes
- Standardization of DTDs - MOA II DTD for text
- Standardization of markup language
- XML based representation of collection structure
- Attributes defining the physical layout of a
schema into relational tables (foreign keys,
attribute data types, ) - XML databases XML organized data collections
- Commercial systems Excelon, TAMINO, Oracle8i,
- XML based Topic Maps
- Represent relationships between collection domain
concepts, collection attibutes
14What Types of Interoperability are Needed?
- Data management (digital objects)
- Ability to work with multiple types of storage
systems, across separate administration domains - Information management (attributes)
- Ability to define a collection independent of
database choice - Ability to migrate collection onto new databases
- Knowledge management (relationships)
- Ability to manage relationships
- Ability to map domain concepts to collection
attributes
15Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
16Collection Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
17Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
18Examples of Implied KnowledgeSenate Legislative
Activities
- Structural knowledge
- Pertinent information embedded in document
headers - Procedural knowledge
- Naming convention
- Senator represented by last name
- Senator represented by last name and state
- Senator represented by last name, first name, and
state - Collection knowledge
- Referenced senators include senators no longer in
the senate
19Information Management Projects
- Digital Libraries
- NSF Digital Library Initiative, Phase II - UCSB,
Stanford - Digital Embryo digital library - GMU
- NPACI Digital Sky - Caltech 2MASS sky survey
- CDL - AMICO
- NSF NSDL - UCAR / DLESE
- Grid Environments
- NASA Information Power Grid - NASA Ames
- DOE Data Visualization Corridor - LLNL
- DOE Particle Physics Data Grid - Stanford,
Caltech - NSF Grid Physics Network - U Fl
- Persistent Archives
- NARA Persistent Archive
- NHPRC - Scalable archives
20Persistent Archive Framework
- Persistent archive functionality - Accessioning
platform - Data management - Archive Markup Language (AML),
Container management - Collection management - Validation of collection,
collection characterization - Knowledge management - Workflow staging,
procedure management for ingestion process,
anomaly detection, characterization of inherent
implied knowledge - Scale - collections of millions to billions of
objects
21Persistent Archive Framework
- Persistent archive functionality Repository
- Data management - Storage system (robot, media,
caching software), media migration, disaster
recovery (archive namespace to container mapping) - Collection management - Container to object
mapping, object metadata storage - Knowledge management - Transaction logging, AML
migration on access or on media migration - Scale - thousands of collections, billions of
objects, petabytes of data
22Persistent Archive Framework
- Persistent archive functionality - Access
platform - Data management - Data caching, container
caching, disk cache management - Information management - Collection
instantiation, access query, browsing support - Knowledge management - Order processing and
workflow tracking, product authentication, usage
characterization, presentation management - Scale - Millions of accesses per day
23Further Information
http//www.npaci.edu/DICE