KnowledgeBased Persistent Archives - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

KnowledgeBased Persistent Archives

Description:

SCA 2001, Santa Rosa. Senate Collection Example ... processing and workflow tracking, product authentication, usage characterization, ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 24
Provided by: bertramlu
Category:

less

Transcript and Presenter's Notes

Title: KnowledgeBased Persistent Archives


1
Knowledge-Based Persistent Archives
  • Richard Marciano
  • marciano_at_sdsc.edu
  • San Diego Supercomputer Center
  • University of California, San Diego

2
Data Intensive Computing Environment
  • Staff
  • Reagan Moore
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Amarnath Gupta
  • George Kremenek
  • Bertram Ludäscher
  • Richard Marciano
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Bing Zhu
  • Students - GSRA
  • Martin Kuhl
  • Liying Sui
  • Yang Yu
  • Valter Crescenzi (Italy)
  • Students - Undergrad Interns
  • Peter Shin
  • Roman Olshanowsky
  • Shabbar Tambawala
  • Pratik Mukhopadhyay
  • Michelle Schumaker

3
Digital Archives
  • Problem
  • How to achieve long-term preservation of
    information (for the archivist records) and
    sustained access?
  • Challenges and Opportunities
  • fight archives obsolescence (in the presence of
    with)
  • rapidly changing storage, data formats, software
    environment, hardware,
  • Approaches
  • Time out (do nothing assume hardware,
    software, data formats, etc. all work 400 years
    from now ...)
  • Emulation (emulate hardware and software
    infrastructure)
  • Migration (migrate to new infrastructure)
  • Standardize data formats

4
Archival Example Senate Collection
  • What you see
  • is maybe NOT what you get (a not so well
    documented format)

5
Senate Collection Example
  • Rich Text Format (a documented Microsoft
    format)

\pard\parM \pard\b S. 345\b0\parM
\pard\qr DATE INTRODUCED 02/03/1999\parM
\pard SPONSOR Allard\parM \i\qc OFFICIAL
TITLE\i0\parM \pard A bill to amend the Animal
Welfare Act to remove the limitation that permits
\ interstate movement of live birds, for the
purpose of fighting, to States in which \ animal
fighting is lawful.\parM \i\qc LATEST
STATUS\i0\par\pardM \pard\plain
\fi-1900\li1900\nowidctlpar\adjustrightFeb 3,
1999\tab Read twice and\ referred to the
Committee on Agriculture.\parM \pardM
  • can be wrapped into XML

S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
6
Senate Collection Example
  • the XML can be lifted from the presentation
    level

S. 345 bold"off"DATE INTRODUCED 02/03/1999 bold"off"SPONSOR Allard bold"off" italic"off"OFFICIAL TITLE bold"off" italic"off"A bill to amend the
Animal Welfare Act to remove the lim\ itation
that permits interstate movement of live birds,
for the purpose of fighting\ , to States in which
animal fighting is lawful. bold"off" italic"off"LATEST STATUS
Feb 3, 1999tabRead twice and
referred to the Committee on Agriculture\ .g
  • to the information level


SENATE AGRICULTURE
02/03/1999te_introduced
Feb 3,
1999
Read twice and referred to the
Committee on Agriculture

A bill to amend the Animal
Welfare Act to remove the limitation that permits
interstate movement of live birds, for the
purpose of fighting, to States in which animal
fighting is lawful.
Allard, Wayne CO

7
XML as an Archival Format
  • Information level schema as an XML DTD

bills (bill) committees?, congressional_record?, cosponsors?,
date_introduced?,
digest?, latest_status_list?, official_title?,
sponsor?, statement_of_purpose?,
submitted_by?, submitted_for?) bill_name CDATA REQUIRED (committee) (cosponsor) T latest_status_list (latest_status) latest_status (ls_date, ls_txt) abstract (PCDATA) (PCDATA) (PCDATA) T co_name (PCDATA) CDATA IMPLIED (PCDATA) (PCDATA) (PCDATA)
8
From XML-Based to Knowledge-Based Archives
  • Collection-based archival with XML save data "as
    is" plus...
  • ... separate content from presentation
  • ... tag your data (take a lift in the info
    hierarchy)
  • ... use a self-describing, semistructured data
    format (XML)
  • Knowledge-based archival now add ...
  • ... conceptual level information
  • ... integrity constraints
  • ... explanations/derivation rules
  • archiving only results yf(x) vs. archiving the
    rules/function "f" (e.g. f the
    Florida procedure...)
  • employ knowledge representation languages

9
Knowledge-Based Archival Senate Example
  • Data provider says
  • Please archive all records of legislative
    activities of the 106th senate!
  • Integrity constraints, eg
  • (1) senators_with_file UNION (sponsor,
    cosponsors, submitted_by)
  • (2) senators sponsors co-sponsors
  • Violation
  • the rhs is a SUPERSET of the lhs !
  • Exceptions
  • (Chafee, John), (Gramm, Phil), (Miller, Zell)
  • (Possible) Explanations
  • senators who joined (Zell), passed away (Chafee),
    were forgotten (Gramm)!?
  • Checking ICs
  • IF sponsor(X), not senator(X) THEN
    ADD(exception_log, missing_senator_info(X))
  • IF condition THEN action
  • Action LOG, WARN,
    ABORT, ...

10
Persistent Archive
  • Manage digital objects for the life of the
    republic
  • Maintain ability to discover and access digital
    objects while supporting hardware and software
    systems evolve

11
Fundamental Concept for a Persistent Archive
  • Persistence requires migration over time onto new
    technology
  • While the migration occurs, a persistent archive
    must be able to interoperate with both the old
    technology and the new technology.
  • A persistent archive is an interoperability
    system.

12
Implicit Concepts for Persistent Archive
  • Infrastructure independence
  • Data set access
  • Authentication
  • Collection management
  • Presentation
  • Non-proprietary formatting
  • Information models
  • XML - Information markup language
  • GML - Graphics markup language
  • Support for ingestion, management, access
  • Accessioning workbench, archive, access workbench

13
Standard Information Markup Language
  • XML representation of metadata attributes
  • Standardization of DTDs - MOA II DTD for text
  • Standardization of markup language
  • XML based representation of collection structure
  • Attributes defining the physical layout of a
    schema into relational tables (foreign keys,
    attribute data types, )
  • XML databases XML organized data collections
  • Commercial systems Excelon, TAMINO, Oracle8i,
  • XML based Topic Maps
  • Represent relationships between collection domain
    concepts, collection attibutes

14
What Types of Interoperability are Needed?
  • Data management (digital objects)
  • Ability to work with multiple types of storage
    systems, across separate administration domains
  • Information management (attributes)
  • Ability to define a collection independent of
    database choice
  • Ability to migrate collection onto new databases
  • Knowledge management (relationships)
  • Ability to manage relationships
  • Ability to map domain concepts to collection
    attributes

15
Data Archive
Ingest Services
Management
Access Services
Access platform
Data repositories
Ingestion platform
Interoperability Standards
Interoperability Protocols
16
Collection Based Persistent Archive
Ingest Services
Management
Access Services
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
17
Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
XTM DTD
Rules - KQL
(Topic Maps / Buckets / Model-based Access)
Information Repository
Attribute- based Query
Attributes Semantics
SDLIP
Information
XML DTD
(Data Handling System - SRB / FTP / HTTP)
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Grids
Feature-based Query
MCAT/HDF
18
Examples of Implied KnowledgeSenate Legislative
Activities
  • Structural knowledge
  • Pertinent information embedded in document
    headers
  • Procedural knowledge
  • Naming convention
  • Senator represented by last name
  • Senator represented by last name and state
  • Senator represented by last name, first name, and
    state
  • Collection knowledge
  • Referenced senators include senators no longer in
    the senate

19
Information Management Projects
  • Digital Libraries
  • NSF Digital Library Initiative, Phase II - UCSB,
    Stanford
  • Digital Embryo digital library - GMU
  • NPACI Digital Sky - Caltech 2MASS sky survey
  • CDL - AMICO
  • NSF NSDL - UCAR / DLESE
  • Grid Environments
  • NASA Information Power Grid - NASA Ames
  • DOE Data Visualization Corridor - LLNL
  • DOE Particle Physics Data Grid - Stanford,
    Caltech
  • NSF Grid Physics Network - U Fl
  • Persistent Archives
  • NARA Persistent Archive
  • NHPRC - Scalable archives

20
Persistent Archive Framework
  • Persistent archive functionality - Accessioning
    platform
  • Data management - Archive Markup Language (AML),
    Container management
  • Collection management - Validation of collection,
    collection characterization
  • Knowledge management - Workflow staging,
    procedure management for ingestion process,
    anomaly detection, characterization of inherent
    implied knowledge
  • Scale - collections of millions to billions of
    objects

21
Persistent Archive Framework
  • Persistent archive functionality Repository
  • Data management - Storage system (robot, media,
    caching software), media migration, disaster
    recovery (archive namespace to container mapping)
  • Collection management - Container to object
    mapping, object metadata storage
  • Knowledge management - Transaction logging, AML
    migration on access or on media migration
  • Scale - thousands of collections, billions of
    objects, petabytes of data

22
Persistent Archive Framework
  • Persistent archive functionality - Access
    platform
  • Data management - Data caching, container
    caching, disk cache management
  • Information management - Collection
    instantiation, access query, browsing support
  • Knowledge management - Order processing and
    workflow tracking, product authentication, usage
    characterization, presentation management
  • Scale - Millions of accesses per day

23
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com