Preservation of Data and Records in a Knowledge-based Society

About This Presentation

Title:

Preservation of Data and Records in a Knowledge-based Society

Description:

By migrating the digital entity encoding format to new standards, more ... Shell. Java, NT. Browsers. Web. WSDL. GridFTP. SDSC Storage Resource Broker & Meta ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 69

Provided by: reag1

Category:

more less

Transcript and Presenter's Notes

Title: Preservation of Data and Records in a Knowledge-based Society

1
Preservation of Data and Records in a
Knowledge-based Society Reagan W. Moore San
Diego Supercomputer Center moore_at_sdsc.edu http//w
ww.npaci.edu/DICE/
2
Data and Knowledge Systems Group

Staff
Reagan Moore
Ilkai Altintas
Chaitan Baru
Sheau Yen Chen
Charles Cowart
Tony Fountain
Amarnath Gupta
Arun Jagatheesan
George Kremenek
Mevlut Kurul
Bertram Ludäscher
Richard Marciano
XuFei Qian
Roman Olshanowsky
Arcot Rajasekar
Abe Singer
Michael Wan
Ilya Zaslavsky

Graduate Students
A. Behere
M. Dortenzio
H. Jasso
M. Memon
H. Shin
L. Sui
G. Wang
Undergraduate Interns
N. Cotofana
D. Le
J. Tran
/- NN

3
Topics

Historical perspective - innovation sources
Persistent archive infrastructure approaches
Digital entities - data, information, knowledge
Technology evolution - levels of abstraction
Automation of archival processes - data grids
Access - exposing information and knowledge

4
Research Objectives

Scalability
Automation of archival processes
Technology evolution management
Infrastructure independence
Levels of abstraction
Access
Information based discovery
Knowledge based discovery

5
Original Expertise

1998 - NSF DLI1 digital library (UCB, U Michigan,
UCSB)
Integration of archival storage behind collection
catalog
Bulk metadata management
1990 - High performance computing
Parallel computing technology
Current system is a 1.7 Tflops SP cluster
1986 - Mass storage systems
Migrated all data forward in time across
6 CPU platforms
3 mass storage systems - DataTree, UniTree, HPSS
6 types of tape media - 3480, 3490, 3490E, 3590,
3590E, 9940B
Current capacity is 6 PBs holding 415 TBs of data

6
Original Project

1998 - NARA supplement to the DARPA/USPTO
Distributed Object Computation Testbed (DOCT)
Scalability
Demonstrated archiving of a 1-million E-mail
collection
1997 - DOCT built a patent digital library for
the USPTO
Scalability - 2 million patent collection
Transformative migration from Greenbook to SGML
Storage Resource Broker (data grid) for
replicating data across sites

7
(No Transcript)
8
Initial Concepts

Provided separate platforms for archival
processes
Created infrastructure independent representation
for all components of persistent archive
Digital entity data format
Storage repository
Information repository

9
ERA Concept model
10
Infrastructure Independence

Emulation
Migrate the display application to new operating
systems, preserving the look and feel of the
technology used to create the digital entity
Migration
Migrate the digital entity encoding format to a
new standard to enable more sophisticated queries
on the information and knowledge content
Are these variants of a continuum of approaches?

11
Presentation of Digital Entities
Application
Operating System
Storage System
Display System
Digital Entity
12
Technology Management - Emulation
Old Application
Wrap Application
New Operating System
New Storage System
New Display System
Digital Entity
13
Technology Management
Old Application
Add Operating System Call
New Operating System
New Storage System
New Display System
Digital Object
14
Technology Management
Old Application
Add Operating System Call
New Operating System
Add Operating System Call
Old Storage System
Old Display System
Digital Entity
15
Technology Management Migration
New Application
New Operating System
New Storage System
New Display System
Migrate Encoding Format
Digital Entity
16
Technology Management - SDSC
New Application
New Operating System
Wrap Storage System
Wrap Display System
Old Storage System
Old Display System
Migrate Encoding Format
Digital Entity
17
Migration Advantages

By migrating the digital entity encoding format
to new standards, more sophisticated technologies
can be applied to express the information and
knowledge content inherent in collections of
digital entities.
Queries can be made on the annotated information
Analyses can be done on the annotated knowledge
to identify anomalies and artifacts

18
Specifying Levels of Abstraction

Technology management becomes simpler if the
persistent archive infrastructure operates on
abstractions, rather than an explicit physical
implementation of a resource
Need abstractions for
Digital entities
Repositories
Can generic infrastructure be created that
provides infrastructure independence for data,
information, and knowledge management?

19
Differentiating between Data, Information, and
Knowledge

Data
Digital entity
Entities are streams of bits
Information
Any semantic label.
Attributes are the semantic label and the
associated data.
Attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object
Knowledge
Relationships between attributes or semantic
labels
Relationships can be procedural/temporal,
structural/spatial, logical/semantic,
functional/algorithmic

20
Digital Entities

Digital entities are images of reality, made of
Data, the bits (zeros and ones) put on a storage
system
Information, the attributes used to assign
semantic meaning to the data
Knowledge, the structural relationships described
by a data model
Every digital entity requires information and
knowledge to correctly interpret and display

21
Types of Digital Entity Abstractions

Differentiate between a digital entity and its
storage repository
Logical representation
What naming conventions are used to assign
semantic meaning?
Physical representation
What is the physical structure of the digital
entity?

22
Levels of Abstraction for Data
Logical Semantics (units, attributes)
Physical Data Model (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
23
Storage Repository Abstraction

Set of operations that can be performed to
manipulate digital entities
Example - Storage Resource Broker
Logical name space
Storage repository abstraction
Information repository abstraction

24
SDSC Storage Resource Broker Meta-data
Catalog Storage Repository Abstraction
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
25
Information Management

Abstraction layer for the operations needed to
manipulate a catalog in a database
Bulk metadata manipulation
Automated SQL generation
Separation of the schema used for the catalog
from the schema used for the information
repository
Schema extension
User defined attributes

26
Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
27
Logical Name Space

Naming transparency - find a digital entity
without knowing its name
Map from attributes to a global file name
Location transparency - access a digital entity
without knowing where it is
Map from global file name to local file name
Access transparency - access a digital entity
without knowing the type of storage system
Federated client-server architecture

28
SDSC Storage Resource Broker Meta-data
Catalog Information Repository Abstraction
Application
Linux I/O
Web WSDL
DLL / Python
Java, NT Browsers
Prolog Predicate
Clients

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Servers
SRM
29
Knowledge Management - Characterizing Properties
of Collections

Characterization of relationships between
attributes
Semantic / logical - cross-walks
Procedural / temporal - records management
Structural / spatial - GIS
Characterization of operations needed to
manipulate a concept space in a knowledge
repository
Mapping from collection attributes to discipline
concepts
Transformation from knowledge relationships to
rules for application in inference engines

30
Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
31
Preservation of Data

Migration
Preserve the data bits
Preserve the digital entity name
Characterize the information and knowledge
content for presentation by new applications

32
Managing Technology Evolution

Data grids provide interoperability mechanisms to
access data in multiple administration domains
and multiple types of storage systems.
Persistent archives migrate collections from old
technology to new technology to support
presentation on new systems
Both require the ability to access heterogeneous
systems

33
Preservation - Data Grids

Name transparency
Find a file by attributes (map from attributes to
global name)
Location transparency
Access a file by a global identifier (map from
global to local file name)
Access transparency
Map from preferred API to access data mechanisms
Preserve the ability to display the system
Authenticity
Disaster recovery, replicate data across storage
systems
Audit and process management

34
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
35
Authenticity

Guarantee that the digital entity has not been
changed
Collection owned entities, only accessible
through the data handling system
Support roles defining access (curation, owner,
annotation, read)
Support access controls mapping users to roles
Audit trails that record all operations on
digital entities
Digital signatures - cryptographic checksums

36
SDSC Storage Resource Broker Meta-data
Catalog Preservation
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
37
Emulation versus Migration

Emulation
Characterize processes that the display
application uses to transform the digital entity
to a visual representation
Migration
Characterize processes needed to transform to a
new encoding format
Both are forms of process management

38
Self-Instantiating Archive

Archive the processes that are used to arrange,
describe, and preserve the digital entities
Annotation of information content
Conversion to archivable form
When accessing the collection, retrieve the
processes and the original digital entities
Apply the processing steps to re-create the
information content
Query the result to discover desired digital
objects

39
From File-Based to Knowledge-Based Archives ...

Conventional, file-based archives
tape archives (.tar), optionally compressed (.Z,
.gz, ...)
integrity checks at bit-level (CRC,
checksums,...)
self-extracting archive add extraction
script/code to archival package
self-installing archive like self-extracting
archive but also automatically execute
installation script

40
... From File-Based to Knowledge-Based Archives
...

Collection- and Knowledge-Based Archives
moving from files (raw data)
... via metadata descriptions to databases
raw data schema/attributes gt encode
information
... via semantic constraints to knowledge bases
databases rules gt encode knowledge
lifting bit-level integrity checks (CRC/checksum)
to
... syntactic integrity e.g., well-formed XML
... structural integrity, type consistency valid
XML (wrt. XML Schema)
... semantic integrity valid databases (database
satisfies the given semantic integrity
constraints)

41
... From File-Based to Knowledge-Based Archives

Knowledge-Based Archives
... include semantic integrity constraints as
part of the archive (could be in plain English
additional context information or other knowledge
about the collection)
Self-Validating Archive
... add a validator to the archive e.g.,
semantic integrity as logic rules, validator
logic engine (e.g., Datalog/Prolog engine)
gt allows the future information user to
understand the raw data, the rules (context
information), and detect rule exceptions, etc.
Self-Instantiating Archive
... similar to the self-installing archive, but
at the information/knowledge level (not file
level) allows to recreate the archival ingestion
process at a later time ("looking the archivist
over the shoulder")
... can include self-validation steps

42
(Simplified) Anatomy of a Self-Validating,
Self-Instantiating Archive

rule engine
rules for semantic integrity constraints gt
validation code
rules for ingestion transformations gt
re-instantiation code
collections
files
bits

rule engine
instantiation rules
validation rules
collections
files
bits
43
Archival Ingestion Network (Pipeline)

Processing Steps Database Transformations t
t Source-FormatSchema ? Target-FormatSchema
if t is invertible gt no information is lost
automate t using DB querytransformation
languages

44
Open Archival Information System (OAIS) Model

Ingest
receive, quality-assure SIPs generate AIPs
Archive
store, refresh AIPs
Manage
populate, maintain schemas, views, ICs access,
update DI
Access
discover, describe, locate, upload DIPs

45
Knowledge Creation Roadmap

Knowledge syntax (consensus)
RDF, XMI, Topic Map
Knowledge management (recursive operations)
Oracle parallel database
Knowledge manipulation (spatial/procedural rules)
Generation of inference rules and mapping to data
models
Knowledge generation (scalable inference engine)
Application of inference rules in inference
engine

46
Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
Rule-based Access
Information Repository
Attribute- based Query
Attributes Semantics
Information
Encoding standards
Query Mechanisms
Data Grid
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
47
Persistent Archives

Storage system abstraction
Logical name space and entity manipulation
Information repository abstraction
Logical schema and physical table structure
Knowledge repository abstraction
Topic maps and inference rules
Digital entity abstraction
Data model and encoding format

48
Archival Processes

? Appraisal determine the archivable content
? Accession - determine the initial physical
location for the data, and the relationship of
the new collection to existing collections
Arrangement - add administration control,
describe the information content (provenance,
authenticity, structure, administrative), and
decompose digital objects into their components
as needed.
Description - complete the definition of
collection attributes by iterating between
arrangement, reformatting, and representation.
Preservation build an archivable form of the
digital entities, characterize the collection
context , and manage their storage
Access provide query mechanisms for
discovering, retrieving, and presenting the
digital entities.
Re-purposing - apply archival processes to build
a new collection context

49
(No Transcript)
50
NARA Prototype

Demonstrate ability to ingest, archive, recreate,
query, and present a digital object from a 1
million record E-mail collection (RFC1036)
2.5 GB of data
6 required fields
13 optional fields
User defined fields (over 1000)
Determine resources required to scale size of
collection

51
(No Transcript)
52
XML DTD for E-mail
53
Formatted Message Using XML DTD
54
Web-based Interface for Accessing the E-mail
Collection
55
Automation of Ingestion Process

Application of an Accessioning Template
Defines the concepts, policies or acceptance of
the collection
Creation of attributes that represent the
accessioning template concepts
Analysis of attributes for anomalies and implied
inherent knowledge

56
Information Generation Processes

Create occurrence index
(Occurrence, attribute, value)
This is needed to be able to recreate original
form of digital object
Analyze completeness of information
Inverse index of attribute values
Identifies unexpected values - consistency
Analyze closure of collection
Are additional concepts needed to represent
inverse index value ranges?

57
Ingestion Processes for Collection
Aggregation of original objects into
containers for storage
Data Organization
Data Storage
58
Ingestion Processes for Collection
Migration of objects into a standard
representation
Information Generation
Attribute Tagging
Attribute Selection
Data Organization
Collection Storage
59
Ingestion Processes for Collection
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection Storage
60
Application of Anomaly Detection to Thomas
Collection

List of bills, amendments, orders sponsored by
each Senator in a session of Congress
The processing rule used to describe senators is
an example of inherent knowledge within the
collection
By building occurrence tables, one can
differentiate between knowledge relationships and
anomalies or artifacts

61
Example Ingestion Network Senate Collection
62
Information Modeling in Knowledge-Based
Archival Senate Example
Data provider says Please archive all records
of legislative activities of the 106th
senate! Integrity constraints (Logic
Rules) (1) senators_with_file UNION
(sponsor, cosponsors, submitted_by) (2)
senators sponsors co-sponsors
Violation the rhs is strictly larger than the
lhs ! Exceptions (Chafee, John), (Gramm, Phil),
(Miller, Zell) (Possible) Explanations senators
who joined (Zell), passed away (Chafee), were
forgotten (Gramm)!? Checking ICs IF sponsor(X),
not senator(X) THEN ADD(exception_log,
missing_senator_info(X)) IF condition THEN action

Action LOG, WARN, ABORT, ...
63
Senator Naming Constraints

Senators name can appear only once on a bill
Senator specified by
Last name
Last name and state
Last name, state, and first name
Detected anomaly, page 205 of an RTF file was
replicated.

64
Persistent Collection

Define context for archiving data -annotate
information content
Create archivable form - standard encoding format
Archive information content along with data
Test closure of the collection - all digital
objects that can be discovered in the collection
are members of the collection
Test completeness of the collection - inherent
relationships within the collection can be cast
in terms of attributes generated from the
annotated information.
Differentiate between inherent knowledge and
anomalies / artifacts

65
Growing Community Interactions

Mass Storage
IEEE Mass storage system technical committee
High performance computing
NSF National Partnership for Advanced
Computational Infrastructure - scalable computing
Digital Library
DLI2 - UCB, Stanford, UCSB - interoperability
NSDL - OAI metadata harvesting, metadata
standards
Data Grid
Global Grid Forum - infrastructure independence
Persistent Archive
InterPARES, records management, OAIS standard

66
Collaborations