Title: Preservation of Data and Records in a Knowledge-based Society
1Preservation of Data and Records in a
Knowledge-based Society Reagan W. Moore San
Diego Supercomputer Center moore_at_sdsc.edu http//w
ww.npaci.edu/DICE/
2Data and Knowledge Systems Group
- Staff
- Reagan Moore
- Ilkai Altintas
- Chaitan Baru
- Sheau Yen Chen
- Charles Cowart
- Tony Fountain
- Amarnath Gupta
- Arun Jagatheesan
- George Kremenek
- Mevlut Kurul
- Bertram Ludäscher
- Richard Marciano
- XuFei Qian
- Roman Olshanowsky
- Arcot Rajasekar
- Abe Singer
- Michael Wan
- Ilya Zaslavsky
- Graduate Students
- A. Behere
- M. Dortenzio
- H. Jasso
- M. Memon
- H. Shin
- L. Sui
- G. Wang
- Undergraduate Interns
- N. Cotofana
- D. Le
- J. Tran
- /- NN
3Topics
- Historical perspective - innovation sources
- Persistent archive infrastructure approaches
- Digital entities - data, information, knowledge
- Technology evolution - levels of abstraction
- Automation of archival processes - data grids
- Access - exposing information and knowledge
4Research Objectives
- Scalability
- Automation of archival processes
- Technology evolution management
- Infrastructure independence
- Levels of abstraction
- Access
- Information based discovery
- Knowledge based discovery
5Original Expertise
- 1998 - NSF DLI1 digital library (UCB, U Michigan,
UCSB) - Integration of archival storage behind collection
catalog - Bulk metadata management
- 1990 - High performance computing
- Parallel computing technology
- Current system is a 1.7 Tflops SP cluster
- 1986 - Mass storage systems
- Migrated all data forward in time across
- 6 CPU platforms
- 3 mass storage systems - DataTree, UniTree, HPSS
- 6 types of tape media - 3480, 3490, 3490E, 3590,
3590E, 9940B - Current capacity is 6 PBs holding 415 TBs of data
6Original Project
- 1998 - NARA supplement to the DARPA/USPTO
Distributed Object Computation Testbed (DOCT) - Scalability
- Demonstrated archiving of a 1-million E-mail
collection - 1997 - DOCT built a patent digital library for
the USPTO - Scalability - 2 million patent collection
- Transformative migration from Greenbook to SGML
- Storage Resource Broker (data grid) for
replicating data across sites
7(No Transcript)
8Initial Concepts
- Provided separate platforms for archival
processes - Created infrastructure independent representation
for all components of persistent archive - Digital entity data format
- Storage repository
- Information repository
9ERA Concept model
10Infrastructure Independence
- Emulation
- Migrate the display application to new operating
systems, preserving the look and feel of the
technology used to create the digital entity - Migration
- Migrate the digital entity encoding format to a
new standard to enable more sophisticated queries
on the information and knowledge content - Are these variants of a continuum of approaches?
11Presentation of Digital Entities
Application
Operating System
Storage System
Display System
Digital Entity
12Technology Management - Emulation
Old Application
Wrap Application
New Operating System
New Storage System
New Display System
Digital Entity
13Technology Management
Old Application
Add Operating System Call
New Operating System
New Storage System
New Display System
Digital Object
14Technology Management
Old Application
Add Operating System Call
New Operating System
Add Operating System Call
Old Storage System
Old Display System
Digital Entity
15Technology Management Migration
New Application
New Operating System
New Storage System
New Display System
Migrate Encoding Format
Digital Entity
16Technology Management - SDSC
New Application
New Operating System
Wrap Storage System
Wrap Display System
Old Storage System
Old Display System
Migrate Encoding Format
Digital Entity
17Migration Advantages
- By migrating the digital entity encoding format
to new standards, more sophisticated technologies
can be applied to express the information and
knowledge content inherent in collections of
digital entities. - Queries can be made on the annotated information
- Analyses can be done on the annotated knowledge
to identify anomalies and artifacts
18Specifying Levels of Abstraction
- Technology management becomes simpler if the
persistent archive infrastructure operates on
abstractions, rather than an explicit physical
implementation of a resource - Need abstractions for
- Digital entities
- Repositories
- Can generic infrastructure be created that
provides infrastructure independence for data,
information, and knowledge management?
19Differentiating between Data, Information, and
Knowledge
- Data
- Digital entity
- Entities are streams of bits
- Information
- Any semantic label.
- Attributes are the semantic label and the
associated data. - Attributes may be tagged data within the digital
object, or tagged data that is associated with
the digital object - Knowledge
- Relationships between attributes or semantic
labels - Relationships can be procedural/temporal,
structural/spatial, logical/semantic,
functional/algorithmic
20Digital Entities
- Digital entities are images of reality, made of
- Data, the bits (zeros and ones) put on a storage
system - Information, the attributes used to assign
semantic meaning to the data - Knowledge, the structural relationships described
by a data model - Every digital entity requires information and
knowledge to correctly interpret and display
21Types of Digital Entity Abstractions
- Differentiate between a digital entity and its
storage repository - Logical representation
- What naming conventions are used to assign
semantic meaning? - Physical representation
- What is the physical structure of the digital
entity?
22Levels of Abstraction for Data
Logical Semantics (units, attributes)
Physical Data Model (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
23Storage Repository Abstraction
- Set of operations that can be performed to
manipulate digital entities - Example - Storage Resource Broker
- Logical name space
- Storage repository abstraction
- Information repository abstraction
24SDSC Storage Resource Broker Meta-data
Catalog Storage Repository Abstraction
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
25Information Management
- Abstraction layer for the operations needed to
manipulate a catalog in a database - Bulk metadata manipulation
- Automated SQL generation
- Separation of the schema used for the catalog
from the schema used for the information
repository - Schema extension
- User defined attributes
26Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
27Logical Name Space
- Naming transparency - find a digital entity
without knowing its name - Map from attributes to a global file name
- Location transparency - access a digital entity
without knowing where it is - Map from global file name to local file name
- Access transparency - access a digital entity
without knowing the type of storage system - Federated client-server architecture
28SDSC Storage Resource Broker Meta-data
Catalog Information Repository Abstraction
Application
Linux I/O
Web WSDL
DLL / Python
Java, NT Browsers
Prolog Predicate
Clients
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Servers
SRM
29Knowledge Management - Characterizing Properties
of Collections
- Characterization of relationships between
attributes - Semantic / logical - cross-walks
- Procedural / temporal - records management
- Structural / spatial - GIS
- Characterization of operations needed to
manipulate a concept space in a knowledge
repository - Mapping from collection attributes to discipline
concepts - Transformation from knowledge relationships to
rules for application in inference engines
30Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
31Preservation of Data
- Migration
- Preserve the data bits
- Preserve the digital entity name
- Characterize the information and knowledge
content for presentation by new applications
32Managing Technology Evolution
- Data grids provide interoperability mechanisms to
access data in multiple administration domains
and multiple types of storage systems. - Persistent archives migrate collections from old
technology to new technology to support
presentation on new systems - Both require the ability to access heterogeneous
systems
33Preservation - Data Grids
- Name transparency
- Find a file by attributes (map from attributes to
global name) - Location transparency
- Access a file by a global identifier (map from
global to local file name) - Access transparency
- Map from preferred API to access data mechanisms
- Preserve the ability to display the system
- Authenticity
- Disaster recovery, replicate data across storage
systems - Audit and process management
34SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
35Authenticity
- Guarantee that the digital entity has not been
changed - Collection owned entities, only accessible
through the data handling system - Support roles defining access (curation, owner,
annotation, read) - Support access controls mapping users to roles
- Audit trails that record all operations on
digital entities - Digital signatures - cryptographic checksums
36SDSC Storage Resource Broker Meta-data
Catalog Preservation
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP
Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
37Emulation versus Migration
- Emulation
- Characterize processes that the display
application uses to transform the digital entity
to a visual representation - Migration
- Characterize processes needed to transform to a
new encoding format - Both are forms of process management
38Self-Instantiating Archive
- Archive the processes that are used to arrange,
describe, and preserve the digital entities - Annotation of information content
- Conversion to archivable form
- When accessing the collection, retrieve the
processes and the original digital entities - Apply the processing steps to re-create the
information content - Query the result to discover desired digital
objects
39From File-Based to Knowledge-Based Archives ...
- Conventional, file-based archives
- tape archives (.tar), optionally compressed (.Z,
.gz, ...) - integrity checks at bit-level (CRC,
checksums,...) - self-extracting archive add extraction
script/code to archival package - self-installing archive like self-extracting
archive but also automatically execute
installation script
40... From File-Based to Knowledge-Based Archives
...
- Collection- and Knowledge-Based Archives
- moving from files (raw data)
- ... via metadata descriptions to databases
- raw data schema/attributes gt encode
information - ... via semantic constraints to knowledge bases
- databases rules gt encode knowledge
- lifting bit-level integrity checks (CRC/checksum)
to - ... syntactic integrity e.g., well-formed XML
- ... structural integrity, type consistency valid
XML (wrt. XML Schema) - ... semantic integrity valid databases (database
satisfies the given semantic integrity
constraints)
41... From File-Based to Knowledge-Based Archives
- Knowledge-Based Archives
- ... include semantic integrity constraints as
part of the archive (could be in plain English
additional context information or other knowledge
about the collection) - Self-Validating Archive
- ... add a validator to the archive e.g.,
semantic integrity as logic rules, validator
logic engine (e.g., Datalog/Prolog engine) - gt allows the future information user to
understand the raw data, the rules (context
information), and detect rule exceptions, etc. - Self-Instantiating Archive
- ... similar to the self-installing archive, but
at the information/knowledge level (not file
level) allows to recreate the archival ingestion
process at a later time ("looking the archivist
over the shoulder") - ... can include self-validation steps
42(Simplified) Anatomy of a Self-Validating,
Self-Instantiating Archive
- rule engine
- rules for semantic integrity constraints gt
validation code - rules for ingestion transformations gt
re-instantiation code - collections
- files
- bits
rule engine
instantiation rules
validation rules
collections
files
bits
43Archival Ingestion Network (Pipeline)
- Processing Steps Database Transformations t
- t Source-FormatSchema ? Target-FormatSchema
- if t is invertible gt no information is lost
- automate t using DB querytransformation
languages
44Open Archival Information System (OAIS) Model
- Ingest
- receive, quality-assure SIPs generate AIPs
- Archive
- store, refresh AIPs
- Manage
- populate, maintain schemas, views, ICs access,
update DI - Access
- discover, describe, locate, upload DIPs
45Knowledge Creation Roadmap
- Knowledge syntax (consensus)
- RDF, XMI, Topic Map
- Knowledge management (recursive operations)
- Oracle parallel database
- Knowledge manipulation (spatial/procedural rules)
- Generation of inference rules and mapping to data
models - Knowledge generation (scalable inference engine)
- Application of inference rules in inference
engine
46Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
Rule-based Access
Information Repository
Attribute- based Query
Attributes Semantics
Information
Encoding standards
Query Mechanisms
Data Grid
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
47Persistent Archives
- Storage system abstraction
- Logical name space and entity manipulation
- Information repository abstraction
- Logical schema and physical table structure
- Knowledge repository abstraction
- Topic maps and inference rules
- Digital entity abstraction
- Data model and encoding format
48Archival Processes
- ? Appraisal determine the archivable content
- ? Accession - determine the initial physical
location for the data, and the relationship of
the new collection to existing collections - Arrangement - add administration control,
describe the information content (provenance,
authenticity, structure, administrative), and
decompose digital objects into their components
as needed. - Description - complete the definition of
collection attributes by iterating between
arrangement, reformatting, and representation. - Preservation build an archivable form of the
digital entities, characterize the collection
context , and manage their storage - Access provide query mechanisms for
discovering, retrieving, and presenting the
digital entities. - Re-purposing - apply archival processes to build
a new collection context
49(No Transcript)
50NARA Prototype
- Demonstrate ability to ingest, archive, recreate,
query, and present a digital object from a 1
million record E-mail collection (RFC1036) - 2.5 GB of data
- 6 required fields
- 13 optional fields
- User defined fields (over 1000)
- Determine resources required to scale size of
collection
51(No Transcript)
52XML DTD for E-mail
53Formatted Message Using XML DTD
54Web-based Interface for Accessing the E-mail
Collection
55Automation of Ingestion Process
- Application of an Accessioning Template
- Defines the concepts, policies or acceptance of
the collection - Creation of attributes that represent the
accessioning template concepts - Analysis of attributes for anomalies and implied
inherent knowledge
56Information Generation Processes
- Create occurrence index
- (Occurrence, attribute, value)
- This is needed to be able to recreate original
form of digital object - Analyze completeness of information
- Inverse index of attribute values
- Identifies unexpected values - consistency
- Analyze closure of collection
- Are additional concepts needed to represent
inverse index value ranges?
57Ingestion Processes for Collection
Aggregation of original objects into
containers for storage
Data Organization
Data Storage
58Ingestion Processes for Collection
Migration of objects into a standard
representation
Information Generation
Attribute Tagging
Attribute Selection
Data Organization
Collection Storage
59Ingestion Processes for Collection
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection Storage
60Application of Anomaly Detection to Thomas
Collection
- List of bills, amendments, orders sponsored by
each Senator in a session of Congress - The processing rule used to describe senators is
an example of inherent knowledge within the
collection - By building occurrence tables, one can
differentiate between knowledge relationships and
anomalies or artifacts
61Example Ingestion Network Senate Collection
62 Information Modeling in Knowledge-Based
Archival Senate Example
Data provider says Please archive all records
of legislative activities of the 106th
senate! Integrity constraints (Logic
Rules) (1) senators_with_file UNION
(sponsor, cosponsors, submitted_by) (2)
senators sponsors co-sponsors
Violation the rhs is strictly larger than the
lhs ! Exceptions (Chafee, John), (Gramm, Phil),
(Miller, Zell) (Possible) Explanations senators
who joined (Zell), passed away (Chafee), were
forgotten (Gramm)!? Checking ICs IF sponsor(X),
not senator(X) THEN ADD(exception_log,
missing_senator_info(X)) IF condition THEN action
Action LOG, WARN, ABORT, ...
63Senator Naming Constraints
- Senators name can appear only once on a bill
- Senator specified by
- Last name
- Last name and state
- Last name, state, and first name
- Detected anomaly, page 205 of an RTF file was
replicated.
64Persistent Collection
- Define context for archiving data -annotate
information content - Create archivable form - standard encoding format
- Archive information content along with data
- Test closure of the collection - all digital
objects that can be discovered in the collection
are members of the collection - Test completeness of the collection - inherent
relationships within the collection can be cast
in terms of attributes generated from the
annotated information. - Differentiate between inherent knowledge and
anomalies / artifacts
65Growing Community Interactions
- Mass Storage
- IEEE Mass storage system technical committee
- High performance computing
- NSF National Partnership for Advanced
Computational Infrastructure - scalable computing - Digital Library
- DLI2 - UCB, Stanford, UCSB - interoperability
- NSDL - OAI metadata harvesting, metadata
standards - Data Grid
- Global Grid Forum - infrastructure independence
- Persistent Archive
- InterPARES, records management, OAIS standard
66Collaborations
- Digital Libraries
- DLI2 - InterLib, CDL
- NSF NSDL - UCAR / DLESE
- NASA Information Power Grid
- DOE ASCI Data Visualization Corridor
- Astronomy
- National Virtual Observatory (NSF)
- 2MASS Project (2 Micron All Sky Survey)
- Particle Physics
- Particle Physics Data Grid (DOE)
- GriPhyN (NSF)
- BaBar (DOE)
- Medicine
- Digital Embryo (NLM)
- Earth Systems Sciences
- ESIPS (NASA)
- LTER (NSF)
- Persistent Archives
- NARA
67Evolution of Persistent Archives
- Preservation of cultural history
- Archival processes
- Preservation of documents and electronic records
- Authenticity
- Preservation of intellectual capital
- Characterization of information and knowledge
content - Archival life cycle
- Re-purposing of collections to facilitate
discovery - Promotion of re-use of archived data
- Self-instantiating archives
68Further Information
http//www.npaci.edu/DICE