Preservation of Data and Records in a Knowledge-based Society - PowerPoint PPT Presentation

1 / 68
About This Presentation
Title:

Preservation of Data and Records in a Knowledge-based Society

Description:

By migrating the digital entity encoding format to new standards, more ... Shell. Java, NT. Browsers. Web. WSDL. GridFTP. SDSC Storage Resource Broker & Meta ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 69
Provided by: reag1
Category:

less

Transcript and Presenter's Notes

Title: Preservation of Data and Records in a Knowledge-based Society


1
Preservation of Data and Records in a
Knowledge-based Society Reagan W. Moore San
Diego Supercomputer Center moore_at_sdsc.edu http//w
ww.npaci.edu/DICE/
2
Data and Knowledge Systems Group
  • Staff
  • Reagan Moore
  • Ilkai Altintas
  • Chaitan Baru
  • Sheau Yen Chen
  • Charles Cowart
  • Tony Fountain
  • Amarnath Gupta
  • Arun Jagatheesan
  • George Kremenek
  • Mevlut Kurul
  • Bertram Ludäscher
  • Richard Marciano
  • XuFei Qian
  • Roman Olshanowsky
  • Arcot Rajasekar
  • Abe Singer
  • Michael Wan
  • Ilya Zaslavsky
  • Graduate Students
  • A. Behere
  • M. Dortenzio
  • H. Jasso
  • M. Memon
  • H. Shin
  • L. Sui
  • G. Wang
  • Undergraduate Interns
  • N. Cotofana
  • D. Le
  • J. Tran
  • /- NN

3
Topics
  • Historical perspective - innovation sources
  • Persistent archive infrastructure approaches
  • Digital entities - data, information, knowledge
  • Technology evolution - levels of abstraction
  • Automation of archival processes - data grids
  • Access - exposing information and knowledge

4
Research Objectives
  • Scalability
  • Automation of archival processes
  • Technology evolution management
  • Infrastructure independence
  • Levels of abstraction
  • Access
  • Information based discovery
  • Knowledge based discovery

5
Original Expertise
  • 1998 - NSF DLI1 digital library (UCB, U Michigan,
    UCSB)
  • Integration of archival storage behind collection
    catalog
  • Bulk metadata management
  • 1990 - High performance computing
  • Parallel computing technology
  • Current system is a 1.7 Tflops SP cluster
  • 1986 - Mass storage systems
  • Migrated all data forward in time across
  • 6 CPU platforms
  • 3 mass storage systems - DataTree, UniTree, HPSS
  • 6 types of tape media - 3480, 3490, 3490E, 3590,
    3590E, 9940B
  • Current capacity is 6 PBs holding 415 TBs of data

6
Original Project
  • 1998 - NARA supplement to the DARPA/USPTO
    Distributed Object Computation Testbed (DOCT)
  • Scalability
  • Demonstrated archiving of a 1-million E-mail
    collection
  • 1997 - DOCT built a patent digital library for
    the USPTO
  • Scalability - 2 million patent collection
  • Transformative migration from Greenbook to SGML
  • Storage Resource Broker (data grid) for
    replicating data across sites

7
(No Transcript)
8
Initial Concepts
  • Provided separate platforms for archival
    processes
  • Created infrastructure independent representation
    for all components of persistent archive
  • Digital entity data format
  • Storage repository
  • Information repository

9
ERA Concept model
10
Infrastructure Independence
  • Emulation
  • Migrate the display application to new operating
    systems, preserving the look and feel of the
    technology used to create the digital entity
  • Migration
  • Migrate the digital entity encoding format to a
    new standard to enable more sophisticated queries
    on the information and knowledge content
  • Are these variants of a continuum of approaches?

11
Presentation of Digital Entities
Application
Operating System
Storage System
Display System
Digital Entity
12
Technology Management - Emulation
Old Application
Wrap Application
New Operating System
New Storage System
New Display System
Digital Entity
13
Technology Management
Old Application
Add Operating System Call
New Operating System
New Storage System
New Display System
Digital Object
14
Technology Management
Old Application
Add Operating System Call
New Operating System
Add Operating System Call
Old Storage System
Old Display System
Digital Entity
15
Technology Management Migration
New Application
New Operating System
New Storage System
New Display System
Migrate Encoding Format
Digital Entity
16
Technology Management - SDSC
New Application
New Operating System
Wrap Storage System
Wrap Display System
Old Storage System
Old Display System
Migrate Encoding Format
Digital Entity
17
Migration Advantages
  • By migrating the digital entity encoding format
    to new standards, more sophisticated technologies
    can be applied to express the information and
    knowledge content inherent in collections of
    digital entities.
  • Queries can be made on the annotated information
  • Analyses can be done on the annotated knowledge
    to identify anomalies and artifacts

18
Specifying Levels of Abstraction
  • Technology management becomes simpler if the
    persistent archive infrastructure operates on
    abstractions, rather than an explicit physical
    implementation of a resource
  • Need abstractions for
  • Digital entities
  • Repositories
  • Can generic infrastructure be created that
    provides infrastructure independence for data,
    information, and knowledge management?

19
Differentiating between Data, Information, and
Knowledge
  • Data
  • Digital entity
  • Entities are streams of bits
  • Information
  • Any semantic label.
  • Attributes are the semantic label and the
    associated data.
  • Attributes may be tagged data within the digital
    object, or tagged data that is associated with
    the digital object
  • Knowledge
  • Relationships between attributes or semantic
    labels
  • Relationships can be procedural/temporal,
    structural/spatial, logical/semantic,
    functional/algorithmic

20
Digital Entities
  • Digital entities are images of reality, made of
  • Data, the bits (zeros and ones) put on a storage
    system
  • Information, the attributes used to assign
    semantic meaning to the data
  • Knowledge, the structural relationships described
    by a data model
  • Every digital entity requires information and
    knowledge to correctly interpret and display

21
Types of Digital Entity Abstractions
  • Differentiate between a digital entity and its
    storage repository
  • Logical representation
  • What naming conventions are used to assign
    semantic meaning?
  • Physical representation
  • What is the physical structure of the digital
    entity?

22
Levels of Abstraction for Data
Logical Semantics (units, attributes)
Physical Data Model (syntax, structure)
Abstraction for Digital Entity
Digital Entity
Files
Abstraction for Repository
Logical Name Space
Physical Data Handling System -SRB/MCAT
Repository
File System, Archive
23
Storage Repository Abstraction
  • Set of operations that can be performed to
    manipulate digital entities
  • Example - Storage Resource Broker
  • Logical name space
  • Storage repository abstraction
  • Information repository abstraction

24
SDSC Storage Resource Broker Meta-data
Catalog Storage Repository Abstraction
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
25
Information Management
  • Abstraction layer for the operations needed to
    manipulate a catalog in a database
  • Bulk metadata manipulation
  • Automated SQL generation
  • Separation of the schema used for the catalog
    from the schema used for the information
    repository
  • Schema extension
  • User defined attributes

26
Levels of Abstraction for Information
Logical Collection Schema
Physical XML Syntax
Abstraction for Digital Entity
Digital Entity
Metadata Attributes
Abstraction for Repository
Logical Database Schema
Physical EMCAT/CWM
Repository
Database
27
Logical Name Space
  • Naming transparency - find a digital entity
    without knowing its name
  • Map from attributes to a global file name
  • Location transparency - access a digital entity
    without knowing where it is
  • Map from global file name to local file name
  • Access transparency - access a digital entity
    without knowing the type of storage system
  • Federated client-server architecture

28
SDSC Storage Resource Broker Meta-data
Catalog Information Repository Abstraction
Application
Linux I/O
Web WSDL
DLL / Python
Java, NT Browsers
Prolog Predicate
Clients

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase, SQLServer
Servers
SRM
29
Knowledge Management - Characterizing Properties
of Collections
  • Characterization of relationships between
    attributes
  • Semantic / logical - cross-walks
  • Procedural / temporal - records management
  • Structural / spatial - GIS
  • Characterization of operations needed to
    manipulate a concept space in a knowledge
    repository
  • Mapping from collection attributes to discipline
    concepts
  • Transformation from knowledge relationships to
    rules for application in inference engines

30
Levels of Abstraction for Knowledge
Logical Relationship Schema
Physical RDF syntax
Abstraction for Digital Entity
Concept Space (ontology instance)
Digital Entity
Abstraction for Repository
Logical Knowledge Repository Schema
Physical Model-based Mediation System
Repository
Knowledge Repository
31
Preservation of Data
  • Migration
  • Preserve the data bits
  • Preserve the digital entity name
  • Characterize the information and knowledge
    content for presentation by new applications

32
Managing Technology Evolution
  • Data grids provide interoperability mechanisms to
    access data in multiple administration domains
    and multiple types of storage systems.
  • Persistent archives migrate collections from old
    technology to new technology to support
    presentation on new systems
  • Both require the ability to access heterogeneous
    systems

33
Preservation - Data Grids
  • Name transparency
  • Find a file by attributes (map from attributes to
    global name)
  • Location transparency
  • Access a file by a global identifier (map from
    global to local file name)
  • Access transparency
  • Map from preferred API to access data mechanisms
  • Preserve the ability to display the system
  • Authenticity
  • Disaster recovery, replicate data across storage
    systems
  • Audit and process management

34
SDSC Storage Resource Broker Meta-data
Catalog Common APIs
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
35
Authenticity
  • Guarantee that the digital entity has not been
    changed
  • Collection owned entities, only accessible
    through the data handling system
  • Support roles defining access (curation, owner,
    annotation, read)
  • Support access controls mapping users to roles
  • Audit trails that record all operations on
    digital entities
  • Digital signatures - cryptographic checksums

36
SDSC Storage Resource Broker Meta-data
Catalog Preservation
Application
Linux I/O
Web WSDL
Access APIs
DLL / Python
Java, NT Browsers
GridFTP

Consistency Management / Authorization-Authenticat
ion
Prime Server
Logical Name Space
Latency Management
Data Transport
Metadata Transport
Storage Abstraction
Catalog Abstraction
Databases DB2, Oracle, Sybase
Servers
HRM
37
Emulation versus Migration
  • Emulation
  • Characterize processes that the display
    application uses to transform the digital entity
    to a visual representation
  • Migration
  • Characterize processes needed to transform to a
    new encoding format
  • Both are forms of process management

38
Self-Instantiating Archive
  • Archive the processes that are used to arrange,
    describe, and preserve the digital entities
  • Annotation of information content
  • Conversion to archivable form
  • When accessing the collection, retrieve the
    processes and the original digital entities
  • Apply the processing steps to re-create the
    information content
  • Query the result to discover desired digital
    objects

39
From File-Based to Knowledge-Based Archives ...
  • Conventional, file-based archives
  • tape archives (.tar), optionally compressed (.Z,
    .gz, ...)
  • integrity checks at bit-level (CRC,
    checksums,...)
  • self-extracting archive add extraction
    script/code to archival package
  • self-installing archive like self-extracting
    archive but also automatically execute
    installation script

40
... From File-Based to Knowledge-Based Archives
...
  • Collection- and Knowledge-Based Archives
  • moving from files (raw data)
  • ... via metadata descriptions to databases
  • raw data schema/attributes gt encode
    information
  • ... via semantic constraints to knowledge bases
  • databases rules gt encode knowledge
  • lifting bit-level integrity checks (CRC/checksum)
    to
  • ... syntactic integrity e.g., well-formed XML
  • ... structural integrity, type consistency valid
    XML (wrt. XML Schema)
  • ... semantic integrity valid databases (database
    satisfies the given semantic integrity
    constraints)

41
... From File-Based to Knowledge-Based Archives
  • Knowledge-Based Archives
  • ... include semantic integrity constraints as
    part of the archive (could be in plain English
    additional context information or other knowledge
    about the collection)
  • Self-Validating Archive
  • ... add a validator to the archive e.g.,
    semantic integrity as logic rules, validator
    logic engine (e.g., Datalog/Prolog engine)
  • gt allows the future information user to
    understand the raw data, the rules (context
    information), and detect rule exceptions, etc.
  • Self-Instantiating Archive
  • ... similar to the self-installing archive, but
    at the information/knowledge level (not file
    level) allows to recreate the archival ingestion
    process at a later time ("looking the archivist
    over the shoulder")
  • ... can include self-validation steps

42
(Simplified) Anatomy of a Self-Validating,
Self-Instantiating Archive
  • rule engine
  • rules for semantic integrity constraints gt
    validation code
  • rules for ingestion transformations gt
    re-instantiation code
  • collections
  • files
  • bits

rule engine
instantiation rules
validation rules
collections
files
bits
43
Archival Ingestion Network (Pipeline)
  • Processing Steps Database Transformations t
  • t Source-FormatSchema ? Target-FormatSchema
  • if t is invertible gt no information is lost
  • automate t using DB querytransformation
    languages

44
Open Archival Information System (OAIS) Model
  • Ingest
  • receive, quality-assure SIPs generate AIPs
  • Archive
  • store, refresh AIPs
  • Manage
  • populate, maintain schemas, views, ICs access,
    update DI
  • Access
  • discover, describe, locate, upload DIPs

45
Knowledge Creation Roadmap
  • Knowledge syntax (consensus)
  • RDF, XMI, Topic Map
  • Knowledge management (recursive operations)
  • Oracle parallel database
  • Knowledge manipulation (spatial/procedural rules)
  • Generation of inference rules and mapping to data
    models
  • Knowledge generation (scalable inference engine)
  • Application of inference rules in inference
    engine

46
Knowledge Based Persistent Archive
Ingest Services
Management
Access Services
Knowledge or Topic-Based Query / Browse
Knowledge Repository for Rules
Relationships Between Concepts
Knowledge
Rule-based Access
Information Repository
Attribute- based Query
Attributes Semantics
Information
Encoding standards
Query Mechanisms
Data Grid
Data
Fields Containers Folders
Storage (Replicas, Persistent IDs)
Feature-based Query
47
Persistent Archives
  • Storage system abstraction
  • Logical name space and entity manipulation
  • Information repository abstraction
  • Logical schema and physical table structure
  • Knowledge repository abstraction
  • Topic maps and inference rules
  • Digital entity abstraction
  • Data model and encoding format

48
Archival Processes
  • ? Appraisal determine the archivable content
  • ? Accession - determine the initial physical
    location for the data, and the relationship of
    the new collection to existing collections
  • Arrangement - add administration control,
    describe the information content (provenance,
    authenticity, structure, administrative), and
    decompose digital objects into their components
    as needed.
  • Description - complete the definition of
    collection attributes by iterating between
    arrangement, reformatting, and representation.
  • Preservation build an archivable form of the
    digital entities, characterize the collection
    context , and manage their storage
  • Access provide query mechanisms for
    discovering, retrieving, and presenting the
    digital entities.
  • Re-purposing - apply archival processes to build
    a new collection context

49
(No Transcript)
50
NARA Prototype
  • Demonstrate ability to ingest, archive, recreate,
    query, and present a digital object from a 1
    million record E-mail collection (RFC1036)
  • 2.5 GB of data
  • 6 required fields
  • 13 optional fields
  • User defined fields (over 1000)
  • Determine resources required to scale size of
    collection

51
(No Transcript)
52
XML DTD for E-mail
53
Formatted Message Using XML DTD
54
Web-based Interface for Accessing the E-mail
Collection
55
Automation of Ingestion Process
  • Application of an Accessioning Template
  • Defines the concepts, policies or acceptance of
    the collection
  • Creation of attributes that represent the
    accessioning template concepts
  • Analysis of attributes for anomalies and implied
    inherent knowledge

56
Information Generation Processes
  • Create occurrence index
  • (Occurrence, attribute, value)
  • This is needed to be able to recreate original
    form of digital object
  • Analyze completeness of information
  • Inverse index of attribute values
  • Identifies unexpected values - consistency
  • Analyze closure of collection
  • Are additional concepts needed to represent
    inverse index value ranges?

57
Ingestion Processes for Collection
Aggregation of original objects into
containers for storage
Data Organization
Data Storage
58
Ingestion Processes for Collection
Migration of objects into a standard
representation
Information Generation
Attribute Tagging
Attribute Selection
Data Organization
Collection Storage
59
Ingestion Processes for Collection
Accession Template
Closure Concept/Attribute
Attribute Inverse Indexing
Information Generation
Knowledge Generation
Attribute Tagging
Attribute Selection
Occurrence Tagging
View Management
Data Organization
Collection Storage
60
Application of Anomaly Detection to Thomas
Collection
  • List of bills, amendments, orders sponsored by
    each Senator in a session of Congress
  • The processing rule used to describe senators is
    an example of inherent knowledge within the
    collection
  • By building occurrence tables, one can
    differentiate between knowledge relationships and
    anomalies or artifacts

61
Example Ingestion Network Senate Collection
62
Information Modeling in Knowledge-Based
Archival Senate Example
Data provider says Please archive all records
of legislative activities of the 106th
senate! Integrity constraints (Logic
Rules) (1) senators_with_file UNION
(sponsor, cosponsors, submitted_by) (2)
senators sponsors co-sponsors
Violation the rhs is strictly larger than the
lhs ! Exceptions (Chafee, John), (Gramm, Phil),
(Miller, Zell) (Possible) Explanations senators
who joined (Zell), passed away (Chafee), were
forgotten (Gramm)!? Checking ICs IF sponsor(X),
not senator(X) THEN ADD(exception_log,
missing_senator_info(X)) IF condition THEN action

Action LOG, WARN, ABORT, ...
63
Senator Naming Constraints
  • Senators name can appear only once on a bill
  • Senator specified by
  • Last name
  • Last name and state
  • Last name, state, and first name
  • Detected anomaly, page 205 of an RTF file was
    replicated.

64
Persistent Collection
  • Define context for archiving data -annotate
    information content
  • Create archivable form - standard encoding format
  • Archive information content along with data
  • Test closure of the collection - all digital
    objects that can be discovered in the collection
    are members of the collection
  • Test completeness of the collection - inherent
    relationships within the collection can be cast
    in terms of attributes generated from the
    annotated information.
  • Differentiate between inherent knowledge and
    anomalies / artifacts

65
Growing Community Interactions
  • Mass Storage
  • IEEE Mass storage system technical committee
  • High performance computing
  • NSF National Partnership for Advanced
    Computational Infrastructure - scalable computing
  • Digital Library
  • DLI2 - UCB, Stanford, UCSB - interoperability
  • NSDL - OAI metadata harvesting, metadata
    standards
  • Data Grid
  • Global Grid Forum - infrastructure independence
  • Persistent Archive
  • InterPARES, records management, OAIS standard

66
Collaborations
  • Digital Libraries
  • DLI2 - InterLib, CDL
  • NSF NSDL - UCAR / DLESE
  • NASA Information Power Grid
  • DOE ASCI Data Visualization Corridor
  • Astronomy
  • National Virtual Observatory (NSF)
  • 2MASS Project (2 Micron All Sky Survey)
  • Particle Physics
  • Particle Physics Data Grid (DOE)
  • GriPhyN (NSF)
  • BaBar (DOE)
  • Medicine
  • Digital Embryo (NLM)
  • Earth Systems Sciences
  • ESIPS (NASA)
  • LTER (NSF)
  • Persistent Archives
  • NARA

67
Evolution of Persistent Archives
  • Preservation of cultural history
  • Archival processes
  • Preservation of documents and electronic records
  • Authenticity
  • Preservation of intellectual capital
  • Characterization of information and knowledge
    content
  • Archival life cycle
  • Re-purposing of collections to facilitate
    discovery
  • Promotion of re-use of archived data
  • Self-instantiating archives

68
Further Information
http//www.npaci.edu/DICE
Write a Comment
User Comments (0)
About PowerShow.com