Title: Developing a Distributed Data Dictionary Service
1Developing a Distributed Data Dictionary Service
- Jim URen
- Jet Propulsion Laboratory
- California Institute of Technology
- Design Hub, KM Standards Working Group EDA Team
- April 11, 2002
2Problem
- 1. Data dictionaries mean different things to
different people - Vocabularies - human readable collections of
terms and definitions pertaining to a domain - Data element dictionaries - machine interpretable
collections of data elements (from ISO/IEC 11179) - Schemas (information models) - structured,
machine interpretable collections of information
models consisting of structured relationships
between data elements - 2. Dictionaries do not communicate with each other
3What is Needed
- A mechanism that can be used to access, publish,
update, relate and integrate data dictionaries
(vocabularies, data elements, and data models) - Mechanism must be able to span domains and
subdomains, e.g., engineering, science, and
administrative - Mechanism must have both manual and automated
interfaces - Mechanism should follow the distributed service
model (e.g., DNS, Internet Domain Name Service,
x.500 Directory, etc.)
4A Solution
- Develop a distributed data dictionary service
using - LDAP Internet service protocol (LightWeight
Directory Access Protocol) - ISO11179 - a specification for standard data
elements - DSML XML DTD/Schema (Directory Service Markup
Language) - Dublin Core Meta-data
- the Service will store and relate vocabulary,
data elements, and data model information
5Advantages of LDAP
- LDAP has many advantages, including
- Universal Access - Internet directory standard,
widely adopted and implemented by numerous
vendors and open source software solutions - Simple - a relatively simple, high-level protocol
with a straightforward API - Extensible - easily extended and adapted
- Access Control and Security - connections can be
authenticated and secured layered Internet
security mechanims - Multi-Platform Development - C/C, Perl, Java,
JavaScript, Python, PHP and other APIs are
available, making LDAP services accessible from
virtually any language, platform, or development
environment
6What is LDAP?
- An Internet Standard from an IETF working
group - RFC 1777 Lightweight Directory Access Protocol
- RFC 1778 String Representation of Standard
Atribute Syntaxes - RFC 1779 String Representation of Distinguished
Names - RFC 1959 LDAP URL Format
- RFC LDAP API
- A distributed, hierarchial data base
- Uses a multi-part naming convention to create
unique records (distinguished names) - cnbehaviour, dcvocabulary, dcPart233,
dc10303, dcISO - cnrequirement_set, dcdata-element, dcPart233,
dc10303, dcISO - cnTBR-apha1, dcshema, dcPart233, dc10303,
dcISO - Includes ability to implement multiple levels of
security
7 Example of an LDAP tree
ISO
10303
14496
9000
. . .
237
235
. . .
233
203
210
209
Vocabulary
Schema
Data Elements
8Advantages of ISO 11179
- an established international standard
- widely supported - US Census Bureau, NIST,
Defense Information System Agency, Environmental
Security, DoE, DoJ, Bureau of Labor Statistics,
DoT, EPA, etc. - Flexible use of elements within the schema
- Easily implemented in an LDAP directory service -
flexible and easily configured LDAP servers well
suited to flexible 11179 schema
9Data Dictionary Components for a given namespace
10A Distributed Data Dictionary Serviceusing
Standards-based technology LDAP Protocol ISO
11179 meta-data schema DSML Dublin Core
Prototype service viewable at http//step.jpl.nas
a.gov/ldap
Supporting Automated Processes
Supporting Validation Scenarios
Supporting Data Modeling Activities
Supporting Terminology Lookups
11A Proposed Data Element Naming Convention
- A structured, multi-part naming system
- similar to IP addressing and URLs
- dot delimited names
- follows convention used by Dublin Core Meta-data
Initiative - short-name aliases could be supported in the
planned distributed data dictionary service - e.g. author DC.Creator, keywordDC.Subject,
etc. - Names would consist of domains, descriptors and
qualifiers.
12Examples of the Data Element Naming Convention
within JPL Domains
- Dublin Core Meta-data Initiative (a JPL adopted
standard) - DC.Date
- DC.Date.Created
- DC.Date.LastModified
- JPLs Planetary Data System (PDS)
- PDS.Target_Name
- PDS.Sampling_Factor
- JPLs Product Data Management System (PDMS)
- PDMS.Version
- PDMS.ReferenceDesignator
- JPL New Business System (NBS)
- NBS.HR.start_date
- NBS.HR.employee_status
13Terminology Lookup Scenarios
- Resolving Ambiguous Terminology - an end user,
needing to clarify use and meaning of a word used
in a specific context, performs a multi-domain
vocabulary lookup across multiple DD services
looking for published vocabulary of referenced
domain - Finding the Correct Acronym - an end user,
confronted with a number of new acronyms used in
a presentation, accesses a local DD service to
look up the acronyms based within probable
domains, thereby eliminating the alternative
meanings e.g., searching for STEP standards work
versus the JPL STEP project - Enabling Improved Search Engine Performance - as
a search engine scans through a document, it
discovers a keyword list and finds a reserved
word the document includes a reference to a
domain-specific vocabulary list in a DD service
the search engine uses this vocabulary to be
certain it is indexing the keywords in the right
context - Building Glossaries for Technical Papers - an
engineer or scientist writing a technical paper,
needs to include a glossary of relevant terms in
the paper by performing a multi-service search,
terms and definitions that relate to the topic of
the paper are quickly found and inserted into the
paper with the corresponding attributions
14Validation Scenarios
- Validating Units of Measure - a system integrator
receives an MCAD geometry model (e.g., STEP AP203
Part 21 file) of a component to be integrated
into any assembly automatically, a standard
validation routine is performed against the
schema located in a referenced data dictionary
that checks for use of the units of measure
called for in the contract and identified in the
exchange file - Enabling Automated Repository Check-In - as a
STEP model is checked into a PDM system, an
automated validation routine checks the model
using the schema (located in the DD service) that
is identified in the Part 21 data file - Improving Quality of Data Handoffs - an MCAD
geometry model is sent from design to thermal
analysis and validation is performed using the
correct schema version as referenced in the
model validation is an automated process that
occurs before any work is done with the model as
it is transferred between domains - Validating for Adequacy and Range the PDS
(NASAs Planetary Data System) central node
receives a dataset description in template format
to be ingested into the dataset catalogue
database. Automatically, a standard validation
routine is performed that checks for required
keywords, key word values and value types in the
dataset in template format against a
corresponding structure stored in the PDS domain
of the data dictionary service
15Data Modeling Scenarios
- Data Reuse in Modelling Activities- a data
modeller, charged with developing an information
model for a new application, uses data elements
published in several DD services (much like a
parts library), ensuring that the new information
model will have compatible interfaces with data
sets that share the same data elements or
collection of elements - Creating a TDP (technical data package) - an
application performs a schema check against
objects about to be wrapped into a TDP (e.g.,
STEP AP232 or PDM Schema TDP) to ensure their
correct structure and meta-data content - Data Integration Enabled - an analyst, charged
with integrating data from two or more data sets,
accesses the correct version of each schema as
referenced in the data set from the DD service
space allowing them to identify/map interfaces
between the data sets, e.g., MCAD-ECAD-cost data - Extending a schema - to solve a "local" problem,
a data modeller uses data elements from a
published collection of data items to extend an
existing official schema the new schema is
published in the DD service with traces/links
back to the official schema
16Whats next? (Completing the prototype )
- Architecture development
- UML Model (50)
- Naming Convention (50)
- Linking ontology (25)
- Server configuration
- 2nd and 3rd DD test nodes (33)
- Wrapping existing DD DBs (10 )
- Client configurations
- LDAP URL (75 ) Java (33)
- Python (33) Perl (33)
- C/C (75) Unix Shell (25)
- PHP (25) Native clients(25)