Managing - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Managing

Description:

Lawrence Berkeley National Laboratory. Open Forum on Metadata Registries ... Eve Maler & Fabrice Desre' OASIS Position Paper: Code Lists ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 32
Provided by: johnm416
Category:
Tags: eve | lawrence | managing

less

Transcript and Presenter's Notes

Title: Managing


1
Managing Using Code Sets with Database Systems
the Web
  • John McCarthy
  • jlmccarthy_at_lbl.gov
  • National Energy Research Scientific Computing
    (NERSC)
  • Lawrence Berkeley National Laboratory
  • Open Forum on Metadata Registries
  • Santa Fe NM, 23 January 2003

2
Introductory Overview
  • Definitions components
  • Representation issues
  • Code attributes -- beyond labels definitions
  • Managing change over time other challenges
  • Questions, Comments Discussion

3
Code Sets are enumerated lists of discrete
permissible values
  • e.g.,
  • used to describe manage data codes

M Male F Female U Unknown
Gender Code Set (metadata)
Data Table
4
Code set metadata can expressed as distinct
components at 2 levels
  • code set name Gender
  • version 3.1
  • definition sex of patient
  • steward Berkeley National Laboratory

code set-level components
codeM label male codeF label
female codeU label unknown
individual code value-level components
5
Code values code set metadata serve different
purposes
  • Codes are data values
  • to facilitate data entry, indexing searching
  • to join different tables (for data manipulation
    and analysis)
  • for OLAP dimensions (summary tables, statistical
    analysis,)
  • Code Sets are metadata
  • to enforce constraints for permissible values
    (foreign keys)
  • to display codes in human-readable form
  • for labels, definitions, lists of values (for
    searching data entry)

6
Code set information has increased in scope
formality
  • National international code set standards
  • Paper codebook for one machine-readable file
  • Metadata dictionaries for single databases
  • Multi-database metadata registries
  • and from simple sets of categories to complex
    classification terminology management schemes

7
Some code sets are complex classification schemes
  • e.g., International Classification of Diseases
    (ICD-9)
  • 460-519 Diseases Of The Respiratory System
  • 470-478 Other diseases of the upper respiratory
    tract
  • 474 Chronic disease of tonsils and adenoids
  • 474.1 Tonsillar and adenoid hypertrophy
  • Alternate Descriptions
  • Enlargement of tonsils or adenoids
  • Hyperplasia of tonsils or adenoids
  • Hypertrophy of tonsils or adenoids
  • Exclusions
  • adenoiditis (474.01) adenoiditis and tonsillitis
    (474.02) tonsillitis (474.0)

Note that higher level concepts are ranges, not
codes
8
Other code sets have multi-level hierarchical
structures
  • e.g.,
  • North American Industrial Classification System
    (NAICS)
  • 11 Agriculture, Forestry, Fishing and Hunting
  • 111 Crop Production
  • 1111 Oilseed and Grain Farming
  • 11111 Soybean Farming
  • 11112 Oilseed (except Soybean) Farming
  • 11113 Dry Pea and Bean Farming
  • 11114 Wheat Farming
  • 11115 Corn Farming
  • 11116 Rice Farming
  • 11119 Other Grain Farming
  • 111191 Oilseed and Grain Combination Farming
  • 111199 All Other Grain Farming

number of digits indicates level in hierarchy and
degree of generality
9
Code sets can be represented in individual or
unified tables
 
10
Unified approach is better, but less frequently
used to date
  • main drawback foreign key constraints more
    difficult
  • requires separate view-based constraints and
    triggers for data and metadata
  • adding new code sets only requires adding
    (metadata) rows
  • not tables, which require DBA and DDL
  • adding new attributes only requires modifying one
    table
  • much easier to use and maintain code sets in a
    single table
  • can use standard SQL or compiled functions for
    all code sets
  • can search for codeset information even if you
    dont know table name
  • most terminology classification systems use
    unified tables
  • essential for multi-lingual applications -- just
    add a column for language

11
Unified code set table requires more machinery
for constraints
alter table PERSONS add constraint FK_ORG
references VORGANIZATIONS
create view VORGANIZATION as select CODE, LABEL,
DESCRIPTION from CODESETS where
codesetorganization
12
Unified code table facilitates use of standard
web GUIs
13
Data dictionaries registriesenhance utility of
code sets
14
Data codes and code set labels have different
needs purposes
  • Data Codes
  • Concise uniform to optimize input, storage,
    joins indexing
  • Can appear in many different columns, tables,
    databases,
  • tradeoff of characters for readability v.
    integers for performance
  • Code Labels and Definitions
  • Have fewer occurrences, so can be changed more
    easily
  • Optimized for readability understanding
  • Require software to look up from codes in data

15
Why not use meaningful characters for data codes?
  • Easier to implement
  • doesnt require machinery to use separate labels
  • direct display of stored data values
  • Can use simple foreign key constraints
  • Easier for data entry, editing, and quality
    control
  • Typing faster than pull-down value lists for data
    entry
  • meaningful names easier to remember
  • can directly edit cells of data tables
  • easier to spot errors

16
BUTsemantically meaningful codes have serious
drawbacks
  • Names (and derived mnemonics) change
  • confusing when codes get out of synch with labels
  • metadata labels much easier to change than data
    codes
  • data codes need to be stable for doing joins
  • Problems increase with database size
  • better storage performance for integer codes
  • increasingly difficult dangerous to change data
    code values
  • especially as the number of tables databases
    increases
  • Not appropriate for multi-lingual situations
  • Unless users are willing to agree on dominant
    language

17
When should we avoid using meaningful code values?
  • When code values are likely to change frequently
  • When number of code values in a set is large
  • When size of database(s) requires integer codes
    (surrogate keys) for storage and/or performance
    of entity matching and identification
  • When scope or distributed nature of database(s)
    precludes changing code values in data
  • When we have mechanisms that facilitate hiding
    underlying numeric code and only exposing labels

18
Other code value attributes extend uses of code
sets
19
Change happens andcode sets have to keep up
  • Some causes of change
  • Scientific knowledge (e.g., AIDS changed
    definitions)
  • Treaties or Legislation (e.g, NAFTA required SIC
    ? NAICS)
  • Mergers acquisitions (e.g., Dept of Homeland
    Security)
  • Typical types of change
  • Minor incremental (usually additions and
    deletions)
  • Labels and definitions (e.g, Natl -gt Reagan, but
    code still DCA)
  • Major reorganization (e.g., for classification
    system)
  • History attributes in codesets table can help

20
What else can help manage the evolution of codes
codesets?
  • Use history tables to preserve snapshots
  • Add attributes (columns) to code value rows
  • Derivation, Ancestor(s), Pedigree, Provenance
  • Do NOT re-use codes!! (e.g., ICD needs version)
  • Avoid replacing code values in data
  • except possibly for a limited number of specific
    tables

21
What other problems challenges arise for code
sets?
  • When code sets are independent entities with own
    data
  • trade-off having code in entity code sets table
    (e.g., organization)
  • Special kinds of code sets
  • Complex (hierarchies, classifications,
    concatenations)
  • Open-ended extensible
  • Proprietary code sets
  • e.g., diagnostic health codes
  • Special code values
  • e.g., null, missing data, does not apply, etc.

22
Open-ended codes let users help extend code sets
select organization from list or enter new name
But
HTML Forms has only forced choice or free text,
not combination -so we have to provide an extra
field on the form in the database invites
variant names for same organization requires
periodic staff review and clean-up
23
Where should logic reside for extensible or
complex code sets?
Please select a country
Web Server
  • extensible codes on client browser
  • problem non-standard client code
  • Java, javascript, visual basic, etc.
  • X-Forms may provide richer tools
  • hierarchies on the web server
  • unless number of codes is small
  • round trip latency for slow connections
  • amount of data for large number of codes

Then select state or province
24
What is the optimal XML representation for code
sets?
NCI XML Representation for 11179 Data
Objects OASIS Universal Business Language --
Code Lists (CL) see Eve Maler presentation
and papers proposes creation of an XSD-based
code list marketplace NOTE CL consumers
often differ from CL producers and maintainers
UBL provides XSD schema templates for code
lists, including list level and code
level information add effective date end
date at both levels? sufficiently flexible
extensible? (e.g., code more than name,value)
25
Many challenges remain for code set research
development
  • general-purpose ways to use effdate/enddate to
    get codes valid for arbitrary dates and time
    periods
  • generic user-defined types to connect code sets
    to data element domains with inheritance
  • standard ways to integrate and utilize
  • complex code sets and classification schemes
  • flexible, extensible data structures to
    facilitate sharing exchange of code set
    information

26
What are the main take-home conclusions about
code sets?
  • use unified tables for code and code set
    attributes
  • add attributes to unified tables to extend their
    utility
  • Effdate, Enddate to support audit trail history
  • Sequence order for special sort ordering
  • Type-specific columns (type, parent code, )
  • use standard tools to extend manage unified
    tables
  • for searching, display, editing, etc.
  • use registries dictionaries for wider dynamic
    use

27
References for further reading
  • Dan Gillman, Achieving metadata registry content
    consistency -- Part 3 Value Domains
  • Eve Maler Fabrice Desre OASIS Position Paper
    Code Lists
  • Eve Maler, Schema Design Rules for UBL...and
    Maybe for You
  • HIPAACode.com, hipaanet.com, HL7 V3 Guide
  • oaspub.epa.gov/edr/codeset.startup
  • educationadvisor.com -gt 65.106.216.198/dsweb/intro
    .htm
  • Malcolm Chisholm, Managing Reference Data in
    Enterprise Databases

28
Contact Information and Acknowledgements
  • JLMcCarthy_at_LBL.gov
  • NIM Team
  • Mikhail Avrekh
  • Clayton Bagwell
  • Mark Heer
  • R.K. Owen
  • Francesca Verdier
  • EPA Data Stds Program, L8, SC32, ,et al
  • Bruce Bargmeyer, Frank Olken, Dan Gillman, Larry
    Fitzwater,

29
(No Transcript)
30
Extra Vugraphs
31
Before ISO 11179 metadata registries, there were
code sets
  • ANSI X3L8 committee on Data Representation
  • chaired by Bill Kenworthy
  • included several old timers here at this
    conference
  • responsible for maintenance of several specific
    code sets
  • became L8 committee on Metadata under Dan Gillman
  • it spearheaded development of data element
    registries 11179
  • thanks to efforts leadership of Bruce
    Bargmeyer, EPA, et al
  • code sets still a fundamental important part of
    11179
Write a Comment
User Comments (0)
About PowerShow.com