Title: Managing
1Managing Using Code Sets with Database Systems
the Web
- John McCarthy
- jlmccarthy_at_lbl.gov
- National Energy Research Scientific Computing
(NERSC) - Lawrence Berkeley National Laboratory
- Open Forum on Metadata Registries
- Santa Fe NM, 23 January 2003
2Introductory Overview
- Definitions components
- Representation issues
- Code attributes -- beyond labels definitions
- Managing change over time other challenges
- Questions, Comments Discussion
3Code Sets are enumerated lists of discrete
permissible values
- e.g.,
- used to describe manage data codes
M Male F Female U Unknown
Gender Code Set (metadata)
Data Table
4Code set metadata can expressed as distinct
components at 2 levels
- code set name Gender
- version 3.1
- definition sex of patient
- steward Berkeley National Laboratory
code set-level components
codeM label male codeF label
female codeU label unknown
individual code value-level components
5Code values code set metadata serve different
purposes
- Codes are data values
- to facilitate data entry, indexing searching
- to join different tables (for data manipulation
and analysis) - for OLAP dimensions (summary tables, statistical
analysis,) - Code Sets are metadata
- to enforce constraints for permissible values
(foreign keys) - to display codes in human-readable form
- for labels, definitions, lists of values (for
searching data entry)
6Code set information has increased in scope
formality
- National international code set standards
- Paper codebook for one machine-readable file
- Metadata dictionaries for single databases
- Multi-database metadata registries
- and from simple sets of categories to complex
classification terminology management schemes
7Some code sets are complex classification schemes
- e.g., International Classification of Diseases
(ICD-9) - 460-519 Diseases Of The Respiratory System
- 470-478 Other diseases of the upper respiratory
tract - 474 Chronic disease of tonsils and adenoids
- 474.1 Tonsillar and adenoid hypertrophy
- Alternate Descriptions
- Enlargement of tonsils or adenoids
- Hyperplasia of tonsils or adenoids
- Hypertrophy of tonsils or adenoids
- Exclusions
- adenoiditis (474.01) adenoiditis and tonsillitis
(474.02) tonsillitis (474.0)
Note that higher level concepts are ranges, not
codes
8Other code sets have multi-level hierarchical
structures
- e.g.,
- North American Industrial Classification System
(NAICS) - 11 Agriculture, Forestry, Fishing and Hunting
- 111 Crop Production
- 1111 Oilseed and Grain Farming
- 11111 Soybean Farming
- 11112 Oilseed (except Soybean) Farming
- 11113 Dry Pea and Bean Farming
- 11114 Wheat Farming
- 11115 Corn Farming
- 11116 Rice Farming
- 11119 Other Grain Farming
- 111191 Oilseed and Grain Combination Farming
- 111199 All Other Grain Farming
number of digits indicates level in hierarchy and
degree of generality
9Code sets can be represented in individual or
unified tables
10Unified approach is better, but less frequently
used to date
- main drawback foreign key constraints more
difficult - requires separate view-based constraints and
triggers for data and metadata - adding new code sets only requires adding
(metadata) rows - not tables, which require DBA and DDL
- adding new attributes only requires modifying one
table - much easier to use and maintain code sets in a
single table - can use standard SQL or compiled functions for
all code sets - can search for codeset information even if you
dont know table name - most terminology classification systems use
unified tables - essential for multi-lingual applications -- just
add a column for language
11Unified code set table requires more machinery
for constraints
alter table PERSONS add constraint FK_ORG
references VORGANIZATIONS
create view VORGANIZATION as select CODE, LABEL,
DESCRIPTION from CODESETS where
codesetorganization
12Unified code table facilitates use of standard
web GUIs
13Data dictionaries registriesenhance utility of
code sets
14Data codes and code set labels have different
needs purposes
- Data Codes
- Concise uniform to optimize input, storage,
joins indexing - Can appear in many different columns, tables,
databases, - tradeoff of characters for readability v.
integers for performance - Code Labels and Definitions
- Have fewer occurrences, so can be changed more
easily - Optimized for readability understanding
- Require software to look up from codes in data
15Why not use meaningful characters for data codes?
- Easier to implement
- doesnt require machinery to use separate labels
- direct display of stored data values
- Can use simple foreign key constraints
- Easier for data entry, editing, and quality
control - Typing faster than pull-down value lists for data
entry - meaningful names easier to remember
- can directly edit cells of data tables
- easier to spot errors
16BUTsemantically meaningful codes have serious
drawbacks
- Names (and derived mnemonics) change
- confusing when codes get out of synch with labels
- metadata labels much easier to change than data
codes - data codes need to be stable for doing joins
- Problems increase with database size
- better storage performance for integer codes
- increasingly difficult dangerous to change data
code values - especially as the number of tables databases
increases - Not appropriate for multi-lingual situations
- Unless users are willing to agree on dominant
language
17When should we avoid using meaningful code values?
- When code values are likely to change frequently
- When number of code values in a set is large
- When size of database(s) requires integer codes
(surrogate keys) for storage and/or performance
of entity matching and identification - When scope or distributed nature of database(s)
precludes changing code values in data - When we have mechanisms that facilitate hiding
underlying numeric code and only exposing labels
18Other code value attributes extend uses of code
sets
19Change happens andcode sets have to keep up
- Some causes of change
- Scientific knowledge (e.g., AIDS changed
definitions) - Treaties or Legislation (e.g, NAFTA required SIC
? NAICS) - Mergers acquisitions (e.g., Dept of Homeland
Security) - Typical types of change
- Minor incremental (usually additions and
deletions) - Labels and definitions (e.g, Natl -gt Reagan, but
code still DCA) - Major reorganization (e.g., for classification
system) - History attributes in codesets table can help
20What else can help manage the evolution of codes
codesets?
- Use history tables to preserve snapshots
- Add attributes (columns) to code value rows
- Derivation, Ancestor(s), Pedigree, Provenance
- Do NOT re-use codes!! (e.g., ICD needs version)
- Avoid replacing code values in data
- except possibly for a limited number of specific
tables
21What other problems challenges arise for code
sets?
- When code sets are independent entities with own
data - trade-off having code in entity code sets table
(e.g., organization) - Special kinds of code sets
- Complex (hierarchies, classifications,
concatenations) - Open-ended extensible
- Proprietary code sets
- e.g., diagnostic health codes
- Special code values
- e.g., null, missing data, does not apply, etc.
22Open-ended codes let users help extend code sets
select organization from list or enter new name
But
HTML Forms has only forced choice or free text,
not combination -so we have to provide an extra
field on the form in the database invites
variant names for same organization requires
periodic staff review and clean-up
23Where should logic reside for extensible or
complex code sets?
Please select a country
Web Server
- extensible codes on client browser
- problem non-standard client code
- Java, javascript, visual basic, etc.
- X-Forms may provide richer tools
- hierarchies on the web server
- unless number of codes is small
- round trip latency for slow connections
- amount of data for large number of codes
Then select state or province
24What is the optimal XML representation for code
sets?
NCI XML Representation for 11179 Data
Objects OASIS Universal Business Language --
Code Lists (CL) see Eve Maler presentation
and papers proposes creation of an XSD-based
code list marketplace NOTE CL consumers
often differ from CL producers and maintainers
UBL provides XSD schema templates for code
lists, including list level and code
level information add effective date end
date at both levels? sufficiently flexible
extensible? (e.g., code more than name,value)
25Many challenges remain for code set research
development
- general-purpose ways to use effdate/enddate to
get codes valid for arbitrary dates and time
periods - generic user-defined types to connect code sets
to data element domains with inheritance - standard ways to integrate and utilize
- complex code sets and classification schemes
- flexible, extensible data structures to
facilitate sharing exchange of code set
information
26What are the main take-home conclusions about
code sets?
- use unified tables for code and code set
attributes - add attributes to unified tables to extend their
utility - Effdate, Enddate to support audit trail history
- Sequence order for special sort ordering
- Type-specific columns (type, parent code, )
- use standard tools to extend manage unified
tables - for searching, display, editing, etc.
- use registries dictionaries for wider dynamic
use
27References for further reading
- Dan Gillman, Achieving metadata registry content
consistency -- Part 3 Value Domains - Eve Maler Fabrice Desre OASIS Position Paper
Code Lists - Eve Maler, Schema Design Rules for UBL...and
Maybe for You - HIPAACode.com, hipaanet.com, HL7 V3 Guide
- oaspub.epa.gov/edr/codeset.startup
- educationadvisor.com -gt 65.106.216.198/dsweb/intro
.htm - Malcolm Chisholm, Managing Reference Data in
Enterprise Databases
28Contact Information and Acknowledgements
- JLMcCarthy_at_LBL.gov
- NIM Team
- Mikhail Avrekh
- Clayton Bagwell
- Mark Heer
- R.K. Owen
- Francesca Verdier
- EPA Data Stds Program, L8, SC32, ,et al
- Bruce Bargmeyer, Frank Olken, Dan Gillman, Larry
Fitzwater,
29(No Transcript)
30Extra Vugraphs
31Before ISO 11179 metadata registries, there were
code sets
- ANSI X3L8 committee on Data Representation
- chaired by Bill Kenworthy
- included several old timers here at this
conference - responsible for maintenance of several specific
code sets - became L8 committee on Metadata under Dan Gillman
- it spearheaded development of data element
registries 11179 - thanks to efforts leadership of Bruce
Bargmeyer, EPA, et al - code sets still a fundamental important part of
11179