Title: Websites
1Websites
- http//mor.nlm.nih.gov/download/rxnav/
- http//www.stccmop.org/quarry
2Profiling DataspacesUnderstanding (and Using)
Other Peoples Data
- David Maier
- Department of Computer Science
- Portland State University
3With Much Support
- Dataspaces Alon Halevy, Mike Franklin
- RxSafe Paul Gorman, Karl Ordelheide, Judy Logan,
Nick Rayner (InfoSonde) - SACO Shannon McWeeney, Ranjani Ramakrishnan
- Quarry Bill Howe, James Rucker
- DIESEL Lois Delcambre, David Archer, Susan
Price, Scott Fletcher, John McCall - Funding NSF, AHRQ, DARPA
4Other Peoples Data
- Why is other peoples data hard to understand?
- They arent around to explain it
- What they wrote down about it wasnt quite true
or complete in the first place - The kinds of data in a source have expanded
beyond the original intent schema drift - How do you understand the data?
5Dataspaces
- Deal with all the data from an enterprise in
whatever models - Data co-existence
- Might not be fully integrated, especially early
on - Pay-as-you-go services
- Im interested in understanding sources and their
relationships
From Databases to Dataspaces A New
Abstraction for Information Management, Michael
Franklin, Alon Halevy, David Maier, SIGMOD
Record, December 2005.
6Example Dataspace RxSafe
- Consolidated medication list for rural elders
- Points in lifetime of a prescription
- Order (clinic, hospital)
- Dispensing (pharmacy)
- Approval (insurer)
- Administration (rehabilitation facility)
- Relevant Standards
- NDCD, RxNorm, NDF-RT
7NDCD National Drug Code Directory
8Sample NDC 62584-023-00
9RxNorm Drug Nomenclature
- RxNav from National Library of Medicine
10NDF-RT
- National Drug File Reference Terminology
- From Veterans Affairs
- Drug class
- Chemical class
- Effects and actions
11NDF-RT (Blue)
12Goals of Understanding Based on Intended Purpose
- Grouping similar medications
- Connecting possible incarnations of same
prescription - Generic Brand Name
- Combining medication information for a given
patient - Must be error preserving
13What You Know is Wrong
- NDC and RxNorm talking about same things
- NDC tradenames 18913
- RxNorm brand names 7600
- Strings in common 418
- All RxNorm relationships have explicit inverses
14What You Know is Incomplete
15Incomplete Description
Source National Library of Medicine
Doesnt mention atoms, attributes Doesnt include
SY, ET, OCD, OBD
16You Might Misunderstand What Youre Told
- RxNorm diagram is for instances
- Multi-ingredient drug case not covered
17The Structure Isnt What You Thought
- RxNorm uses UMLS, not domain-specific
- More complex than this can have several atoms
in each concept
18How Do You Make Progress?
- Tools with minimal assumptions
- Ability to check hypothesis
- Means to customize data to intended task
19What I WantDataspace Charting Toolkit
- Help with Familiarization, Profiling, Enhancement
- Inspector for generic models
- Dataspace profiler
- Assumption tracker and checker
- Structure discovery techniques
- Customization to task based on discovered
characteristics
20Quarry Data Model (Howe)
- resource, property, value
- (subject, predicate, object) if you prefer
- no intrinsic distinction between literal values
and resource values - no explicit types or classes
21Example RxNorm
Concept
Relationship
Atom
userkey
prop
value
10001
NDC
1
10001
ORIG_CODE
123
10001
ingredient_of
10004
10001
type
DC
up to 23M triples describing 0.6M concepts and
atoms
22Example Metadata for Scientific Data Repository
/anim-sal_estuary_7.gif
23SKIP
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Quarry API
- /2004/2004-001//anim-tem_estuary_bottom.gif
- aggregate bottom
- animation isotem
- day 001
- directory images
- plottype isotem
- region estuary
- runid 2004-001
- year 2004
-
- /2004/2004-001//amp_plume_2d.gif
- day 001
- directory images
- plottype 2d
- region plume
- runid 2004-001
- year 2004
32Behind the Scenes
- Signatures
- resources possessing the same properties
clustered together - Posit that Signatures ltlt Resources
- Queries evaluated over Signature Extents
33Experimental Results
- Yet Another RDF Store
- Several B-Tree indexes to support
- spo, po ? s, os ? p, etc.
- 3M triples
- We looked at multi-term queries
?s ltp0gt lto0gt ?s ltp1gt lto1gt ?s ltpngt ltongt
34Experimental Results Queries
3.6M triples 606k resources 149 signatures
35Dataspace ProfilingNDC Examples (Rayner)
- l_seq_no is key of listings yes
- lblcode, prodcode key of listings no
- 45,953 45,972
(19) - firm_name ? lblcode no
- 2931 2952 (21)
- each product listing should have gt0 packages and
gt0 ingredients - 44,972 (1180) 45,180 (792)
36Checking Across SourcesNDC vs. RxNorm
- Ingredients
- 2794 ingredient names in NDCD
- 5145 ingredients in RxNorm
- 1570 equal strings
37What to Do with Flawed Assumptions?
- Track exceptions
- Refine assumption
- firm_name, location ? lblcode
- Refine knowledge of world
- RxNorm has ingredient variants (which have the
same type as ingredients) - Want to track assumptions as they evolve, results
of checks
38Customization InfoSonde
- Support customizations appropriate to discovered
data characteristics - Three-part modules
- Probe Check or discover properties
- Switch Present applicable customizations
- Check Test that chosen switch is still valid
39Linkage Extension Module
Want to extend a linkage based on a discovered
functional relationship Probe Join satisfies
FD here, ING ? DC Switch Materialize functional
relationship can be used to extend original
relation with DC Check Test that FD still holds
40My First Visit to Zürich
- See
- http//www.cs.pdx.edu/maier/france-report/report1
3.txt