Title: Automatic Assignment of SNOMED Categories:
1Automatic Assignment of SNOMED Categories Prelimi
nary and Qualitative Evaluations
P. Ruch, I. Tbahriti, J. Gobeill, R. Baud, C.
Lovis and A. Geissbüuhler robert.baud_at_sim.hcuge.c
h
2Aspects of any terminolgy
Language dependant
Language independant
In vitro
In vivo
Language Epistemology
Ontology
Terminology
Model of the domain Taxonomy Relationships Form
al constraints
Pragmatic domain Representation is necessary
for dissemination Bag of terms
Expert domain representation Medical
literature Unique prefered terms
3Text categorization tasks for what?
- Key words assignement
- Libraries and Digital Libraries
- ACM, IEEE, MeSH
- Encoding
- ICD for billing
- SNOMED for patient record indexing
- Genes attributes process, function, component
(GO)
4Goal
- To develop and to evaluate an automatic tool for
- Browsing SNOMED CT
- Automatic assigment of SNOMED CT categories to a
text - Automatic assignement of SNOMED CT categories
- Passage extraction to support the prediction
- Display interface
5Problem statement
- Absence of tuning/evaluation data (gold standard)
- To transform a MEDLINE/MeSH collection into a
MEDLINE/SNOMED collection - Use the UMLS for the transformation
- Tune and evaluate the categorization
effectiveness - Discussion
- Appropriateness of the collection
- Comparison MeSH and SNOMED (different
granularity)
6MEDLINE citation
- PMID 11924965
- Simple multiplex genotyping by surface-enhanced
resonance Raman scattering. - Graham D, Mallinder BJ, Whitcombe D, Watson ND,
Smith WE. - The accurate detection of DNA sequences is
essential for a variety of post human genome
projects including detection of specific gene
variants for medical diagnostics and
pharmacogenomics. A specific DNA sequence
detection assay based on surface-enhanced
resonance Raman scattering (SERRS) and an
amplification refractory mutation system (ARMS)
is reported. Initially, generation of PCR
products was achieved by using specifically
designed allele-specific SERRS active primers.
Detection by SERRS of the PCR products confirmed
the presence of the sequence tested for by the
allele-specific oligonucleotides. This lead
directly to the multiplex genotyping of human DNA
samples for the deltaF508 mutational status of
the cystic fibrosis transmembrane conductance
regulator gene using SERRS active primers in an
ARMS assay. Removal of the unincorporated primers
allowed fast and accurate analysis in this system
in a multiplex format without any separation of
amplicons. The results indicate that SERRS can be
used in modern genetic analysis and offers an
opportunity for the development of novel assays.
This is the first demonstration of the use of
SERRS in multiplex genotyping and shows potential
advantages over fluorescence as a detection
technique with considerable promise for future
development. - Major MeSH Cystic Fibrosis DNA Genotype
Polymerase Chain Reaction - Minor MeSH HLA-DQ Antigens Human Reverse
Transcriptase - Sequence Analysis Spectrum Analysis,
Raman - Support, Non-U.S. Gov't
- Only MeSH marked as major () are
used in our experiments !
7UMLS mapping to SNOMED-CT (via the CUI)
- PMID 11924965
- Simple multiplex genotyping by surface-enhanced
resonance Raman scattering. - Graham D, Mallinder BJ, Whitcombe D, Watson ND,
Smith WE. - The accurate detection of DNA sequences is
essential for a variety of post human genome
projects including detection of specific gene
variants for medical diagnostics and
pharmacogenomics. A specific DNA sequence
detection assay based on surface-enhanced
resonance Raman scattering (SERRS) and an
amplification refractory mutation system (ARMS)
is reported. Initially, generation of PCR
products was achieved by using specifically
designed allele-specific SERRS active primers.
Detection by SERRS of the PCR products confirmed
the presence of the sequence tested for by the
allele-specific oligonucleotides. This lead
directly to the multiplex genotyping of human DNA
samples for the deltaF508 mutational status of
the cystic fibrosis transmembrane conductance
regulator gene using SERRS active primers in an
ARMS assay. Removal of the unincorporated primers
allowed fast and accurate analysis in this system
in a multiplex format without any separation of
amplicons. The results indicate that SERRS can be
used in modern genetic analysis and offers an
opportunity for the development of novel assays.
This is the first demonstration of the use of
SERRS in multiplex genotyping and shows potential
advantages over fluorescence as a detection
technique with considerable promise for future
development. - SNOMED cystic fibrosis (SN190911006/SN190905008
), dna (SN024851008), - polymerase chain reaction
(SN258066000) , genotype (SN363779003) -
8General strategies for text categorization
- Retrieval based on word-matching, which
attributes concepts to text based on shared words
between the text and the concepts - Cross Language IR (SAPHIRE Int., Hersh et al.
1998) - Recent and rare
- Empirical learning of text-concept associations
from a training set of texts and their associated
concepts - Reuters (Bayesian classifiers, Lewis 1992) 100
classes - Text categorization/filtering paradygm
Sebastiani hundreds - ? But learning conditions
- ? Usual browsing tools (UMLS, GO, CLUE Browsers)
- Boolean, Completer, Exact match (morphology)
- Poor ranking and recall effectiveness
9Browsers
- Browser
- Hierarchical viewer ex. GO Browser (NIH) ?
- Textual input
- Input regulation cystic fibrosis transmembrane
conductance - Output NOTHING ?
- Good browser
- Input  papular hamartoma between mesoderm and
ectoderm - Output linear papular ectodermal-mesodermal
hamartoma - in top 10
-
10?
11Learning and large scale text categorization
- Limits of Machine Learning approaches
- Number of binary classifiers computational issue
- Class sets must be static diachronic issues
- Rare classes are ignored experimental issues
- Lack of training data transcendental issue
- collection of clinical reports with SNOMED-CT ?
12Data and metrics
- Cystic fibrosis collection 1239 MEDLINE records
- Tuning/Evaluation split 239/1000
- Top precision
- Precision of categories returned on the top of
the list. - Average precision
- Average Precision over 11 recall points
13Basic Strategies
- Pattern matcher thesaurus RegEx
- word1wordn5 ? word1 _,2 wordn
- word1wordn5 ? word1 wordiwordn
- ? Boolean scoring good precision
- Vector Space Porter stems TFIDF weighting
VS - ? Fine scoring good recall
- Balanced combination of each method
- Data-driven ? need a (small) sample for tuning
14Vector space parameters weighting !
- TF weightterm f(term frequency) 1
- IDF weightterm f(document frequency-1)
- Normalization cosine, pivoted
15Specific normalization
- Removing meta-abbreviations
- NOS (Not otherwise specified)
- NES (Not elsewhere classified)
- NOC (Not otherwise classifiable)
- NFQ (Not further qualified)...
- Handle/Expand more than fifty SNOMED specific
abbreviations mainly from Read codes - ACOF ADVA AR CFIO CFSO FB FH FHM HFQ
LOC MVNTA MVTA...
16Features stems, noun phrases and thesauri
- PMID 11924965
- Simple multiplex genotyping by surface-enhanced
resonance Raman scattering. - Graham D, Mallinder BJ, Whitcombe D, Watson ND,
Smith WE. - The accurate detection of DNA sequences is
essential for a variety of post human genome
projects including detection of specific gene
variants for medical diagnostics and
pharmacogenomics. A specific DNA sequence
detection assay based on surface-enhanced
resonance Raman scattering (SERRS) and an
amplification refractory mutation system (ARMS)
is reported. Initially, generation of PCR
products was achieved by using specifically
designed allele-specific SERRS active primers.
Detection by SERRS of the PCR products confirmed
the presence of the sequence tested for by the
allele-specific oligonucleotides. This lead
directly to the multiplex genotyping of human DNA
samples for the deltaF508 mutational status of
the cystic fibrosis NP transmembrane
conductance regulator gene using SERRS active
primers in an ARMS assay. Removal of the
unincorporated primers allowed fast and accurate
analysis in this system in a multiplex format
without any separation of amplicons. The results
indicate that SERRS can be used in modern genetic
analysis and offers an opportunity for the
development of novel assays. This is the first
demonstration of the use of SERRS in multiplex
genotyping and shows potential advantages over
fluorescence as a detection technique with
considerable promise for future development. - Major Terms Cystic Fibrosis DNA Genotype
Polymerase Chain Reaction
17Example ranked categories passages
- Input
- Fibrodysplasia ossificans progressiva is a very
rare and disabling hereditary disorder of
connective tissue characterised by symmetric
congenital anomalies of the great toes and thumbs
and by progressive heterotopic ossification of
tendons, ligaments, fasciae and striated muscles.
In this case we report a 17-year-old boy who
presented with a painful swelling of the right
mandibula with trismus. Multiple heterotopic soft
tissue calcifications, severe scoliosis and
typical anomalies of toes and thumbs on the
radiographs were pathognomonic for fibrodysplasia
ossificans progressiva.
18(No Transcript)
19Results
- Weighting function Top Average
- concepts.abstracts Precision Precision
- dtu.dtn regex 0.801 0.4545
- lnc.lnn regex 0.791 0.453
- anc.ntn regex 0.787 0.4515
- atn.ntn regex 0.823 0.4485
- lnc.atn 0.696 0.355
- Difference between combined and not combined
classifiers is statistically significant (15)
20Conclusions
- SNOMED categorization can be done with 80
precision - Combined approaches are effective
- Multilingual aspects have to be considered
language tools have to be available in several
languages - There is a need for reference collections of
texts duely indexed to Snomed CT and acting as a
gold standard
21Further conclusions
- SNOMED categorization is not only for scientific
papers, but also and/or mainly for patient
records - Language aspects
- Epistemological aspects
- A large dissemination of SNOMED CT is depending
on a quality insurance policy and availability of
multilingual language tools, on the top of an
excellent terminology - Future resources for SNOMED CT development and
maintenance will only be available in case of a
large dissemination.
22Future Work
- Need a hierarchical viewer for research, call for
bid !
23- Please visit www.semanticmining.org !
- Thank you for your attention
24About MetaMap
CF collection, MeSH, UMLS distribution, not LHRC
version