Automatic Assignment of SNOMED Categories: - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Automatic Assignment of SNOMED Categories:

Description:

P. Ruch, I. Tbahriti, J. Gobeill, R. Baud, C. Lovis and A. Geissb uhler. robert.baud_at_sim.hcuge.ch ... To develop and to evaluate an automatic tool for. Browsing ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 25
Provided by: hiww
Category:

less

Transcript and Presenter's Notes

Title: Automatic Assignment of SNOMED Categories:


1
Automatic Assignment of SNOMED Categories Prelimi
nary and Qualitative Evaluations
P. Ruch, I. Tbahriti, J. Gobeill, R. Baud, C.
Lovis and A. Geissbüuhler robert.baud_at_sim.hcuge.c
h
2
Aspects of any terminolgy
Language dependant
Language independant
In vitro
In vivo
Language Epistemology
Ontology
Terminology
Model of the domain Taxonomy Relationships Form
al constraints
Pragmatic domain Representation is necessary
for dissemination Bag of terms
Expert domain representation Medical
literature Unique prefered terms
3
Text categorization tasks for what?
  • Key words assignement
  • Libraries and Digital Libraries
  • ACM, IEEE, MeSH
  • Encoding
  • ICD for billing
  • SNOMED for patient record indexing
  • Genes attributes process, function, component
    (GO)

4
Goal
  • To develop and to evaluate an automatic tool for
  • Browsing SNOMED CT
  • Automatic assigment of SNOMED CT categories to a
    text
  • Automatic assignement of SNOMED CT categories
  • Passage extraction to support the prediction
  • Display interface

5
Problem statement
  • Absence of tuning/evaluation data (gold standard)
  • To transform a MEDLINE/MeSH collection into a
    MEDLINE/SNOMED collection
  • Use the UMLS for the transformation
  • Tune and evaluate the categorization
    effectiveness
  • Discussion
  • Appropriateness of the collection
  • Comparison MeSH and SNOMED (different
    granularity)

6
MEDLINE citation
  • PMID 11924965
  • Simple multiplex genotyping by surface-enhanced
    resonance Raman scattering.
  • Graham D, Mallinder BJ, Whitcombe D, Watson ND,
    Smith WE.
  • The accurate detection of DNA sequences is
    essential for a variety of post human genome
    projects including detection of specific gene
    variants for medical diagnostics and
    pharmacogenomics. A specific DNA sequence
    detection assay based on surface-enhanced
    resonance Raman scattering (SERRS) and an
    amplification refractory mutation system (ARMS)
    is reported. Initially, generation of PCR
    products was achieved by using specifically
    designed allele-specific SERRS active primers.
    Detection by SERRS of the PCR products confirmed
    the presence of the sequence tested for by the
    allele-specific oligonucleotides. This lead
    directly to the multiplex genotyping of human DNA
    samples for the deltaF508 mutational status of
    the cystic fibrosis transmembrane conductance
    regulator gene using SERRS active primers in an
    ARMS assay. Removal of the unincorporated primers
    allowed fast and accurate analysis in this system
    in a multiplex format without any separation of
    amplicons. The results indicate that SERRS can be
    used in modern genetic analysis and offers an
    opportunity for the development of novel assays.
    This is the first demonstration of the use of
    SERRS in multiplex genotyping and shows potential
    advantages over fluorescence as a detection
    technique with considerable promise for future
    development.
  • Major MeSH Cystic Fibrosis DNA Genotype
    Polymerase Chain Reaction
  • Minor MeSH HLA-DQ Antigens Human Reverse
    Transcriptase
  • Sequence Analysis Spectrum Analysis,
    Raman
  • Support, Non-U.S. Gov't
  • Only MeSH marked as major () are
    used in our experiments !

7
UMLS mapping to SNOMED-CT (via the CUI)
  • PMID 11924965
  • Simple multiplex genotyping by surface-enhanced
    resonance Raman scattering.
  • Graham D, Mallinder BJ, Whitcombe D, Watson ND,
    Smith WE.
  • The accurate detection of DNA sequences is
    essential for a variety of post human genome
    projects including detection of specific gene
    variants for medical diagnostics and
    pharmacogenomics. A specific DNA sequence
    detection assay based on surface-enhanced
    resonance Raman scattering (SERRS) and an
    amplification refractory mutation system (ARMS)
    is reported. Initially, generation of PCR
    products was achieved by using specifically
    designed allele-specific SERRS active primers.
    Detection by SERRS of the PCR products confirmed
    the presence of the sequence tested for by the
    allele-specific oligonucleotides. This lead
    directly to the multiplex genotyping of human DNA
    samples for the deltaF508 mutational status of
    the cystic fibrosis transmembrane conductance
    regulator gene using SERRS active primers in an
    ARMS assay. Removal of the unincorporated primers
    allowed fast and accurate analysis in this system
    in a multiplex format without any separation of
    amplicons. The results indicate that SERRS can be
    used in modern genetic analysis and offers an
    opportunity for the development of novel assays.
    This is the first demonstration of the use of
    SERRS in multiplex genotyping and shows potential
    advantages over fluorescence as a detection
    technique with considerable promise for future
    development.
  • SNOMED cystic fibrosis (SN190911006/SN190905008
    ), dna (SN024851008),
  • polymerase chain reaction
    (SN258066000) , genotype (SN363779003)

8
General strategies for text categorization
  • Retrieval based on word-matching, which
    attributes concepts to text based on shared words
    between the text and the concepts
  • Cross Language IR (SAPHIRE Int., Hersh et al.
    1998)
  • Recent and rare
  • Empirical learning of text-concept associations
    from a training set of texts and their associated
    concepts
  • Reuters (Bayesian classifiers, Lewis 1992) 100
    classes
  • Text categorization/filtering paradygm
    Sebastiani hundreds
  • ? But learning conditions
  • ? Usual browsing tools (UMLS, GO, CLUE Browsers)
  • Boolean, Completer, Exact match (morphology)
  • Poor ranking and recall effectiveness

9
Browsers
  • Browser
  • Hierarchical viewer ex. GO Browser (NIH) ?
  • Textual input
  • Input regulation cystic fibrosis transmembrane
    conductance
  • Output NOTHING ?
  • Good browser
  • Input  papular hamartoma between mesoderm and
    ectoderm 
  • Output linear papular ectodermal-mesodermal
    hamartoma
  • in top 10

10
?
11
Learning and large scale text categorization
  • Limits of Machine Learning approaches
  • Number of binary classifiers computational issue
  • Class sets must be static diachronic issues
  • Rare classes are ignored experimental issues
  • Lack of training data transcendental issue
  • collection of clinical reports with SNOMED-CT ?

12
Data and metrics
  • Cystic fibrosis collection 1239 MEDLINE records
  • Tuning/Evaluation split 239/1000
  • Top precision
  • Precision of categories returned on the top of
    the list.
  • Average precision
  • Average Precision over 11 recall points

13
Basic Strategies
  • Pattern matcher thesaurus RegEx
  • word1wordn5 ? word1 _,2 wordn
  • word1wordn5 ? word1 wordiwordn
  • ? Boolean scoring good precision
  • Vector Space Porter stems TFIDF weighting
    VS
  • ? Fine scoring good recall
  • Balanced combination of each method
  • Data-driven ? need a (small) sample for tuning

14
Vector space parameters weighting !
  • TF weightterm f(term frequency) 1
  • IDF weightterm f(document frequency-1)
  • Normalization cosine, pivoted

15
Specific normalization
  • Removing meta-abbreviations
  • NOS (Not otherwise specified)
  • NES (Not elsewhere classified)
  • NOC (Not otherwise classifiable)
  • NFQ (Not further qualified)...
  • Handle/Expand more than fifty SNOMED specific
    abbreviations mainly from Read codes
  • ACOF ADVA AR CFIO CFSO FB FH FHM HFQ
    LOC MVNTA MVTA...

16
Features stems, noun phrases and thesauri
  • PMID 11924965
  • Simple multiplex genotyping by surface-enhanced
    resonance Raman scattering.
  • Graham D, Mallinder BJ, Whitcombe D, Watson ND,
    Smith WE.
  • The accurate detection of DNA sequences is
    essential for a variety of post human genome
    projects including detection of specific gene
    variants for medical diagnostics and
    pharmacogenomics. A specific DNA sequence
    detection assay based on surface-enhanced
    resonance Raman scattering (SERRS) and an
    amplification refractory mutation system (ARMS)
    is reported. Initially, generation of PCR
    products was achieved by using specifically
    designed allele-specific SERRS active primers.
    Detection by SERRS of the PCR products confirmed
    the presence of the sequence tested for by the
    allele-specific oligonucleotides. This lead
    directly to the multiplex genotyping of human DNA
    samples for the deltaF508 mutational status of
    the cystic fibrosis NP transmembrane
    conductance regulator gene using SERRS active
    primers in an ARMS assay. Removal of the
    unincorporated primers allowed fast and accurate
    analysis in this system in a multiplex format
    without any separation of amplicons. The results
    indicate that SERRS can be used in modern genetic
    analysis and offers an opportunity for the
    development of novel assays. This is the first
    demonstration of the use of SERRS in multiplex
    genotyping and shows potential advantages over
    fluorescence as a detection technique with
    considerable promise for future development.
  • Major Terms Cystic Fibrosis DNA Genotype
    Polymerase Chain Reaction

17
Example ranked categories passages
  • Input
  • Fibrodysplasia ossificans progressiva is a very
    rare and disabling hereditary disorder of
    connective tissue characterised by symmetric
    congenital anomalies of the great toes and thumbs
    and by progressive heterotopic ossification of
    tendons, ligaments, fasciae and striated muscles.
    In this case we report a 17-year-old boy who
    presented with a painful swelling of the right
    mandibula with trismus. Multiple heterotopic soft
    tissue calcifications, severe scoliosis and
    typical anomalies of toes and thumbs on the
    radiographs were pathognomonic for fibrodysplasia
    ossificans progressiva.

18
(No Transcript)
19
Results
  • Weighting function Top Average
  • concepts.abstracts Precision Precision
  • dtu.dtn regex 0.801 0.4545
  • lnc.lnn regex 0.791 0.453
  • anc.ntn regex 0.787 0.4515
  • atn.ntn regex 0.823 0.4485
  • lnc.atn 0.696 0.355
  • Difference between combined and not combined
    classifiers is statistically significant (15)

20
Conclusions
  • SNOMED categorization can be done with 80
    precision
  • Combined approaches are effective
  • Multilingual aspects have to be considered
    language tools have to be available in several
    languages
  • There is a need for reference collections of
    texts duely indexed to Snomed CT and acting as a
    gold standard

21
Further conclusions
  • SNOMED categorization is not only for scientific
    papers, but also and/or mainly for patient
    records
  • Language aspects
  • Epistemological aspects
  • A large dissemination of SNOMED CT is depending
    on a quality insurance policy and availability of
    multilingual language tools, on the top of an
    excellent terminology
  • Future resources for SNOMED CT development and
    maintenance will only be available in case of a
    large dissemination.

22
Future Work
  • Need a hierarchical viewer for research, call for
    bid !

23
  • Please visit www.semanticmining.org !
  • Thank you for your attention

24
About MetaMap
CF collection, MeSH, UMLS distribution, not LHRC
version
Write a Comment
User Comments (0)
About PowerShow.com