April%2019th,2002 - PowerPoint PPT Presentation

About This Presentation
Title:

April%2019th,2002

Description:

... based medicine criteria exist only for a small fraction of medicine ... Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 33
Provided by: DFK
Category:
Tags: 2019th | april

less

Transcript and Presenter's Notes

Title: April%2019th,2002


1
Multilingual Concept Hierarchies for Medical
Information Organization and Retrieval
MUCHMORE
2
Project Overview
Application ? Addressing a Real-Life Medical
Scenario for Cross-Lingual Information Retrieval
Research Development ? Developing Novel,
Hybrid (Corpus-/Concept- Based) Methods for
Handling this Scenario
Evaluation ? Evaluating the Technical
Performance of (Combinations of) Existing and
Novel Methods
3
User Perspective (ZInfo)
Vision BAIK Model
  • MuchMore
  • ? Provide Relevant Medical Information
  • for a Specific Patient Problem
  • Automatically, from the Web
  • Independent of Language

4
User Perspective (ZInfo)
User Requirements
  • Automatic Query Generation (and Expansion),
    Identifying the Exact Problem of the Patient
  • Retrieval and Relevance Ranking of Evidence
    Based Medical Literature, Language Independent
  • Summarization and Filtering of Results
    According to a User Profile

5
User Perspective (ZInfo)
User Evaluation
Evaluate Usefulness ? Query Generation ?
Relevance for Decisions in Diagnostics and
Treatment
Use for Medical Cases ? Part of Postgraduate
Course in Medical Informatics
Problematic Issues ? Different medical profiles,
schools, experience, speciality ? Relevant for
one user may mean less or nothing to
another ? Evidence based medicine criteria exist
only for a small fraction of medicine
6
MuchMore Prototype
  • Overview of Prototype Functionality
  • Relation between Functionality and User
    Requirements
  • ? Issues Addressed by Research and Development
    within MuchMore

7
RD in MuchMore
Semantic Annotation Based CLIR
  • Corpus Annotation (DFKI, ZInfo)
  • ? PoS, Morphology, Phrases, Grammatical Functions
  • ? Term and Relation Tagging

Term Extraction (XRCE, EIT, CMU,
CSLI) ? Bilingual Lexicon Extraction, Extension
of Semantic Resources
Sense Disambiguation (CSLI, DFKI) ? Tuning and
Extension of Semantic Resources ? Combining Sense
Disambiguation Methods
Relation Extraction (DFKI, CSLI) ? Grammatical
Function Tagging ? Extracting Semantic Relation
Indicators ? Extracting Novel Semantic Relations
Semantic Indexing/Retrieval (EIT,DFKI)
8
RD in MuchMore
Additional Approaches in CLIR
  • Corpus Based CLIR
  • Bilingual Lexicon Extraction (XRCE, EIT, CMU,
    CSLI)
  • Pseudo Relevance Feedback PRF (CMU)
  • Generalized Vector Space Model GVSM (CMU)

Text Classification Based CLIR (CMU) ?
Hierarchical/Flat kNN with MeSH
Summarization (CMU) ? Query, Genre Specific
9
Corpus Annotation
Annotation Evaluation
Corpus 9000 English and German Medical
Abstracts from 41 Journals, Springer LINK
WebSite, 1 M Tokens for each Language
  • PoS
  • Lexicon Update, Remaining Error Rate 1.5
    (EN)
  • Histologically, we found a subepidermal blister
    formation and a predominantly neutrophilic
    infiltrate. posVB gt pos_correctNN

Morphology
German Nouns MMorph Recall Incorrect Error-Rate
test-dvlp 889 617 69.40 42 6.81
test-final 989 683 69.06 79 11.57
Incorrect, e.g. Chorionzottenbiopsie gt Chor
Ion Zotte Biopsie
Term and Relation Tagging ? Evaluation of 8
DE/EN Parallel Abstracts, Relevant for a Query
10
Term Extraction
XRCE (Aims and Resources)
  • Aim
  • Bilingual Lexicon Extraction
  • From Comparable Corpora at Word Level From
    Parallel Corpora at Word, and Term (Multi-Word)
    Level
  • Bilingual Extension of Semantic Resource (MeSH)
  • Resources
  • Optimal Combination of Existing Resources
    (Corpus, General Dictionary, Thesaurus MeSH)
  • Corpus Specific German Decompounding (Improves
    Recall by 25 at Equal Precision)

verbesserter transabdomineller Techniken improved transabdominal techniques
Prognose des Frühcarcinoms prognosis of early gastric cancer
Verletzungen des Gehirns intracranial injuries
Lebensqualitaet quality of live
11
Term Extraction
XRCE (Results of Best Method)
  • Optimal Combination of Resources
  • Retaining only 10 best Translations for each
    Candidate
  • 1. word-to-word, comparable corpora F1 0.84
  • 2.a word-to-word, parallel corpora F1 0.98
  • 2.b term-to-term, parallel corpora F1 0.85
  • Evaluating Separately with Individual Resources
    (F1)
  • Corpus 0.62 MeSH 0.51 General Dictionary
    0.56
  • 3. MeSH Extension 1453 new multi-word terms
    added (synonyms or new term entries) extracted
    from the Springer corpus

12
Term Extraction
EIT (Similarity Thesauri)
Method ? Extract Most Frequent Terms (Single
Word) by Comparison of Term Frequencies in a
General Corpus (German SDA, English LA Times)
vs. Medical Corpus
  • Results
  • ? Single Word Terms (Springer Abstracts)
  • German-English104,904 / English-German 49,454
  • ? Multiword Terms (Phrase Lexicon Generated from
    ICD10)
  • German Phrases 354 / English Phrases 665
  • Bilingual Phrasal Entries Generated
  • German - English 225 / English - German 246

13
Term Extraction
CMU (EBT Bilingual Lexicon)
Method ? For each word in one language,
accumulate counts of the number of times the
translations of the sentences containing that
word include each word of the other language.
These co-occurrence counts may be restricted
using word-alignment techniques. ? Apply a
variable threshold to filter out uncommon
co-occurrences which are unlikely to be
translations. The result is a lexicon listing
candidate translations and their relative
frequencies.
  • Results
  • ? 99.000 Bilingual Term Pairs (PubMed Parallel
    Abstracts)
  • (Estimated Error Rate lt 10)

14
Term Extraction
CSLI (Infomap System)
Represent English and German Words as Vectors
that are Produced by Recording the Number of
Co-Occurrences of the Word in Question with each
of a Set of Content-Bearing Words. Use (Cosine)
Similarity Measure on these Rows to Find Nearest
Neighbours.
1, 000 (English) content-bearing words
Term (EN) SIM Term (DE) SIM
bone 1.00 knochen 0.82
cancellous 0.70 knochens 0.71
osteoinductive 0.67 knochenneubildung 0.67
demineralized 0.65 spongiosa 0.64
trabeculae 0.64 knochenresorption 0.60
formation 0.60 allogenen 0.60
periosteum 0.56 knöcherne 0.59

ligament knee joint
.
.
.
ligament English words
English
Kreuzband Kniegelenk German words
German
.
.
.
.
.
.
15
WSD Terms, Senses
Semantic Resource Extension and Tuning
  • Extension (DFKI)
  • Morphological Analysis (Decomposition)
  • Entzündungsgewebe (infection tissue) HYPONYM
    Gewebe,Körpergewebe (body tissue)
  • Gewebe, Stoff,Textilstoff (textile)
  • Semantic Similarity (Co-Occurrence Patterns)
  • Karzinom (carcinoma), Metastase (metastasis)
    SYNONYM Geschwulst, Tumor, ....
  • Tuning (CSLI, DFKI)
  • Aligning Clusters with Senses
  • C0043210GERPL1254343PFS1496289Frauen3
  • C0043210ENGPL1189496PFS1423265Human adult
    females0

16
WSD Algorithm
Combination of Methods (Task, Domain, General)
  • Bilingual Sense Selection (CSLI)
  • 1 Sense in L1 vs. gt1 Sense in L2
  • English blood vessel (C0005847) vs. vessel
    (polysaccharide) (C0148346)
  • German Blutgefaesse blood vessel (C0005847)
  • Collocations and Senses (CSLI)
  • For an ambiguous single word term that is part
    of several unambiguous multiword terms, choose
    the sense of the most frequent multiword term.
  • single word term abortion 1) a natural process
    C0000786 (T047)
  • 2) a medical procedure C0000811 (T061)
  • multiword term recurrent abortion C0000809
    (T047) gt sense 1
  • induced abortion C0000811 (T061) gt sense 2

17
WSD Algorithm
Combination of Methods (Task, Domain, General)
  • Domain Specific Senses (DFKI)
  • Concept Relevance in Domain Corpus
  • Mineral 0.030774033 Mineralstoff, Eisen,
    Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5
    Allanit, Alumogel, ..., Axionit, Beryll, ...
    Wurtzit, Zirkon
  • Instance-Based Learning (DFKI)
  • Unsupervised Context Models (n-grams)
  • Training (Learn Class Models) He drank ltmilk
    LIQUIDgt He drank ltcoffee LIQUIDgt He drank
    lttea LIQUIDgt He drank ltchocolate FOOD,
    LIQUIDgt
  • Application (Apply Class Models) He drank
    ltchocolate FOOD, LIQUIDgt He drank ltJava
    GEOGAPHICAL, LIQUIDgt

18
WSD Evaluation
Lexical Sample Evaluation Corpora (Medical)
  • Ambiguous MeSH EN 847 (2.5), DE 780 (2.1)
    EWN EN 6300 (2.8) DE 4059 (1.5)
  • Evaluation (Nouns) GermaNet (40), English MeSH
    (59), German MeSH (28)

Band (tape, strap. ligament)
Fall (drop, case, instance)
Gefäss (jar, vessel)
Operation (operation, surgery)
Prüfung (survey, tryout, checkup)
Verletzung (injury, trauma)
Wahl (ballot, choice, option)
Lage (site, status, position, layer)
Gewicht (weight, importance)

19
Relation Extraction
Grammatical Function Tagging (DFKI)
  • Robust, Shallow Grammatical Function Tagger
  • EM Model (Trained on Frankfurter Rundschau 35M
    Tokens, Adaptation on Medical Corpora Under
    Development)
  • 1.5M Types Verb, Voice, Function,
    Nom-Head-Argument
  • abarbeiten ACT SUBJ Politiker
  • ? Use of PoS Information, Use of Chunk
    Information Planned
  • ? Tags for SUBJ, OBJ, IOBJ, ACT/PAS
  • ? German Available, English under Development

Untersucht ltPRED1PASgt wurden 30 Patienten
ltPRED1SUBJgt ltPRED2SUBJgt, die sich ltPRED2SUBJgt
einer elektiven aortokoronaren Bypassoperation
ltPRED2IOBJgt unterziehen ltPRED2ACTgt mussten.
20
Relation Extraction
Semantic Relation Indicators (DFKI, CSLI)
Novel Semantic Relations (DFKI, CSLI)
Cluster 1 T047/T060 (Diagnoses) T060/T101
(Affects) T060/T169 ...
differentiate conclude discriminate diagnose illus
trate
reduce treat follow diagnose cure
Cluster 3 T047/T121 (Treats, Causes) T061/T121
(Uses) T121/T184 (Treats) ...
Cluster 2 T101/T169 T101/T184 T101/T048 ...
T047 Disease T048 Mental Dysfunction T060
Diagnostic Procedure T101 Patient T121 Pharm.
Substance T169 Funct. Concept (Syndrom) T184
Sign or Symptom
suffer demonstrate progress develop die
21
Summarization (CMU)
Extractive Summarization
  • Maximal Marginal Relevance (MMR)
  • ? Find passages most relevant to query
  • ? Maximize information novelty (minimize passage
    redundancy)
  • Assemble extracted passages for summary
  • Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
    dj))
  • Q query, d document, S similarity
    function
  • ? tradeoff factor between relevance novelty
  • k number of passages to include in summary

Applications ? Re-ranking retrieved documents
from IR Engine ? Ranking passages from a document
for inclusion in summaries ? Ranking passages
from topically-related document cluster for
cluster summary
22
Summarization (CMU)
MuchMore Application
? INDICATIVE and QUERY-RELEVANT
Task Query-Relevant (focused) Query-Free (generic)
INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts
CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries
  • ? MMR applies to English and German
  • Genre-based specialization (e.g. include
    conclusions for scientific articles)
  • Linguistic specialization possible
  • ? Summarization should apply when retrieving FULL
    articles ? query-driven summaries instead of
    generic abstracts

23
Technical Evaluation
Test Data
? Test Collection Springer Abstracts
(German and English) ? Query Set 25
of 126 Selected by ZInfo ? Relevance
Assessments Assumption Documents Retrieved
by all Runs for one Query (Intersection) are
Relevant Pool Size 500 Documents Based on 18
Runs Done by CMU, CSLI and EIT German
(ZInfo) 959 Relevant Documents English (CMU)
500 Relevant Documents (1 judge) 964 Relevant
Documents (3 judges)
24
Technical Evaluation
Methods Evaluated
  • ? Corpus Based Similarity Thesaurus (EIT)
  • Example-based Translation (CMU)
  • Pseudo Relevance Feedback (CMU)
  • Generalized Vector Space Model (CMU)
  • Hybrid Classification (CMU)
  • Hierarchical kNN, Rocchio
  • Flat kNN, Rocchio-style Classifier
  • Semantic Annotation Extraction (DFKI, XRCE)
  • UMLS / XRCE Terms Semantic Relations
    EuroWordNet Terms
  • Semantic Annotation Similarity Thesaurus

25
Technical Evaluation
TREC-Style Performance Measurements
  • Overall Performance
  • ? 11point-Average Precision (Interpolated)
  • Performance in the High-Precision Area
  • Assumption User Wants to Get Most Relevant
    Documents Topranked within the Result List
  • ? Average Interpolated Precision at Recall of 0.1
  • ? Exact Precision after 10 Retrieved Documents
  • Applied to Experiments Evaluating Semantic
    Annotations

26
Technical Evaluation
Results Corpus Based Methods
Data Sets ? EIT The Springer Parallel
Corpus, i.e. 9640 Documents for English, and
9640 documents for German ? CMU Half of the
Corpus, i.e. a Test Set with 4820 Documents in
each.
System Eng-Eng Ger-Ger Ger-Eng Eng-Ger
Monolingual EIT lnu.ltn 0.1914 0.1848 N/A N/A
Crosslingual EIT SimThes lnu.ltn N/A N/A 0.1258 0.1109
Monolingual PRF 0.6782 0.5078 N/A N/A
Crosslingual PRF N/A N/A 0.5487 0.5758
EBT chi-squared N/A N/A 0.5232 0.5396
Crosslingual GVSM (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002)
27
Technical Evaluation
Results Hybrid Methods
Categorization (Preliminary Results) Reuters-2157
8 10,000 documents, 90 categories Reuters
Corpus Volume 1, TREC-10 version (RCV1) 783,484
documents, 84 categories Reuters Koller Sahami
subsets (ICML98) 138 to 939 documents, 6-11
categories in a set OHSUMED 233,445 documents,
14,321 categories
System Data Set Macro-avg F1 Micro-avg F1
kNN Reuters 21578 .60 .86
Rocchio Reuters 21578 .59 .85
kNN RCV1.TREC-10 (F0.5 .44) (F0.5 .55)
Rocchio RCV1.TREC-10 (F0.5 .39) (F0.5 .49)
kNN R-KS Subsets (3) .85, .81, .97 .89, .80, .94
HkNN R-KS Subsets (3) .85, .80, .98 .86, .82, .99
Rocchio R-KS Subsets (3) .80, .75, .96 .82, .83, .96
HRocchio R-KS Subsets (3) .83, .81, .98 .78, .84, .99
kNN OHSUMED .26 .48
28
Technical Evaluation
Results Hybrid Methods
Semantic Annotation Extraction Data Set Full
Springer Corpus Weighting Scheme Coordination
Level Matching (CLM) 1. Pass Documents
Preferred Containing Matching
Terms or Semantic Relations 2. Pass All
Features Using lnu.ltn Rel. Assessments German
System 11pt AvPrec 11pt AvPrec Prec at Recall of 0.1 Prec at Recall of 0.1 Prec at 10 Docs Retr Prec at 10 Docs Retr
System SemA-v3 SemA-v4 Sem-Av3 SemA-v4 SemA-v3 SemAv4
EN2DE Morph EWN - 0.0005 - 0.0017 - 0.0040
EN2DE Morph UMLS - 0.0933 - 0.2898 - 0.1840
EN2DE Morph UMLS XRCE - 0.1486 - 0.4258 - 0.3360
DE2EN Morph EWN - 0.0479 - 0.1240 - 0.0960
DE2EN Morph UMLS 0.1507 0.1392 0.3895 0.3963 0.2520 0.2920
29
Technical Evaluation
Results Hybrid Methods
Semantic Annotation Similarity Thesaurus
Data Set Full Springer Corpus Weighting
Scheme Coordination Level Matching (CLM) Rel.
Assessments German
System 11pt AvPrec Prec at Recall of 0.1 Prec at 10 Docs Retr
EN2DE transl. Morphology EWN 0.0276 0.1353 0.1000
EN2DE transl. Morphology UMLS 0.1487 0.4126 0.3320
EN2DE transl. Morphology UMLS XRCE 0.1706 0.4495 0.3600
DE2EN transl. Morphology EWN 0.1101 0.3165 0.2000
DE2EN transl. Morphology UMLS 0.1413 0.4038 0.2680
30
Technical Evaluation
Summary of the Results
  • Assumption CLIR achieves up to 75 of
    Monolingual Baseline
  • (11pt Average Precision)
  • ? Corpus-based Methods (Compared to Monolingual
    PRF)
  • German English PRF 81 , EBT 77 , EIT
    66
  • English German PRF 113 , EBT 106 , EIT
    60
  • Hybrid Methods (Compared to Monolingual EIT)
  • German English 73 (UMLS Terms SemRels)
  • English German 50 (UMLS Terms SemRels)
  • English German 80 (UMLS Terms SemRels
    XRCE Terms)
  • German English 74 (SimThes UMLS Terms
    SemRels)
  • English German 80 (SimThes UMLS Terms
    SemRels)
  • English German 92 (SimThes UMLS Terms
    SemRels XRCE Terms)

31
Management
Deviations from the Work Plan
  • Corpus Collection
  • Comparable Medical Document Corpora are Very
    Difficult to Obtain, Anonymization Must be
    Validated by Hospital CIO
  • Work with Shuffled Parallel Corpus
  • Radiology Reports (600.000) Available in German,
    to be Obtained for English
  • Corpus Annotation
  • More Efforts on Improving PoS Tagging and
    Morphological Analysis (English and German
    Medical Specialist Lexicon)
  • Relation Extraction
  • More Efforts on Grammatical Function Tagging as
    Preprocessing for Semantic Relation Tagging and
    Extraction

32
Management
Future Prospects and Activities
  • RD Topics
  • Ontology Development Combining Axes in
    AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI,
    DFKI)
  • Semantic Web Semantic Annotation of Medical
    Documents with Metadata (UMLS in Protégé)
  • Related Projects and Workshops
  • Project Proposal IKAR/OS on KM Visualization in
    Life Sciences
  • OntoWeb SIG on LT in Ontology Development and Use
  • MuchMore Workshop with Invited Experts in Medical
    Information Access, CLIR and Semantic Annotation
    (September 2002)
  • ZInfo/MuchMore Workshop on Electronic Patient
    Records (Spring 2003)
Write a Comment
User Comments (0)
About PowerShow.com