Title: April%2019th,2002
1Multilingual Concept Hierarchies for Medical
Information Organization and Retrieval
MUCHMORE
2Project Overview
Application ? Addressing a Real-Life Medical
Scenario for Cross-Lingual Information Retrieval
Research Development ? Developing Novel,
Hybrid (Corpus-/Concept- Based) Methods for
Handling this Scenario
Evaluation ? Evaluating the Technical
Performance of (Combinations of) Existing and
Novel Methods
3User Perspective (ZInfo)
Vision BAIK Model
- MuchMore
- ? Provide Relevant Medical Information
- for a Specific Patient Problem
- Automatically, from the Web
- Independent of Language
4User Perspective (ZInfo)
User Requirements
- Automatic Query Generation (and Expansion),
Identifying the Exact Problem of the Patient - Retrieval and Relevance Ranking of Evidence
Based Medical Literature, Language Independent - Summarization and Filtering of Results
According to a User Profile
5User Perspective (ZInfo)
User Evaluation
Evaluate Usefulness ? Query Generation ?
Relevance for Decisions in Diagnostics and
Treatment
Use for Medical Cases ? Part of Postgraduate
Course in Medical Informatics
Problematic Issues ? Different medical profiles,
schools, experience, speciality ? Relevant for
one user may mean less or nothing to
another ? Evidence based medicine criteria exist
only for a small fraction of medicine
6MuchMore Prototype
- Overview of Prototype Functionality
- Relation between Functionality and User
Requirements - ? Issues Addressed by Research and Development
within MuchMore
7RD in MuchMore
Semantic Annotation Based CLIR
- Corpus Annotation (DFKI, ZInfo)
- ? PoS, Morphology, Phrases, Grammatical Functions
- ? Term and Relation Tagging
Term Extraction (XRCE, EIT, CMU,
CSLI) ? Bilingual Lexicon Extraction, Extension
of Semantic Resources
Sense Disambiguation (CSLI, DFKI) ? Tuning and
Extension of Semantic Resources ? Combining Sense
Disambiguation Methods
Relation Extraction (DFKI, CSLI) ? Grammatical
Function Tagging ? Extracting Semantic Relation
Indicators ? Extracting Novel Semantic Relations
Semantic Indexing/Retrieval (EIT,DFKI)
8RD in MuchMore
Additional Approaches in CLIR
- Corpus Based CLIR
- Bilingual Lexicon Extraction (XRCE, EIT, CMU,
CSLI) - Pseudo Relevance Feedback PRF (CMU)
- Generalized Vector Space Model GVSM (CMU)
Text Classification Based CLIR (CMU) ?
Hierarchical/Flat kNN with MeSH
Summarization (CMU) ? Query, Genre Specific
9Corpus Annotation
Annotation Evaluation
Corpus 9000 English and German Medical
Abstracts from 41 Journals, Springer LINK
WebSite, 1 M Tokens for each Language
- PoS
- Lexicon Update, Remaining Error Rate 1.5
(EN) - Histologically, we found a subepidermal blister
formation and a predominantly neutrophilic
infiltrate. posVB gt pos_correctNN
Morphology
German Nouns MMorph Recall Incorrect Error-Rate
test-dvlp 889 617 69.40 42 6.81
test-final 989 683 69.06 79 11.57
Incorrect, e.g. Chorionzottenbiopsie gt Chor
Ion Zotte Biopsie
Term and Relation Tagging ? Evaluation of 8
DE/EN Parallel Abstracts, Relevant for a Query
10Term Extraction
XRCE (Aims and Resources)
- Aim
- Bilingual Lexicon Extraction
- From Comparable Corpora at Word Level From
Parallel Corpora at Word, and Term (Multi-Word)
Level - Bilingual Extension of Semantic Resource (MeSH)
- Resources
- Optimal Combination of Existing Resources
(Corpus, General Dictionary, Thesaurus MeSH) - Corpus Specific German Decompounding (Improves
Recall by 25 at Equal Precision)
verbesserter transabdomineller Techniken improved transabdominal techniques
Prognose des Frühcarcinoms prognosis of early gastric cancer
Verletzungen des Gehirns intracranial injuries
Lebensqualitaet quality of live
11Term Extraction
XRCE (Results of Best Method)
- Optimal Combination of Resources
- Retaining only 10 best Translations for each
Candidate - 1. word-to-word, comparable corpora F1 0.84
- 2.a word-to-word, parallel corpora F1 0.98
- 2.b term-to-term, parallel corpora F1 0.85
- Evaluating Separately with Individual Resources
(F1) - Corpus 0.62 MeSH 0.51 General Dictionary
0.56 - 3. MeSH Extension 1453 new multi-word terms
added (synonyms or new term entries) extracted
from the Springer corpus
12Term Extraction
EIT (Similarity Thesauri)
Method ? Extract Most Frequent Terms (Single
Word) by Comparison of Term Frequencies in a
General Corpus (German SDA, English LA Times)
vs. Medical Corpus
- Results
- ? Single Word Terms (Springer Abstracts)
- German-English104,904 / English-German 49,454
- ? Multiword Terms (Phrase Lexicon Generated from
ICD10) - German Phrases 354 / English Phrases 665
- Bilingual Phrasal Entries Generated
- German - English 225 / English - German 246
13Term Extraction
CMU (EBT Bilingual Lexicon)
Method ? For each word in one language,
accumulate counts of the number of times the
translations of the sentences containing that
word include each word of the other language.
These co-occurrence counts may be restricted
using word-alignment techniques. ? Apply a
variable threshold to filter out uncommon
co-occurrences which are unlikely to be
translations. The result is a lexicon listing
candidate translations and their relative
frequencies.
- Results
- ? 99.000 Bilingual Term Pairs (PubMed Parallel
Abstracts) - (Estimated Error Rate lt 10)
14Term Extraction
CSLI (Infomap System)
Represent English and German Words as Vectors
that are Produced by Recording the Number of
Co-Occurrences of the Word in Question with each
of a Set of Content-Bearing Words. Use (Cosine)
Similarity Measure on these Rows to Find Nearest
Neighbours.
1, 000 (English) content-bearing words
Term (EN) SIM Term (DE) SIM
bone 1.00 knochen 0.82
cancellous 0.70 knochens 0.71
osteoinductive 0.67 knochenneubildung 0.67
demineralized 0.65 spongiosa 0.64
trabeculae 0.64 knochenresorption 0.60
formation 0.60 allogenen 0.60
periosteum 0.56 knöcherne 0.59
ligament knee joint
.
.
.
ligament English words
English
Kreuzband Kniegelenk German words
German
.
.
.
.
.
.
15WSD Terms, Senses
Semantic Resource Extension and Tuning
- Extension (DFKI)
- Morphological Analysis (Decomposition)
- Entzündungsgewebe (infection tissue) HYPONYM
Gewebe,Körpergewebe (body tissue) - Gewebe, Stoff,Textilstoff (textile)
- Semantic Similarity (Co-Occurrence Patterns)
- Karzinom (carcinoma), Metastase (metastasis)
SYNONYM Geschwulst, Tumor, ....
- Tuning (CSLI, DFKI)
- Aligning Clusters with Senses
- C0043210GERPL1254343PFS1496289Frauen3
- C0043210ENGPL1189496PFS1423265Human adult
females0
16WSD Algorithm
Combination of Methods (Task, Domain, General)
- Bilingual Sense Selection (CSLI)
- 1 Sense in L1 vs. gt1 Sense in L2
- English blood vessel (C0005847) vs. vessel
(polysaccharide) (C0148346) - German Blutgefaesse blood vessel (C0005847)
- Collocations and Senses (CSLI)
- For an ambiguous single word term that is part
of several unambiguous multiword terms, choose
the sense of the most frequent multiword term. - single word term abortion 1) a natural process
C0000786 (T047) - 2) a medical procedure C0000811 (T061)
- multiword term recurrent abortion C0000809
(T047) gt sense 1 - induced abortion C0000811 (T061) gt sense 2
17WSD Algorithm
Combination of Methods (Task, Domain, General)
- Domain Specific Senses (DFKI)
- Concept Relevance in Domain Corpus
- Mineral 0.030774033 Mineralstoff, Eisen,
Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5
Allanit, Alumogel, ..., Axionit, Beryll, ...
Wurtzit, Zirkon
- Instance-Based Learning (DFKI)
- Unsupervised Context Models (n-grams)
- Training (Learn Class Models) He drank ltmilk
LIQUIDgt He drank ltcoffee LIQUIDgt He drank
lttea LIQUIDgt He drank ltchocolate FOOD,
LIQUIDgt - Application (Apply Class Models) He drank
ltchocolate FOOD, LIQUIDgt He drank ltJava
GEOGAPHICAL, LIQUIDgt
18WSD Evaluation
Lexical Sample Evaluation Corpora (Medical)
- Ambiguous MeSH EN 847 (2.5), DE 780 (2.1)
EWN EN 6300 (2.8) DE 4059 (1.5) - Evaluation (Nouns) GermaNet (40), English MeSH
(59), German MeSH (28)
Band (tape, strap. ligament)
Fall (drop, case, instance)
Gefäss (jar, vessel)
Operation (operation, surgery)
Prüfung (survey, tryout, checkup)
Verletzung (injury, trauma)
Wahl (ballot, choice, option)
Lage (site, status, position, layer)
Gewicht (weight, importance)
19Relation Extraction
Grammatical Function Tagging (DFKI)
- Robust, Shallow Grammatical Function Tagger
- EM Model (Trained on Frankfurter Rundschau 35M
Tokens, Adaptation on Medical Corpora Under
Development) - 1.5M Types Verb, Voice, Function,
Nom-Head-Argument - abarbeiten ACT SUBJ Politiker
- ? Use of PoS Information, Use of Chunk
Information Planned - ? Tags for SUBJ, OBJ, IOBJ, ACT/PAS
- ? German Available, English under Development
Untersucht ltPRED1PASgt wurden 30 Patienten
ltPRED1SUBJgt ltPRED2SUBJgt, die sich ltPRED2SUBJgt
einer elektiven aortokoronaren Bypassoperation
ltPRED2IOBJgt unterziehen ltPRED2ACTgt mussten.
20Relation Extraction
Semantic Relation Indicators (DFKI, CSLI)
Novel Semantic Relations (DFKI, CSLI)
Cluster 1 T047/T060 (Diagnoses) T060/T101
(Affects) T060/T169 ...
differentiate conclude discriminate diagnose illus
trate
reduce treat follow diagnose cure
Cluster 3 T047/T121 (Treats, Causes) T061/T121
(Uses) T121/T184 (Treats) ...
Cluster 2 T101/T169 T101/T184 T101/T048 ...
T047 Disease T048 Mental Dysfunction T060
Diagnostic Procedure T101 Patient T121 Pharm.
Substance T169 Funct. Concept (Syndrom) T184
Sign or Symptom
suffer demonstrate progress develop die
21Summarization (CMU)
Extractive Summarization
- Maximal Marginal Relevance (MMR)
- ? Find passages most relevant to query
- ? Maximize information novelty (minimize passage
redundancy) - Assemble extracted passages for summary
- Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
dj)) - Q query, d document, S similarity
function - ? tradeoff factor between relevance novelty
- k number of passages to include in summary
Applications ? Re-ranking retrieved documents
from IR Engine ? Ranking passages from a document
for inclusion in summaries ? Ranking passages
from topically-related document cluster for
cluster summary
22Summarization (CMU)
MuchMore Application
? INDICATIVE and QUERY-RELEVANT
Task Query-Relevant (focused) Query-Free (generic)
INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts
CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries
- ? MMR applies to English and German
- Genre-based specialization (e.g. include
conclusions for scientific articles) - Linguistic specialization possible
- ? Summarization should apply when retrieving FULL
articles ? query-driven summaries instead of
generic abstracts
23Technical Evaluation
Test Data
? Test Collection Springer Abstracts
(German and English) ? Query Set 25
of 126 Selected by ZInfo ? Relevance
Assessments Assumption Documents Retrieved
by all Runs for one Query (Intersection) are
Relevant Pool Size 500 Documents Based on 18
Runs Done by CMU, CSLI and EIT German
(ZInfo) 959 Relevant Documents English (CMU)
500 Relevant Documents (1 judge) 964 Relevant
Documents (3 judges)
24Technical Evaluation
Methods Evaluated
- ? Corpus Based Similarity Thesaurus (EIT)
- Example-based Translation (CMU)
- Pseudo Relevance Feedback (CMU)
- Generalized Vector Space Model (CMU)
- Hybrid Classification (CMU)
- Hierarchical kNN, Rocchio
- Flat kNN, Rocchio-style Classifier
- Semantic Annotation Extraction (DFKI, XRCE)
- UMLS / XRCE Terms Semantic Relations
EuroWordNet Terms - Semantic Annotation Similarity Thesaurus
25Technical Evaluation
TREC-Style Performance Measurements
- Overall Performance
- ? 11point-Average Precision (Interpolated)
- Performance in the High-Precision Area
- Assumption User Wants to Get Most Relevant
Documents Topranked within the Result List - ? Average Interpolated Precision at Recall of 0.1
- ? Exact Precision after 10 Retrieved Documents
-
- Applied to Experiments Evaluating Semantic
Annotations
26Technical Evaluation
Results Corpus Based Methods
Data Sets ? EIT The Springer Parallel
Corpus, i.e. 9640 Documents for English, and
9640 documents for German ? CMU Half of the
Corpus, i.e. a Test Set with 4820 Documents in
each.
System Eng-Eng Ger-Ger Ger-Eng Eng-Ger
Monolingual EIT lnu.ltn 0.1914 0.1848 N/A N/A
Crosslingual EIT SimThes lnu.ltn N/A N/A 0.1258 0.1109
Monolingual PRF 0.6782 0.5078 N/A N/A
Crosslingual PRF N/A N/A 0.5487 0.5758
EBT chi-squared N/A N/A 0.5232 0.5396
Crosslingual GVSM (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002)
27Technical Evaluation
Results Hybrid Methods
Categorization (Preliminary Results) Reuters-2157
8 10,000 documents, 90 categories Reuters
Corpus Volume 1, TREC-10 version (RCV1) 783,484
documents, 84 categories Reuters Koller Sahami
subsets (ICML98) 138 to 939 documents, 6-11
categories in a set OHSUMED 233,445 documents,
14,321 categories
System Data Set Macro-avg F1 Micro-avg F1
kNN Reuters 21578 .60 .86
Rocchio Reuters 21578 .59 .85
kNN RCV1.TREC-10 (F0.5 .44) (F0.5 .55)
Rocchio RCV1.TREC-10 (F0.5 .39) (F0.5 .49)
kNN R-KS Subsets (3) .85, .81, .97 .89, .80, .94
HkNN R-KS Subsets (3) .85, .80, .98 .86, .82, .99
Rocchio R-KS Subsets (3) .80, .75, .96 .82, .83, .96
HRocchio R-KS Subsets (3) .83, .81, .98 .78, .84, .99
kNN OHSUMED .26 .48
28Technical Evaluation
Results Hybrid Methods
Semantic Annotation Extraction Data Set Full
Springer Corpus Weighting Scheme Coordination
Level Matching (CLM) 1. Pass Documents
Preferred Containing Matching
Terms or Semantic Relations 2. Pass All
Features Using lnu.ltn Rel. Assessments German
System 11pt AvPrec 11pt AvPrec Prec at Recall of 0.1 Prec at Recall of 0.1 Prec at 10 Docs Retr Prec at 10 Docs Retr
System SemA-v3 SemA-v4 Sem-Av3 SemA-v4 SemA-v3 SemAv4
EN2DE Morph EWN - 0.0005 - 0.0017 - 0.0040
EN2DE Morph UMLS - 0.0933 - 0.2898 - 0.1840
EN2DE Morph UMLS XRCE - 0.1486 - 0.4258 - 0.3360
DE2EN Morph EWN - 0.0479 - 0.1240 - 0.0960
DE2EN Morph UMLS 0.1507 0.1392 0.3895 0.3963 0.2520 0.2920
29Technical Evaluation
Results Hybrid Methods
Semantic Annotation Similarity Thesaurus
Data Set Full Springer Corpus Weighting
Scheme Coordination Level Matching (CLM) Rel.
Assessments German
System 11pt AvPrec Prec at Recall of 0.1 Prec at 10 Docs Retr
EN2DE transl. Morphology EWN 0.0276 0.1353 0.1000
EN2DE transl. Morphology UMLS 0.1487 0.4126 0.3320
EN2DE transl. Morphology UMLS XRCE 0.1706 0.4495 0.3600
DE2EN transl. Morphology EWN 0.1101 0.3165 0.2000
DE2EN transl. Morphology UMLS 0.1413 0.4038 0.2680
30Technical Evaluation
Summary of the Results
- Assumption CLIR achieves up to 75 of
Monolingual Baseline - (11pt Average Precision)
- ? Corpus-based Methods (Compared to Monolingual
PRF) - German English PRF 81 , EBT 77 , EIT
66 - English German PRF 113 , EBT 106 , EIT
60 - Hybrid Methods (Compared to Monolingual EIT)
- German English 73 (UMLS Terms SemRels)
- English German 50 (UMLS Terms SemRels)
- English German 80 (UMLS Terms SemRels
XRCE Terms) - German English 74 (SimThes UMLS Terms
SemRels) - English German 80 (SimThes UMLS Terms
SemRels) - English German 92 (SimThes UMLS Terms
SemRels XRCE Terms)
31Management
Deviations from the Work Plan
- Corpus Collection
- Comparable Medical Document Corpora are Very
Difficult to Obtain, Anonymization Must be
Validated by Hospital CIO - Work with Shuffled Parallel Corpus
- Radiology Reports (600.000) Available in German,
to be Obtained for English
- Corpus Annotation
- More Efforts on Improving PoS Tagging and
Morphological Analysis (English and German
Medical Specialist Lexicon)
- Relation Extraction
- More Efforts on Grammatical Function Tagging as
Preprocessing for Semantic Relation Tagging and
Extraction
32Management
Future Prospects and Activities
- RD Topics
- Ontology Development Combining Axes in
AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI,
DFKI) - Semantic Web Semantic Annotation of Medical
Documents with Metadata (UMLS in Protégé)
- Related Projects and Workshops
- Project Proposal IKAR/OS on KM Visualization in
Life Sciences
- OntoWeb SIG on LT in Ontology Development and Use
- MuchMore Workshop with Invited Experts in Medical
Information Access, CLIR and Semantic Annotation
(September 2002) - ZInfo/MuchMore Workshop on Electronic Patient
Records (Spring 2003)