April%2019th,2002

About This Presentation

Title:

April%2019th,2002

Description:

... based medicine criteria exist only for a small fraction of medicine ... Optimal Combination of Existing Resources (Corpus, General Dictionary, Thesaurus: MeSH) ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 33

Provided by: DFK

Category:

more less

Transcript and Presenter's Notes

Title: April%2019th,2002

1
Multilingual Concept Hierarchies for Medical
Information Organization and Retrieval
MUCHMORE
2
Project Overview
Application ? Addressing a Real-Life Medical
Scenario for Cross-Lingual Information Retrieval
Research Development ? Developing Novel,
Hybrid (Corpus-/Concept- Based) Methods for
Handling this Scenario
Evaluation ? Evaluating the Technical
Performance of (Combinations of) Existing and
Novel Methods
3
User Perspective (ZInfo)
Vision BAIK Model

MuchMore
? Provide Relevant Medical Information
for a Specific Patient Problem
Automatically, from the Web
Independent of Language

4
User Perspective (ZInfo)
User Requirements

Automatic Query Generation (and Expansion),
Identifying the Exact Problem of the Patient
Retrieval and Relevance Ranking of Evidence
Based Medical Literature, Language Independent
Summarization and Filtering of Results
According to a User Profile

5
User Perspective (ZInfo)
User Evaluation
Evaluate Usefulness ? Query Generation ?
Relevance for Decisions in Diagnostics and
Treatment
Use for Medical Cases ? Part of Postgraduate
Course in Medical Informatics
Problematic Issues ? Different medical profiles,
schools, experience, speciality ? Relevant for
one user may mean less or nothing to
another ? Evidence based medicine criteria exist
only for a small fraction of medicine
6
MuchMore Prototype

Overview of Prototype Functionality
Relation between Functionality and User
Requirements
? Issues Addressed by Research and Development
within MuchMore

7
RD in MuchMore
Semantic Annotation Based CLIR

Corpus Annotation (DFKI, ZInfo)
? PoS, Morphology, Phrases, Grammatical Functions
? Term and Relation Tagging

Term Extraction (XRCE, EIT, CMU,
CSLI) ? Bilingual Lexicon Extraction, Extension
of Semantic Resources
Sense Disambiguation (CSLI, DFKI) ? Tuning and
Extension of Semantic Resources ? Combining Sense
Disambiguation Methods
Relation Extraction (DFKI, CSLI) ? Grammatical
Function Tagging ? Extracting Semantic Relation
Indicators ? Extracting Novel Semantic Relations
Semantic Indexing/Retrieval (EIT,DFKI)
8
RD in MuchMore
Additional Approaches in CLIR

Corpus Based CLIR
Bilingual Lexicon Extraction (XRCE, EIT, CMU,
CSLI)
Pseudo Relevance Feedback PRF (CMU)
Generalized Vector Space Model GVSM (CMU)

Text Classification Based CLIR (CMU) ?
Hierarchical/Flat kNN with MeSH
Summarization (CMU) ? Query, Genre Specific
9
Corpus Annotation
Annotation Evaluation
Corpus 9000 English and German Medical
Abstracts from 41 Journals, Springer LINK
WebSite, 1 M Tokens for each Language

PoS
Lexicon Update, Remaining Error Rate 1.5
(EN)
Histologically, we found a subepidermal blister
formation and a predominantly neutrophilic
infiltrate. posVB gt pos_correctNN

Morphology
German Nouns MMorph Recall Incorrect Error-Rate
test-dvlp 889 617 69.40 42 6.81
test-final 989 683 69.06 79 11.57
Incorrect, e.g. Chorionzottenbiopsie gt Chor
Ion Zotte Biopsie
Term and Relation Tagging ? Evaluation of 8
DE/EN Parallel Abstracts, Relevant for a Query
10
Term Extraction
XRCE (Aims and Resources)

Aim
Bilingual Lexicon Extraction
From Comparable Corpora at Word Level From
Parallel Corpora at Word, and Term (Multi-Word)
Level
Bilingual Extension of Semantic Resource (MeSH)

Resources
Optimal Combination of Existing Resources
(Corpus, General Dictionary, Thesaurus MeSH)
Corpus Specific German Decompounding (Improves
Recall by 25 at Equal Precision)

verbesserter transabdomineller Techniken improved transabdominal techniques
Prognose des Frühcarcinoms prognosis of early gastric cancer
Verletzungen des Gehirns intracranial injuries
Lebensqualitaet quality of live
11
Term Extraction
XRCE (Results of Best Method)

Optimal Combination of Resources
Retaining only 10 best Translations for each
Candidate
1. word-to-word, comparable corpora F1 0.84
2.a word-to-word, parallel corpora F1 0.98
2.b term-to-term, parallel corpora F1 0.85
Evaluating Separately with Individual Resources
(F1)
Corpus 0.62 MeSH 0.51 General Dictionary
0.56
3. MeSH Extension 1453 new multi-word terms
added (synonyms or new term entries) extracted
from the Springer corpus

12
Term Extraction
EIT (Similarity Thesauri)
Method ? Extract Most Frequent Terms (Single
Word) by Comparison of Term Frequencies in a
General Corpus (German SDA, English LA Times)
vs. Medical Corpus

Results
? Single Word Terms (Springer Abstracts)
German-English104,904 / English-German 49,454
? Multiword Terms (Phrase Lexicon Generated from
ICD10)
German Phrases 354 / English Phrases 665
Bilingual Phrasal Entries Generated
German - English 225 / English - German 246

13
Term Extraction
CMU (EBT Bilingual Lexicon)
Method ? For each word in one language,
accumulate counts of the number of times the
translations of the sentences containing that
word include each word of the other language.
These co-occurrence counts may be restricted
using word-alignment techniques. ? Apply a
variable threshold to filter out uncommon
co-occurrences which are unlikely to be
translations. The result is a lexicon listing
candidate translations and their relative
frequencies.

Results
? 99.000 Bilingual Term Pairs (PubMed Parallel
Abstracts)
(Estimated Error Rate lt 10)

14
Term Extraction
CSLI (Infomap System)
Represent English and German Words as Vectors
that are Produced by Recording the Number of
Co-Occurrences of the Word in Question with each
of a Set of Content-Bearing Words. Use (Cosine)
Similarity Measure on these Rows to Find Nearest
Neighbours.
1, 000 (English) content-bearing words
Term (EN) SIM Term (DE) SIM
bone 1.00 knochen 0.82
cancellous 0.70 knochens 0.71
osteoinductive 0.67 knochenneubildung 0.67
demineralized 0.65 spongiosa 0.64
trabeculae 0.64 knochenresorption 0.60
formation 0.60 allogenen 0.60
periosteum 0.56 knöcherne 0.59

ligament knee joint
.
.
.
ligament English words
English
Kreuzband Kniegelenk German words
German
.
.
.
.
.
.
15
WSD Terms, Senses
Semantic Resource Extension and Tuning

Extension (DFKI)
Morphological Analysis (Decomposition)
Entzündungsgewebe (infection tissue) HYPONYM
Gewebe,Körpergewebe (body tissue)
Gewebe, Stoff,Textilstoff (textile)
Semantic Similarity (Co-Occurrence Patterns)
Karzinom (carcinoma), Metastase (metastasis)
SYNONYM Geschwulst, Tumor, ....

Tuning (CSLI, DFKI)
Aligning Clusters with Senses
C0043210GERPL1254343PFS1496289Frauen3
C0043210ENGPL1189496PFS1423265Human adult
females0

16
WSD Algorithm
Combination of Methods (Task, Domain, General)

Bilingual Sense Selection (CSLI)
1 Sense in L1 vs. gt1 Sense in L2
English blood vessel (C0005847) vs. vessel
(polysaccharide) (C0148346)
German Blutgefaesse blood vessel (C0005847)

Collocations and Senses (CSLI)
For an ambiguous single word term that is part
of several unambiguous multiword terms, choose
the sense of the most frequent multiword term.
single word term abortion 1) a natural process
C0000786 (T047)
2) a medical procedure C0000811 (T061)
multiword term recurrent abortion C0000809
(T047) gt sense 1
induced abortion C0000811 (T061) gt sense 2

17
WSD Algorithm
Combination of Methods (Task, Domain, General)

Domain Specific Senses (DFKI)
Concept Relevance in Domain Corpus
Mineral 0.030774033 Mineralstoff, Eisen,
Ferrum, Fluor, Kalzium, Magnesium 4.9409806E-5
Allanit, Alumogel, ..., Axionit, Beryll, ...
Wurtzit, Zirkon

Instance-Based Learning (DFKI)
Unsupervised Context Models (n-grams)
Training (Learn Class Models) He drank ltmilk
LIQUIDgt He drank ltcoffee LIQUIDgt He drank
lttea LIQUIDgt He drank ltchocolate FOOD,
LIQUIDgt
Application (Apply Class Models) He drank
ltchocolate FOOD, LIQUIDgt He drank ltJava
GEOGAPHICAL, LIQUIDgt

18
WSD Evaluation
Lexical Sample Evaluation Corpora (Medical)

Ambiguous MeSH EN 847 (2.5), DE 780 (2.1)
EWN EN 6300 (2.8) DE 4059 (1.5)
Evaluation (Nouns) GermaNet (40), English MeSH
(59), German MeSH (28)

Band (tape, strap. ligament)
Fall (drop, case, instance)
Gefäss (jar, vessel)
Operation (operation, surgery)
Prüfung (survey, tryout, checkup)
Verletzung (injury, trauma)
Wahl (ballot, choice, option)
Lage (site, status, position, layer)
Gewicht (weight, importance)

19
Relation Extraction
Grammatical Function Tagging (DFKI)

Robust, Shallow Grammatical Function Tagger
EM Model (Trained on Frankfurter Rundschau 35M
Tokens, Adaptation on Medical Corpora Under
Development)
1.5M Types Verb, Voice, Function,
Nom-Head-Argument
abarbeiten ACT SUBJ Politiker
? Use of PoS Information, Use of Chunk
Information Planned
? Tags for SUBJ, OBJ, IOBJ, ACT/PAS
? German Available, English under Development

Untersucht ltPRED1PASgt wurden 30 Patienten
ltPRED1SUBJgt ltPRED2SUBJgt, die sich ltPRED2SUBJgt
einer elektiven aortokoronaren Bypassoperation
ltPRED2IOBJgt unterziehen ltPRED2ACTgt mussten.
20
Relation Extraction
Semantic Relation Indicators (DFKI, CSLI)
Novel Semantic Relations (DFKI, CSLI)
Cluster 1 T047/T060 (Diagnoses) T060/T101
(Affects) T060/T169 ...
differentiate conclude discriminate diagnose illus
trate
reduce treat follow diagnose cure
Cluster 3 T047/T121 (Treats, Causes) T061/T121
(Uses) T121/T184 (Treats) ...
Cluster 2 T101/T169 T101/T184 T101/T048 ...
T047 Disease T048 Mental Dysfunction T060
Diagnostic Procedure T101 Patient T121 Pharm.
Substance T169 Funct. Concept (Syndrom) T184
Sign or Symptom
suffer demonstrate progress develop die
21
Summarization (CMU)
Extractive Summarization

Maximal Marginal Relevance (MMR)
? Find passages most relevant to query
? Maximize information novelty (minimize passage
redundancy)
Assemble extracted passages for summary
Argmaxkdi in C?S(Q, di) - (1-?)maxdj in R (S(di,
dj))
Q query, d document, S similarity
function
? tradeoff factor between relevance novelty
k number of passages to include in summary

Applications ? Re-ranking retrieved documents
from IR Engine ? Ranking passages from a document
for inclusion in summaries ? Ranking passages
from topically-related document cluster for
cluster summary
22
Summarization (CMU)
MuchMore Application
? INDICATIVE and QUERY-RELEVANT
Task Query-Relevant (focused) Query-Free (generic)
INDICATIVE, for Filtering (Do I read further?) To filter search engine results Short abstracts
CONTENTFUL, for reading in lieu of full doc. To solve problems for busy professionals Executive summaries

? MMR applies to English and German
Genre-based specialization (e.g. include
conclusions for scientific articles)
Linguistic specialization possible
? Summarization should apply when retrieving FULL
articles ? query-driven summaries instead of
generic abstracts

23
Technical Evaluation
Test Data
? Test Collection Springer Abstracts
(German and English) ? Query Set 25
of 126 Selected by ZInfo ? Relevance
Assessments Assumption Documents Retrieved
by all Runs for one Query (Intersection) are
Relevant Pool Size 500 Documents Based on 18
Runs Done by CMU, CSLI and EIT German
(ZInfo) 959 Relevant Documents English (CMU)
500 Relevant Documents (1 judge) 964 Relevant
Documents (3 judges)
24
Technical Evaluation
Methods Evaluated

? Corpus Based Similarity Thesaurus (EIT)
Example-based Translation (CMU)
Pseudo Relevance Feedback (CMU)
Generalized Vector Space Model (CMU)
Hybrid Classification (CMU)
Hierarchical kNN, Rocchio
Flat kNN, Rocchio-style Classifier
Semantic Annotation Extraction (DFKI, XRCE)
UMLS / XRCE Terms Semantic Relations
EuroWordNet Terms
Semantic Annotation Similarity Thesaurus

25
Technical Evaluation
TREC-Style Performance Measurements

Overall Performance
? 11point-Average Precision (Interpolated)
Performance in the High-Precision Area
Assumption User Wants to Get Most Relevant
Documents Topranked within the Result List
? Average Interpolated Precision at Recall of 0.1
? Exact Precision after 10 Retrieved Documents
Applied to Experiments Evaluating Semantic
Annotations

26
Technical Evaluation
Results Corpus Based Methods
Data Sets ? EIT The Springer Parallel
Corpus, i.e. 9640 Documents for English, and
9640 documents for German ? CMU Half of the
Corpus, i.e. a Test Set with 4820 Documents in
each.
System Eng-Eng Ger-Ger Ger-Eng Eng-Ger
Monolingual EIT lnu.ltn 0.1914 0.1848 N/A N/A
Crosslingual EIT SimThes lnu.ltn N/A N/A 0.1258 0.1109
Monolingual PRF 0.6782 0.5078 N/A N/A
Crosslingual PRF N/A N/A 0.5487 0.5758
EBT chi-squared N/A N/A 0.5232 0.5396
Crosslingual GVSM (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002) (first evaluation to be completed in July, 2002)
27
Technical Evaluation
Results Hybrid Methods
Categorization (Preliminary Results) Reuters-2157
8 10,000 documents, 90 categories Reuters
Corpus Volume 1, TREC-10 version (RCV1) 783,484
documents, 84 categories Reuters Koller Sahami
subsets (ICML98) 138 to 939 documents, 6-11
categories in a set OHSUMED 233,445 documents,
14,321 categories
System Data Set Macro-avg F1 Micro-avg F1
kNN Reuters 21578 .60 .86
Rocchio Reuters 21578 .59 .85
kNN RCV1.TREC-10 (F0.5 .44) (F0.5 .55)
Rocchio RCV1.TREC-10 (F0.5 .39) (F0.5 .49)
kNN R-KS Subsets (3) .85, .81, .97 .89, .80, .94
HkNN R-KS Subsets (3) .85, .80, .98 .86, .82, .99
Rocchio R-KS Subsets (3) .80, .75, .96 .82, .83, .96
HRocchio R-KS Subsets (3) .83, .81, .98 .78, .84, .99
kNN OHSUMED .26 .48
28
Technical Evaluation
Results Hybrid Methods
Semantic Annotation Extraction Data Set Full
Springer Corpus Weighting Scheme Coordination
Level Matching (CLM) 1. Pass Documents
Preferred Containing Matching
Terms or Semantic Relations 2. Pass All
Features Using lnu.ltn Rel. Assessments German
System 11pt AvPrec 11pt AvPrec Prec at Recall of 0.1 Prec at Recall of 0.1 Prec at 10 Docs Retr Prec at 10 Docs Retr
System SemA-v3 SemA-v4 Sem-Av3 SemA-v4 SemA-v3 SemAv4
EN2DE Morph EWN - 0.0005 - 0.0017 - 0.0040
EN2DE Morph UMLS - 0.0933 - 0.2898 - 0.1840
EN2DE Morph UMLS XRCE - 0.1486 - 0.4258 - 0.3360
DE2EN Morph EWN - 0.0479 - 0.1240 - 0.0960
DE2EN Morph UMLS 0.1507 0.1392 0.3895 0.3963 0.2520 0.2920
29
Technical Evaluation
Results Hybrid Methods
Semantic Annotation Similarity Thesaurus
Data Set Full Springer Corpus Weighting
Scheme Coordination Level Matching (CLM) Rel.
Assessments German
System 11pt AvPrec Prec at Recall of 0.1 Prec at 10 Docs Retr
EN2DE transl. Morphology EWN 0.0276 0.1353 0.1000
EN2DE transl. Morphology UMLS 0.1487 0.4126 0.3320
EN2DE transl. Morphology UMLS XRCE 0.1706 0.4495 0.3600
DE2EN transl. Morphology EWN 0.1101 0.3165 0.2000
DE2EN transl. Morphology UMLS 0.1413 0.4038 0.2680
30
Technical Evaluation
Summary of the Results

Assumption CLIR achieves up to 75 of
Monolingual Baseline
(11pt Average Precision)
? Corpus-based Methods (Compared to Monolingual
PRF)
German English PRF 81 , EBT 77 , EIT
66
English German PRF 113 , EBT 106 , EIT
60
Hybrid Methods (Compared to Monolingual EIT)
German English 73 (UMLS Terms SemRels)
English German 50 (UMLS Terms SemRels)
English German 80 (UMLS Terms SemRels
XRCE Terms)
German English 74 (SimThes UMLS Terms
SemRels)
English German 80 (SimThes UMLS Terms
SemRels)
English German 92 (SimThes UMLS Terms
SemRels XRCE Terms)

31
Management
Deviations from the Work Plan

Corpus Collection
Comparable Medical Document Corpora are Very
Difficult to Obtain, Anonymization Must be
Validated by Hospital CIO
Work with Shuffled Parallel Corpus
Radiology Reports (600.000) Available in German,
to be Obtained for English

Corpus Annotation
More Efforts on Improving PoS Tagging and
Morphological Analysis (English and German
Medical Specialist Lexicon)

Relation Extraction
More Efforts on Grammatical Function Tagging as
Preprocessing for Semantic Relation Tagging and
Extraction

32
Management
Future Prospects and Activities

RD Topics
Ontology Development Combining Axes in
AGK-Thesaurus (ZInfo) with Cluster Methods (CSLI,
DFKI)
Semantic Web Semantic Annotation of Medical
Documents with Metadata (UMLS in Protégé)

Related Projects and Workshops
Project Proposal IKAR/OS on KM Visualization in
Life Sciences
OntoWeb SIG on LT in Ontology Development and Use
MuchMore Workshop with Invited Experts in Medical
Information Access, CLIR and Semantic Annotation
(September 2002)
ZInfo/MuchMore Workshop on Electronic Patient
Records (Spring 2003)