The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric

Description:

Charlotte Andersen. John Pestian. Karen Davis. Lukasz Itert. Pawel Matykewicz. Wlodzislaw Duch ... in whole sentences or large windows, not only in phrases. ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 21

Provided by: lukasz2

Category:

more less

Transcript and Presenter's Notes

Title: The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric

1
The Semantic Retrieval SystemReal-time System
for Classifying and Retrieving Unstructured
Pediatric Clinical Annotations

Charlotte Andersen
John Pestian
Karen Davis
Lukasz Itert
Pawel Matykewicz
Wlodzislaw Duch

Cincinnati, February 2005
2
Outline

The project
Goals
Focus
Software
Results
Plans

3
CCHRF project outline (simplified)
Preprocessing

INPUT (raw medical text)

MetaMap input
MetaMap Software - UMLS concept discovering and
indexing
Hypothesis generation, validation, important
relations.
Annotations Concept Space (UMLS concepts)
Decision support systems
Automatic medical billing
4
Long-term goals (too ambitious?)

IR system facilitating discoveries, helping to
answer questions like
Retrieve similar cases using discharge summaries.
Is X related to Y?
Will X help patient with Y?
What correlates with X?
What causes changes of X?
What are therapy options for X?
Automatic creation of medical billing codes from
text.
Can we work out scenarios of use for our target
system?

5
First big problem disambiguation

Map raw text to some structured form, removing
all ambiguities, expanding acronyms etc.
Use the NLMs MetaMap to create an XML formatted
data whose schema is based on the Unified Medical
Language Systems (ULMS) Semantic Network
ontology.
ltsemantic typegtwordlt/semantic typegt
E.C. gt ltbacteriumgtEscherichia Colilt/bacteriumgt
ltpatientgt
ltFIRST-NAMEgtBoblt/FIRST-NAMEgt
ltLAST-NAMEgtNopelt/LAST-NAMEgt
lt/patientgt

6
XML or structured text

Final XML should include maximum information that
can be derived with high confidence for each
word, including
1. Annotations for parts of speech (tree tagger)
for which type of words?
2. Tags for semantic type (135 types in ULMS
tags for other non-medical types)
3. Tags for word sense (ULMS dictionaries such
as WordNet)
4. Values assigned to some semantic types, ex
Temperaturehigh, or T102F.
What should we keep depends on the scenarios how
the system will be used.

7
Small steps to solve the big problem

Main subproblems
Removing patient-specific information, but
keeping all information related to a single case
together how to link sequence of records for a
single person?
Text cleaning misspellings, obtaining unique
terms.
Expansion of abbreviations and acronyms.
Ambiguity of medical terms.
Ambiguity of common words how interesting are
common terms, which categories/semantic types
should be used?
Assigning values to some categories, ex blood
pressure temperature.
Check XML standards developed at AMIA.

8
Human information retrieval

3 levels
First local, recognition of terms, we bring a
lot of background knowledge reading the text,
ignore misspellings and small mistakes.
Second larger units, semantic interpretation,
discover and understand meaning of concepts
composed of several terms, define semantic word
sense for ambiguous words, expand terms and
acronyms to reach unambiguous interpretation.
Third episodic level of processing, or what the
whole record or text is about? Knowing the
category of text helps in unique intrepretaiton
at recognition and semantic level.

9
Recognition

Pawel started some work, a short report on text
recognition memory was written.
NLM has GSpell and WedSpell spelling suggestion
tools, and the BagOWordsPlus phrase retrieval
tool (new, worth trying).
GSpell java classes, used to propose spelling
corrections and unique spelling for words that
have alternative spellings.
Even if spelled correctly it may be a mistake,
ex
diseasedisease0.01.0NGramsCorrect
diseasediscase1.00.873NGrams
diseasediseased1.00.873NGrams
diseasedecease2.00.5819672267388108NGrams

10
Recognition cont

Is this an issue in our case?
Can we estimate how serious are problems at the
recognition level?
The term may be a part of the phrase, and this
would be discovered only when the term is
correctly recognized.
How do we know that we have acronym/abbreviation?
Frequently capital letters, usually 2-4 letters,
morphological structure using bi-grams is
improbable, ex DMI, CRC, IVF.
Acronyms and abbreviations should be recognized
and expanded.
Need probability of various typos (keys that are
close, characters that are inverted, frequent
errors, anticipation what character should come
next etc), and errors at the spelling and
phonological level.
External dictionaries should be checked to find
out if the word is not a specific medical term
that is not listed in ULMS.

11
Semantic level

Required to
select the most probable term from recognition
process that gives several alternatives at the
same confidence level
WSD, or find semantic word sense for ambiguous
words.

Word may have correct spelling but no sense at
the semantic level go back to the recognition
level and generate more similar words to check
which one is the most probable at the semantic
level.
This should give in most cases highly probable
term once this is achieved unique semantic word
sense is defined.
Semantic knowledge representation may be done
using
context vectors,
concept-description vectors
more elaborate approaches, like frames (CYC).

12
Semantic knowledge representation

Context vectors numerical, easy to generate from
co-occurrence. Widely used statistical approach,
but lacks semantics concept name and its
properties may be far apart A?? B ? A ? B
Concept description vectors (CDV),
knowledge-based list properties of concepts,
derive info from definitions, dictionaries,
ontologies, pay more attention to unique
features.
Frames, structured representations more
expressive power, with symbolic values such as
color blue or color in blue, green ect. time
admission_time time day before discharge
time morning ..., ect ...
Initially simple vector representation should be
sufficient for WSD, but remember that expressive
power is limited. Some thinking about
simplified, computationally efficient frame-based
representation should be done.

13
Episodic level

Try to understand what the whole record or
paragraph is about.
ACP has at least 14 distinct meanings in the
Medline abstracts recognition/semantic level is
not sufficient for disambiguation.
Essentially requires categorization of
documents/paragraphs.
The record should be placed in some category and
this will restrict the type of semantic meanings
that are probable in this category.
This is more expensive than the semantic level.
To achieve this categories of records should be
identified (document classification).
Lukasz has made first experiments using different
knowledge rep with discharge summaries.
Different levels R, S, E - are coupled. Knowing
the disease it is easier to uniquely expand some
acronyms and provide WSD. Adding some
XML-annotation should make text categorization
easier. Several interpretations should be
maintained, then one selected.

14
Billing codes

Is it feasible? Complete automatisation may be
hard.
Many courses and books are on the market, B
annually.
Simplest solution proper database gt codes
automatically.
Knowledge-based approach to derive billing codes
from texts look at the rules in books, try to
analyze text, estimate which fields are easy and
which difficult.
Memory-based approach find similar description
that have the same codes (used in national
census).
Correlation-based look at the statistical
distribution of codes, correlation between digit
values useful for checking, osmetimes fro
prediction.
Demo

15
Billing codes

Many courses and books are on the market, B
annually.
Simplest solution proper database gt codes
automatically.
Knowledge-based approach to derive billing codes
from texts look at the rules in books, try to
analyze text, estimate which fields are easy and
which difficult.
Memory-based approach find similar description
that have the same codes (used in national
census).
Correlation-based look at the statistical
distribution of codes, correlation between digit
values useful for checking, osmetimes fro
prediction.

16
General questions

How should we proceed? Depending on the scenario
of use, we can work on selected aspects of the
problem or try to put the whole system together
go on improving it.
What data can we have access to? How reliable it
is?
What should we still do at the pre-processing
stage? Anonymizing but linking individual
patients?
How should we leverage on the POS-tagged corpus?
Compare different unsupervised taggers check
the improvement of supervised taggers use POS
as additional info in concept discovery and
WSDother ideas ?

17
Recognition memory level

Cleaning the text, focusing on details many
misspellings, various recognition memory
techniques may be applied to token gt term
mappings, Pawel has made a good start but be
careful, it is easy to introduce errors.
Improvements of GSpell are of interest to NLM.
About 1000 disambiguation rules were derived from
gt700K trigrams, but how universal are these rules
on the new texts? Are some not too specific?
Semi-automatic approach may be based on context
vectors cluster different use of mm, ALL etc
first and for each try to assign unique meaning
from context how does it compare with manually
derived rules? Can we combine the two approaches
for higher confidence?

18
Semantic memory level

So far we have used only MetaMap but we need
phrase and concept indexing noun phrases,
creating equivalence classes, compression of
information finding concepts in whole sentences
or large windows, not only in phrases.
WSD, or rather concept sense disambiguation, CSD
work with the context vectors in the compressed
text.
Knowledge-based approach create
concept-description vectors from medical
dictionaries and ontologies that goes beyond
context vectors by providing reference knowledge.
Knowledge discovery assigning values to
concepts, assigning concepts to numbers and
adjectives, ex. blood_pressurexxx-yyy, or
blood_pressurenormal adjective noun relation,
or number concept relations looking for
relations at this stage, use fuzzy/similarity
logic.

19
Episodic memory level

Document categorization what categories? For
billing very detailed ones, but even rough
categories are useful to narrow down the choices
for acronym and WSD.
Lukasz most common categories derived from
database, not clear how accurate is initial
diagnosis but at this rough level should be
rather fine.
Use MesH headings at some level? Challenge
select the best set of headings that will help to
find unique sense of words and acronyms.
Many advanced approaches to text categorization,
like kernel-based methods for text, nice field,
but the secret is in pre-processing, finding good
feature space.
Relation to the 20Q game, gaining confidence
stepwise.

20
Suggestions priorities

What are our priorities? All 3 levels are
important. Where will our greatest impact be?
Start with document categorization? People
usually know document category when they read it
misunderstandings is certain if short documents
are given to wrong experts. Try knowledge-based
clustering and supervised learning recurrent NN
for structured problemsdecision trees for many
missing values ...
Good categorization needs concepts/phrases, we
should focus on concept discovery and check
coupling with document categorization, exploring
parallel hypothesis.
Some work should also be finished at the
recognition memory level acronyms
misspellings.