The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric

Description:

Charlotte Andersen. John Pestian. Karen Davis. Lukasz Itert. Pawel Matykewicz. Wlodzislaw Duch ... in whole sentences or large windows, not only in phrases. ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: The Semantic Retrieval System: Realtime System for Classifying and Retrieving Unstructured Pediatric


1
The Semantic Retrieval SystemReal-time System
for Classifying and Retrieving Unstructured
Pediatric Clinical Annotations
  • Charlotte Andersen
  • John Pestian
  • Karen Davis
  • Lukasz Itert
  • Pawel Matykewicz
  • Wlodzislaw Duch

Cincinnati, February 2005
2
Outline
  • The project
  • Goals
  • Focus
  • Software
  • Results
  • Plans

3
CCHRF project outline (simplified)
Preprocessing
  • INPUT (raw medical text)

MetaMap input
MetaMap Software - UMLS concept discovering and
indexing
Hypothesis generation, validation, important
relations.
Annotations Concept Space (UMLS concepts)
Decision support systems
Automatic medical billing
4
Long-term goals (too ambitious?)
  • IR system facilitating discoveries, helping to
    answer questions like
  • Retrieve similar cases using discharge summaries.
  • Is X related to Y?
  • Will X help patient with Y?
  • What correlates with X?
  • What causes changes of X?
  • What are therapy options for X?
  • Automatic creation of medical billing codes from
    text.
  • Can we work out scenarios of use for our target
    system?

5
First big problem disambiguation
  • Map raw text to some structured form, removing
    all ambiguities, expanding acronyms etc.
  • Use the NLMs MetaMap to create an XML formatted
    data whose schema is based on the Unified Medical
    Language Systems (ULMS) Semantic Network
    ontology.
  • ltsemantic typegtwordlt/semantic typegt
  • E.C. gt ltbacteriumgtEscherichia Colilt/bacteriumgt
  • ltpatientgt
  • ltFIRST-NAMEgtBoblt/FIRST-NAMEgt
  • ltLAST-NAMEgtNopelt/LAST-NAMEgt
  • lt/patientgt

6
XML or structured text
  • Final XML should include maximum information that
    can be derived with high confidence for each
    word, including
  • 1. Annotations for parts of speech (tree tagger)
    for which type of words?
  • 2. Tags for semantic type (135 types in ULMS
    tags for other non-medical types)
  • 3. Tags for word sense (ULMS dictionaries such
    as WordNet)
  • 4. Values assigned to some semantic types, ex
    Temperaturehigh, or T102F.
  • What should we keep depends on the scenarios how
    the system will be used.

7
Small steps to solve the big problem
  • Main subproblems
  • Removing patient-specific information, but
    keeping all information related to a single case
    together how to link sequence of records for a
    single person?
  • Text cleaning misspellings, obtaining unique
    terms.
  • Expansion of abbreviations and acronyms.
  • Ambiguity of medical terms.
  • Ambiguity of common words how interesting are
    common terms, which categories/semantic types
    should be used?
  • Assigning values to some categories, ex blood
    pressure temperature.
  • Check XML standards developed at AMIA.

8
Human information retrieval
  • 3 levels
  • First local, recognition of terms, we bring a
    lot of background knowledge reading the text,
    ignore misspellings and small mistakes.
  • Second larger units, semantic interpretation,
    discover and understand meaning of concepts
    composed of several terms, define semantic word
    sense for ambiguous words, expand terms and
    acronyms to reach unambiguous interpretation.
  • Third episodic level of processing, or what the
    whole record or text is about? Knowing the
    category of text helps in unique intrepretaiton
    at recognition and semantic level.

9
Recognition
  • Pawel started some work, a short report on text
    recognition memory was written.
  • NLM has GSpell and WedSpell spelling suggestion
    tools, and the BagOWordsPlus phrase retrieval
    tool (new, worth trying).
  • GSpell java classes, used to propose spelling
    corrections and unique spelling for words that
    have alternative spellings.
  • Even if spelled correctly it may be a mistake,
    ex
  • diseasedisease0.01.0NGramsCorrect
  • diseasediscase1.00.873NGrams
  • diseasediseased1.00.873NGrams
  • diseasedecease2.00.5819672267388108NGrams

10
Recognition cont
  • Is this an issue in our case?
  • Can we estimate how serious are problems at the
    recognition level?
  • The term may be a part of the phrase, and this
    would be discovered only when the term is
    correctly recognized.
  • How do we know that we have acronym/abbreviation?
  • Frequently capital letters, usually 2-4 letters,
    morphological structure using bi-grams is
    improbable, ex DMI, CRC, IVF.
  • Acronyms and abbreviations should be recognized
    and expanded.
  • Need probability of various typos (keys that are
    close, characters that are inverted, frequent
    errors, anticipation what character should come
    next etc), and errors at the spelling and
    phonological level.
  • External dictionaries should be checked to find
    out if the word is not a specific medical term
    that is not listed in ULMS.

11
Semantic level
  • Required to
  • select the most probable term from recognition
    process that gives several alternatives at the
    same confidence level
  • WSD, or find semantic word sense for ambiguous
    words.
  • Word may have correct spelling but no sense at
    the semantic level go back to the recognition
    level and generate more similar words to check
    which one is the most probable at the semantic
    level.
  • This should give in most cases highly probable
    term once this is achieved unique semantic word
    sense is defined.
  • Semantic knowledge representation may be done
    using
  • context vectors,
  • concept-description vectors
  • more elaborate approaches, like frames (CYC).

12
Semantic knowledge representation
  • Context vectors numerical, easy to generate from
    co-occurrence. Widely used statistical approach,
    but lacks semantics concept name and its
    properties may be far apart A?? B ? A ? B
  • Concept description vectors (CDV),
    knowledge-based list properties of concepts,
    derive info from definitions, dictionaries,
    ontologies, pay more attention to unique
    features.
  • Frames, structured representations more
    expressive power, with symbolic values such as
    color blue or color in blue, green ect. time
    admission_time time day before discharge
    time morning ..., ect ...
  • Initially simple vector representation should be
    sufficient for WSD, but remember that expressive
    power is limited. Some thinking about
    simplified, computationally efficient frame-based
    representation should be done.

13
Episodic level
  • Try to understand what the whole record or
    paragraph is about.
  • ACP has at least 14 distinct meanings in the
    Medline abstracts recognition/semantic level is
    not sufficient for disambiguation.
  • Essentially requires categorization of
    documents/paragraphs.
  • The record should be placed in some category and
    this will restrict the type of semantic meanings
    that are probable in this category.
  • This is more expensive than the semantic level.
    To achieve this categories of records should be
    identified (document classification).
  • Lukasz has made first experiments using different
    knowledge rep with discharge summaries.
  • Different levels R, S, E - are coupled. Knowing
    the disease it is easier to uniquely expand some
    acronyms and provide WSD. Adding some
    XML-annotation should make text categorization
    easier. Several interpretations should be
    maintained, then one selected.

14
Billing codes
  • Is it feasible? Complete automatisation may be
    hard.
  • Many courses and books are on the market, B
    annually.
  • Simplest solution proper database gt codes
    automatically.
  • Knowledge-based approach to derive billing codes
    from texts look at the rules in books, try to
    analyze text, estimate which fields are easy and
    which difficult.
  • Memory-based approach find similar description
    that have the same codes (used in national
    census).
  • Correlation-based look at the statistical
    distribution of codes, correlation between digit
    values useful for checking, osmetimes fro
    prediction.
  • Demo

15
Billing codes
  • Many courses and books are on the market, B
    annually.
  • Simplest solution proper database gt codes
    automatically.
  • Knowledge-based approach to derive billing codes
    from texts look at the rules in books, try to
    analyze text, estimate which fields are easy and
    which difficult.
  • Memory-based approach find similar description
    that have the same codes (used in national
    census).
  • Correlation-based look at the statistical
    distribution of codes, correlation between digit
    values useful for checking, osmetimes fro
    prediction.

16
General questions
  • How should we proceed? Depending on the scenario
    of use, we can work on selected aspects of the
    problem or try to put the whole system together
    go on improving it.
  • What data can we have access to? How reliable it
    is?
  • What should we still do at the pre-processing
    stage? Anonymizing but linking individual
    patients?
  • How should we leverage on the POS-tagged corpus?
    Compare different unsupervised taggers check
    the improvement of supervised taggers use POS
    as additional info in concept discovery and
    WSDother ideas ?

17
Recognition memory level
  • Cleaning the text, focusing on details many
    misspellings, various recognition memory
    techniques may be applied to token gt term
    mappings, Pawel has made a good start but be
    careful, it is easy to introduce errors.
  • Improvements of GSpell are of interest to NLM.
  • About 1000 disambiguation rules were derived from
    gt700K trigrams, but how universal are these rules
    on the new texts? Are some not too specific?
  • Semi-automatic approach may be based on context
    vectors cluster different use of mm, ALL etc
    first and for each try to assign unique meaning
    from context how does it compare with manually
    derived rules? Can we combine the two approaches
    for higher confidence?

18
Semantic memory level
  • So far we have used only MetaMap but we need
    phrase and concept indexing noun phrases,
    creating equivalence classes, compression of
    information finding concepts in whole sentences
    or large windows, not only in phrases.
  • WSD, or rather concept sense disambiguation, CSD
    work with the context vectors in the compressed
    text.
  • Knowledge-based approach create
    concept-description vectors from medical
    dictionaries and ontologies that goes beyond
    context vectors by providing reference knowledge.
  • Knowledge discovery assigning values to
    concepts, assigning concepts to numbers and
    adjectives, ex. blood_pressurexxx-yyy, or
    blood_pressurenormal adjective noun relation,
    or number concept relations looking for
    relations at this stage, use fuzzy/similarity
    logic.

19
Episodic memory level
  • Document categorization what categories? For
    billing very detailed ones, but even rough
    categories are useful to narrow down the choices
    for acronym and WSD.
  • Lukasz most common categories derived from
    database, not clear how accurate is initial
    diagnosis but at this rough level should be
    rather fine.
  • Use MesH headings at some level? Challenge
    select the best set of headings that will help to
    find unique sense of words and acronyms.
  • Many advanced approaches to text categorization,
    like kernel-based methods for text, nice field,
    but the secret is in pre-processing, finding good
    feature space.
  • Relation to the 20Q game, gaining confidence
    stepwise.

20
Suggestions priorities
  • What are our priorities? All 3 levels are
    important. Where will our greatest impact be?
  • Start with document categorization? People
    usually know document category when they read it
    misunderstandings is certain if short documents
    are given to wrong experts. Try knowledge-based
    clustering and supervised learning recurrent NN
    for structured problemsdecision trees for many
    missing values ...
  • Good categorization needs concepts/phrases, we
    should focus on concept discovery and check
    coupling with document categorization, exploring
    parallel hypothesis.
  • Some work should also be finished at the
    recognition memory level acronyms
    misspellings.
Write a Comment
User Comments (0)
About PowerShow.com