Identifying Abbreviation Definitions in Biomedical Text - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying Abbreviation Definitions in Biomedical Text

Description:

The volume of biomedical text is growing at a fast rate. ... of the Gcn5-related N-acetyltransferase (GNAT) superfamily, a family of enzymes ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 7
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Identifying Abbreviation Definitions in Biomedical Text


1
Identifying Abbreviation Definitions in
Biomedical Text
  • Ariel Schwartz Marti Hearst

2
The Problem
  • The volume of biomedical text is growing at a
    fast rate. New abbreviations are introduced
    frequently.
  • Manual abbreviation dictionaries are out of date.
  • The goal is to have a simple, fast and accurate
    algorithm to identify abbreviations and their
    definitions in biomedical text.
  • We are interested in this algorithm, as one of
    many preprocessing steps we apply to biomedical
    texts, in order to be able to extract meaningful
    information from these texts.

3
Abbreviation Examples
  • Heat-shock protein 40 (Hsp40) enables Hsp70 to
    play critical roles in a number of cellular
    processes, such as protein folding, assembly,
    degradation and translocation in vivo.
  • Glutathione S-transferase pull-down experiments
    showed the direct interaction of in vitro
    translated p110, p64, and p58 of the essential
    CBF3 kinetochore protein complex with Cbf1p, a
    basic region helix-loop-helix zipper protein
    (bHLHzip) that specifically binds to the CDEI
    region on the centromere DNA.
  • Hpa2 is a member of the Gcn5-related
    N-acetyltransferase (GNAT) superfamily, a family
    of enzymes with diverse substrates including
    histones, other proteins,arylalkylamines and
    aminoglycosides.

4
Related Work
  • Pustejovsky et al. present a solution based on
    hand-build regular expression and syntactic
    information. Achieved 72 recall at 98
  • Chang et al. use linear regression on a
    pre-selected set of features. Achieved 83 recall
    at 80 precision, and 75 recall at 95
    precision.
  • Park and Byrd present a rule-based algorithm for
    extraction of abbreviation definitions in general
    text.
  • Yoshida et al. present an approach close to ours,
    trying to first match characters on word and
    syllable boundaries.

Counting partial matches, and abbreviations
missing from the gold-standard their algorithm
achieved 83 recall at 98 precision.
5
The Algorithm
  • Much simpler than other approaches.
  • Extracts abbreviation-definition candidates
    adjacent to parentheses.
  • Finds correct definitions by matching characters
    in the abbreviation to characters in the
    definition, starting from the right.
  • The first character in the abbreviation must
    match a character at the beginning of a word in
    the definition.
  • To increase precision a few simple heuristics are
    applied to eliminate incorrect pairs.
  • Example Heat shock transcription factor (HSF).
  • The algorithm finds the correct definition, but
    not the correct alignment Heat shock
    transcription factor

6
Results
  • On the gold-standard the algorithm achieved 83
    recall at 96 precision.
  • On a larger test collection the results were 90
    recall at 95 precision.
  • An alternative algorithm, based on modification
    of the Park and Byrd algorithm using decision
    lists, achieved only slightly better results
    83 recall at 97 precision, and 90 at 96
    precision.
  • These results show that a very simple algorithm
    produces results that are comparable to these of
    the exiting more complex algorithms.

Counting partial matches, and abbreviations
missing from the gold-standard our algorithm
achieved 83 recall at 99 precision.
Write a Comment
User Comments (0)
About PowerShow.com