Entity Recognition: Current Status and Summer Plan - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Entity Recognition: Current Status and Summer Plan

Description:

On BIOSIS honey bee: waiting to hear from Nyla for judgment on the honey bee sample ... Evaluate the performance on honey bee data based on Nyla's judgments ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 9
Provided by: jin144
Category:

less

Transcript and Presenter's Notes

Title: Entity Recognition: Current Status and Summer Plan


1
Entity RecognitionCurrent Status and Summer Plan
  • Jing Jiang
  • May 12, 2006

2
Update since last meeting
  • Met with Nyla (the biologist) to talk about
    training/evaluation data
  • Most annotated genes in the BioCreative data set
    are reasonable
  • To manually annotate a sample set of bee
    literature for evaluation and tuning purpose
  • Tagged some other collections (fly-bcb, songbird,
    Wnt pathway)
  • Identified some common errors and came up with
    some heuristics to fix the errors

3
Current performance
  • On BIOSIS honey bee waiting to hear from Nyla
    for judgment on the honey bee sample
  • On Wnt pathway full-text articles (a sample of
    100 sentences, judged by Xin)
  • Precision 92 (207 / 224)
  • Recall 84 (207 / 245)
  • Examples
  • fly, songbird, Wnt pathway

4
Common errors and heuristics
  • Same word/phrase tagged differently within the
    same article
  • Because of the different contexts
  • Heuristic force the tagging to be consistent
  • Long form and its abbreviation tagged differently
  • E.g. a cDNA encoding Apis mellifera
    ultraspiracle (AMUSP) and
  • Heuristic force the tagging to be consistent
  • Easily detectable false positives
  • E.g. Roughly half of Drosophila genes currently
  • Heuristic compile a list (of species names,
    chemical names, etc.) and some heuristic rules

5
Common errors and heuristics (cont.)
  • Conjunctive words/phrases tagged differently
  • E.g. three cbl genes (c-cbl , cblb , and cblc)
    which
  • Heuristic use some rules to capture such
    conjunctive words, and tag them consistently
  • Tokenization errors
  • E.g. There is no difference in AmTRP-expressing
    cells among worker,
  • Heuristic compile a list of typical suffixes
    (such as -expressing, -dependent, etc.) that
    should be separated from their prefixes

6
Common errors and heuristics
  • Mistakes caused by citations
  • Only in certain text (Wnt pathway collection has
    this problem. BIOSIS collections dont.)
  • E.g. Among the downstream targets of PI 3-kinase
    are phospholipase C (6-9) , protein kinase C (10,
    11) , Rac (12-14) , and
  • Heuristic remove these citations(?)
  • Controversial cases domain, subunit, etc.
  • E.g. Alternating proline / alanine sequence of
    beta B1 subunit originates
  • BioCreative data set tags these as part of gene
    names

7
Summer plan
  • Evaluate the performance on honey bee data based
    on Nylas judgments
  • Implement and tune the heuristics to capture the
    common errors, and evaluate their effectiveness
  • Some heuristics may cause new errors
  • Tune on the annotated sample honey bee data
  • Based on the need of BeeSpace, find a good
    balance between precision and recall
  • Work with Todd on the input/output format of the
    entity recognizer

8
Discussion
Write a Comment
User Comments (0)
About PowerShow.com