Dictionarybased Approaches for Biomedical Term Recognition - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Dictionarybased Approaches for Biomedical Term Recognition

Description:

Dictionary-based protein name recognition. Approximate string searching ... Ex.) board abord. Cost = 2 (delete `a' and add `a') Weighted Edit Distance ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 28
Provided by: wwwtsuji
Category:

less

Transcript and Presenter's Notes

Title: Dictionarybased Approaches for Biomedical Term Recognition


1
Dictionary-based Approaches for Biomedical Term
Recognition
  • Yoshimasa Tsuruoka
  • CREST, JST
  • (Japan Science and Technology Corporation)

2
Outline
  • Dictionary-based protein name recognition
  • Approximate string searching
  • Filtering by a naïve Bayes classifier
  • Probabilistic Term Variant Generator
  • Generation Algorithm
  • Application Dictionary expansion

3
Background
  • Information extraction from biomedical documents
  • Recognizing technical terms (e.g. DNA, protein
    names)
  • We measured glucocorticoid receptors ( GR )
  • in mononuclear leukocytes ( MNL ) isolated

4
Technical Term Recognition
  • Machine learning based
  • Identifying the regions of terms
  • ? No ID information
  • Dictionary-based
  • Comparing the strings with each entry in the
    dictionary
  • ? ID information

5
Problems of Dictionary-based approaches
  • Spelling variation degrades recall
  • ? Approximate string searching
  • False positives degrade precision
  • ? Filtering by machine learning

6
Exact String Searching
  • Example
  • Text
  • Phorbol myristate acetate induced Egr-1 mRNA
  • Dictionary
  • EGP
  • EGR-1
  • EGR-1 binding protein
  • ? Any of them does not match

7
Edit Distance
  • Defines the distance of two strings by the
    sequence of three kinds of operations.
  • Substitution
  • Insertion
  • Deletion
  • Ex.) board ? abord
  • Cost 2 (delete a and add a)

8
Weighted Edit Distance
  • Uniform-cost edit distance is not appropriate
  • The distance of the first pair should be smaller
    than that of the second one.
  • ? Weighted Edit Distance

EGR-1 ? EGR 1 cost 1 EGR-1 ? FGR-1 cost
1
9
Calculating Weighted Edit Distance by Dynamic
Programming
  • Dynamic Programming Matrix

10
Approximate String Searching
  • Setting zeros at the positions of separators in
    the first row

11
Recognizing Terms
  • To Consider the length of a string
  • Normalized Cost
  • ? Longer strings are preferred

12
Cost Table
13
Filtering by Machine Learning
  • Classify each recognized terms into two classes.

induced Egr-1 mRNA
rejected
accepted
Protein name
14
Features for Classification
  • Features
  • Contextual features
  • Term features
  • Ex.)
  • encoding a putative zinc finger protein was
    found to derepress beta-galactosidase
  • W-1 a, W1 was, Wbegin putative,Wend
    protein, Wmiddle zinc, Wmiddle finger

15
ExperimentProtein Name Recognition
  • Corpus GENIA corpus 3.01
  • For training 1800 abstracts
  • For testing 200 abstracts
  • Class Protein
  • Innermost tags
  • Dictionary constructed from training data
  • Classifier Naïve Bayes Model

16
Experimental Results
Exact matching
Filtering Approximate String Searching
17
Feature
Filtering Approximate String Searching (th.
6.0)
18
Automatic Generation of Spelling Variants
  • Variant Generator
  • NF-Kappa B (1.0)
  • NF Kappa B (0.9)
  • NF kappa B (0.6)
  • NF kappaB (0.5)
  • NFkappaB (0.3)

Generator
NF-Kappa B
Each generated variant has its generation
probability
19
Generation Algorithm
  • Recursive generation
  • P P x Pop

T cell (1.0)
0.5
0.2
T-cell (0.5)
T cells (0.2)
0.2
T-cells (0.1)
20
Gathering Examples of Spelling Variation
  • Abbreviation Extraction (Schwartz 2003)
  • Extracts short and long form pairs

21
Learning Operation Rules
  • Operations for generating variants
  • Substitution
  • Deletion
  • Insertion
  • Context
  • Character-level context preceding (following)
    two characters
  • Operation Probability

22
Probabilistic Rules
23
Example (1)
24
Example (2)
25
Example (3)
26
ApplicationDictionary Expansion
  • Expanding each entry in the dictionary
  • Threshold of Generation Probability 0.1
  • Max number of variants for each entry 20

27
Conclusion
  • Dictionary-based Term Recognition
  • For boosting precision Filtering
  • For boosting recall Approximate string searching
  • Probabilistic Variant Generator
  • Learning from actual examples
  • Dictionary expansion by the generator improves
    recall without the loss of precision.
Write a Comment
User Comments (0)
About PowerShow.com