Dictionarybased Approaches for Biomedical Term Recognition

About This Presentation

Title:

Dictionarybased Approaches for Biomedical Term Recognition

Description:

Dictionary-based protein name recognition. Approximate string searching ... Ex.) board abord. Cost = 2 (delete `a' and add `a') Weighted Edit Distance ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 28

Provided by: wwwtsuji

Category:

more less

Transcript and Presenter's Notes

Title: Dictionarybased Approaches for Biomedical Term Recognition

1
Dictionary-based Approaches for Biomedical Term
Recognition

Yoshimasa Tsuruoka
CREST, JST
(Japan Science and Technology Corporation)

2
Outline

Dictionary-based protein name recognition
Approximate string searching
Filtering by a naïve Bayes classifier
Probabilistic Term Variant Generator
Generation Algorithm
Application Dictionary expansion

3
Background

Information extraction from biomedical documents
Recognizing technical terms (e.g. DNA, protein
names)
We measured glucocorticoid receptors ( GR )
in mononuclear leukocytes ( MNL ) isolated

4
Technical Term Recognition

Machine learning based
Identifying the regions of terms
? No ID information
Dictionary-based
Comparing the strings with each entry in the
dictionary
? ID information

5
Problems of Dictionary-based approaches

Spelling variation degrades recall
? Approximate string searching
False positives degrade precision
? Filtering by machine learning

6
Exact String Searching

Example
Text
Phorbol myristate acetate induced Egr-1 mRNA
Dictionary
EGP
EGR-1
EGR-1 binding protein
? Any of them does not match

7
Edit Distance

Defines the distance of two strings by the
sequence of three kinds of operations.
Substitution
Insertion
Deletion
Ex.) board ? abord
Cost 2 (delete a and add a)

8
Weighted Edit Distance

Uniform-cost edit distance is not appropriate
The distance of the first pair should be smaller
than that of the second one.
? Weighted Edit Distance

EGR-1 ? EGR 1 cost 1 EGR-1 ? FGR-1 cost
1
9
Calculating Weighted Edit Distance by Dynamic
Programming

Dynamic Programming Matrix

10
Approximate String Searching

Setting zeros at the positions of separators in
the first row

11
Recognizing Terms

To Consider the length of a string
Normalized Cost
? Longer strings are preferred

12
Cost Table
13
Filtering by Machine Learning

Classify each recognized terms into two classes.

induced Egr-1 mRNA
rejected
accepted
Protein name
14
Features for Classification

Features
Contextual features
Term features
Ex.)
encoding a putative zinc finger protein was
found to derepress beta-galactosidase
W-1 a, W1 was, Wbegin putative,Wend
protein, Wmiddle zinc, Wmiddle finger

15
ExperimentProtein Name Recognition

Corpus GENIA corpus 3.01
For training 1800 abstracts
For testing 200 abstracts
Class Protein
Innermost tags
Dictionary constructed from training data
Classifier Naïve Bayes Model

16
Experimental Results
Exact matching
Filtering Approximate String Searching
17
Feature
Filtering Approximate String Searching (th.
6.0)
18
Automatic Generation of Spelling Variants

Variant Generator

NF-Kappa B (1.0)
NF Kappa B (0.9)
NF kappa B (0.6)
NF kappaB (0.5)
NFkappaB (0.3)

Generator
NF-Kappa B
Each generated variant has its generation
probability
19
Generation Algorithm

Recursive generation
P P x Pop

T cell (1.0)
0.5
0.2
T-cell (0.5)
T cells (0.2)
0.2
T-cells (0.1)
20
Gathering Examples of Spelling Variation

Abbreviation Extraction (Schwartz 2003)
Extracts short and long form pairs

21
Learning Operation Rules

Operations for generating variants
Substitution
Deletion
Insertion
Context
Character-level context preceding (following)
two characters
Operation Probability

22
Probabilistic Rules
23
Example (1)
24
Example (2)
25
Example (3)
26
ApplicationDictionary Expansion

Expanding each entry in the dictionary
Threshold of Generation Probability 0.1
Max number of variants for each entry 20

27
Conclusion

Dictionary-based Term Recognition
For boosting precision Filtering
For boosting recall Approximate string searching
Probabilistic Variant Generator
Learning from actual examples
Dictionary expansion by the generator improves
recall without the loss of precision.

Write a Comment

User Comments (0)