Title: Dictionarybased Approaches for Biomedical Term Recognition
1Dictionary-based Approaches for Biomedical Term
Recognition
- Yoshimasa Tsuruoka
- CREST, JST
- (Japan Science and Technology Corporation)
2Outline
- Dictionary-based protein name recognition
- Approximate string searching
- Filtering by a naïve Bayes classifier
- Probabilistic Term Variant Generator
- Generation Algorithm
- Application Dictionary expansion
3Background
- Information extraction from biomedical documents
- Recognizing technical terms (e.g. DNA, protein
names) -
- We measured glucocorticoid receptors ( GR )
- in mononuclear leukocytes ( MNL ) isolated
4Technical Term Recognition
- Machine learning based
- Identifying the regions of terms
- ? No ID information
- Dictionary-based
- Comparing the strings with each entry in the
dictionary - ? ID information
5Problems of Dictionary-based approaches
- Spelling variation degrades recall
- ? Approximate string searching
- False positives degrade precision
- ? Filtering by machine learning
6Exact String Searching
- Example
- Text
- Phorbol myristate acetate induced Egr-1 mRNA
- Dictionary
- EGP
- EGR-1
- EGR-1 binding protein
-
- ? Any of them does not match
7Edit Distance
- Defines the distance of two strings by the
sequence of three kinds of operations. - Substitution
- Insertion
- Deletion
- Ex.) board ? abord
- Cost 2 (delete a and add a)
8Weighted Edit Distance
- Uniform-cost edit distance is not appropriate
- The distance of the first pair should be smaller
than that of the second one. - ? Weighted Edit Distance
EGR-1 ? EGR 1 cost 1 EGR-1 ? FGR-1 cost
1
9Calculating Weighted Edit Distance by Dynamic
Programming
- Dynamic Programming Matrix
10Approximate String Searching
- Setting zeros at the positions of separators in
the first row
11Recognizing Terms
- To Consider the length of a string
- Normalized Cost
-
- ? Longer strings are preferred
12Cost Table
13Filtering by Machine Learning
- Classify each recognized terms into two classes.
induced Egr-1 mRNA
rejected
accepted
Protein name
14Features for Classification
- Features
- Contextual features
- Term features
- Ex.)
- encoding a putative zinc finger protein was
found to derepress beta-galactosidase - W-1 a, W1 was, Wbegin putative,Wend
protein, Wmiddle zinc, Wmiddle finger
15ExperimentProtein Name Recognition
- Corpus GENIA corpus 3.01
- For training 1800 abstracts
- For testing 200 abstracts
- Class Protein
- Innermost tags
- Dictionary constructed from training data
- Classifier Naïve Bayes Model
16Experimental Results
Exact matching
Filtering Approximate String Searching
17Feature
Filtering Approximate String Searching (th.
6.0)
18Automatic Generation of Spelling Variants
- NF-Kappa B (1.0)
- NF Kappa B (0.9)
- NF kappa B (0.6)
- NF kappaB (0.5)
- NFkappaB (0.3)
-
Generator
NF-Kappa B
Each generated variant has its generation
probability
19Generation Algorithm
- Recursive generation
- P P x Pop
T cell (1.0)
0.5
0.2
T-cell (0.5)
T cells (0.2)
0.2
T-cells (0.1)
20Gathering Examples of Spelling Variation
- Abbreviation Extraction (Schwartz 2003)
- Extracts short and long form pairs
21Learning Operation Rules
- Operations for generating variants
- Substitution
- Deletion
- Insertion
- Context
- Character-level context preceding (following)
two characters - Operation Probability
22Probabilistic Rules
23Example (1)
24Example (2)
25Example (3)
26ApplicationDictionary Expansion
- Expanding each entry in the dictionary
- Threshold of Generation Probability 0.1
- Max number of variants for each entry 20
27Conclusion
- Dictionary-based Term Recognition
- For boosting precision Filtering
- For boosting recall Approximate string searching
- Probabilistic Variant Generator
- Learning from actual examples
- Dictionary expansion by the generator improves
recall without the loss of precision.