Title: Extracting Biological Keywords from Scientific Text
1Extracting Biological Keywords from Scientific
Text
- Wen-Juan Hou and Hsin-Hsi Chen (2004). Enhancing
Performance of Protein and Gene Name Recognizers
Using Filtering and Integration Strategies.
Journal of Biomedical Informatics, 37(6),
December 2004, 448-460.
2Mining Protein Collocates
- collocate
- keywords/key phrases that co-occur with protein
names
Tag the raw material
- Preprocessing
- 1. Remove stop words
- . Stem
Calculate collocation value
Extract significant biological keywords
3Step 1 Tagging the Corpus
- to calculate the collocation values of words with
proteins from a corpus ? recognize protein
names - to improve the performance of protein name
tagging ? application of protein collocates - Preparing a protein name tagged corpus and
developing a high performance protein name tagger
? a chicken-egg problem - Dictionary-based approach ? a partial-tagged
corpus
4Step 1 Tagging the Corpus (continued)
- ltNAME TYPE"PROTEIN"gt Chloroperoxidase lt/NAMEgt
(CPO) is a versatile heme-containing enzyme that
exhibits ltNAME TYPE"PROTEIN"gt peroxidase lt/NAMEgt
, ltNAME TYPE"PROTEIN"gt catalase lt/NAMEgt and
ltNAME TYPE"PROTEIN"gt cytochrome P450 lt/NAMEgt
-like activities in addition to catalyzing
halogenation reactions.
5Step 2 Preprocessing
- Step 2.1 Exclusion of Stopwords
- the stoplists (Fox, 1992) was adopted, but the
words also appearing in the protein lexicon were
removed. - capsid of the lumazine ? of is excluded
- 387 stopwords were used
- Step 2.2 Stemming
- Stemming can group the same word semantics and
reflect more information around the proteins. - inhibited and inhibition will be mapped into
the root form inhibit after stemming.
6Step 3 Computing Collocation Statistics
Frequency
- define a collocation window to be five words on
each side of protein names - capture collocation bigrams at a distance
- both bind and signal are good collocates with
proteins, but the frequencies of bind and
signal are 365 and 9, respectively - A low threshold strategy will keep both of these
two words, but many false candidates will pass
the threshold at the same time.
7Step 3 Computing Collocation Statistics Mean
and Variance
- is the average distance for word i in the
collocation windows - is the distance of the j-th occurrence of
word i away from proteins in the collocation
windows - is the total occurrences of
word I - is the standard deviation of
8Step 3 Computing Collocation Statistics Mean
and Variance (continued)
- standard deviation of value zero
- the collocates and the protein names always occur
at exactly the same distance equal to the mean
value - low standard deviation
- two words usually occur at about the same
distance, i.e., near the mean value - high standard deviation
- the collocates and the protein names occur at
random - phosphoinositide owns zero standard deviation,
but it only occurs twice in the corpus
9Step 3 Computing Collocation Statistics t-test
Model
- is the probability of protein
- When ? (confidence level) is equal to 0.005, the
value of t is 2.576. - If the t-value is larger than 2.576, the word is
regarded as a good collocate of protein with
99.5 confidence.
N 4n - 15
10Step 4 Extraction of Collocation Keywords
- PASTA website in Sheffield University with 1,514
MEDLINE abstracts - 4,782 different stemmed words appear in the
collocation windows - 541 collocations generated in Step 3
- the output may contain nouns, prepositions,
numbers, verbs, etc. - assign parts of speech to these words
- there remained 154 collocation keywords with
verbal part of speech
11The Collocates with the Highest 15 t-value in
Step 3
12The Collocates with the Highest 15 t-value in
Step 4
13Interpretation of interact
- the verb interact is located at average
-0.388 position with the value of standard
deviation 3.256. - The minus sign denotes that the word is on the
left side of the protein. - The absolute value of distance minus 1 indicates
how many words are inserted in between the
collocate and the protein class.
14interact appears at two different locations
- As ltNAME TYPE"PROTEIN"gt ubiquitin
lt/NAMEgt-conjugating enzymes interact with
different substrates or other accessory proteins
in the ubiquitination pathway, these variable
surface regions may confer distinct specificity
to individual enzymes.
- The modelremain free to interact with ltNAME
TYPE"PEOTEIN"gt ferredoxin lt/NAMEgt and ltNAME
TYPE"PROTEIN"gt flavodoxin lt/NAMEgt, the
physiological partners of ltNAME TYPE"PROTEIN"gt
ferredoxin lt/NAMEgt NADP() reductase.
15Collocates mined from corpus
- act (-, -ed, -ing, -ion, -ive, -ivities, -ivity,
-s), activat (-e, -ed, -es, -ing, -ion, -or) ,
adopt (-,ed, -s), affect (-, -ed, -ing, -s),
allow (-, -ed, -s), analys (-sed, -ses, -sis,
-zed, -zing), appear (-, -s), arrange (-d,
-ment), assembl (-ing, -y), associat (-e, -ed,
-ion), bas (-e, -ed, -is), belong (-, -ing, -s),
bind (-, -ing, -s) / bound, bond (-, -ed, -ing,
-s), bridge (-, -d, -s), calculat (-ed, -ion),
called, carr (-ied, -ier, -ies), cataly (-sed,
-ses, -stic, -ze, -zed, -zes, -zing), cause,
16Evaluation
- ask an expert to examine the resultant keywords
- make sure if our keyword set contains significant
information that most experts talked about in
their papers - apply the keyword set to some application, e.g.,
protein name recognition, and to confirm if the
performance of this application is improved when
the keyword set is introduced
17(No Transcript)
18Extracting Key Phrases
- apply the proposed method with window size of one
word on each side of the original single keywords
- select candidates of standard deviations less
than 0.4 - There are 99 compound collocation words with
99.5 confidence and the values of standard
deviations are less than 0.4
19Examples of Compound Collocates with Prepositions
20Examples of Compound Collocates with Nouns
21Enhancing Performance of Protein Name Recognizers
Using Collocation
22Protein Name Recognizers
- Challenging issues
- variant structural characteristics
- resemblance to regular noun phrases
- similarity with other kinds of biological
substances - Approaches
- rule-based
- Kex (Fukuda, et al., 1998)
- Yapex (Olsson, et al., 2002)
- corpus-based
- HMM model (Collier, et al., 2000)
23Comparisons Kex vs. Yapex
- Kex
- 30 abstracts on SH3 domain and 50 abstracts on
signal transduction - 94.70 precision and 98.84 recall
- Yapex
- 48 documents were queried from protein binding
and interaction - 53 documents were randomly chosen from GENIA
corpus - 67.8 precision and 66.4 recall
- Changing the domain may result in the variant
performance - 40.4 precision and 41.1 recall of Kex on the
test data of Yapex
24Filtering Strategies
- M0 (baseline model)
- the output generated by Yapex
- M1
- For each candidate in M0, check if a collocate is
found in its collocation window. - If yes, tag the candidate as a protein name.
Otherwise, discard it. - M2
- If a collocate appears in the candidate or in the
collocation window of the candidate, then tag the
candidate as a protein name. - Otherwise, discard it.
25Filtering Strategies (Continued)
- M3
- During checking if there exists a collocate
co-occurring with a protein candidate, the
candidate without any collocate is kept
undecidable instead of definite no. - After all the protein names are examined, those
undecidable candidates may be considered as
protein names when one of their co-occurrences
containing any collocate. - M31 and M32
26Experimental Results
- to enhance the precision rate, but not to reduce
the significant recall rate
27Integration Strategies
- how to improve the recall rates
- different protein name taggers have their own
specific features - a protein name recognizer may tag a protein name
that another recognizer cannot identify, or - both of them may accept certain common proteins
- The integration strategies are used to select
correct protein names proposed by multiple
recognizers
28Candidates Proposed by Two Systems
29Integration Strategies
- When the protein names produced from two
recognizers are totally separated (i.e., type A),
retain them as the protein candidates. - When the protein names produced from two
recognizers are exactly the same (i.e., type B),
retain them as the protein candidates. - When the protein names tagged by two taggers have
partial overlap (i.e., types C, D and E), two
additional integration strategies are employed - Yapex-based strategies
- Kex-based strategies
30Experiment Designs
- YA and KA
- Use the collocates automatically extracted to
filter out the candidates - YB and KB
- Use the terms suggested by human experts for the
filtering strategies - YA-C and KA-C
- If Yapex and Kex recommend the same protein names
(i.e., type B), regard them as protein names
without consideration of collocates. - Otherwise, use the collocates proposed in this
study to make filtering. - YB-C and KB-C
- Similar to (3) except that the collocates are
replaced by the terms suggested by human experts
31Results
- Yapex-based Integration Strategy
- Kex-based Integration Strategy
32Discussions
- The strategy of delaying the decision until clear
evidence is found is workable - The tendency M32gtM31gtM2gtM1 is still kept in the
new experiments - The set of collocates proposed by our system is
more complete than the set of terms suggested by
human expert - The performances of YA, YA-C, KA, and KA-C are
better than the performances of the corresponding
models (i.e., YB, YB-C, KB, and KB-C)
33Discussions
- The recall rates of both Yapex- and Kex-based
integration are increased - The precision rates are decreased more than the
increase of recall rates - Kex performed not well in this test set
- Yapex performed better in this test corpus