Extracting Biological Keywords from Scientific Text - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

Extracting Biological Keywords from Scientific Text

Description:

Enhancing Performance of Protein and Gene Name Recognizers Using Filtering and ... appear (-, -s), arrange (-d, -ment), assembl (-ing, -y), associat (-e, -ed, -ion) ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 34

Provided by: hhc2

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Biological Keywords from Scientific Text

1
Extracting Biological Keywords from Scientific
Text

Wen-Juan Hou and Hsin-Hsi Chen (2004). Enhancing
Performance of Protein and Gene Name Recognizers
Using Filtering and Integration Strategies.
Journal of Biomedical Informatics, 37(6),
December 2004, 448-460.

2
Mining Protein Collocates

collocate
keywords/key phrases that co-occur with protein
names

Tag the raw material

Preprocessing
1. Remove stop words
. Stem

Calculate collocation value
Extract significant biological keywords
3
Step 1 Tagging the Corpus

to calculate the collocation values of words with
proteins from a corpus ? recognize protein
names
to improve the performance of protein name
tagging ? application of protein collocates
Preparing a protein name tagged corpus and
developing a high performance protein name tagger
? a chicken-egg problem
Dictionary-based approach ? a partial-tagged
corpus

4
Step 1 Tagging the Corpus (continued)

ltNAME TYPE"PROTEIN"gt Chloroperoxidase lt/NAMEgt
(CPO) is a versatile heme-containing enzyme that
exhibits ltNAME TYPE"PROTEIN"gt peroxidase lt/NAMEgt
, ltNAME TYPE"PROTEIN"gt catalase lt/NAMEgt and
ltNAME TYPE"PROTEIN"gt cytochrome P450 lt/NAMEgt
-like activities in addition to catalyzing
halogenation reactions.

5
Step 2 Preprocessing

Step 2.1 Exclusion of Stopwords
the stoplists (Fox, 1992) was adopted, but the
words also appearing in the protein lexicon were
removed.
capsid of the lumazine ? of is excluded
387 stopwords were used
Step 2.2 Stemming
Stemming can group the same word semantics and
reflect more information around the proteins.
inhibited and inhibition will be mapped into
the root form inhibit after stemming.

6
Step 3 Computing Collocation Statistics
Frequency

define a collocation window to be five words on
each side of protein names
capture collocation bigrams at a distance
both bind and signal are good collocates with
proteins, but the frequencies of bind and
signal are 365 and 9, respectively
A low threshold strategy will keep both of these
two words, but many false candidates will pass
the threshold at the same time.

7
Step 3 Computing Collocation Statistics Mean
and Variance

is the average distance for word i in the
collocation windows
is the distance of the j-th occurrence of
word i away from proteins in the collocation
windows
is the total occurrences of
word I
is the standard deviation of

8
Step 3 Computing Collocation Statistics Mean
and Variance (continued)

standard deviation of value zero
the collocates and the protein names always occur
at exactly the same distance equal to the mean
value
low standard deviation
two words usually occur at about the same
distance, i.e., near the mean value
high standard deviation
the collocates and the protein names occur at
random
phosphoinositide owns zero standard deviation,
but it only occurs twice in the corpus

9
Step 3 Computing Collocation Statistics t-test
Model

is the probability of protein
When ? (confidence level) is equal to 0.005, the
value of t is 2.576.
If the t-value is larger than 2.576, the word is
regarded as a good collocate of protein with
99.5 confidence.

N 4n - 15
10
Step 4 Extraction of Collocation Keywords

PASTA website in Sheffield University with 1,514
MEDLINE abstracts
4,782 different stemmed words appear in the
collocation windows
541 collocations generated in Step 3
the output may contain nouns, prepositions,
numbers, verbs, etc.
assign parts of speech to these words
there remained 154 collocation keywords with
verbal part of speech

11
The Collocates with the Highest 15 t-value in
Step 3
12
The Collocates with the Highest 15 t-value in
Step 4
13
Interpretation of interact

the verb interact is located at average
-0.388 position with the value of standard
deviation 3.256.
The minus sign denotes that the word is on the
left side of the protein.
The absolute value of distance minus 1 indicates
how many words are inserted in between the
collocate and the protein class.

14
interact appears at two different locations

As ltNAME TYPE"PROTEIN"gt ubiquitin
lt/NAMEgt-conjugating enzymes interact with
different substrates or other accessory proteins
in the ubiquitination pathway, these variable
surface regions may confer distinct specificity
to individual enzymes.

The modelremain free to interact with ltNAME
TYPE"PEOTEIN"gt ferredoxin lt/NAMEgt and ltNAME
TYPE"PROTEIN"gt flavodoxin lt/NAMEgt, the
physiological partners of ltNAME TYPE"PROTEIN"gt
ferredoxin lt/NAMEgt NADP() reductase.

15
Collocates mined from corpus

act (-, -ed, -ing, -ion, -ive, -ivities, -ivity,
-s), activat (-e, -ed, -es, -ing, -ion, -or) ,
adopt (-,ed, -s), affect (-, -ed, -ing, -s),
allow (-, -ed, -s), analys (-sed, -ses, -sis,
-zed, -zing), appear (-, -s), arrange (-d,
-ment), assembl (-ing, -y), associat (-e, -ed,
-ion), bas (-e, -ed, -is), belong (-, -ing, -s),
bind (-, -ing, -s) / bound, bond (-, -ed, -ing,
-s), bridge (-, -d, -s), calculat (-ed, -ion),
called, carr (-ied, -ier, -ies), cataly (-sed,
-ses, -stic, -ze, -zed, -zes, -zing), cause,

16
Evaluation

ask an expert to examine the resultant keywords
make sure if our keyword set contains significant
information that most experts talked about in
their papers
apply the keyword set to some application, e.g.,
protein name recognition, and to confirm if the
performance of this application is improved when
the keyword set is introduced

17
(No Transcript)
18
Extracting Key Phrases

apply the proposed method with window size of one
word on each side of the original single keywords
select candidates of standard deviations less
than 0.4
There are 99 compound collocation words with
99.5 confidence and the values of standard
deviations are less than 0.4

19
Examples of Compound Collocates with Prepositions
20
Examples of Compound Collocates with Nouns
21
Enhancing Performance of Protein Name Recognizers
Using Collocation
22
Protein Name Recognizers

Challenging issues
variant structural characteristics
resemblance to regular noun phrases
similarity with other kinds of biological
substances
Approaches
rule-based
Kex (Fukuda, et al., 1998)
Yapex (Olsson, et al., 2002)
corpus-based
HMM model (Collier, et al., 2000)

23
Comparisons Kex vs. Yapex

Kex
30 abstracts on SH3 domain and 50 abstracts on
signal transduction
94.70 precision and 98.84 recall
Yapex
48 documents were queried from protein binding
and interaction
53 documents were randomly chosen from GENIA
corpus
67.8 precision and 66.4 recall
Changing the domain may result in the variant
performance
40.4 precision and 41.1 recall of Kex on the
test data of Yapex

24
Filtering Strategies

M0 (baseline model)
the output generated by Yapex
M1
For each candidate in M0, check if a collocate is
found in its collocation window.
If yes, tag the candidate as a protein name.
Otherwise, discard it.
M2
If a collocate appears in the candidate or in the
collocation window of the candidate, then tag the
candidate as a protein name.
Otherwise, discard it.

25
Filtering Strategies (Continued)

M3
During checking if there exists a collocate
co-occurring with a protein candidate, the
candidate without any collocate is kept
undecidable instead of definite no.
After all the protein names are examined, those
undecidable candidates may be considered as
protein names when one of their co-occurrences
containing any collocate.
M31 and M32

26
Experimental Results

to enhance the precision rate, but not to reduce
the significant recall rate

27
Integration Strategies

how to improve the recall rates
different protein name taggers have their own
specific features
a protein name recognizer may tag a protein name
that another recognizer cannot identify, or
both of them may accept certain common proteins
The integration strategies are used to select
correct protein names proposed by multiple
recognizers

28
Candidates Proposed by Two Systems
29
Integration Strategies

When the protein names produced from two
recognizers are totally separated (i.e., type A),
retain them as the protein candidates.
When the protein names produced from two
recognizers are exactly the same (i.e., type B),
retain them as the protein candidates.
When the protein names tagged by two taggers have
partial overlap (i.e., types C, D and E), two
additional integration strategies are employed
Yapex-based strategies
Kex-based strategies

30
Experiment Designs

YA and KA
Use the collocates automatically extracted to
filter out the candidates
YB and KB
Use the terms suggested by human experts for the
filtering strategies
YA-C and KA-C
If Yapex and Kex recommend the same protein names
(i.e., type B), regard them as protein names
without consideration of collocates.
Otherwise, use the collocates proposed in this
study to make filtering.
YB-C and KB-C
Similar to (3) except that the collocates are
replaced by the terms suggested by human experts

31
Results

Yapex-based Integration Strategy

Kex-based Integration Strategy

32
Discussions

The strategy of delaying the decision until clear
evidence is found is workable
The tendency M32gtM31gtM2gtM1 is still kept in the
new experiments
The set of collocates proposed by our system is
more complete than the set of terms suggested by
human expert
The performances of YA, YA-C, KA, and KA-C are
better than the performances of the corresponding
models (i.e., YB, YB-C, KB, and KB-C)

33
Discussions