Extracting Biological Keywords from Scientific Text - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Extracting Biological Keywords from Scientific Text

Description:

Enhancing Performance of Protein and Gene Name Recognizers Using Filtering and ... appear (-, -s), arrange (-d, -ment), assembl (-ing, -y), associat (-e, -ed, -ion) ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 34
Provided by: hhc2
Category:

less

Transcript and Presenter's Notes

Title: Extracting Biological Keywords from Scientific Text


1
Extracting Biological Keywords from Scientific
Text
  • Wen-Juan Hou and Hsin-Hsi Chen (2004). Enhancing
    Performance of Protein and Gene Name Recognizers
    Using Filtering and Integration Strategies.
    Journal of Biomedical Informatics, 37(6),
    December 2004, 448-460.

2
Mining Protein Collocates
  • collocate
  • keywords/key phrases that co-occur with protein
    names

Tag the raw material
  • Preprocessing
  • 1. Remove stop words
  • . Stem

Calculate collocation value
Extract significant biological keywords
3
Step 1 Tagging the Corpus
  • to calculate the collocation values of words with
    proteins from a corpus ? recognize protein
    names
  • to improve the performance of protein name
    tagging ? application of protein collocates
  • Preparing a protein name tagged corpus and
    developing a high performance protein name tagger
    ? a chicken-egg problem
  • Dictionary-based approach ? a partial-tagged
    corpus

4
Step 1 Tagging the Corpus (continued)
  • ltNAME TYPE"PROTEIN"gt Chloroperoxidase lt/NAMEgt
    (CPO) is a versatile heme-containing enzyme that
    exhibits ltNAME TYPE"PROTEIN"gt peroxidase lt/NAMEgt
    , ltNAME TYPE"PROTEIN"gt catalase lt/NAMEgt and
    ltNAME TYPE"PROTEIN"gt cytochrome P450 lt/NAMEgt
    -like activities in addition to catalyzing
    halogenation reactions.

5
Step 2 Preprocessing
  • Step 2.1 Exclusion of Stopwords
  • the stoplists (Fox, 1992) was adopted, but the
    words also appearing in the protein lexicon were
    removed.
  • capsid of the lumazine ? of is excluded
  • 387 stopwords were used
  • Step 2.2 Stemming
  • Stemming can group the same word semantics and
    reflect more information around the proteins.
  • inhibited and inhibition will be mapped into
    the root form inhibit after stemming.

6
Step 3 Computing Collocation Statistics
Frequency
  • define a collocation window to be five words on
    each side of protein names
  • capture collocation bigrams at a distance
  • both bind and signal are good collocates with
    proteins, but the frequencies of bind and
    signal are 365 and 9, respectively
  • A low threshold strategy will keep both of these
    two words, but many false candidates will pass
    the threshold at the same time.

7
Step 3 Computing Collocation Statistics Mean
and Variance
  • is the average distance for word i in the
    collocation windows
  • is the distance of the j-th occurrence of
    word i away from proteins in the collocation
    windows
  • is the total occurrences of
    word I
  • is the standard deviation of

8
Step 3 Computing Collocation Statistics Mean
and Variance (continued)
  • standard deviation of value zero
  • the collocates and the protein names always occur
    at exactly the same distance equal to the mean
    value
  • low standard deviation
  • two words usually occur at about the same
    distance, i.e., near the mean value
  • high standard deviation
  • the collocates and the protein names occur at
    random
  • phosphoinositide owns zero standard deviation,
    but it only occurs twice in the corpus

9
Step 3 Computing Collocation Statistics t-test
Model
  • is the probability of protein
  • When ? (confidence level) is equal to 0.005, the
    value of t is 2.576.
  • If the t-value is larger than 2.576, the word is
    regarded as a good collocate of protein with
    99.5 confidence.

N 4n - 15
10
Step 4 Extraction of Collocation Keywords
  • PASTA website in Sheffield University with 1,514
    MEDLINE abstracts
  • 4,782 different stemmed words appear in the
    collocation windows
  • 541 collocations generated in Step 3
  • the output may contain nouns, prepositions,
    numbers, verbs, etc.
  • assign parts of speech to these words
  • there remained 154 collocation keywords with
    verbal part of speech

11
The Collocates with the Highest 15 t-value in
Step 3
12
The Collocates with the Highest 15 t-value in
Step 4
13
Interpretation of interact
  • the verb interact is located at average
    -0.388 position with the value of standard
    deviation 3.256.
  • The minus sign denotes that the word is on the
    left side of the protein.
  • The absolute value of distance minus 1 indicates
    how many words are inserted in between the
    collocate and the protein class.

14
interact appears at two different locations
  • As ltNAME TYPE"PROTEIN"gt ubiquitin
    lt/NAMEgt-conjugating enzymes interact with
    different substrates or other accessory proteins
    in the ubiquitination pathway, these variable
    surface regions may confer distinct specificity
    to individual enzymes.
  • The modelremain free to interact with ltNAME
    TYPE"PEOTEIN"gt ferredoxin lt/NAMEgt and ltNAME
    TYPE"PROTEIN"gt flavodoxin lt/NAMEgt, the
    physiological partners of ltNAME TYPE"PROTEIN"gt
    ferredoxin lt/NAMEgt NADP() reductase.

15
Collocates mined from corpus
  • act (-, -ed, -ing, -ion, -ive, -ivities, -ivity,
    -s), activat (-e, -ed, -es, -ing, -ion, -or) ,
    adopt (-,ed, -s), affect (-, -ed, -ing, -s),
    allow (-, -ed, -s), analys (-sed, -ses, -sis,
    -zed, -zing), appear (-, -s), arrange (-d,
    -ment), assembl (-ing, -y), associat (-e, -ed,
    -ion), bas (-e, -ed, -is), belong (-, -ing, -s),
    bind (-, -ing, -s) / bound, bond (-, -ed, -ing,
    -s), bridge (-, -d, -s), calculat (-ed, -ion),
    called, carr (-ied, -ier, -ies), cataly (-sed,
    -ses, -stic, -ze, -zed, -zes, -zing), cause,

16
Evaluation
  • ask an expert to examine the resultant keywords
  • make sure if our keyword set contains significant
    information that most experts talked about in
    their papers
  • apply the keyword set to some application, e.g.,
    protein name recognition, and to confirm if the
    performance of this application is improved when
    the keyword set is introduced

17
(No Transcript)
18
Extracting Key Phrases
  • apply the proposed method with window size of one
    word on each side of the original single keywords
  • select candidates of standard deviations less
    than 0.4
  • There are 99 compound collocation words with
    99.5 confidence and the values of standard
    deviations are less than 0.4

19
Examples of Compound Collocates with Prepositions
20
Examples of Compound Collocates with Nouns
21
Enhancing Performance of Protein Name Recognizers
Using Collocation
22
Protein Name Recognizers
  • Challenging issues
  • variant structural characteristics
  • resemblance to regular noun phrases
  • similarity with other kinds of biological
    substances
  • Approaches
  • rule-based
  • Kex (Fukuda, et al., 1998)
  • Yapex (Olsson, et al., 2002)
  • corpus-based
  • HMM model (Collier, et al., 2000)

23
Comparisons Kex vs. Yapex
  • Kex
  • 30 abstracts on SH3 domain and 50 abstracts on
    signal transduction
  • 94.70 precision and 98.84 recall
  • Yapex
  • 48 documents were queried from protein binding
    and interaction
  • 53 documents were randomly chosen from GENIA
    corpus
  • 67.8 precision and 66.4 recall
  • Changing the domain may result in the variant
    performance
  • 40.4 precision and 41.1 recall of Kex on the
    test data of Yapex

24
Filtering Strategies
  • M0 (baseline model)
  • the output generated by Yapex
  • M1
  • For each candidate in M0, check if a collocate is
    found in its collocation window.
  • If yes, tag the candidate as a protein name.
    Otherwise, discard it.
  • M2
  • If a collocate appears in the candidate or in the
    collocation window of the candidate, then tag the
    candidate as a protein name.
  • Otherwise, discard it.

25
Filtering Strategies (Continued)
  • M3
  • During checking if there exists a collocate
    co-occurring with a protein candidate, the
    candidate without any collocate is kept
    undecidable instead of definite no.
  • After all the protein names are examined, those
    undecidable candidates may be considered as
    protein names when one of their co-occurrences
    containing any collocate.
  • M31 and M32

26
Experimental Results
  • to enhance the precision rate, but not to reduce
    the significant recall rate

27
Integration Strategies
  • how to improve the recall rates
  • different protein name taggers have their own
    specific features
  • a protein name recognizer may tag a protein name
    that another recognizer cannot identify, or
  • both of them may accept certain common proteins
  • The integration strategies are used to select
    correct protein names proposed by multiple
    recognizers

28
Candidates Proposed by Two Systems
29
Integration Strategies
  • When the protein names produced from two
    recognizers are totally separated (i.e., type A),
    retain them as the protein candidates.
  • When the protein names produced from two
    recognizers are exactly the same (i.e., type B),
    retain them as the protein candidates.
  • When the protein names tagged by two taggers have
    partial overlap (i.e., types C, D and E), two
    additional integration strategies are employed
  • Yapex-based strategies
  • Kex-based strategies

30
Experiment Designs
  • YA and KA
  • Use the collocates automatically extracted to
    filter out the candidates
  • YB and KB
  • Use the terms suggested by human experts for the
    filtering strategies
  • YA-C and KA-C
  • If Yapex and Kex recommend the same protein names
    (i.e., type B), regard them as protein names
    without consideration of collocates.
  • Otherwise, use the collocates proposed in this
    study to make filtering.
  • YB-C and KB-C
  • Similar to (3) except that the collocates are
    replaced by the terms suggested by human experts

31
Results
  • Yapex-based Integration Strategy
  • Kex-based Integration Strategy

32
Discussions
  • The strategy of delaying the decision until clear
    evidence is found is workable
  • The tendency M32gtM31gtM2gtM1 is still kept in the
    new experiments
  • The set of collocates proposed by our system is
    more complete than the set of terms suggested by
    human expert
  • The performances of YA, YA-C, KA, and KA-C are
    better than the performances of the corresponding
    models (i.e., YB, YB-C, KB, and KB-C)

33
Discussions
  • The recall rates of both Yapex- and Kex-based
    integration are increased
  • The precision rates are decreased more than the
    increase of recall rates
  • Kex performed not well in this test set
  • Yapex performed better in this test corpus
Write a Comment
User Comments (0)
About PowerShow.com