Title: Learning to Extract Proteins and their Interactions from Medline Abstracts
1Learning to Extract Proteins and their
Interactions from Medline Abstracts
Raymond J. Mooney Department of Computer Sciences
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk
Wah Wong
Edward M. Marcotte, Arun Ramani
Department of Computer Sciences
Institute for Cellular and Molecular Biology
University of Texas at Austin
2Biological Motivation
- Human Genome Project has produced huge amounts of
genetic data. - Next step is analyzing and interpreting this data.
3(No Transcript)
4Starting at the tip of chromosome 1...
1 taaccctaac cctaacccta accctaaccc
taaccctaac cctaacccta accctaaccc 61
taaccctaac cctaacccta accctaaccc taaccctaac
cctaacccaa ccctaaccct 121 aaccctaacc
ctaaccctaa ccctaacccc taaccctaac cctaacccta
accctaacct 181 aaccctaacc ctaaccctaa
ccctaaccct aaccctaacc ctaaccctaa cccctaaccc
241 taaccctaaa ccctaaaccc taaccctaac cctaacccta
accctaaccc caaccccaac 301 cccaacccca
accccaaccc caaccctaac ccctaaccct aaccctaacc
ctaccctaac 361 cctaacccta accctaaccc
taaccctaac ccctaacccc taaccctaac cctaacccta
421 accctaaccc taaccctaac ccctaaccct aaccctaacc
ctaaccctcg cggtaccctc 481 agccggcccg
cccgcccggg tctgacctga ggagaactgt gctccgcctt
cagagtacca 541 ccgaaatctg tgcagaggac
aacgcagctc cgccctcgcg gtgctctccg ggtctgtgct
601 gaggagaacg caactccgcc ggcgcaggcg cagagaggcg
cgccgcgccg gcgcaggcgc 661 agacacatgc
tagcgcgtcg gggtggaggc gtggcgcagg cgcagagagg
cgcgccgcgc 721 cggcgcaggc gcagagacac
atgctaccgc gtccaggggt ggaggcgtgg cgcaggcgca
781 gagaggcgca ccgcgccggc gcaggcgcag agacacatgc
tagcgcgtcc aggggtggag 841 gcgtggcgca
ggcgcagaga cgcaagccta cgggcggggg ttgggggggc
gtgtgttgca 901 ggagcaaagt cgcacggcgc
cgggctgggg cggggggagg gtggcgccgt gcacgcgcag
961 aaactcacgt cacggtggcg cggcgcagag acgggtagaa
cctcagtaat ccgaaaagcc 1021 gggatcgacc
gccccttgct tgcagccggg cactacagga cccgcttgct
cacggtgctg 1081 tgccagggcg ccccctgctg
gcgactaggg caactgcagg gctctcttgc ttagagtggt ...
5641 gctccagggc ccgctcacct tgctcctgct
ccttctgctg ctgcttctcc agctttcgct 5701
ccttcatgct gcgcagcttg gccttgccga tgcccccagc
ttggcggatg gactctagca 5761 gagtggccag
ccaccggagg ggtcaaccac ttccctggga gctccctgga
ctggagccgg 5821 gaggtgggga acagggcaag
gaggaaaggc tgctcaggca gggctgggga agcttactgt
5881 gtccaagagc ctgctgggag ggaagtcacc tcccctcaaa
cgaggagccc tgcgctgggg 5941 aggccggacc
tttggagact gtgtgtgggg gcctgggcac tgacttctgc
aaccacctga 6001 gcgcgggcat cctgtgtgca
gatactccct gcttcctctc tagcccccac cctgcagagc
6061 tggacccctg agctagccat gctctgacag tctcagttgc
acacacgagc cagcagaggg 6121 gttttgtgcc
acttctggat gctagggtta cactgggaga cacagcagtg
aagctgaaat 6181 gaaaaatgtg ttgctgtagt
ttgttattag accccttctt tccattggtt taattaggaa
6241 tggggaaccc agagcctcac ttgttcaggc tccctctgcc
ctagaagtga gaagtccaga 6301 gctctacagt
ttgaaaacca ctattttatg aaccaagtag aacaagatat
ttgaaatgga 6361 aactattcaa aaaattgaga
atttctgacc acttaacaaa cccacagaaa atccacccga
6421 gtgcactgag cacgccagaa atcaggtggc ctcaaagagc
tgctcccacc tgaaggagac 6481 gcgctgctgc
tgctgtcgtc ctgcctggcg ccttggccta caggggccgc
ggttgagggt 6541 gggagtgggg gtgcactggc
cagcacctca ggagctgggg gtggtggtgg gggcggtggg
6601 ggtggtgtta gtaccccatc ttgtaggtct gaaacacaaa
gtgtggggtg tctagggaag... and 3x109 more...
5Proteomics 101
- Genes code for proteins.
- Proteins are the basic components of biological
machinery. - Proteins accomplish their functions by
interacting with other proteins. - Knowledge of protein interactions is fundamental
to understanding gene function. - Chains of interactions compose large, complex
gene networks.
6Sample Gene Network
7Yeast Gene Network
Yeast
5,800 genes
5,800 proteins x 2-10 interactions/protein
12,000 - 60,000 interactions
10-20,000 knowngt 1/3 of the way to a complete
map!
8Human Gene Network
40,000 genes
gtgt40,000 proteins x 2-10 interactions/protein
lt5,000 known gt approx. 1 of the complete map!
gtgt80,000 - 400,000 interactions
gt Were a long ways from the complete map
9Relevant Sources of Data
Biological literature 14 million
documents DNA sequence data 1010
nucleotides Gene expression data 108
measurements, but... DNA polymorphisms 107
known Gene inactivation (knockout) studies
105 Protein structure data 104 structures
Protein interaction data 104 interactions,
but Protein expression data 104 measurements,
but... Protein location data 104 measurements
10Extraction from Biomedical Literature
- An ever increasing wealth of biological
information is present in millions of published
articles but retrieving it in structured form is
difficult. - Much of this literature is available through the
NIH -NLMs Medline repository. - 11 million abstracts in electronic form are
available through Medline. - Excellent source of information on protein
interactions. - Need automated information extraction to easily
locate and structure this information.
11TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
12TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
13TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
14Manually Developed IE Systems for Medline
- A number of projects have focused on the manual
development of information extraction (IE)
systems for biomedical literature. - KeX for extracting protein names (Fukuda et al.,
1998) Extract words with special symbols
excluding those with more than half of the
characters being special symbols, hence
eliminating strings such as /-. - Suiseki for extracting protein interactions
(Blaschke et al., 2001) PROT (0-2) PROT (0-2)
complex NOUN between (0-3) PROT (0-3) and (0-3)
PROT
15Learning Information Extractors
- Manually developing IE systems is tedious and
time-consuming and they do not capture all
possible formats and contexts for the desired
information. - Machine learning from supervised corpora, is
becoming the standard approach to building
information extractors. - Recently, several learning approaches have been
applied to Medline extraction (Craven Kumlein,
1999 Tanabe Wilbur, 2002 Raychaudhuri et al.,
2002). - We have explored the use of a variety of machine
learning techniques to develop IE systems for
extracting human protein names and interactions,
presenting uniform results on a single,
reasonably large, human-annotated corpus.
16Non-Learning Protein Extractors
- Dictionary-based extraction
- KEX (Fukuda et al., 1998)
17Learning Methods for Protein Extraction
- Rule-based pattern induction
- Rapier (Califf Mooney, 1999)
- BWI (Freitag Kushmerick, 2000)
- Token classification (chunking approach)
- K-nearest neighbor
- Transformation-Based Learning Abgene
(Tanabe Wilbur, 2002) - Support Vector Machine
- Maximum entropy
- Hidden Markov Models
- Conditional Random Fields (Lafferty, McCallum,
and Pereira, 2001) - Relational Markov Networks (Taskar, Abbeel, and
Koller, 2002)
18Our Biomedical Corpora
- 750 abstracts that contain the word human were
randomly chosen from Medline for testing protein
name extraction. They contain a total of 5,206
protein references. - 200 abstracts previously known to contain protein
interactions were obtained from the Database of
Interacting Proteins. They contain 1,101
interactions and 4,141 protein names. - As negative examples for interaction extraction
are rare, an extra set of 30 abstracts containing
sentences with non-interacting proteins are
included. - The resulting 230 abstracts are used for testing
protein interaction extraction.
19The Yapex Corpus
- 200 abstracts from Medline, manually tagged for
protein names. - 147 randomly chosen such that they contain the
Mesh terms protein binding, interaction,
molecular. - 53 randomly chosen from the GENIA corpus
http//www.sics.se/humle/projects/prothalt/
20Evaluation Metrics for Information Extraction
- Precision is the percentage of extracted items
that are correct. - Recall is the percentage of correct items that
are extracted. - Extracted protein names are considered correct if
the same character sequences have been
human-tagged as protein names in the exact
positions. - Extracted protein interactions from an abstract
are considered correct if both proteins have been
human-tagged as interacting in that abstract.
Positions are not taken into account.
21Dictionary as Source of Domain Knowledge
- Before applying machine learning, abstracts are
tagged by matching n-grams against entries from a
dictionary. Tagged abstracts are used as input
for subsequent methods. - A dictionary of 42,000 protein names is used
(synonyms included). - Generalization of protein names leads to
increased coverage
Original Protein Name Generalized Name
Interleukin-1 beta Interleukin num greek
Interferon alpha-D Interferon greek roman
NF-IL6-beta NF IL num greek
TR2 TR num
22Rule-based Learning Algorithms Rapier and BWI
- Rule-based learning algorithms are used for
inducing patterns for extracting protein names. - For Rapier (Califf Mooney, 1999), each rule
consists of a pre-filler pattern, a filler
pattern and a post-filler pattern. human
(2) transcriptase ( - For BWI (Freitag Kushmerick, 2000), rules are
composed of contextual patterns called wrappers,
recognizing the start or end of a protein
name. human transcriptase ( - High precision (gt 70) but low recall (lt 25).
23Hidden Markov Models
- We use part-of-speech information in HMMs as
described in (Ray Craven, 2001). - We train a positive model that generates
sentences containing proteins, and a null model
that generates sentences containing no proteins. - Select the model which gives the highest
likelihood of generating a particular sentence,
and tag the sentence using the Viterbi path in
that model.
- Moderate precision (60) and moderate recall
(40).
24Name Extraction by Token Classification(Chunking
Approach)
- Since in our data no protein names directly abut
each other, we can reduce the extraction problem
to classification of individual words as being
part of a protein name or not. - Protein names are extracted by identifying the
longest sequences of words classified as being
part of a protein name.
Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
25Constructing Feature Vectors for Classification
- For each token, we take the following as
features - Current token
- Last 2 tokens and next 2 tokens
- Output of dictionary-based tagger for these 5
tokens - Suffix for each of the 5 tokens (last 1, 2, and 3
characters) - Class labels for last 2 tokens
Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
26Maximum-Entropy Token Classifier
- Distinguish among 5 types of tags
- S(-tart), C(-ontinue), E(-nd), U(-nique),
O(-ther) - Feature templates
- current, previous, next word, and previous tag
- part-of-speech for current, previous, next word
- word class (full) ex FGF1 gt AAA0
- word class (brief) ex FGF1 gt A0 (Collins,
ACL02) - An extractions confidence is the minimum of its
transition probabilities.
Example (4 tokens)
?t(y) is the forward probability of getting to
state y at time step t
27MaxEnt Greedy Extraction
- Use a Viterbi-like algorithm to find the most
likely complete sequence of tags. - Drawback many low confidence extractions are
missed. - Want to be able to increase recall beyond Viterbi
results to control precision-recall trade-off. - Solution use a greedy extraction algorithm on
all token sequences between any two consecutive
Viterbi extractions.
28Experimental Method
- 10-fold cross-validation Average results over 10
trials with different training and (independent)
test data. - For methods which produce confidence in
extractions, vary threshold for extraction in
order to explore recall-precision trade-off. - Use standard methods from information-retrieval
to generate a complete precision-recall curve. - Maximizing F-measure assumes a particular
cost-benefit trade-off between incorrect and
missed extractions.
29Protein Name Extraction Results(Bunescu et al.,
2004)
30Graphical Models
- An intuitive representation of conditional
independence between domain variables. - Directed Models gt well suited to represent
temporal and causal relationships (Bayesian
Networks, HMMs) - Undirected Models gt appropriate for
representing statistical correlation between
variables (Markov Networks) - Generative Models gt define a joint probability
over observations and labels (HMMs) - Discriminative Models gt specifies a probability
over labels given a set of observations
(Conditional Random Fields Lafferty et al.
2001). - Allow for arbitrary, overlapping features over
the observation sequence.
31Discriminative Markov Networks
- G (V, E) an undirected graph
- V X ? Y a set of discrete random variables
- X observed variables
- Y hidden variables (labels)
- C(G) the cliques of G
- Vc Xc ? Yc the set of vertices in a clique
c?C(G)
the set of clique potentials
A clique potential ?c specifies the compatibility
of any possible assignment of values over the
nodes in the associated clique c.
32Conditional Random Fields
Lafferty et al. 2001
- CRFs are a type of discriminative Markov
networks used for tagging sequences. - CRFs have shown superior or competitive
performance in various tasks as - Shallow Parsing
- Entity Recognition
- Table Extraction
Sha Pereira 2003
McCallum Li 2003
Pinto et al 2003
33Conditional Random Fields (CRFs) Lafferty,
McCallum Pereira 2001
- Undirected graphical model for sequence
segmentation. - Log-linear model, different from MaxEnt model
because of global normalization
?tags
T1.tag
T2.tag
T3.tag
Start
Tn.tag
End
?tw
T1.w
T2.w
T3.w
Tn.w
?cap
T1.cap
T2.cap
T3.cap
Tn.cap
- Tj.tag the tag (one of S, C, E, U, O) at
position j - Tj.w true if word w occurs at position j
- Tj.cap true if word at position j begins with
capital letter,
34Protein Name Extraction Results (Yapex)
35Collective Classification of Web Pages
Taskar, Abbeel Koller 2002
36Collective Information Extraction
- Task
- Extracting protein/gene names from Medline
abstracts. - Approach
- Collectively classify all candidate phrases from
the same abstract. - Binary classification
- e.label 0 gt e is not a protein name
- e.label 1 gt e is a protein name
- Use two types of label correlations
- Acronyms and their long forms.
- Repetitions of the same phrase.
37Collective Information Extraction
The control of human ribosomal protein L22 (
rpL22 ) to enter into the nucleolus and its
ability to be assembled into the ribosome is
regulated by its sequence . The nuclear import of
rpL22 depends on a classical nuclear
localization signal of four lysines at positions
13 16 Once it reaches the nucleolus , the
question of whether rpL22 is assembled into the
ribosome depends upon the presence of the N -
domain .
e3
of rpL22 depends
repetition
e1
e2
e5
acronym
overlap
repetition
repetition
ribosomal protein L22
( rpL22 )
L22
whether rpL22 is
e4
38Relational Markov Networks
Taskar, Abbeel Koller 2002
Discriminative Markov Networks, augmented with
clique templates
39Candidate Entities Definition
- Candidate Entities
- The set of candidate entities usually depends on
the type of named entity. - In general, could consider as candidates all
phrases of length lt L, where L may be task
dependent. - Two examples
- Genes, Proteins Most entity names are base
noun phrases or parts of them. Thus a candidate
extraction is any contiguous sequence of tokens
whose POS tags are from JJ, VBN, VBG,
POS, NN, NNS, NNP, NNPS, CD, ?,
and whose head is either a noun or a number. - People, Organizations, Locations Most entity
names are sequences of proper names potentially
interspersed with definite articles and
prepositions.
40Candidate Entities Local Features
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
- Entity Features based on features introduced in
Collins 02 - head word, with generic placeholder for numbers
gt HD 0 - entity text gt TXT superoxide dismutase 1
- entity type e.g. concatenation of its words types
gt TYPE a a 0 - bigrams / trigrams at entity left / right
boundaries based on combinations of
lexical tokens, and word types. - Bigrams left gt BL antioxidant superoxide,
BL antioxidant a, - Bigrams right gt BR 0 (,
- Trigrams left gt TL the antioxidant
superoxide, TL the antioxidant a, - Trigrams right gt TR 0 ( SOD1, TR 0 ( A0,
- suffix / prefix lists of words and word types
- Preffixes gt PF superoxide, PF superoxide
dismutase, - Suffixes gt SF 0, SF 0, SF dismutase
0,
41Overlap Template
- Entity names should not overlap gt hardwired
overlap potential ?OT.
e1
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
e2
e1
e2
e1.label0 e1.label1
e2.label0 1 1
e2.label1 1 0
42Repeat Template
Production of nitric oxide ( NO ) in endothelial
cells is regulated by direct interactions of
endothelial nitric oxide synthase ( eNOS ) Here
we have used the yeast two - hybrid system and
identified a novel 34 kDa protein , termed NOSIP
( eNOS interaction protein ) , which avidly binds
to the carboxyl terminal region of the eNOS
oxygenase domain .
u
v
43Acronym Template
v2
v1
v
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
d ?
v3
vOR ? v1? v2? v3
vOR
v
v1
vn
v2
44Experimental Results
- Datasets
- Yapex a dataset of 200 Medline abstracts,
manually tagged for protein names. - Aimed a dataset of 225 Medline abstracts, of
which 200 are known to mention protein
interactions. - CoNLL the CoNLL 2003 English dataset.
- Compared three approaches
- LTRMN ? RMN extraction using local templates
Overlap Template - GLTRMN ? RMN extraction using both local and
global templates. - CRF ? extraction as token classification using
Conditional Random Fields Lafferty et al 2001,
with features based on current word,
previous/next words, words short/long types and
POS tags Bunescu et al 2004.
45Experimental Results Yapex
Yapex
46Experimental Results Aimed
Aimed
47Experimental Results CoNLL
CoNLL 2003
48Protein Interaction Extraction
- Most IE methods focus on extracting individual
entities. - Protein interaction extraction requires
extracting relations between entities. - Our current results on relation extraction have
focused on rule-based learning approaches.
49Rapier and BWI Revisited the Inter-filler
Approach
- Existing rule-based learning algorithms are used
for inducing patterns for identifying protein
interactions. - Rules are learned for extracting
inter-fillers. SHPTPW interacts with another
signaling protein, Grb7. - Inter-fillers are sometimes very long (9 tokens
on average 215 tokens maximum!). For some
rule-based learning algorithms (e.g. Rapier), the
time complexity can grow exponentially in the
length of inter-fillers.
50Rapier and BWI Revisited the Role-filler
Approach
- In the role-filler approach, we extract two
interacting proteins into different slots, which
we call the interactor and the interactee. - A sentence is divided into segments. Interactors
are associated with interactees in the same
segment using simple heuristics.
We show that the S252W mutation allows the
mesenchymal splice form of FGFR2 (FGFR2c) to bind
and to be activated by the mesenchymally
expressed ligands FGF7 or FGF10 and the
epithelial splice form of FGFR2 (FGFR2b) to be
activated by FGF2, FGF6, and FGF9.
- Moderately high precision (gt 60) but low recall
(lt 40).
51ELCS (Extraction using Longest Common
Subsequences)
- A new method for inducing rules that extract
interactions between previously tagged proteins. - Each rule consists of a sequence of words with
allowable word gaps between them (similar to
Blaschke Valencia, 2001, 2002). - (7)
interactions (0) between (5) PROT (9) PROT (17)
. - Any pair of proteins in a sentence if tagged as
interacting forms a positive example, otherwise
it forms a negative example. - Positive examples are repeatedly generalized to
form rules until the rules become overly general
and start matching negative examples.
52Generalizing Rules using Longest Common
Subsequence
The self - association site appears to be formed
by interactions between helices 1 and 2 of beta
spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to
form two combined alpha - beta triple - helical
segments . Title Physical and functional
interactions between the transcriptional
inhibitors Id3 and ITF-2b .
53The ELCS Framework
- A greedy-covering, bottom-up rule induction
method is used to cover all the positive examples
without covering many negative examples. - We use an algorithm similar to beam search that
considers only the n 25 best rules for
generalization at any time. - The confidence level of a rule is based on the
number of positive and negative examples the rule
covers while allowing some margin for noise
(Cestnik, 1990).
54Protein Interaction Extraction Results
55Protein Interaction Extraction Results (full)
56Ongoing and Future Work
- Extracted proteins and their interactions from
753,459 Medline abstracts on human biology.
Evaluation of results in progress. - Improve RMN approach with better local and global
templates, better candidate entity generation,
and better algorithms for probabilistic
inference. - Extend RMN approach to handle extracting
relations between entities. - Evaluate RMN approach on other biological
entities and relations and on other
non-biological corpora. - Reduce human efforts by actively selecting the
best training examples for human labeling. - Combine evidence from text with other biological
data sources to derive accurate, comprehensive
gene networks.
57Conclusions
- We have compared a wide variety of existing
machine-learning methods for extracting human
protein names and interactions. - CRFs approach performs the best of existing
methods. - We developed a new more-general approach based on
RMNs that allows collective extraction that
integrates information across all potential
extractions. - For extracting protein interactions, we found
that several methods for learning extraction
rules outperform hand-written rules with respect
to precision and noisy protein tags.
58The End