Learning%20to%20Extract%20Proteins%20and%20their%20Interactions%20from%20Medline%20Abstracts - PowerPoint PPT Presentation

About This Presentation
Title:

Learning%20to%20Extract%20Proteins%20and%20their%20Interactions%20from%20Medline%20Abstracts

Description:

... tyrosine phosphorylation and physical association with the Rb protein ... dependent protein kinase subunits (cdks) and the Rb tumor-suppressor protein. ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 59
Provided by: rohitjai
Category:

less

Transcript and Presenter's Notes

Title: Learning%20to%20Extract%20Proteins%20and%20their%20Interactions%20from%20Medline%20Abstracts


1
Learning to Extract Proteins and their
Interactions from Medline Abstracts
Raymond J. Mooney Department of Computer Sciences
Razvan Bunescu, Ruifang Ge, Rohit J. Kate, Yuk
Wah Wong
Edward M. Marcotte, Arun Ramani
Department of Computer Sciences
Institute for Cellular and Molecular Biology
University of Texas at Austin
2
Biological Motivation
  • Human Genome Project has produced huge amounts of
    genetic data.
  • Next step is analyzing and interpreting this data.

3
(No Transcript)
4
Starting at the tip of chromosome 1...
1 taaccctaac cctaacccta accctaaccc
taaccctaac cctaacccta accctaaccc 61
taaccctaac cctaacccta accctaaccc taaccctaac
cctaacccaa ccctaaccct 121 aaccctaacc
ctaaccctaa ccctaacccc taaccctaac cctaacccta
accctaacct 181 aaccctaacc ctaaccctaa
ccctaaccct aaccctaacc ctaaccctaa cccctaaccc
241 taaccctaaa ccctaaaccc taaccctaac cctaacccta
accctaaccc caaccccaac 301 cccaacccca
accccaaccc caaccctaac ccctaaccct aaccctaacc
ctaccctaac 361 cctaacccta accctaaccc
taaccctaac ccctaacccc taaccctaac cctaacccta
421 accctaaccc taaccctaac ccctaaccct aaccctaacc
ctaaccctcg cggtaccctc 481 agccggcccg
cccgcccggg tctgacctga ggagaactgt gctccgcctt
cagagtacca 541 ccgaaatctg tgcagaggac
aacgcagctc cgccctcgcg gtgctctccg ggtctgtgct
601 gaggagaacg caactccgcc ggcgcaggcg cagagaggcg
cgccgcgccg gcgcaggcgc 661 agacacatgc
tagcgcgtcg gggtggaggc gtggcgcagg cgcagagagg
cgcgccgcgc 721 cggcgcaggc gcagagacac
atgctaccgc gtccaggggt ggaggcgtgg cgcaggcgca
781 gagaggcgca ccgcgccggc gcaggcgcag agacacatgc
tagcgcgtcc aggggtggag 841 gcgtggcgca
ggcgcagaga cgcaagccta cgggcggggg ttgggggggc
gtgtgttgca 901 ggagcaaagt cgcacggcgc
cgggctgggg cggggggagg gtggcgccgt gcacgcgcag
961 aaactcacgt cacggtggcg cggcgcagag acgggtagaa
cctcagtaat ccgaaaagcc 1021 gggatcgacc
gccccttgct tgcagccggg cactacagga cccgcttgct
cacggtgctg 1081 tgccagggcg ccccctgctg
gcgactaggg caactgcagg gctctcttgc ttagagtggt ...
5641 gctccagggc ccgctcacct tgctcctgct
ccttctgctg ctgcttctcc agctttcgct 5701
ccttcatgct gcgcagcttg gccttgccga tgcccccagc
ttggcggatg gactctagca 5761 gagtggccag
ccaccggagg ggtcaaccac ttccctggga gctccctgga
ctggagccgg 5821 gaggtgggga acagggcaag
gaggaaaggc tgctcaggca gggctgggga agcttactgt
5881 gtccaagagc ctgctgggag ggaagtcacc tcccctcaaa
cgaggagccc tgcgctgggg 5941 aggccggacc
tttggagact gtgtgtgggg gcctgggcac tgacttctgc
aaccacctga 6001 gcgcgggcat cctgtgtgca
gatactccct gcttcctctc tagcccccac cctgcagagc
6061 tggacccctg agctagccat gctctgacag tctcagttgc
acacacgagc cagcagaggg 6121 gttttgtgcc
acttctggat gctagggtta cactgggaga cacagcagtg
aagctgaaat 6181 gaaaaatgtg ttgctgtagt
ttgttattag accccttctt tccattggtt taattaggaa
6241 tggggaaccc agagcctcac ttgttcaggc tccctctgcc
ctagaagtga gaagtccaga 6301 gctctacagt
ttgaaaacca ctattttatg aaccaagtag aacaagatat
ttgaaatgga 6361 aactattcaa aaaattgaga
atttctgacc acttaacaaa cccacagaaa atccacccga
6421 gtgcactgag cacgccagaa atcaggtggc ctcaaagagc
tgctcccacc tgaaggagac 6481 gcgctgctgc
tgctgtcgtc ctgcctggcg ccttggccta caggggccgc
ggttgagggt 6541 gggagtgggg gtgcactggc
cagcacctca ggagctgggg gtggtggtgg gggcggtggg
6601 ggtggtgtta gtaccccatc ttgtaggtct gaaacacaaa
gtgtggggtg tctagggaag... and 3x109 more...
5
Proteomics 101
  • Genes code for proteins.
  • Proteins are the basic components of biological
    machinery.
  • Proteins accomplish their functions by
    interacting with other proteins.
  • Knowledge of protein interactions is fundamental
    to understanding gene function.
  • Chains of interactions compose large, complex
    gene networks.

6
Sample Gene Network
7
Yeast Gene Network
Yeast
5,800 genes
5,800 proteins x 2-10 interactions/protein
12,000 - 60,000 interactions
10-20,000 knowngt 1/3 of the way to a complete
map!
8
Human Gene Network
40,000 genes
gtgt40,000 proteins x 2-10 interactions/protein
lt5,000 known gt approx. 1 of the complete map!
gtgt80,000 - 400,000 interactions
gt Were a long ways from the complete map
9
Relevant Sources of Data
Biological literature 14 million
documents DNA sequence data 1010
nucleotides Gene expression data 108
measurements, but... DNA polymorphisms 107
known Gene inactivation (knockout) studies
105 Protein structure data 104 structures
Protein interaction data 104 interactions,
but Protein expression data 104 measurements,
but... Protein location data 104 measurements
10
Extraction from Biomedical Literature
  • An ever increasing wealth of biological
    information is present in millions of published
    articles but retrieving it in structured form is
    difficult.
  • Much of this literature is available through the
    NIH -NLMs Medline repository.
  • 11 million abstracts in electronic form are
    available through Medline.
  • Excellent source of information on protein
    interactions.
  • Need automated information extraction to easily
    locate and structure this information.

11
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
12
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
13
TI - Two potentially oncogenic cyclins, cyclin A
and cyclin D1, share common properties of subunit
configuration, tyrosine phosphorylation and
physical association with the Rb protein AB -
Originally identified as a mitotic cyclin,
cyclin A exhibits properties of growth factor
sensitivity, susceptibility to viral subversion
and association with a tumor-suppressor protein,
properties which are indicative of an
S-phase-promoting factor (SPF) as well as a
candidate proto-oncogene. Other recent studies
have identified human cyclin D1 (PRAD1) as a
putative G1 cyclin and candidate
proto-oncogene. However, the specific enzymatic
activities and, hence, the precise biochemical
mechanisms through which cyclins function to
govern cell cycle progression remain
unresolved. In the present study we have
investigated the coordinate interactions between
these two potentially oncogenic cyclins,
cyclin-dependent protein kinase subunits (cdks)
and the Rb tumor-suppressor protein. The
distribution of cyclin D isoforms was modulated
by serum factors in primary fetal rat lung
epithelial cells. Moreover, cyclin D1 was found
to be phosphorylated on tyrosine residues in vivo
and, like cyclin A, was readily phosphorylated by
pp60c-src in vitro. In synchronized human
osteosarcoma cells, cyclin D1 is induced in early
G1 and becomes associated with p9Ckshs1, a
Cdk-binding subunit. Immunoprecipitation
experiments with human osteosarcoma cells and
Ewings sarcoma cells demonstrated that cyclin D1
is associated with both p34cdc2 and p33cdk2, and
that cyclin D1 immune complexes exhibit
appreciable histone H1 kinase activity. Immobilize
d, recombinant cyclins A and D1 were found to
associate with cellular proteins in complexes
that contain the p105Rb protein. This study
identifies several common aspects of cyclin
biochemistry, including tyrosine phosphorylation
and the potential to interact directly or
indirectly with the Rb protein, that may
ultimately relate membrane-mediated signaling
events to the regulation of gene expression.
14
Manually Developed IE Systems for Medline
  • A number of projects have focused on the manual
    development of information extraction (IE)
    systems for biomedical literature.
  • KeX for extracting protein names (Fukuda et al.,
    1998) Extract words with special symbols
    excluding those with more than half of the
    characters being special symbols, hence
    eliminating strings such as /-.
  • Suiseki for extracting protein interactions
    (Blaschke et al., 2001) PROT (0-2) PROT (0-2)
    complex NOUN between (0-3) PROT (0-3) and (0-3)
    PROT

15
Learning Information Extractors
  • Manually developing IE systems is tedious and
    time-consuming and they do not capture all
    possible formats and contexts for the desired
    information.
  • Machine learning from supervised corpora, is
    becoming the standard approach to building
    information extractors.
  • Recently, several learning approaches have been
    applied to Medline extraction (Craven Kumlein,
    1999 Tanabe Wilbur, 2002 Raychaudhuri et al.,
    2002).
  • We have explored the use of a variety of machine
    learning techniques to develop IE systems for
    extracting human protein names and interactions,
    presenting uniform results on a single,
    reasonably large, human-annotated corpus.

16
Non-Learning Protein Extractors
  • Dictionary-based extraction
  • KEX (Fukuda et al., 1998)

17
Learning Methods for Protein Extraction
  • Rule-based pattern induction
  • Rapier (Califf Mooney, 1999)
  • BWI (Freitag Kushmerick, 2000)
  • Token classification (chunking approach)
  • K-nearest neighbor
  • Transformation-Based Learning Abgene
    (Tanabe Wilbur, 2002)
  • Support Vector Machine
  • Maximum entropy
  • Hidden Markov Models
  • Conditional Random Fields (Lafferty, McCallum,
    and Pereira, 2001)
  • Relational Markov Networks (Taskar, Abbeel, and
    Koller, 2002)

18
Our Biomedical Corpora
  • 750 abstracts that contain the word human were
    randomly chosen from Medline for testing protein
    name extraction. They contain a total of 5,206
    protein references.
  • 200 abstracts previously known to contain protein
    interactions were obtained from the Database of
    Interacting Proteins. They contain 1,101
    interactions and 4,141 protein names.
  • As negative examples for interaction extraction
    are rare, an extra set of 30 abstracts containing
    sentences with non-interacting proteins are
    included.
  • The resulting 230 abstracts are used for testing
    protein interaction extraction.

19
The Yapex Corpus
  • 200 abstracts from Medline, manually tagged for
    protein names.
  • 147 randomly chosen such that they contain the
    Mesh terms protein binding, interaction,
    molecular.
  • 53 randomly chosen from the GENIA corpus

http//www.sics.se/humle/projects/prothalt/
20
Evaluation Metrics for Information Extraction
  • Precision is the percentage of extracted items
    that are correct.
  • Recall is the percentage of correct items that
    are extracted.
  • Extracted protein names are considered correct if
    the same character sequences have been
    human-tagged as protein names in the exact
    positions.
  • Extracted protein interactions from an abstract
    are considered correct if both proteins have been
    human-tagged as interacting in that abstract.
    Positions are not taken into account.

21
Dictionary as Source of Domain Knowledge
  • Before applying machine learning, abstracts are
    tagged by matching n-grams against entries from a
    dictionary. Tagged abstracts are used as input
    for subsequent methods.
  • A dictionary of 42,000 protein names is used
    (synonyms included).
  • Generalization of protein names leads to
    increased coverage

Original Protein Name Generalized Name
Interleukin-1 beta Interleukin num greek
Interferon alpha-D Interferon greek roman
NF-IL6-beta NF IL num greek
TR2 TR num
22
Rule-based Learning Algorithms Rapier and BWI
  • Rule-based learning algorithms are used for
    inducing patterns for extracting protein names.
  • For Rapier (Califf Mooney, 1999), each rule
    consists of a pre-filler pattern, a filler
    pattern and a post-filler pattern. human
    (2) transcriptase (
  • For BWI (Freitag Kushmerick, 2000), rules are
    composed of contextual patterns called wrappers,
    recognizing the start or end of a protein
    name. human transcriptase (
  • High precision (gt 70) but low recall (lt 25).

23
Hidden Markov Models
  • We use part-of-speech information in HMMs as
    described in (Ray Craven, 2001).
  • We train a positive model that generates
    sentences containing proteins, and a null model
    that generates sentences containing no proteins.
  • Select the model which gives the highest
    likelihood of generating a particular sentence,
    and tag the sentence using the Viterbi path in
    that model.
  • Moderate precision (60) and moderate recall
    (40).

24
Name Extraction by Token Classification(Chunking
Approach)
  • Since in our data no protein names directly abut
    each other, we can reduce the extraction problem
    to classification of individual words as being
    part of a protein name or not.
  • Protein names are extracted by identifying the
    longest sequences of words classified as being
    part of a protein name.

Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
25
Constructing Feature Vectors for Classification
  • For each token, we take the following as
    features
  • Current token
  • Last 2 tokens and next 2 tokens
  • Output of dictionary-based tagger for these 5
    tokens
  • Suffix for each of the 5 tokens (last 1, 2, and 3
    characters)
  • Class labels for last 2 tokens

Two potentially oncogenic cyclins , cyclin A and
cyclin D1 , share common properties of subunit
configuration , tyrosine phosphorylation and
physical association with the Rb protein
26
Maximum-Entropy Token Classifier
  • Distinguish among 5 types of tags
  • S(-tart), C(-ontinue), E(-nd), U(-nique),
    O(-ther)
  • Feature templates
  • current, previous, next word, and previous tag
  • part-of-speech for current, previous, next word
  • word class (full) ex FGF1 gt AAA0
  • word class (brief) ex FGF1 gt A0 (Collins,
    ACL02)
  • An extractions confidence is the minimum of its
    transition probabilities.

Example (4 tokens)
?t(y) is the forward probability of getting to
state y at time step t
27
MaxEnt Greedy Extraction
  • Use a Viterbi-like algorithm to find the most
    likely complete sequence of tags.
  • Drawback many low confidence extractions are
    missed.
  • Want to be able to increase recall beyond Viterbi
    results to control precision-recall trade-off.
  • Solution use a greedy extraction algorithm on
    all token sequences between any two consecutive
    Viterbi extractions.

28
Experimental Method
  • 10-fold cross-validation Average results over 10
    trials with different training and (independent)
    test data.
  • For methods which produce confidence in
    extractions, vary threshold for extraction in
    order to explore recall-precision trade-off.
  • Use standard methods from information-retrieval
    to generate a complete precision-recall curve.
  • Maximizing F-measure assumes a particular
    cost-benefit trade-off between incorrect and
    missed extractions.

29
Protein Name Extraction Results(Bunescu et al.,
2004)
30
Graphical Models
  • An intuitive representation of conditional
    independence between domain variables.
  • Directed Models gt well suited to represent
    temporal and causal relationships (Bayesian
    Networks, HMMs)
  • Undirected Models gt appropriate for
    representing statistical correlation between
    variables (Markov Networks)
  • Generative Models gt define a joint probability
    over observations and labels (HMMs)
  • Discriminative Models gt specifies a probability
    over labels given a set of observations
    (Conditional Random Fields Lafferty et al.
    2001).
  • Allow for arbitrary, overlapping features over
    the observation sequence.

31
Discriminative Markov Networks
  • G (V, E) an undirected graph
  • V X ? Y a set of discrete random variables
  • X observed variables
  • Y hidden variables (labels)
  • C(G) the cliques of G
  • Vc Xc ? Yc the set of vertices in a clique
    c?C(G)

the set of clique potentials
A clique potential ?c specifies the compatibility
of any possible assignment of values over the
nodes in the associated clique c.
32
Conditional Random Fields
Lafferty et al. 2001
  • CRFs are a type of discriminative Markov
    networks used for tagging sequences.
  • CRFs have shown superior or competitive
    performance in various tasks as
  • Shallow Parsing
  • Entity Recognition
  • Table Extraction

Sha Pereira 2003
McCallum Li 2003
Pinto et al 2003
33
Conditional Random Fields (CRFs) Lafferty,
McCallum Pereira 2001
  • Undirected graphical model for sequence
    segmentation.
  • Log-linear model, different from MaxEnt model
    because of global normalization

?tags
T1.tag
T2.tag
T3.tag
Start
Tn.tag
End

?tw

T1.w
T2.w
T3.w
Tn.w
?cap
T1.cap
T2.cap
T3.cap
Tn.cap
  • Tj.tag the tag (one of S, C, E, U, O) at
    position j
  • Tj.w true if word w occurs at position j
  • Tj.cap true if word at position j begins with
    capital letter,


34
Protein Name Extraction Results (Yapex)
35
Collective Classification of Web Pages
Taskar, Abbeel Koller 2002
36
Collective Information Extraction
  • Task
  • Extracting protein/gene names from Medline
    abstracts.
  • Approach
  • Collectively classify all candidate phrases from
    the same abstract.
  • Binary classification
  • e.label 0 gt e is not a protein name
  • e.label 1 gt e is a protein name
  • Use two types of label correlations
  • Acronyms and their long forms.
  • Repetitions of the same phrase.

37
Collective Information Extraction
The control of human ribosomal protein L22 (
rpL22 ) to enter into the nucleolus and its
ability to be assembled into the ribosome is
regulated by its sequence . The nuclear import of
rpL22 depends on a classical nuclear
localization signal of four lysines at positions
13 16 Once it reaches the nucleolus , the
question of whether rpL22 is assembled into the
ribosome depends upon the presence of the N -
domain .
e3
of rpL22 depends
repetition
e1
e2
e5
acronym
overlap
repetition
repetition
ribosomal protein L22
( rpL22 )
L22
whether rpL22 is
e4
38
Relational Markov Networks
Taskar, Abbeel Koller 2002
Discriminative Markov Networks, augmented with
clique templates
  • Overlap Template (OT)
  • Acronym Template (AT)
  • Repeat Template (RT)

39
Candidate Entities Definition
  • Candidate Entities
  • The set of candidate entities usually depends on
    the type of named entity.
  • In general, could consider as candidates all
    phrases of length lt L, where L may be task
    dependent.
  • Two examples
  • Genes, Proteins Most entity names are base
    noun phrases or parts of them. Thus a candidate
    extraction is any contiguous sequence of tokens
    whose POS tags are from JJ, VBN, VBG,
    POS, NN, NNS, NNP, NNPS, CD, ?,
    and whose head is either a noun or a number.
  • People, Organizations, Locations Most entity
    names are sequences of proper names potentially
    interspersed with definite articles and
    prepositions.

40
Candidate Entities Local Features
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
  • Entity Features based on features introduced in
    Collins 02
  • head word, with generic placeholder for numbers
    gt HD 0
  • entity text gt TXT superoxide dismutase 1
  • entity type e.g. concatenation of its words types
    gt TYPE a a 0
  • bigrams / trigrams at entity left / right
    boundaries based on combinations of
    lexical tokens, and word types.
  • Bigrams left gt BL antioxidant superoxide,
    BL antioxidant a,
  • Bigrams right gt BR 0 (,
  • Trigrams left gt TL the antioxidant
    superoxide, TL the antioxidant a,
  • Trigrams right gt TR 0 ( SOD1, TR 0 ( A0,
  • suffix / prefix lists of words and word types
  • Preffixes gt PF superoxide, PF superoxide
    dismutase,
  • Suffixes gt SF 0, SF 0, SF dismutase
    0,

41
Overlap Template
  • Entity names should not overlap gt hardwired
    overlap potential ?OT.

e1
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
e2
e1
e2
e1.label0 e1.label1
e2.label0 1 1
e2.label1 1 0
42
Repeat Template
Production of nitric oxide ( NO ) in endothelial
cells is regulated by direct interactions of
endothelial nitric oxide synthase ( eNOS ) Here
we have used the yeast two - hybrid system and
identified a novel 34 kDa protein , termed NOSIP
( eNOS interaction protein ) , which avidly binds
to the carboxyl terminal region of the eNOS
oxygenase domain .
u
v
43
Acronym Template
v2
v1
v
to the antioxidant superoxide dismutase ? 1 (
SOD1 ) enzyme and
d ?
v3
vOR ? v1? v2? v3
vOR
v

v1
vn
v2
44
Experimental Results
  • Datasets
  • Yapex a dataset of 200 Medline abstracts,
    manually tagged for protein names.
  • Aimed a dataset of 225 Medline abstracts, of
    which 200 are known to mention protein
    interactions.
  • CoNLL the CoNLL 2003 English dataset.
  • Compared three approaches
  • LTRMN ? RMN extraction using local templates
    Overlap Template
  • GLTRMN ? RMN extraction using both local and
    global templates.
  • CRF ? extraction as token classification using
    Conditional Random Fields Lafferty et al 2001,
    with features based on current word,
    previous/next words, words short/long types and
    POS tags Bunescu et al 2004.

45
Experimental Results Yapex
Yapex
46
Experimental Results Aimed
Aimed
47
Experimental Results CoNLL
CoNLL 2003
48
Protein Interaction Extraction
  • Most IE methods focus on extracting individual
    entities.
  • Protein interaction extraction requires
    extracting relations between entities.
  • Our current results on relation extraction have
    focused on rule-based learning approaches.

49
Rapier and BWI Revisited the Inter-filler
Approach
  • Existing rule-based learning algorithms are used
    for inducing patterns for identifying protein
    interactions.
  • Rules are learned for extracting
    inter-fillers. SHPTPW interacts with another
    signaling protein, Grb7.
  • Inter-fillers are sometimes very long (9 tokens
    on average 215 tokens maximum!). For some
    rule-based learning algorithms (e.g. Rapier), the
    time complexity can grow exponentially in the
    length of inter-fillers.

50
Rapier and BWI Revisited the Role-filler
Approach
  • In the role-filler approach, we extract two
    interacting proteins into different slots, which
    we call the interactor and the interactee.
  • A sentence is divided into segments. Interactors
    are associated with interactees in the same
    segment using simple heuristics.

We show that the S252W mutation allows the
mesenchymal splice form of FGFR2 (FGFR2c) to bind
and to be activated by the mesenchymally
expressed ligands FGF7 or FGF10 and the
epithelial splice form of FGFR2 (FGFR2b) to be
activated by FGF2, FGF6, and FGF9.
  • Moderately high precision (gt 60) but low recall
    (lt 40).

51
ELCS (Extraction using Longest Common
Subsequences)
  • A new method for inducing rules that extract
    interactions between previously tagged proteins.
  • Each rule consists of a sequence of words with
    allowable word gaps between them (similar to
    Blaschke Valencia, 2001, 2002). - (7)
    interactions (0) between (5) PROT (9) PROT (17)
    .
  • Any pair of proteins in a sentence if tagged as
    interacting forms a positive example, otherwise
    it forms a negative example.
  • Positive examples are repeatedly generalized to
    form rules until the rules become overly general
    and start matching negative examples.

52
Generalizing Rules using Longest Common
Subsequence
The self - association site appears to be formed
by interactions between helices 1 and 2 of beta
spectrin repeat 17 of one dimer with helix 3 of
alpha spectrin repeat 1 of the other dimer to
form two combined alpha - beta triple - helical
segments . Title Physical and functional
interactions between the transcriptional
inhibitors Id3 and ITF-2b .
53
The ELCS Framework
  • A greedy-covering, bottom-up rule induction
    method is used to cover all the positive examples
    without covering many negative examples.
  • We use an algorithm similar to beam search that
    considers only the n 25 best rules for
    generalization at any time.
  • The confidence level of a rule is based on the
    number of positive and negative examples the rule
    covers while allowing some margin for noise
    (Cestnik, 1990).

54
Protein Interaction Extraction Results
55
Protein Interaction Extraction Results (full)
56
Ongoing and Future Work
  • Extracted proteins and their interactions from
    753,459 Medline abstracts on human biology.
    Evaluation of results in progress.
  • Improve RMN approach with better local and global
    templates, better candidate entity generation,
    and better algorithms for probabilistic
    inference.
  • Extend RMN approach to handle extracting
    relations between entities.
  • Evaluate RMN approach on other biological
    entities and relations and on other
    non-biological corpora.
  • Reduce human efforts by actively selecting the
    best training examples for human labeling.
  • Combine evidence from text with other biological
    data sources to derive accurate, comprehensive
    gene networks.

57
Conclusions
  • We have compared a wide variety of existing
    machine-learning methods for extracting human
    protein names and interactions.
  • CRFs approach performs the best of existing
    methods.
  • We developed a new more-general approach based on
    RMNs that allows collective extraction that
    integrates information across all potential
    extractions.
  • For extracting protein interactions, we found
    that several methods for learning extraction
    rules outperform hand-written rules with respect
    to precision and noisy protein tags.

58
The End
Write a Comment
User Comments (0)
About PowerShow.com