Title: Protein Quaternary Fold Recognition Using Conditional Graphical Models
1Protein Quaternary Fold Recognition Using
Conditional Graphical Models
- Yan Liu
- IBM Research
- Jaime Carbonell (CMU), Vanathi Gopalakrishnan (U
Pitt), Peter Weigele (MIT) - ICML-2007 workshop
2Snapshot of Cell Biology
3Example Protein Structures
Triple beta-spiral fold in Adenovirus Fiber Shaft
Adenovirus Fibre Shaft
Virus Capsid
4Predicting Protein Structures
- Protein Structure is a key determinant of protein
function - Crystalography to resolve protein structures
experimentally in-vitro is very expensive, NMR
can only resolve very-small proteins - The gap between the known protein sequences and
structures - 3,023,461 sequences v.s. 36,247 resolved
structures (1.2) - Therefore we need to predict structures in-silico
5Quaternary Folds and Alignments
- Protein fold
- Identifiable regular arrangement of secondary
structural elements - Thus far, a limited number of protein folds have
been discovered (1000) - Very few research work on quaternary folds
- Complex structures and few labeled data
- Quaternary fold recognition
6Related Work
- Previous Work in General Protein Structure
Prediction - Sequence similarity perspective Altschul et al,
1997, Durbin et al, 1998, Karplus et al, 1998,
Jones, 2001 - Physical forces perspective Jones, 1998
- Structural biology perspective Efimov, 1991
Wilmot and Thornton, 1990 Bradley at al, 2001 - Previous Work in Quaternary Structure Prediction
- Mostly on partial tasks, e.g. classification of
protein sequences, analysis of domain-domain
docking or interaction types and geometric
regularities and constraints - Computational challenges in viral fold
recognition - Complex structures, insufficient data and less
sequence similarities between membership proteins
7Conditional Random Fields
- Hidden Markov model (HMM) Rabiner, 1989
- Conditional random fields (CRFs) Lafferty et al,
2001 - Model conditional probability directly
(discriminative models, directly optimizable) - Allow arbitrary dependencies in observation
- Adaptive to different loss functions and
regularizers - Promising results in multiple applications
- But, need to scale up (computationally) and
extend to long-distance dependencies
8Our Solution Conditional Graphical Models
Long-range dependency
Local dependency
- Segmentation CRF
- Outputs Y M, Wi , where Wi pi, qi, si
- Feature definition
- Node feature
- Local interaction feature
- Long-range interaction feature
9Linked Segmentation CRF
- Node secondary structure elements and/or simple
fold - Edges Local interactions and long-range
inter-chain and intra-chain interactions - L-SCRF conditional probability of y given x is
defined as
10Linked Segmentation CRF (II)
- Objective
- Training learn the model parameters ?
- Minimizing regularized negative log loss
- Iterative search algorithms by seeking the
direction whose empirical values agree with the
expectation - Complex graphs results in huge computational
complexity
11Approximate Inference - Learning
- Most approximation algorithms cannot handle
variable number of nodes in the graph, but we
need variable graph topologies, so - Contrastive Divergence Hinton Welling, 2002
- ??k Ep0 fk E p1fk
- P0 estimated from empirical samples
- P1 estimated from a few samples starting the
seeds from the empirical samples
12Approximate Inference - Inference
- Reversible jump MCMC sampling Greens, 1995,
Schmidler et al, 2001 with Four types of
Metropolis operators - State switching
- Position switching
- Segment split
- Segment merge
- MAP estimate using simulated annealing reversible
jump MCMC Andireu et al, 2000 - Replace the sample with RJ MCMC
- Theoretically converge on the global optimum
13Experiments Target Quaternary Fold
- Triple beta-spirals van Raaij et al. Nature
1999 - Virus fibers in adenovirus, reovirus and PRD1
- Double barrel trimer Benson et al, 2004
- Coat protein of adenovirus, PRD1, STIV, PBCV
14Features for Protein Fold Recognition
15Experiment Results Fold Recognition
Triple beta-spirals
16Experiment Results Alignment Prediction
17Experiment ResultsDiscovery of New Membership
Proteins
- Predicted membership proteins of triple
beta-spirals can be accessed at - http//www.cs.cmu.edu/yanliu/swissprot_list.xls
- Membership proteins of double barrel-trimer
suggested by biologists Benson, 2005 compared
with L-SCRF predictions
18Conclusion
- Conditional graphical models for protein
structure prediction - Effective representation for protein structural
properties - Feasibility to incorporate different kinds of
informative features - Efficient inference algorithms for large-scale
applications - A major extension compared with previous work
- Knowledge representation through graphical models
- Ability to handle long-range interactions within
one chain and between chains - Future work
- Automatic learning of graph topology
- Applications to other domains
19(No Transcript)
20Tertiary Fold Recognition ß-Helix fold
- Histogram and ranks for known ß-helices against
PDB-minus dataset
5
Chain graph model reduces the real running time
of SCRFs model by around 50 times
21Fold Alignment Prediction ß-Helix
- Predicted alignment for known ß -helices on
cross-family validation
22Discovery of New Potential ß-helices
- Run structural predictor seeking potential
ß-helices from Uniprot (structurally unresolved)
databases - Full list (98 new predictions) can be accessed at
www.cs.cmu.edu/yanliu/SCRF.html - Verification on 3 proteins with later
experimentally resolved structures from different
organisms - 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
- 1PXZ The Major Allergen From Cedar Pollen
- GP14 of Shigella bacteriophage as a ß-helix
protein - No single false positive!
23Previous Work
- Sequence similarity perspective
- Sequence similarity searches, e.g. PSI-BLAST
Altschul et al, 1997 - Profile HMM, .e.g. HMMER Durbin et al, 1998 and
SAM Karplus et al, 1998 - Window-based methods, e.g. PSI_pred Jones, 2001
- Physical forces perspective
- Homology modeling or threading, e.g. Threader
Jones, 1998 - Structural biology perspective
- Painstakingly hand-engineered methods for
specific structures, e.g. aa- and ßß- hairpins,
ß-turn and ß-helix Efimov, 1991 Wilmot and
Thornton, 1990 Bradley at al, 2001
Fail to capture the structure properties and
long-range dependencies
Generative models based on rough approximation of
free-energy, perform very poorly on complex
structures
Very Hard to generalize due to built-in
constants, fixed features
24Graphical Models
- A graphical model is a graph representation of
probability dependencies Pearl 1993 Jordan
1999 - Node random variables
- Edges dependency relations
- Directed graphical model (Bayesian networks)
- Undirected graphical model (Markov random fields)