Title: Protein Quaternary Fold Recognition Using Conditional Graphical Models
1Protein Quaternary Fold Recognition Using
Conditional Graphical Models
- Yan Liu, Jaime Carbonell
- Vanathi Gopalakrishnan (U Pitt), Peter Weigele
(MIT) - Language Technologies Institute
- School of Computer Science
- Carnegie Mellon University
- IJCAI-2007 Hyderabad, India
2Snapshot of Cell Biology
3Example Protein Structures
Triple beta-spiral fold in Adenovirus Fiber Shaft
Adenovirus Fibre Shaft
Virus Capsid
4Predicting Protein Structures
- Protein Structure is a key determinant of protein
function - Crystalography to resolve protein structures
experimentally in-vitro is very expensive, NMR
can only resolve very-small proteins - The gap between the known protein sequences and
structures - 3,023,461 sequences v.s. 36,247 resolved
structures (1.2) - Therefore we need to predict structures in-silico
5Quaternary Folds and Alignments
- Protein fold
- Identifiable regular arrangement of secondary
structural elements - Thus far, a limited number of protein folds have
been discovered (1000) - Very few research work on quaternary folds
- Complex structures and few labeled data
- Quaternary fold recognition
6Previous Work
- Sequence similarity perspective
- Sequence similarity searches, e.g. PSI-BLAST
Altschul et al, 1997 - Profile HMM, .e.g. HMMER Durbin et al, 1998 and
SAM Karplus et al, 1998 - Window-based methods, e.g. PSI_pred Jones, 2001
- Physical forces perspective
- Homology modeling or threading, e.g. Threader
Jones, 1998 - Structural biology perspective
- Painstakingly hand-engineered methods for
specific structures, e.g. aa- and ßß- hairpins,
ß-turn and ß-helix Efimov, 1991 Wilmot and
Thornton, 1990 Bradley at al, 2001
Fail to capture the structure properties and
long-range dependencies
Generative models based on rough approximation of
free-energy, perform very poorly on complex
structures
Very Hard to generalize due to built-in
constants, fixed features
7Conditional Random Fields
- Hidden Markov model (HMM) Rabiner, 1989
- Conditional random fields (CRFs) Lafferty et al,
2001 - Model conditional probability directly
(discriminative models, directly optimizable) - Allow arbitrary dependencies in observation
- Adaptive to different loss functions and
regularizers - Promising results in multiple applications
- But, need to scale up (computationally) and
extend to long-distance dependencies
8Our Solution Conditional Graphical Models
Long-range dependency
Local dependency
- Outputs Y M, Wi , where Wi pi, qi, si
- Feature definition
- Node feature
- Local interaction feature
- Long-range interaction feature
9Linked Segmentation CRF
- Node secondary structure elements and/or simple
fold - Edges Local interactions and long-range
inter-chain and intra-chain interactions - L-SCRF conditional probability of y given x is
defined as
10Linked Segmentation CRF (II)
- Classification
- Training learn the model parameters ?
- Minimizing regularized negative log loss
- Iterative search algorithms by seeking the
direction whose empirical values agree with the
expectation - Complex graphs results in huge computational
complexity
11Approximate Inference of L-SCRF
- Most approximation algorithms cannot handle
variable number of nodes in the graph, but we
need variable graph topologies, so - Reversible jump MCMC sampling Greens, 1995,
Schmidler et al, 2001 with Four types of
Metropolis operators - State switching
- Position switching
- Segment split
- Segment merge
- Simulated annealing reversible jump MCMC Andireu
et al, 2000 - Replace the sample with RJ MCMC
- Theoretically converge on the global optimum
12Experiments Target Quaternary Fold
- Triple beta-spirals van Raaij et al. Nature
1999 - Virus fibers in adenovirus, reovirus and PRD1
- Double barrel trimer Benson et al, 2004
- Coat protein of adenovirus, PRD1, STIV, PBCV
13Features for Protein Fold Recognition
14Tertiary Fold Recognition ß-Helix fold
- Histogram and ranks for known ß-helices against
PDB-minus dataset
5
Chain graph model reduces the real running time
of SCRFs model by around 50 times
15Fold Alignment Prediction ß-Helix
- Predicted alignment for known ß -helices on
cross-family validation
16Discovery of New Potential ß-helices
- Run structural predictor seeking potential
ß-helices from Uniprot (structurally unresolved)
databases - Full list (98 new predictions) can be accessed at
www.cs.cmu.edu/yanliu/SCRF.html - Verification on 3 proteins with later
experimentally resolved structures from different
organisms - 1YP2 Potato Tuber ADP-Glucose Pyrophosphorylase
- 1PXZ The Major Allergen From Cedar Pollen
- GP14 of Shigella bacteriophage as a ß-helix
protein - No single false positive!
17Experiment Results Fold Recognition
Triple beta-spirals
18Experiment Results Alignment Prediction
19Experiment ResultsDiscovery of New Membership
Proteins
- Predicted membership proteins of triple
beta-spirals can be accessed at - http//www.cs.cmu.edu/yanliu/swissprot_list.xls
- Membership proteins of double barrel-trimer
suggested by biologists Benson, 2005 compared
with L-SCRF predictions
20Conclusion
- Conditional graphical models for protein
structure prediction - Effective representation for protein structural
properties - Feasibility to incorporate different kinds of
informative features - Efficient inference algorithms for large-scale
applications - A major extension compared with previous work
- Knowledge representation through graphical models
- Ability to handle long-range interactions within
one chain and between chains - Future work
- Automatic learning of graph topology
- Applications to other domains
21(No Transcript)
22Graphical Models
- A graphical model is a graph representation of
probability dependencies Pearl 1993 Jordan
1999 - Node random variables
- Edges dependency relations
- Directed graphical model (Bayesian networks)
- Undirected graphical model (Markov random fields)