Title: CS882: Protein Structure Prediction
1CS882 Protein Structure Prediction
- Jinbo Xu
- School of Computer Science, University of Waterloo
2Outline
- Protein Structure Basics
- Protein Structure Prediction
- Prediction Assessment
- Linear Program Approach to Protein Threading
3Relevance of Protein Structurein the Post-Genome
Era
structure
medicine
sequence
function
4A Protein Sequence
gtgi22330039refNP_683383.1 unknown protein
protein id At1g45196.1 Arabidopsis
thaliana MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSS
ASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL DSARSSFSV
ALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTN
KSSVFPSPGTPTYLHSMQKGW SSERVPLRSNGGRSPPNAGFLPLYSGRT
VPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYS
LY SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPS
MARSVSIHGCSETLASSSQDDIHESMKDAATDA QAVSRRDMATQMSPEG
SIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWS
KKHRGLYHGNGSKM
5Amino Acids
Side chain
Each amino acid is identified by its side chain,
which determines the properties of this amino
acid.
6Side Chain Properties
Hydrophobic amino acids stay inside of a protein,
while Hydrophilic ones tend to stay in the
exterior of a protein. Oppositely charged amino
acids can form salt bridge. Polar amino acids can
participate hydrogen bonding
The amino acids names are colored according to
their type positively charged, negatively
charged, polar but not charged, aliphatic
(nonpolar), and aromatic. Amino acids that are
essential to mammals are marked with an asterisk
().
7Levels of Protein Structures
- Primary sequence
- Amino acid (residue) sequence
- Secondary structure
- local arrangement, such as helix, beta sheet,
loop - Tertiary structure
- Overall spatial conformation
- Quaternary structure
- Spatial relationship among multiple chains
8Beta Sheet Examples
Anti-parallel beta sheet
Parallel beta sheet
9Beta Sheet Examples (Contd)
10Helix Examples
11Protein Structure Example
Beta Sheet
Helix
Loop
ID 12as 2 chains
12What determines structures?
- Hydrogen bonds essential in stabilizing the
basic secondary structures - Hydrophobic effects strongest determinants of
protein structures - Van der Waal Forces stabilizing the hydrophobic
cores - Electrostatic forces oppositely charged side
chains form salt bridges
13Domain, Motif, Fold
- Domain a discrete portion of a protein assumed
to fold independently of the rest of the protein
and possessing its own function. - Most proteins have multiple domains.
- The overall shape of a domain is called a fold.
There are only a few thousand possible folds. - Super-secondary structure, motif
- Frequently occurring structure patterns among
multiple proteins, which are not necessarily have
similar folds.
14Protein Structure Prediction
- Problem
- Given the amino acid sequence of a protein,
whats its shape in three-dimensional space? - Subproblems
- Secondary structure prediction
- Residue-residue contact prediction
- Angle prediction
15Why Prediction Needed?
- The functions of a protein is determined by its
structure. - Experimental methods to determine protein
structure are time-consuming and expensive. - Big gap between the available protein sequences
and structures.
16Growth of Protein Sequences and Structures
Data from http//www.dna.affrc.go.jp
17Protein Classification
- Family the proteins in the same family are
homologous, evolved from the same ancestor.
Usually, the identity of two sequences are very
high. Similar structures. - Super Family distant homologous sequences,
evolved from the same ancestor. Sequence identity
is around 25-30. Similar structures. - Fold only shapes are similar, no homologous
relationship. Usually, sequence identity is very
low. - Protein classification databases SCOP, CATH
18Target Sequence Category
- Homology Modeling (HM) targets
- Easy HM has a homologous protein with known
structure - Hard HM has a distant homologous protein with
known structure - Also called Comparative Modeling (CM) targets
- Fold Recognition (FR) targets
- Can find a protein with known structure having
the same fold as the target - New Fold (NF) targets
19Observations
- Sequences determine structures
- Proteins fold into minimum energy state.
- structures are more conserved than sequences. If
two protein sequences share 30 identical
residues, then they have a very good chance to
have the same fold.
20Prediction Methods
- Ab initio folding NF targets, build a structure
without referring to an existing structure - Homology Modeling HM targets, sequence-based
method - Protein Threading FR targets, sequence-structure
alignment - Consensus Method vote a prediction from some
candidates generated by several prediction
programs
21Ab Initio Folding
- Based on the first-principle
- Build structures purely from protein sequences,
no templates used - Unaffordable computing demands
- Paradigm is changing, knowledge-based methods are
proposed
22Ab Initio Energy Function
23Lattice Model
- Arrange all the atoms at some grid points by
Monte Carlo simulation - Pure Lattice Model only works for very small
proteins - Use simplified representation
- Add constraints such as partial NMR data or
predicted residue-residue contacts can speed up
convergence
taken from Jeff. Skolnick et al.
24Segment Assembly
- Also called Mini-Threading
- Bakers method
- Construct a library of small structure fragments,
each of which has length 9. - Design a threading method to predict the
structure of a protein sequence segment of length
9. - Cut a target sequence to (len-9) sequence
segments. For each sequence segment, choose some
candidate fragments from the fragment library. - Assemble these fragments by Monte Carlo
simulation - Thousands of simulations are done. The generated
structures are grouped into some clusters.
Clusters are ranked by their average energy
functions. - D. Jones uses super secondary structure as a
basic construction unit.
25Homology Modeling
- Search homologous proteins by sequence search
tools such as PSI-BLAST - Multiple sequence alignment (key step)
- Identify cores and loops conserved segments are
cores, otherwise loops - Core modeling copy backbone coordinates from the
homologous one with know structure - Loop modeling search fragment library
- Side chain modeling search rotamer library
- Refinement some tools such as WHAT IF, PROCHECK,
and Verify3D can be used
26Protein Threading
- Make a structure prediction through finding an
optimal alignment (placement) of a protein
sequence onto each known structure (structural
template) - alignment quality is measured by some
statistics-based scoring function - best overall alignment among all templates may
give a structure prediction
27Threading Example
28PDB New Fold Growth
Old fold
New fold
- The number of unique folds in nature is fairly
small (possibly a few thousands) - 90 of new structures submitted to PDB in the
past three years have similar structural folds in
PDB
29Protein Threading Procedures
- Step 1 Construction of Template Library
- Step 2 Design of Scoring Function
- Step 3 Sequence-Structure Alignment
- Step 4 Template Selection and Model Construction
Only step 1 is relatively easy!
30Template Database
- A representative set of protein structures
extracted from the PDB database. It satisfies the
following conditions - The resolution of each representative structure
should be good - A good X-ray structure has higher priority than a
NMR structure - The sequence identity between any two
representatives should be no more than 30, in
order to save computing time.
- Examples
- CATH http//www.biochem.ucl.ac.uk/bsm/cath/
- SCOP http//scop.mrc-lmb.cam.ac.uk/scop/
- PDB_SELECT http//www.cmbi.kun.nl/gv/pdbsel/
31Scoring Function
how well a residue fits a structural
environment E_s (Fitness score)
how preferable to put two particular residues
nearby E_p (Pairwise potential)
sequence similarity between query and template
proteins E_m (Mutation score)
alignment gap penalty E_g (gap score)
How consistent of the secondary structures E_ss
E E_p E_s E_m E_g E_ss
Minimize E to find a sequence-template alignment
32Nonpairwise Threading Programs
- Sequence-sequence alignment
- Sequence-profile alignment
- Sequence-HMM model alignment
- e.g. SAMT02 (K. Karplus et al.)
- Profile-sequence alignment
- e.g. PDB-Blast (A. Godzik et al.)
- Profile-profile alignment
- e.g. PROSPECT-II (Y. Xu et al.)
- Combinations of several alignments
- e.g. 3DPS (L.A. Kelley et al), SHGU (D. Fischer)
33Pairwise Threading Algorithms
- Approximation Algorithm
- Interaction-Frozen Algorithm (A. Godzik et al.)
- Monte Carlo Sampling (S.H. Bryant et al.)
- Double dynamic programming (D. Jones et al.)
- Exact Algorithm
- Branch-and-bound (R.H. Lathrop and T.F. Smith)
- PROSPECT-I uses Divide-and-conquer (Y. Xu et al.)
- Linear programming by RAPTOR (J. Xu et al.)
34Fold Recognition and Model Building
- Protein Threading algorithms can only generate
sequence-template alignments - Correct templates should be chosen from the
template database - zScore, Neural Network, SVM
- Some tools are needed to generate 3D structure of
the sequence from the sequencetemplate alignment - MODELLER (http//salilab.org/modeller/modeller.htm
l) - MaxSprout (http//www.ebi.ac.uk/maxsprout/)
- Jackal (http//honiglab.cpmc.columbia.edu/programs
/jackal/intro.html)
35CASP/CAFASP
- CASP Critical Assessment of Structure Prediction
- CAFASP Critical Assessment of Fully Automated
Structure Prediction
CASP Predictor
CAFASP Predictor
- Wont get tired
- High-throughput
36CASP/CAFASP (contd)
- Public
- Organized by structure community
- Evaluated by the unbiased third-party
- Held every two years
- Blind
- Experimental structures to be determined by
structure centers after competition - Drawback lt100 targets
- Blindness
- Some centers are reluctant to release their
structures
37Threading Model
- Each template is parsed as a chain of cores. Two
adjacent cores are connected by a loop. Cores are
the most conserved segments in a protein. - No gap allowed within a core.
- Only the pairwise contact between two core
residues are considered because contacts involved
with loop residues are not conserved well. - Global alignment employed
38Contact Graph
- Each residue as a vertex
- One edge between two residues if their spatial
distance is within given cutoff. - Cores are the most conserved segments in the
template
template
39Simplified Contact Graph
40Contact Graph and Alignment Diagram
41Contact Graph and Alignment Diagram
42Calculation of Alignment Score
43Hardness of Protein Threading
- Protein Threading is NP-hard
- Proof
- Reduce Max-Cut problem to this problem
- Given a graph, number its nodes by certain order.
Assume there are M nodes. - Consider a sequence of length 2M like this
PHPHPH - Pairwise score is 1 only if two different types
of residues are mapped to two ends of one graph
edge, otherwise 0
44Linear Integer Program
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xylt11 -x2ylt5 x, ygt0
Linear contraints
Integral contraints (nonlinear)
x, y integer
45Linear Integer Program
- Linear programs can be solved within polynomial
time - No polynomial time for integer programs so far
- Relaxed to linear program, solve the linear
version - Branch-and-bound or branch-and-cut (may cost
exponential time)
46Variables
- x(i,l) denotes core i is aligned to sequence
position l - y(i,l,j,k) denotes that core i is aligned to
position l and core j is aligned to position k at
the same time.
47Standard Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
48Better Formulation
a singleton score parameter b pairwise score
parameter
Each y variable is 1 if and only if its two x
variable are 1
Each core has only one alignment position
- 99 real threading instances generate integral
solutions directly - The fractional solution space of this formulation
is a subset of that of the previous one
49Integrality of Real Instances
50Correlation Coefficient between Fractional
Solutions and Templates, Sequences
edges the number of edges in the template
contact graph.
51Integrality Summary
- 99 real instances could be solved by linear
programming directly, no additional
branch-and-bound needed - No special template or sequence found generates
more fractional solutions - No special feature of templates or sequences
found leads to fractional solutions - Explanation?
52zScore to Fold Recognition
- It is defined to be deviation of the alignment
score to the expected - Expected alignment score is calculated by random
shuffle - fixing the sequence-template alignment positions
- Randomly shuffle the sequence and recalculate the
alignment score - Calculate the mean and variance of the alignment
scores generated by random shuffling - zScore equals to (mean-alignment score)/variance
53SVM to Fold Recognition
- SVM classification to classify threading pairs
- Features
- Template length, sequence length, alignment
length - Mutation score, fitness score, secondary
structure score, pairwise score - Gap penalty
54CAFASP3/CASP5
- Same Target Set 62 targets
- Time for each target
- Individual Servers 48 hours
- Meta Servers 96 hours
- CASP5 Predictors May to September of 2002
- Resources for predictors
- No X-ray, NMR machines (of course)
- CAFASP3 predictors no manual intervention
- CASP5 predictors anything (servers, google,)
- Evaluation
- CASP5 assessed by expertscomputer
- CAFASP3 evaluated by MaxSub, a computer program.
Predicted structures are superimposed to the
experimental structures to see how long is
superimposable.
55CASP5/CAFASP3 targets
Hard
Easy
Prediction Difficulty
CM Comparative Modelling, HM Homology
Modelling FR Fold Recogniton, NF New Fold
56State of the Art
Servers with name in italic are meta servers
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
57RAPTORs sensitivity on FR targets
1. RAPTOR is weak at recognizing FR(A) targets
(need improvement ) 2. RAPTOR could not deal with
NF targets at all (normal)
58CAFASP3 Example
- Target Size144
- Super-imposable size within 5A 118
- RMSD1.9
Blue Correct Prediction
Red Experimental Structure
Green Incorrect Prediction
59Term Project Options
- Design a machine learning approach to fold
recognition - Design a template database from scratch, based on
only the PDB database, and compare it to other
databases - Protein secondary structure prediction
- Local alignment for protein threading
- Critical review of three protein threading
algorithms branch-and-bound, divide-and-conquer,
and linear programming
60Open Questions
- A practical and exact algorithm to protein
threading problem allowing gaps within cores - A practical and exact algorithm to ab initio
protein folding problem - Investigate the conditions under which the linear
program will generate integral solutions directly
61Acknowledgements
- Bonnie Berger, Introduction to Computational
Molecular Biology, course notes, 2001 - Bin Ma, Bioinformatics, course notes, 2004
62Reading List
- CASP1, CASP2, CASP3, CASP4, and CASP5 Special
Issues, Proteins Structure, Function and
Genetics, 1995, 1997, 1999, 2001, 2003