Title: Structure Prediction
1Structure Prediction
2Methods
- Ab initio
- Heuristics
- Machine learning
- Homology modeling
- Threading
3RNA Structure Prediction Ab-initio
- Sequence over A, C, G, U
- Complementary pairs attract, form base-pairs or
minimizes energy - We are not interested in overall energy of the
sequence, just the process of minimization - Just the linear sequence, zero base pairs,
energy0 - Physics is embedded within free-energy
parameter/function - Minimization of energy is objective
4RNA Structure Prediction Knot-free
- Knot-free assumption
- Knot base pairs (I, j) and (k, l) where Iltjltkltl
- Knot-free causes planar graph, and makes DP
algorithm feasible - Base pairs are disjoint or embed in each other
5RNA Structure Prediction Principle of optimality
- Assumption 1 Base-pairing do not affect each
others energy - Now one can add energy minimization by all base
pairs in a string and check which configuration
produces lowest energy - Combinatorics is exponential
- Need further assumption
6RNA Structure Prediction DP Algorithm
- Assume energy for each component can be
calculated independently - a(r,k) free energy for base pair (r,k), where r,
k from ACGU - a is zero for self-pairing (impossible)
7RNA Structure Prediction DP Algorithm
- E(Sij) min
- E(SI1,j-1 ) a(ri,rj), when i,j pairs,
- MinE(SI,k-1) E(Sk1,j ), when j pairs with
k, Iltkltj - Compute (n x n) matrix for I and j, bottom up,
for I-j0, I-j1, I-j2, - Complexity O(n3)
8RNA Structure Prediction relax assumptions
- Consider some special energy functions, other
than just the base pairing ones a(r,k) - This means different types of base pairings
- Some more practical topology
9RNA Structure Prediction Loops
- Say, base pair at (I,j) and Iltultvltwltj
- v is accessible from base pair (I,j) if there is
no base pair at (u,v) - Loop is the bases accessible from base pair (I,j)
- Note, still no knot
- Some loops p249
10RNA Structure Prediction Energy over loops
- Say, (I,j) base pair closes a loop
- Si1,j-1 may not have the minimum energy
configuration - Because energy of Si1,j-1 plus free energy of
a(ri,rj) may be less than min-energy
configuration of string (I1 to j-1) without base
pairing at (I,j) - This interactive-ness was ignored at the previous
assumption level - Dynamic Programming can still be done, if we
explicitly specify energy parameters
11RNA Structure Prediction Energy over loops
- E(Sij) min
- E(SI1,j ), I is not paired
- E(SI1,j-1 ), j is not paired
- minE(S,i,k-1) E(Sk1,j ), when i or j pairs
with k, iltkltj, - E(LI,j ), when (I,j) base pairs and all special
structures may appear within embeds first
formula of previous assumption
12RNA Structure Prediction More assumptions
- Disregard free energies that do not belong to any
loops - Added energy of only components is the final
energy of the string no interaction between
components - Only 4 types of loops as in p249 for E(LI,j ),
(can add more, if you know their energy
parameterization)
13RNA Structure Prediction free energies for 4
loops
- Hairpin loop of size k Zi(k)
- Additional stabilizing energy for two adjacent
base pairs(in addition to a(r,k)) eta, constant - Destabilizing energy for bulge of size k
beta(k) - Destabilizing energy for interior loop of size k
gamma(k)
14RNA Structure Prediction E(LI,j )
- Hairpin a(ri,rj) zi(j-I1)
- Stacked-pair a(ri,rj)etaE(Si1,j-1)
- Bulge on i mina(ri,rj)beta(k) E(Sik1,j-1),
kgt1 - Bulge on j mina(ri,rj)beta(k) E(Si1,j-k-1),
kgt1 - Interior loop mina(ri,rj)gamma(k1k2)
E(Sik11,j-k2-1), k1,k2gt1
15RNA Structure Prediction complexity
- O(n2) table entries
- On each entry
- First 2 formulae O(1) leading to O(n2)
- Third formula O(n) O(n3)
- 4.1 (E(L) hairpin) O(1) O(n2)
- 4.2 O(1) O(n2)
- 4.3 O(n), run on k O(n3)
- 4.4 O(n), run on k O(n3)
- 4.5 O(n2), run on k1, k2 O(n4)
- Final complexity from 4.4 O(n4)
16Protein Threading
- Interactions in proteins are between 20x20
residues, as opposed to 4x4 NAa at most in RNAs - Residue interactions are quite non-local, causing
much more structural complexity - Proteins have frequent loops (helices are loops)
- So, prediction by Ab initio is extremely difficult
17Protein Threading
- Number of protein folds are few (1,000 for
20,000 proteins) - Threading map the target sequence over a
template fold - Threading is an alignment problem, Torda, Fig1
- Find the fold to which target aligns optimally
(minimum energy function) - Needs basic scoring functions as in sequence
alignment
18Protein Threading number of folds
- More the number of folds in database more time
to find correct template - Scoring function for threading is quite
imperfect need more available templates
(contradictory requirements)
19Protein Threading Scoring functions
- Full force field is not necessarily ideal
- it involves dynamics between molecules, stretch,
torsion, etc. - Unimportant for a static alignment
20Protein Threading Scoring functions
- Scoring function could be between residues from
the same sequence for coming close to each other
on the alignment - Torda, Fig 5
- Example scoring function (free energy)
- For pair of residues A and B to be at distance r
(Torda, p7) - G(AB) kT ln(rho-rAB / rho-0-rAB),
- rho-rAB is probability of AB to be at distance
r, - rho-0 is probability of random occurrence of
that (k,T usual)
21Protein Threading Scoring functions
- Probabilities are collected from PDB proteins
with known structure - Different threading scheme uses different scoring
functions, but mostly they are derived from PDB
22Protein Threading Scoring functions
- Example (Setubal-Meidanis, p257)
- G1(I, ti) for placing i-th residue in sequence to
the ti position in the fold - G2(I, j, ti, tj) simultaneous placements of i, j,
for Iltj - Constrained to be within a range, say bilttiltei
23Protein Threading
- Optimization is not only on placement, but also
on multiple folds in database - Accuracy is very sensitive to alignment errors
24Protein Threading Dynamic programming
- Advantage/disadvantage of DP is that it is
deterministic - Problem adjacency is hard to define in 3D
25Protein Threading Dynamic programming
- DP try out different combination of adjacent
residues on different parts of a template (Torda,
Fig 5c adjacent comes from template sequence) - Start with smaller number of elements and build
up to the full sequence - Alternative approach start with placing each
residue to one of its possible positions and
see where next residue should go continue
residue by residue
26Protein Threading Probabilistic algorithm
- Monte Carlo simulation randomly throw residues
at positions on fold and check aggregate scoring
function - Simulated annealing gradually move residues to
optimize, stochastically making random shifts to
avoid local optimum - Time consuming, the result is non-deterministic
27Protein Threading Branch and bound
- In the worst case try all possible alignments,
but prune the search space for non-useful
branches using some bounding function
28Protein Threading Search on folds
- Divide and conquer over the space of folds
- Assumption folds can be ordered for their
goodness for the target protein - Example Setubal-Meidanis, p258
29Protein Threading Future
- Slow
- Subsumed by Ab intio of IBM Blue Gene type
projects - De Novo technique using linear programming (Xu
and Li, 2003) - Threading techniques are not only useful for
structure prediction but for fold recognition
problem also no alignment, just find the
template (fold suggests function)