Protein Structural Prediction - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Structural Prediction

Description:

Protein Structural Prediction Structure Determines Function Protein Structure Prediction ab initio Use just first principles: energy, geometry, and kinematics ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 52

Provided by: root

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Structural Prediction

1
Protein Structural Prediction
2
Structure Determines Function
The Protein Folding Problem

What determines structure?
Energy
Kinematics

How can we determine structure?
Experimental methods
Computational predictions

3
Protein Structure Prediction

ab initio
Use just first principles energy, geometry, and
kinematics
Homology
Find the best match to a database of sequences
with known 3D-structure
Threading
Meta-servers and other methods

4
Threading
MTYKLILN . NGVDGEWTYTE
Main difference between homology-based prediction
and threading Threading uses the structure to
compute energy function during alignment

Threading is the golden mean between
homology-based prediction and molecular modeling
(?)

5
Threading Overview

Build a structural template database
Define a sequencestructure energy function
Apply a threading algorithm to query sequence
Perform local refinement of secondary structure
Report best resulting structural model

6
Threading Search Space
Protein Sequence X
Protein Structure Y
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
7
Threading Template Database

FSSP, SCOP, CATH
Remove pairs of proteins with highly similar
structures
Efficiency
Statistical skew in favor of large families

8
Threading Energy Function
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
how well a residue fits a structural
environment Es
how preferable to put two particular residues
nearby Ep
how often a residue mutates to the template
residue Em
alignment gap penalty Eg
compatibility with local secondary structure
prediction Ess
total energy wmEm wsEs wpEp wgEg
wssEss
9
Threading Formulation
x

Contact graph captures amino acid interactions
Cores represent important local structure units
No gaps within each core

y
z
u
Ci
v
Cj
x
z
y
C1
C2
C3
C4
u
v
a
t1a
?1
?0
t4a
?4
?3
t3a
?2
t2a
10
Threading Formulation
CMG (v, ?)
11
Threading Formulation

From Lathrop Smith

12
Threading Search Space
Protein Sequence X
Protein Structure Y
How Hard is Threading?
MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE
CORES
13
How Hard is Threading?

At least as hard as MAX-CUT
MAX-CUT Given graph G (V, E), find a cut (S,
T) of V with maximum number of edges between S
and T.
The Bad News APX-complete even when each node
has at most B edges (where Bgt2)

14
Reduction of MAX-CUT to Threading
0 1 0 1 0 1 0 1 0 1 0 1 0 1 v1 v2 v3
v4 v5 v6 v7
Sequence consists of V 01-pairs

V cores, each core i has length 1 and
corresponds to vi
Let Ep(0,1) 1 every edge labeled 0-1 or 1-0
gets a score of 1
Then, size of cut threading score

15
Threading with Branch Bound

Set of solutions can be partitioned into subsets
(branch)
Upper limit on a subsets solution can be
computed fast (bound)
Branch Bound
Select subset with best possible bound
Subdivide it, and compute a bound for each subset

16
Threading with Branch Bound

Key to this algorithm is tradeoff on lower bound
efficient
tight

17
Threading with Integer Programming
maximize
z 6x5y
Linear function
Subject to
Linear Program
Integer Program
3xy 11 -x2y 5 x, y 0
Linear constraints
Integral constraints (nonlinear)
x, y ? 0, 1
RAPTOR integer programming-based
threading perhaps the best protein threading
system
18
Threading with Integer Programming

x(i,k) denotes that core i is aligned to
sequence position k
y(i,k,j,l) denotes that core i is aligned to
position k and core j is aligned to position l
D(i) all positions where core i can be aligned
to
R(i, j, k) set of possible alignments of core j,
given that core i aligns to position k
corei (headi, taili, lengthi taili headi
1)

19
Threading with Integer Programming
Cores are aligned in order
Each y variable is 1 if and only if its two x
variables are 1 x and y represent exactly the
same threading
Each core has only one alignment position
20
Energy Function is Linear

Sequence substitution score
Fitness of aa in each position (example,
hydrophobicity)
Agreement with secondary structure prediction
Pairwise interaction between two cores
Gap between two successive cores

21
LP Relaxation and (again) Branch Bound

Relax the integral constraint, to
x(i,j), y(i,k,j,l) ? 0
Solve the LP using a standard method
(RAPTOR uses IBMs OSL)
If resulting solution is integral, done
Else, select one non-integral variable
(heuristically), and generate two subproblems by
setting it to 0, and 1 -- use Branch Bound
In practice, in RAPTOR only 1 of the instances
in the test database required step 4 almost
all solutions are integral !!!

22
CAFASP

GOAL
The goal of CAFASP is to evaluate the performance
of fully automatic structure prediction servers
available to the community. In contrast to the
normal CASP procedure, CAFASP aims to answer the
question of how well servers do without any
intervention of experts, i.e. how well ANY user
using only automated methods can predict protein
structure. CAFASP assesses the performance of
methods without the user intervention allowed in
CASP.

23
Performance Evaluation in CAFASP3
Servers (54 in total) Sum MaxSub Score correct (30 FR targets)
3ds5 robetta 5.17-5.25 15-17
pmod 3ds3 pmode3 4.21-4.36 13-14
RAPTOR 3.98 13
shgu 3.93 13
3dsn 3.64-3.90 12-13
pcons3 3.75 12
fugu3 orf_c 3.38-3.67 11-12

pdbblast 0.00 0
Servers with name in italic are meta servers
MaxSub score ranges from 0 to 1 Therefore,
maximum total score is 30
(http//ww.cs.bgu.ac.il/dfischer/CAFASP3,
released in December, 2002.)
24
One structure where RAPTOR did best
Red true structure Blue correct part of
prediction Green wrong part of prediction

Target Size144
Super-imposable size within 5A 118
RMSD1.9

25
Some more results by other programs
26
Some more results by other programs
27
Some more results by other programs
28
Structural Motifs
beta helix
beta barrel
beta trefoil
29
Structural Motif Recognition

Secondary Structure Prediction
Find the ? helices, ? sheets, loops in a protein
sequence
Given an amino acid residue sequence, does it
fold as a
Coiled Coil?
? helix?
? barrel?
Zinc finger?
Intermediate goals towards folding
Useful information about the function of a
protein
More amenable to sequence analysis, than full
fold prediction

30
Structural Motif Recognition

Collect a database of known motifs and
corresponding amino acid subsequences
Devise a method/model to match a new sequence
to existing motif database
Verify computationally on a test set (divide
database into training and testing subsets)
Verify in lab

31
Structural Motif Recognition Methods

Alignment
Neural Nets
Hidden Markov Models
Threading
Profile-based Methods
Other Statistical Methods

32
Predicting Coiled Coils
33
Predicting Coiled Coils

NewCoils multiply probs of frequencies in each
coiled coil position

34
Predicting Coiled Coils

PairCoil multiply pairwise probs of spatially
neighboring positions

Use a sliding window of length 28
Perfect score separation between true and false
examples (false non-coil-coil ? helices)
Berger et al. PNAS 1995

35
Predicting ? helices

Helix composed of three parallel ? sheets
Very few solved structures, very different from
one another
Absent in eukaryotes!
Probably evolved subsequent to prok/euk split

36
Predicting ? helices

Only available program BetaWrap
The rungs subproblem
Given the location of a T2 turn of one rung, find
location of T2 turn of next rung
Distribution of turn lengths
Bonus/penalty for stacked pairs in the parallel
strands
Discard if highly charged residues in the
inward-point positions of ? strand
From a rung to multiple rungs
Find multiple initial B2-T2-B3 rungs
Use sequence template based on hydrophobicity to
find many candidate rungs
Find optimal wrap by DP heuristic score,
based on 5 consistent rungs
Completing the parse
Find B1 strands by locally optimizing their
location

37
Predicting ? helices

BetaWrap gives scores that separate true from
false ? helices
Bradley et al. PNAS 2001

38
Predicting ? trefoils
http//betawrappro.csail.mit.edu/ Similar idea
use a combination of domain-specific expert
knowledge with statistics WRAP-AND-PACK WRAP
Search for antiparallel ? strands to wrap a
cap PACK Place the side chains in the interior
of the wrapped ? strands
39
Predicting Secondary Structure

Given amino acid sequence, classify positions
into ? helices, ? strands, or loops
In general, harder than protein motif
identification
Best methods rely on Neural Networks
Similarly good separation can be achieved by SVMs
PSIPRED
Given a sequence x, generate profile using
PSI-BLAST
Pass the profile to a pre-trained NN
Output classification ? helix / ? strand / loops

40
PSIPRED
Profile M

Training Testing
Start with database of determined folds (lt1.87
Ao)
Remove redundancy any pair of proteins with
high similarity (found by PSI-BLAST) 187
remaining proteins
3-fold cross validation
76 classification accuracy

41
PSIPRED server

PSIPRED PREDICTION RESULTS
Conf Confidence (0low, 9high)
Pred Predicted secondary structure (Hhelix,
Estrand, Ccoil)
AA Target sequence
PSIPRED HFORMAT (PSIPRED V2.3 by David Jones)
Conf 9888788777656877765688766579
Pred CCCCCCCCCCCCCCCCCCCCCCCCCCCC
AA PEPTIDEPEPTIDEPEPTIDEPEPTIDE
Conf Confidence (0low, 9high)
Pred Predicted secondary structure (Hhelix,
Estrand, Ccoil)
AA Target sequence

PSIPRED PREDICTION RESULTS Conf Confidence
(0low, 9high) Pred Predicted secondary
structure (Hhelix, Estrand, Ccoil) AA
Target sequence PSIPRED HFORMAT (PSIPRED
V2.3 by David Jones) Conf 9988888721001112100121
12359 Pred CCCCCCCCCCCHHHHHHHHCCCCCCCC AA
PTYPTYPTXXXXXXXXXXXXTEETEET PSIPRED PREDICTION
RESULTS Conf Confidence (0low, 9high) Pred
Predicted secondary structure (Hhelix, Estrand,
Ccoil) AA Target sequence PSIPRED
HFORMAT (PSIPRED V2.3 by David Jones) Conf
91025687432236422336410232027743223653334679 Pred
CCCCCCCCCCCCCCCCCCCCCCCEEEECCCCCCCCCCCCCCCCC
AA THISISAPRXTEINSEQXENCETHISISAPRXTEINSEQXENCE
42
TRILOGY SequenceStructure Patterns

Identify short sequencestructure patterns 3
amino acids
Find statistically significant ones
(hypergeometric distribution)
Correct for multiple trials
These patterns may have structural or functional
importance
Pseq R1xa-bR2xc-dR3
Pstr 3 C? C? distances, 3 C? C? vectors
Start with short patterns of 3 amino acids
V, I, L, M, F, Y, W, D, E, K, R, H, N,
Q, S, T, A, G, S
Extend to longer patterns
Bradley et al. PNAS 998500-8505, 2002

43
TRILOGY
44
TRILOGY Extension
Glue together two 3-aa patterns that overlap in 2
amino acids
P-score ?iMpat,,min(Mseq, Mstr) C(Mseq, i)
C(T Mseq, Mstr i) C(T, Mstr)-1
45
TRILOGY Longer Patterns
?-?-? unit found in three proteins with the
TIM-barrel fold
NAD/RAD binding motif found in several folds
Type-II ? turn between unpaired ? strands
Helix-hairpin-helix DNA-binding motif
A ?-hairpin connected with a crossover to a third
?-strand
Three strands of an anti-parallel ?-sheet
A fold with repeated aligned ?-sheets
Four Cysteines forming 4 S-S disulfide bonds
46
Small Libraries of Structural Fragments for
Representing Protein Structures
47
Fragment Libraries For Structure Modeling
predicted structure
known structures
48
Small Libraries of Protein Fragments

Kolodny, Koehl, Guibas, Levitt, JMB 2002
Goal
Small alphabet of protein structural fragments
that can be used to represent any structure
Generate fragments from known proteins
Cluster fragments to identify common structural
motifs
Test library accuracy on proteins not in the
initial set

49
Small Libraries of Protein Fragments

Dataset 200 unique protein domains with most
reliable distinct structures from SCOP
36,397 residues
Divide each protein domain into consecutive
fragments beginning at random initial position
Library Four sets of backbone fragments
4, 5, 6, and 7-residue long fragments
Cluster the resulting small structures into k
clusters using cRMS, and applying k-means
clustering with simulated annealing
Cluster with k-means
Iteratively break join clusters with simulated
annealing to optimize total variance S(x µ)2

50
Evaluating the Quality of a Library