Title: 3D Structure Prediction
13D Structure Prediction Assessment Pt. 2
- David Wishart
- Rm. 2123 Dent/Pharm Centre
- david.wishart_at_ualberta.ca
23D Structure Generation
- X-ray Crystallography
- NMR Spectroscopy
- Homology or Comparative Modelling
- Threading (1D and 2D threading)
- Secondary Structure Prediction
- Ab initio Structure Prediction
3Outline
- Threading (1D and 3D threading)
- Secondary Structure Prediction
- Ab initio Structure Prediction
- Structure Evaluation Assessment
- PERL and PDB
4Definition
- Threading - A protein fold recognition technique
that involves incrementally replacing the
sequence of a known protein structure with a
query sequence of unknown structure. The new
model structure is evaluated using a simple
heuristic measure of protein fold quality. The
process is repeated against all known 3D
structures until an optimal fit is found.
5Why Threading?
- Secondary structure is more conserved than
primary structure - Tertiary structure is more conserved than
secondary structure - Therefore very remote relationships can be better
detected through 2o or 3o structural homology
instead of sequence homology
6Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
7Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
R
E
8Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
T
H
9Visualizing Threading
THREADINGSEQNCEECNQESGNI ERHTHREADINGSEQNCETHREAD
GSEQNCEQCQESGIDAERTHR...
10Visualizing Threading
11Threading
- Database of 3D structures and sequences
- Protein Data Bank (or non-redundant subset)
- Query sequence
- Sequence lt 25 identity to known structures
- Alignment protocol
- Dynamic programming
- Evaluation protocol
- Distance-based potential or secondary structure
- Ranking protocol
122 Kinds of Threading
- 2D Threading or Prediction Based Methods (PBM)
- Predict secondary structure (SS) or ASA of query
- Evaluate on basis of SS and/or ASA matches
- 3D Threading or Distance Based Methods (DBM)
- Create a 3D model of the structure
- Evaluate using a distance-based hydrophobicity
or pseudo-thermodynamic potential
132D Threading Algorithm
- Convert PDB to a database containing sequence, SS
and ASA information - Predict the SS and ASA for the query sequence
using a high-end algorithm - Perform a dynamic programming alignment using the
query against the database (include sequence, SS
ASA) - Rank the alignments and select the most probable
fold
14Database Conversion
gtProtein1 THREADINGSEQNCEECNQESGNI HHHHHHCCCCEEEEE
CCCHHHHHH ERHTHREADINGSEQNCETHREAD HHCCEEEEECCCCCH
HHHHHHHHH
gtProtein2 QWETRYEWQEDFSHAECNQESGNI EEEEECCCCHHHHHH
HHHHHHHHH YTREWQHGFDSASQWETRA CCCCEEEEECCCEEEEECC
gtProtein3 LKHGMNSNWEDFSHAECNQESG EEECCEEEECCCEEECC
CCCCC
15Secondary Structure
-
-
162o Structure Identification
- DSSP - Database of Secondary Structures for
Proteins (swift.embl-heidelberg.de/dssp) - VADAR - Volume Area Dihedral Angle Reporter
(redpoll.pharmacy.ualberta.ca) - PDB - Protein Data Bank (www.rcsb.org)
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCA HHHHHHCCEEEEEE
EEEEECCHHHHHHHCCCCCCC
17Accessible Surface Area
Reentrant Surface
Accessible Surface
Solvent Probe
Van der Waals Surface
18ASA Calculation
- DSSP - Database of Secondary Structures for
Proteins (swift.embl-heidelberg.de/dssp) - VADAR - Volume Area Dihedral Angle Reporter
(www.pence.ualberta.ca/ftp/vadar) - GetArea - www.scsb.utmb.edu/getarea/area_form.html
QHTAWCLTSEQHTAAVIWDCETPGKQNGAYQEDCAMD
BBPPBEEEEEPBPBPBPBBPEEEPBPEPEEEEEEEEE 10562987994
15251510478941496989999999
19Other ASA sites
- Connolly Molecular Surface Home Page
- http//www.biohedron.com/
- Naccess Home Page
- http//sjh.bi.umist.ac.uk/naccess.html
- ASA Parallelization
- http//cmag.cit.nih.gov/Asa.htm
- Protein Structure Database
- http//www.psc.edu/biomed/pages/research/PSdb/
202D Threading Algorithm
- Convert PDB to a database containing sequence, SS
and ASA information - Predict the SS and ASA for the query sequence
using a high-end algorithm - Perform a dynamic programming alignment using the
query against the database (include sequence, SS
ASA) - Rank the alignments and select the most probable
fold
212o Structure Prediction
- Statistical (Chou-Fasman, GOR)
- Homology or Nearest Neighbor (Levin)
- Physico-Chemical (Lim, Eisenberg)
- Pattern Matching (Cohen, Rooman)
- Neural Nets (Qian Sejnowski, Karplus)
- Evolutionary Methods (Barton, Niemann)
- Combined Approaches (Rost, Levin, Argos)
22Chou-Fasman Statistics
23The PhD Approach
PRFILE...
24The PhD Algorithm
- Search the SWISS-PROT database and select high
scoring homologues - Create a sequence profile from the resulting
multiple alignment - Include global sequence info in the profile
- Input the profile into a trained two-layer neural
network to predict the structure and to
clean-up the prediction
25Prediction Performance
26Best of the Best
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein
- Jpred (73-75)
- http//jura.ebi.ac.uk8888/
- PREDATOR (75)
- http//www.embl-heidelberg.de/cgi/predator_serv.pl
- PSIpred (77)
- http//insulin.brunel.ac.uk/psipred
27ASA Prediction
- PredictProtein-PHDacc (58)
- http//cubic.bioc.columbia.edu/predictprotein
- PredAcc (70?)
- condor.urbb.jussieu.fr/PredAccCfg.html
QHTAW...
QHTAWCLTSEQHTAAVIW BBPPBEEEEEPBPBPBPB
282D Threading Algorithm
- Convert PDB to a database containing sequence, SS
and ASA information - Predict the SS and ASA for the query sequence
using a high-end algorithm - Perform a dynamic programming alignment using the
query against the database (include sequence, SS
ASA) - Rank the alignments and select the most probable
fold
29Dynamic Programming
G
E
N
E
T
I
C
S
G
60
40
30
20
20
0
10
0
E
40
50
30
30
20
0
10
0
N
30
30
40
20
20
0
10
0
E
20
20
20
30
20
10
10
0
S
20
20
20
20
20
0
10
10
I
10
10
10
10
10
20
10
0
S
0
0
0
0
0
0
0
10
30Sij (Identity Matrix)
A C D E F G H I K L M N P Q R S T V W Y A 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 C 0 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 D 0 0 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 E 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 F 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 G 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 H 0 0
0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 I 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 0 K 0 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 L 0 0 0 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 M 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
0 0 N 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 P 0
0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Q 0 0 0 0 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 R 0 0 0 0 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 S 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
0 0 0 V 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 W
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 Y 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
31A Simple Example...
A A T V D A 1 V V D
A A T V D A 1 1 V V D
A A T V D A 1 1 0 0 0 V V D
A A T V D A 1 1 0 0 0 V 0 V D
A A T V D A 1 1 0 0 0 V 0 1 1 V D
A A T V D A 1 1 0 0 0 V 0 1 1 2 V D
32A Simple Example...
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V D
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A 1 1 0 0 0 V 0 1 1 2 1 V 0 1
1 2 2 D 0 1 1 1 3
A A T V D A - V V D
A A T V D A V V D
A A T V D A V - V D
33Lets Include 2o info ASA
H E C
E P B
Sij
Sij
H 1 0 0 E 0 1 0 C 0 0 1
E 1 0 0 P 0 1 0 B 0 0 1
strc
asa
Sij k1Sij k2Sij k3Sij
total
seq
strc
asa
34A Simple Example...
E E E C C
E E E C C
E E E C C
A A T V D A 2 V V D
A A T V D A 2 2 V V D
A A T V D A 2 2 1 0 0 V V D
E E C C
E E C C
E E C C
E E E C C
E E E C C
E E E C C
A A T V D A 2 2 1 0 0 V 1 V D
A A T V D A 2 2 1 0 0 V 1 3 3 V D
A A T V D A 2 2 1 0 0 V 1 3 3 3 V D
E E C C
E E C C
E E C C
35A Simple Example...
E E E C C
E E E C C
E E E C C
A A T V D A 2 2 1 0 0 V 1 3 3 3 2 V D
A A T V D A 2 2 1 0 0 V 1 3 3 3 2 V 0 2
3 5 4 D 0 2 3 4 7
A A T V D A 2 2 1 0 0 V 1 3 3 3 2 V 0 2
3 5 4 D 0 2 3 4 7
E E C C
E E C C
E E C C
A A T V D A - V V D
A A T V D A V V D
A A T V D A V - V D
362D Threading Performance
- In test sets 2D threading methods can identify
30-40 of proteins having very remote homologues
(i.e. not detected by BLAST) using minimal
non-redundant databases (lt700 proteins) - If the database is expanded 4x the performance
jumps to 70-75 - Performs best on true homologues as opposed to
postulated analogues
372D Threading Advantages
- Algorithm is easy to implement
- Algorithm is very fast (10x faster than 3D
threading approaches) - The 2D database is small (lt500 kbytes) compared
to 3D database (gt1.5 Gbytes) - Appears to be just as accurate as DBM or other 3D
threading approaches - Very amenable to web servers
38Servers - PredictProtein
39Servers - 123D
40Servers - GenThreader
41Servers - LIBRA 1
42More Servers - www.bronco.ualberta.ca
432D Threading Disadvantages
- Reliability is not 100 making most threading
predictions suspect unless experimental evidence
can be used to support the conclusion - Does not produce a 3D model at the end of the
process - Doesnt include all aspects of 2o and 3o
structure features in prediction process - PSI-BLAST may be just as good (faster too!)
44Making it Better
- Include 3D threading analysis as part of the 2D
threading process -- offers another layer of
information - Include more information about the coil state
(3-state prediction isnt good enough) - Include other biochemical (ligands, function,
binding partners, motifs) or phylogenetic
(origin, species) information
45Outline
- Threading (1D and 3D threading)
- Secondary Structure Prediction
- Ab initio Structure Prediction
- Structure Evaluation Assessment
- PERL and PDB
46Ab Initio Prediction
- Predicting the 3D structure without any prior
knowledge - Used when homology modelling or threading have
failed (no homologues are evident) - Equivalent to solving the Protein Folding
Problem - Still a research problem
47Polypeptides can be...
- Represented by a range of approaches or
approximations including - all atom representations in cartesian space
- all atom representations in dihedral space
- simplified atomic versions in dihedral space
- tube/cylinder/ribbon representations
- lattice models
48Ab Initio Folding
- Two Central Problems
- Sampling conformational space (10100)
- The energy minimum problem
- The Sampling Problem (Solutions)
- Lattice models, off-lattice models, simplified
chain methods, parallelism - The Energy Problem (Solutions)
- Threading energies, packing assessment, topology
assessment
49A Simple 2D Lattice
3.5Å
50Lattice Folding
51Lattice Algorithm
- Build a n x m matrix (a 2D array)
- Choose an arbitrary point as your N terminal
residue (start residue) - Add or subtract 1 from the x or y position of
the start residue - Check to see if the new point (residue) is off
the lattice or is already occupied - Evaluate the energy
- Go to step 3) and repeat until done
52Lattice Energy Algorithm
- Red hydrophobic, Blue hydrophilic
- If Red is near empty space E E1
- If Blue is near empty space E E-1
- If Red is near another Red E E-1
- If Blue is near another Blue E E0
- If Blue is near Red E E0
53More Complex Lattices
543D Lattices
55Really Complex 3D Lattices
J. Skolnick
56Lattice Methods
Advantages
Disadvantages
- Easiest and quickest way to build a polypeptide
- Implicitly includes excluded volume
- More complex lattices allow reasonably accurate
representation
- At best, only an approximation to the real thing
- Does not allow accurate constructs
- Complex lattices are as costly as the real thing
57Non-Lattice Models
3.5 Å
H
R
Resi
C
H
1.53 Å
1.00 Å
1.32 Å
C
N
1.47 Å
1.24 Å
O
C
Resi1
H
R
58Vistraj Foldtraj
- Chris Hogue Howard Feldman (SLRI)
- Uses simplified Ca chain to represent polypeptide
backbone - Generates a simplified self-avoiding chain of
100 residues in 3 sec - Uses a binary tree search to look for potential
collisions in 3D space - Reconstructs full polypeptide from Cas
59Simplified Chain Representation
4
q
3
f
2
1
Spherical Coordinates
60The Search Sphere
Helix
Coil
b-Sheet
61Building a Ca Peptide Chain
n 3 n 5 n 7 n 9
62Simplified Chain Representation
Reconstructing backbone atoms from Ca atoms
63(No Transcript)
64Best Method So Far...
Rosetta - David Baker
65Blue Gene and Protein Folding
66Outline
- Threading (1D and 3D threading)
- Secondary Structure Prediction
- Ab initio Structure Prediction
- Structure Evaluation Assessment
- PERL and PDB
67Why Assess Structure?
- A structure can (and often does) have mistakes
- A poor structure will lead to poor models of
mechanism or relationship - Unusual parts of a structure may indicate
something important (or an error)
68Famous bad structures
- Azobacter ferredoxin (wrong space group)
- Zn-metallothionein (mistraced chain)
- Alpha bungarotoxin (poor stereochemistry)
- Yeast enolase (mistraced chain)
- Ras P21 oncogene (mistraced chain)
- Gene V protein (poor stereochemistry)
69How to Assess Structure?
- Assess experimental fit (look at R factor or
rmsd) - Assess correctness of overall fold (look at
disposition of hydrophobes) - Assess structure quality (packing,
stereochemistry, bad contacts, etc.)
70A Good Protein Structure..
X-ray structure NMR structure
- R 0.59 random chain
- R 0.45 initial structure
- R 0.35 getting there
- R 0.25 typical protein
- R 0.15 best case
- R 0.05 small molecule
- rmsd 4 Å random
- rmsd 2 Å initial fit
- rmsd 1.5 Å OK
- rmsd 0.8 Å typical
- rmsd 0.4 Å best case
- rmsd 0.2 Å dream on
71A Good Protein Structure..
- Minimizes disallowed torsion angles
- Maximizes number of hydrogen bonds
- Maximizes buried hydrophobic ASA
- Maximizes exposed hydrophilic ASA
- Minimizes interstitial cavities or spaces
72A Good Protein Structure..
- Minimizes number of bad contacts
- Minimizes number of buried charges
- Minimizes radius of gyration
- Minimizes covalent and noncovalent (van der Waals
and coulombic) energies
73Radius Radius of Gyration
- RAD 3.875 x NUMRES 0.333 (Folded)
- RADG 0.41 x (110 x NUMRES) 0.5 (Unfolded)
Radius Radius of Gyration
74Packing Volume
Loose Packing Dense Packing Protein
Proteins are Densely Packed
75Accessible Surface Area
76Accessible Surface Area
- Solvation free energy is related to ASA
- DG SDsiAi
- Proteins typically have 60 of their ASA
comprised of polar atoms or residues - Proteins typically have 40 of their ASA
comprised of nonpolar atoms or residues - DASA (obs - exp.) reveals shape/roughness
77Structure Validation Servers
- WhatIf Web Server - http//www.cmbi.kun.nl1100/WI
WWWI/ - Biotech Validation Suite - http//biotech.ebi.ac.u
k8400/cgi-bin/sendquery - Verify3D -
http//www.doe-mbi.ucla.edu/Services/Verify_3D/ - VADAR - http//redpoll.pharmacy.ualberta.ca
78(No Transcript)
79(No Transcript)
80(No Transcript)
81(No Transcript)
82Structure Validation Programs
- PROCHECK - http//www.biochem.ucl.ac.uk/roman/pr
ocheck/procheck.html - PROSA II - http//lore.came.sbg.ac.at/People/mo/Pr
osa/prosa.html - VADAR - http//www.pence.ualberta.ca/ftp/vadar/
- DSSP - http//www.embl-heidelberg.de/dssp/
83Procheck
84Slides Located At...
http//redpoll.pharmacy.ualberta.ca