Title: Predicting and Classifying Protein Structures
1Predicting and Classifying Protein Structures
- Michel Dumontier, Ph.D.
- Carleton University
- michel_at_bioinfocg.com
2Outline
- 3D Structure Determination
- Validation
- Structure Classification
- Structure Prediction
- Secondary Structure
3Structure Validation
- A structure can (and often does) have mistakes
- A poor structure will lead to poor models of
mechanism or relationship - Unusual parts of a structure may indicate
something important (or an error)
4Famous bad structures
- Azobacter ferredoxin (wrong space group)
- Zn-metallothionein (mistraced chain)
- Alpha bungarotoxin (poor stereochemistry)
- Yeast enolase (mistraced chain)
- Ras P21 oncogene (mistraced chain)
- Gene V protein (poor stereochemistry)
5Structure Validation
- Assess experimental fit
- look at Resolution, R-Factor or RMSD
- Assess correctness of overall fold
- look at disposition of hydrophobic residues
- Assess structure quality
- packing
- stereochemistry
- contacts...
6X-Ray Resolution
- Resolution Meaning
- gt4.0 Coordinates meaningless.
- 3.0 - 4.0 Fold possibly correct, but errors
are very likely. Many sidechains placed with
wrong rotamer. - 2.5 - 3.0 Fold likely correct except that some
surface loops might be mis-modelled. Several
long, thin sidechains (lys, glu, gln, etc) and
small sidechains (ser, val, thr, etc) likely to
have wrong rotamers. - 2.0 - 2.5 As 2.5 - 3.0, but number of sidechains
in wrong rotamer is considerably less. Many
small errors can normally be detected. Fold
normally correct and number of errors in
surface loops is small. - 1.5 - 2.0 Few residues have wrong rotamer. Many
small errors can normally be detected. Fold
always correct, also in surface loops. - 0.5 - 1.5 Threonines may have wrong
chirality on the C-beta.
7A Good Protein Structure..
X-ray structure NMR structure
- R 0.59 random chain
- R 0.45 initial structure
- R 0.35 getting there
- R 0.25 typical protein
- R 0.15 best case
- R 0.05 small molecule
- RMSD 4 Å random
- RMSD 2 Å initial fit
- RMSD 1.5 Å OK
- RMSD 0.8 Å typical
- RMSD 0.4 Å best case
- RMSD 0.2 Å dream on
8A Good Protein Structure..
- Minimizes disallowed torsion angles
- Maximizes number of hydrogen bonds
- Maximizes buried hydrophobic ASA
- Maximizes exposed hydrophilic ASA
- Minimizes interstitial cavities or spaces
9A Good Protein Structure..
- Minimizes number of bad contacts
- Minimizes number of buried charges
- Minimizes radius of gyration
- Minimizes covalent and noncovalent (van der Waals
and coulombic) energies
10Structure Validation Servers
- WHAT IF
- http//swift.cmbi.kun.nl/WIWWWI/
- Verify3D
- http//www.doe-mbi.ucla.edu/Services/Verify_3D/
- VADAR
- http//redpoll.pharmacy.ualberta.ca
11(No Transcript)
12(No Transcript)
13Structure Validation Programs
- PROCHECK
- http//www.biochem.ucl.ac.uk/roman/procheck/proch
eck.html - VADAR
- http//www.pence.ca/software/vadar/latest/vadar.ht
ml - DSSP
- http//www.cmbi.kun.nl/gv/dssp/
14Procheck
15Outline
- 3D Structure Determination
- Validation
- Structure Classification
- Structure Prediction
- Secondary Structure
16Domains are ubiquitous in proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason Modularity Folding efficiency
2MCP.PDB
17Protein Domains an alphabet of functional
modules
18SCOP
- The SCOP database aims to provide a detailed and
comprehensive description of the structural and
evolutionary relationships between all proteins
whose structure is known. - Created by manual inspection and aided by
automated methods - Consists of four hierarchical categories
- Class, Fold, Superfamily and Family.
- http//scop.mrc-lmb.cam.ac.uk/scop
19structural classification
The eight most frequent SCOP superfolds
20Semi-automated consensus domain definition -
Structure (CATH)
Dehydrolipoamide dehydrogenase 1LPFA
http//www.biochem.ucl.ac.uk/bsm/cath/
Jones S et al. (1998) Domain assignment for
protein structures using a consensus approach
Chracterization and analysis. Protein Science
7233-242
21CATH - Class
Class 4 Few Secondary Structures
Class 2 Mainly Beta
Class 3 Mixed Alpha/Beta
Secondary structure content (automatic)
22CATH - Architecture
Super Roll
Barrel
2-Layer Sandwich
Orientation of secondary structures (manual)
23CATH - Topology
Serine Protease
Aconitase, domain 4
TIM Barrel
Topological connection and number of secondary
structures
24CATH - Homology
Dihydropteroate (DHP) synthetase
FMN dependent fluorescent proteins
7-stranded glycosidases
Superfamily clusters of similar structures
functions
25Conserved Domain Database
Automated (objective) domain definition using
sequence.
CDD from Smart and Pfam CDART from CDD and
Genbank
http//www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtm
l
26Homologous domains have similar structures
1PLS/2DYN 23 ID
1PLS - PH domain (Human pleckstrin)
2DYN - PH domain (Human dynamin)
27Homology and Structural Similarity
Proteins that diverge in evolution maintain their
global fold !
Russell et al. (1997) J Mol Biol 269 423-439
28Superposition
- Important as a means to identify protein motifs
and fold families - Non-evolutionary structural relationships
Structural similarity between Calmodulin and
Acetylcholinesterase
29RMSD metric
To calculate the RMSD, a pairwise correspondence
of points has to be defined first.
30RMSDopt
RMSDopt min(RMSDcoord)
RMSDopt RMSDcoord(A, Rs x (B-Ts))
The translation vector Ts and the rotation matrix
Ms define a superposition of the vector set B on
A.
An analytic solution of the superposition problem
is available, but not straightforward (involves
an eigenvalue problem).
31Superposition in practice
- Pre-aligned structures
- VAST www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtm
l - FSSP www.bioinfo.biocenter.helsinki.fi8080/dali/i
ndex.html - Homstrad www-cryst.bioc.cam.ac.uk/homstrad/
- PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum/
- DALI www.ebi.ac.uk/dali/
- On the fly
- CE cl.sdsc.edu/ce.html
- FAST biowulf.bu.edu/FAST/
32Outline
- 3D Structure Determination
- Validation
- Structure Classification
- Structure Prediction
- Secondary Structure
33Secondary (2o) Structure
34Secondary Structure Prediction
- One of the first fields to emerge in
bioinformatics (1967) - Grew from a simple observation that certain amino
acids or combinations of amino acids seemed to
prefer to be in certain secondary structures - Subject of hundreds of papers and dozens of
books, many methods
352o Structure Prediction
- Statistical (Chou-Fasman, GOR)
- Homology or Nearest Neighbor (Levin)
- Physico-Chemical (Lim, Eisenberg)
- Pattern Matching (Cohen, Rooman)
- Neural Nets (Qian Sejnowski, Karplus)
- Evolutionary Methods (Barton, Niemann)
- Combined Approaches (Rost, Levin, Argos)
36Secondary Structure Prediction
37Chou-Fasman Statistics
38Simplified C-F Algorithm
- Select a window of 7 residues
- Calculate average Pa over this window and assign
that value to the central residue - Repeat the calculation for Pb and Pc
- Slide the window down one residue and repeat
until sequence is complete - Analyze resulting plot and assign secondary
structure (H, B, C) for each residue to highest
value.
39Simplified C-F Algorithm
helix
beta
coil
10 20 30 40
50 60
40Limitations of Chou-Fasman
- Does not take into account
- long range information (gt3 residues away)
- structure class
- Does not include
- related sequences or alignments in prediction
process - Only about 55 accurate
41The PhD Algorithm
- Search the SWISS-PROT database and select high
scoring homologues - Create a sequence profile from the resulting
multiple alignment - Include global sequence info in the profile
- Input the profile into a trained two-layer neural
network to predict the structure and to
clean-up the prediction
42Prediction Performance
43Best of the Best
- PredictProtein-PHD (72)
- http//cubic.bioc.columbia.edu/predictprotein/
- Jpred (73-75)
- http//www.compbio.dundee.ac.uk/www-jpred/submit.
html - SAM-T02 (75)
- http//www.cse.ucsc.edu/research/compbio/HMM-apps/
T02-query.html - PSIpred (77)
- http//bioinf.cs.ucl.ac.uk/psipred/psiform.html
44(No Transcript)
45Evaluating Secondary Structure Predictions
- Historically problematic due to tester bias
(developer trains and tests their own
predictions) - Some predictions were up to 10 off
- Move to make testing independent and test sets as
large as possible - EVA evaluation of protein secondary structure
prediction
46EVA
- gt10 different methods evaluated as new structures
are deposited in the PDB - Results posted on the web and updated weekly
- http//cubic.bioc.columbia.edu/eva
47EVA
48Secondary Structure Evaluation
- Q3 score
- standard method in evaluating performance, 3
states (H,C,B) evaluated like a multiple choice
exam with 3 choices. Same as correct - SOV (segment overlap score)
- more useful measure of how segments overlap and
how much overlap exists
49Homology Modeling
- Similar sequences usually share the same fold.
- Structure models can be constructed from
alignments with proteins having a 3D structure. - When no suitable template structure can be found,
possible templates are found using threading - More with Boris in 3.3 and 3.5
50ab initio Protein Structure Prediction
- Predicting the 3D structure without any prior
knowledge - Used when homology modeling or threading have
failed (no homologues are evident) - Equivalent to solving the Protein Folding
Problem - Still an active research problem
- Howards Lecture 5.2
51Conclusions
- Protein structures are now sufficiently abundant
and well defined that they can be classified
using well-developed rules of taxonomy - Distant relationships and common rules of folding
can be uncovered through fold classification
comparison
52Conclusions
- Structure prediction is still one of the key
areas of active research in bioinformatics and
computational biology - Significant strides have been made over the past
decade through the use of larger databases,
machine learning methods and faster computers