Predicting and Classifying Protein Structures - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Predicting and Classifying Protein Structures

Description:

RMSD = 0.2 dream on. A Good Protein Structure.. X-ray ... Results posted on the web and updated weekly. http://cubic.bioc.columbia.edu/eva. Lecture 3.2 ... – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 53
Provided by: MichelDu5
Category:

less

Transcript and Presenter's Notes

Title: Predicting and Classifying Protein Structures


1
Predicting and Classifying Protein Structures
  • Michel Dumontier, Ph.D.
  • Carleton University
  • michel_at_bioinfocg.com

2
Outline
  • 3D Structure Determination
  • Validation
  • Structure Classification
  • Structure Prediction
  • Secondary Structure

3
Structure Validation
  • A structure can (and often does) have mistakes
  • A poor structure will lead to poor models of
    mechanism or relationship
  • Unusual parts of a structure may indicate
    something important (or an error)

4
Famous bad structures
  • Azobacter ferredoxin (wrong space group)
  • Zn-metallothionein (mistraced chain)
  • Alpha bungarotoxin (poor stereochemistry)
  • Yeast enolase (mistraced chain)
  • Ras P21 oncogene (mistraced chain)
  • Gene V protein (poor stereochemistry)

5
Structure Validation
  • Assess experimental fit
  • look at Resolution, R-Factor or RMSD
  • Assess correctness of overall fold
  • look at disposition of hydrophobic residues
  • Assess structure quality
  • packing
  • stereochemistry
  • contacts...

6
X-Ray Resolution
  • Resolution Meaning
  • gt4.0 Coordinates meaningless.
  • 3.0 - 4.0 Fold possibly correct, but errors
    are very likely. Many sidechains placed with
    wrong rotamer.
  • 2.5 - 3.0 Fold likely correct except that some
    surface loops might be mis-modelled. Several
    long, thin sidechains (lys, glu, gln, etc) and
    small sidechains (ser, val, thr, etc) likely to
    have wrong rotamers.
  • 2.0 - 2.5 As 2.5 - 3.0, but number of sidechains
    in wrong rotamer is considerably less. Many
    small errors can normally be detected. Fold
    normally correct and number of errors in
    surface loops is small.
  • 1.5 - 2.0 Few residues have wrong rotamer. Many
    small errors can normally be detected. Fold
    always correct, also in surface loops.
  • 0.5 - 1.5 Threonines may have wrong
    chirality on the C-beta.

7
A Good Protein Structure..
X-ray structure NMR structure
  • R 0.59 random chain
  • R 0.45 initial structure
  • R 0.35 getting there
  • R 0.25 typical protein
  • R 0.15 best case
  • R 0.05 small molecule
  • RMSD 4 Ã… random
  • RMSD 2 Ã… initial fit
  • RMSD 1.5 Ã… OK
  • RMSD 0.8 Ã… typical
  • RMSD 0.4 Ã… best case
  • RMSD 0.2 Ã… dream on

8
A Good Protein Structure..
  • Minimizes disallowed torsion angles
  • Maximizes number of hydrogen bonds
  • Maximizes buried hydrophobic ASA
  • Maximizes exposed hydrophilic ASA
  • Minimizes interstitial cavities or spaces

9
A Good Protein Structure..
  • Minimizes number of bad contacts
  • Minimizes number of buried charges
  • Minimizes radius of gyration
  • Minimizes covalent and noncovalent (van der Waals
    and coulombic) energies

10
Structure Validation Servers
  • WHAT IF
  • http//swift.cmbi.kun.nl/WIWWWI/
  • Verify3D
  • http//www.doe-mbi.ucla.edu/Services/Verify_3D/
  • VADAR
  • http//redpoll.pharmacy.ualberta.ca

11
(No Transcript)
12
(No Transcript)
13
Structure Validation Programs
  • PROCHECK
  • http//www.biochem.ucl.ac.uk/roman/procheck/proch
    eck.html
  • VADAR
  • http//www.pence.ca/software/vadar/latest/vadar.ht
    ml
  • DSSP
  • http//www.cmbi.kun.nl/gv/dssp/

14
Procheck
15
Outline
  • 3D Structure Determination
  • Validation
  • Structure Classification
  • Structure Prediction
  • Secondary Structure

16
Domains are ubiquitous in proteins
Large proteins are composed of compact,
semi-independent units - domains.
Reason Modularity Folding efficiency
2MCP.PDB
17
Protein Domains an alphabet of functional
modules
18
SCOP
  • The SCOP database aims to provide a detailed and
    comprehensive description of the structural and
    evolutionary relationships between all proteins
    whose structure is known.
  • Created by manual inspection and aided by
    automated methods
  • Consists of four hierarchical categories
  • Class, Fold, Superfamily and Family.
  • http//scop.mrc-lmb.cam.ac.uk/scop

19
structural classification
The eight most frequent SCOP superfolds
20
Semi-automated consensus domain definition -
Structure (CATH)
Dehydrolipoamide dehydrogenase 1LPFA
http//www.biochem.ucl.ac.uk/bsm/cath/
Jones S et al. (1998) Domain assignment for
protein structures using a consensus approach
Chracterization and analysis. Protein Science
7233-242
21
CATH - Class
Class 4 Few Secondary Structures
Class 2 Mainly Beta
Class 3 Mixed Alpha/Beta
  • Class 1 Mainly Alpha

Secondary structure content (automatic)
22
CATH - Architecture
  • Roll

Super Roll
Barrel
2-Layer Sandwich
Orientation of secondary structures (manual)
23
CATH - Topology
  • L-fucose Isomerase

Serine Protease
Aconitase, domain 4
TIM Barrel
Topological connection and number of secondary
structures
24
CATH - Homology
  • Alanine racemase

Dihydropteroate (DHP) synthetase
FMN dependent fluorescent proteins
7-stranded glycosidases
Superfamily clusters of similar structures
functions
25
Conserved Domain Database
Automated (objective) domain definition using
sequence.
CDD from Smart and Pfam CDART from CDD and
Genbank
http//www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtm
l
26
Homologous domains have similar structures
1PLS/2DYN 23 ID
1PLS - PH domain (Human pleckstrin)
2DYN - PH domain (Human dynamin)
27
Homology and Structural Similarity
Proteins that diverge in evolution maintain their
global fold !
Russell et al. (1997) J Mol Biol 269 423-439
28
Superposition
  • Important as a means to identify protein motifs
    and fold families
  • Non-evolutionary structural relationships

Structural similarity between Calmodulin and
Acetylcholinesterase
29
RMSD metric
To calculate the RMSD, a pairwise correspondence
of points has to be defined first.
30
RMSDopt
RMSDopt min(RMSDcoord)
RMSDopt RMSDcoord(A, Rs x (B-Ts))
The translation vector Ts and the rotation matrix
Ms define a superposition of the vector set B on
A.
An analytic solution of the superposition problem
is available, but not straightforward (involves
an eigenvalue problem).
31
Superposition in practice
  • Pre-aligned structures
  • VAST www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtm
    l
  • FSSP www.bioinfo.biocenter.helsinki.fi8080/dali/i
    ndex.html
  • Homstrad www-cryst.bioc.cam.ac.uk/homstrad/
  • PDBsum www.biochem.ucl.ac.uk/bsm/pdbsum/
  • DALI www.ebi.ac.uk/dali/
  • On the fly
  • CE cl.sdsc.edu/ce.html
  • FAST biowulf.bu.edu/FAST/

32
Outline
  • 3D Structure Determination
  • Validation
  • Structure Classification
  • Structure Prediction
  • Secondary Structure

33
Secondary (2o) Structure
34
Secondary Structure Prediction
  • One of the first fields to emerge in
    bioinformatics (1967)
  • Grew from a simple observation that certain amino
    acids or combinations of amino acids seemed to
    prefer to be in certain secondary structures
  • Subject of hundreds of papers and dozens of
    books, many methods

35
2o Structure Prediction
  • Statistical (Chou-Fasman, GOR)
  • Homology or Nearest Neighbor (Levin)
  • Physico-Chemical (Lim, Eisenberg)
  • Pattern Matching (Cohen, Rooman)
  • Neural Nets (Qian Sejnowski, Karplus)
  • Evolutionary Methods (Barton, Niemann)
  • Combined Approaches (Rost, Levin, Argos)

36
Secondary Structure Prediction
37
Chou-Fasman Statistics
38
Simplified C-F Algorithm
  • Select a window of 7 residues
  • Calculate average Pa over this window and assign
    that value to the central residue
  • Repeat the calculation for Pb and Pc
  • Slide the window down one residue and repeat
    until sequence is complete
  • Analyze resulting plot and assign secondary
    structure (H, B, C) for each residue to highest
    value.

39
Simplified C-F Algorithm
helix
beta
coil
10 20 30 40
50 60
40
Limitations of Chou-Fasman
  • Does not take into account
  • long range information (gt3 residues away)
  • structure class
  • Does not include
  • related sequences or alignments in prediction
    process
  • Only about 55 accurate

41
The PhD Algorithm
  • Search the SWISS-PROT database and select high
    scoring homologues
  • Create a sequence profile from the resulting
    multiple alignment
  • Include global sequence info in the profile
  • Input the profile into a trained two-layer neural
    network to predict the structure and to
    clean-up the prediction

42
Prediction Performance
43
Best of the Best
  • PredictProtein-PHD (72)
  • http//cubic.bioc.columbia.edu/predictprotein/
  • Jpred (73-75)
  • http//www.compbio.dundee.ac.uk/www-jpred/submit.
    html
  • SAM-T02 (75)
  • http//www.cse.ucsc.edu/research/compbio/HMM-apps/
    T02-query.html
  • PSIpred (77)
  • http//bioinf.cs.ucl.ac.uk/psipred/psiform.html

44
(No Transcript)
45
Evaluating Secondary Structure Predictions
  • Historically problematic due to tester bias
    (developer trains and tests their own
    predictions)
  • Some predictions were up to 10 off
  • Move to make testing independent and test sets as
    large as possible
  • EVA evaluation of protein secondary structure
    prediction

46
EVA
  • gt10 different methods evaluated as new structures
    are deposited in the PDB
  • Results posted on the web and updated weekly
  • http//cubic.bioc.columbia.edu/eva

47
EVA
48
Secondary Structure Evaluation
  • Q3 score
  • standard method in evaluating performance, 3
    states (H,C,B) evaluated like a multiple choice
    exam with 3 choices. Same as correct
  • SOV (segment overlap score)
  • more useful measure of how segments overlap and
    how much overlap exists

49
Homology Modeling
  • Similar sequences usually share the same fold.
  • Structure models can be constructed from
    alignments with proteins having a 3D structure.
  • When no suitable template structure can be found,
    possible templates are found using threading
  • More with Boris in 3.3 and 3.5

50
ab initio Protein Structure Prediction
  • Predicting the 3D structure without any prior
    knowledge
  • Used when homology modeling or threading have
    failed (no homologues are evident)
  • Equivalent to solving the Protein Folding
    Problem
  • Still an active research problem
  • Howards Lecture 5.2

51
Conclusions
  • Protein structures are now sufficiently abundant
    and well defined that they can be classified
    using well-developed rules of taxonomy
  • Distant relationships and common rules of folding
    can be uncovered through fold classification
    comparison

52
Conclusions
  • Structure prediction is still one of the key
    areas of active research in bioinformatics and
    computational biology
  • Significant strides have been made over the past
    decade through the use of larger databases,
    machine learning methods and faster computers
Write a Comment
User Comments (0)
About PowerShow.com