Protein Function Analysis using Computational Mutagenesis PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Protein Function Analysis using Computational Mutagenesis


1
Protein Function Analysis using Computational
Mutagenesis
  • CASB workshop, 9/23/10

Iosif Vaisman Laboratory for Structural
Bioinformatics proteins.gmu.edu
Department of Bioinformatics and Computational
Biology
2
Dealunay simplices classification
3
Protein representation (Crambin)
4
Neighbor identification in proteinsVoronoi/Delau
nay Tessellation in 2D
5
Neighbor identification in proteinsVoronoi/Delau
nay Tessellation in 2D
6
7
6
Voronoi Tessellation
Delaunay Tessellation
6
Delaunay tessellation of Crambin
7
Delaunay Tessellation of Protein Structure
D (Asp)
Ca or center of mass
Abstract each amino acid to a point Atomic
coordinates Protein Data Bank (PDB)
Delaunay tessellation 3D tiling of space into
non-overlapping, irregular tetrahedral simplices.
Each simplex objectively defines a quadruplet of
nearest-neighbor amino acids at its vertices.
8
Compositional propensities of Delaunay simplices
f- observed quadruplet frequency, pijkl
Caiajakal, a - residue frequency
AAAA C 4! / 4! 1
AAAV C 4! / (3! x 1!) 4
AAVV C 4! / (2! x 2!) 6
AAVR C 4! / (2! x 1! x 1!) 12
AVRS C 4! / (1! x 1! x 1! x 1!) ) 24

9
Counting Quadruplets
  • assuming order independence among residues
    comprising Delaunay simplices, the maximum number
    of all possible combinations of quadruplets
    forming such simplices is 8855

10
Log-likelihood of amino acid quadruplets with
different compositions
11
Log-likelihood of amino acid quadruplets
12
Log-likelihood of amino acid quadruplets
13
Computational Mutagenesis Methodology
  • Observations
  • Relatively few mutant and wt structures of same
    protein have been solved
  • Tessellations of mutant and wt protein structures
    are very similar or identical
  • Approach
  • Obtain topological score (TSmut) and 3D-1D
    potential profile vector (Qmut) for any mutant
    protein by using the wt structure tessellation as
    a template
  • Simply change the residue label at a given
    point(s) and re-compute

A22
s(I,D,A,L)
A22
s(R,D,A,L)
s(I,G,F,L)
s(R,G,F,L)
L6
L6
D3
D3
Mutation
F7
F7
(R5 ? I5)
s(I,D,K,S)
s(R,D,K,S)
G62
G62
K4
K4
S64
S64
s(I,S,C,G)
s(R,S,C,G)
I5
R5
C63
C63
(TSwt, Qwt)
(TSmut, Qmut)
14
Computational Mutagenesis Methodology
  • Scalar Residual Score of a mutant
  • (mutant wt) topological score difference
    TSmut TSwt (empirical measure of relative
    structural change due to mutation)
  • Vector Residual Profile of a mutant
  • R Qmut Qwt (mutant wt) 3D-1D potential
    profile difference (environmental perturbation
    score at every position in structure)
  • Denote R lt EC1, EC2, EC3,, ECN gt
  • ECi qi,mut qi,wt relative Environmental
    Change at position i
  • Geometric property If mutant is due to a single
    substitution at position j, then ECj mutant
    residual score (epicenter of impact)
  • The only other nonzero EC components correspond
    to neighboring positions that participate in
    simplices with j

15
Approach 1 Protein Topological Score (TS)
  • Obtained by summing the log-likelihood scores of
    all simplicial quadruplets defined by the protein
    tessellation
  • Global measure of protein sequence-structure
    compatibility
  • Total (empirical or statistical) potential of the
    protein

TS ?î s(î), sum taken over all simplex
quadruplets î in the entire tessellation.
A22
s(R,D,A,L)
s(R,G,F,L)
L6
D3
F7
s(R,D,K,S)
G62
K4
S64
s(R,S,C,G)
R5
C63
Close-up view of only the four simplices that use
R at position 5 as a vertex (hypothetical)
16
Approach 2 Residue Environment Scores
  • For each amino acid position, locally sum the
    log-likelihood scores s(i,j,k,l) of only simplex
    quadruplets that include it as a vertex

A22
s(R,D,A,L)
s(R,G,F,L)
L6
D3
Example q5 q(R5) ?(i,j,k,l) s(i,j,k,l), sum
over all simplex quadruplets (i,j,k,l) that
include amino acid R5
F7
s(R,D,K,S)
G62
K4
S64
s(R,S,C,G)
R5
C63
  • The scores of all amino acid positions in the
    protein structure form a 3D-1D Potential Profile
    vector Q lt q1, q2, q3,,qN gt (N length of
    primary sequence in solved structure)

17
Reversibility Analysis
S1,E1 reference PDB
S1,E2 Calculated Mutant
Forward Mutation
S2,E2 Mutant PDB
S2,E1 Calculated reference
Reverse Mutation
18
Reversibility of mutations (T4 lysozyme)
Protein Mutation Score change 1l63
T26E -2.49 180l E26T 2.01 1l63 A82S
1.49 123l S82A -1.49 1l63 V87M
-0.28 1cu3 M87V 0.22 1l63 A93C
-1.98 138l C93A 1.78 1l63 T152S
-1.08 1goj S152T 1.12
19
Reversibility Analysis
20
Functional Effects of Amino Acid Substitutions
  • Change in protein stability
  • Effect on melting temperature ?Tm Tm (mutant)
    Tm (wt)
  • Effect on thermal denaturation ??G ?G (mutant)
    ?G (wt)
  • Effect on denaturant denaturation ??GH2O ?GH2O
    (mutant) ?GH2O (wt)
  • Change in protein activity
  • Mutant enzymatic activity relative to wt
  • Mutant strength of DNA binding relative to wt
  • Disease potential of human coding nsSNPs
  • Neutral polymorphism or disease-associated
    mutation?
  • For protein targets of inhibitor drugs
  • Continued susceptibility or (degree of )
    resistance that patients with the mutant protein
    have to the inhibitor
  • Inhibitor binding energy to mutant target
    relative to wt

21
Examples ofExperimental Mutagenesis Data
22
Example HIV-1 Protease (PR)
23
HIV-1 PR Dataset Example Residual Profiles of
536 Experimental Mutants


24
Experimental Mutants Residual Scores Elucidate
the Structure-Function Relationship
536 HIV-1 protease mutants
4041 lac repressor mutants
630 hIL-3 mutants
371 gene V protein mutants
25
Universal Model Approach 8635 Experimental
Mutants from 7 Proteins
26
Universal Model Approach 980 Experimental
Mutants from 20 Proteins
27
Structure-Function Correlation Based on Residual
Scores nsSNPs
  • 1790 nsSNPs corresponding to single amino acid
    substitutions in several hundred proteins with
    tessellatable structures
  • Function 1332 nsSNPs associated with disease
    458 neutral
  • Data obtained from Swiss-Prot and HPI

28
Structure-Function Correlation Based on Residual
Scores Drug Susceptibility
29
Algorithm Performance 2015 T4 Lysozyme Mutants
30
Learning Curves for HIV-1 protease and T4
lysozyme mutants
31
Real-World Application T4 Lysozyme Predictions
  • Experimental data (not part of training set)
    obtained from ProTherm database
  • Result predictions match experiments for 30/35
    (86) of the mutants

32
T4 Lysozyme Mutational Array
Training set mutants (n 2015)
Predicted test set mutants (n 1101)
Active
Inactive
Active
Inactive
33
GVP Mutational Array
34
Support Vector Regression
Capriotti et al. SVM regression (for comparison)
r 0.71, Standard Error 1.3 kcal/mol, y
0.5223x 0.4705
35
Conclusions
  • Computational mutagenesis derived from a
    four-body, knowledge-based statistical potential
    uniquely characterizes each protein mutant using
    both sequential and structural features
  • Attributes correlate well with mutant function -
    valuable for developing accurate machine learning
    based predictive models

36
Acknowledgements
Structural Bioinformatics Laboratory
(GMU) Tariq Alsheddi (structure
alignment) David Bostick (topological
similarity) Andrew Carr (functional sites,
visualization) Sunita Kumari (structural
genomics) Yong Luo (evolutionary structure
analysis) Majid Masso (mutagenesis, HIV-1
protease, LAC repressor, T4 lysozyme, SNP) Ewy
Mathe (mutagenesis, p53) Olivia
Peters (protein-protein interfaces) Vadim
Ravich (HIV RT mutagenesis) Greg Reck (hydration
potentials, amyloids) Todd Taylor (statistical
potentials, secondary structure, topology,
protein stability) Bill Zhang (mutagenesis, BRCA1)
Collaborators John Grefenstette (GMU) Curt
Jamison (GMU) Dmitri Klimov (GMU) Dan Carr
(GMU) Estela Blaisten (GMU) Vladimir Karginov
(IB) Unpublished data Clyde Hutchison
(UNC) Ron Swanstrom (UNC) Funding NSF NIH-Innov
ative Biologics GMU-INOVA Research Fund
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Evaluating Algorithm Performance
  • Overall goal Develop model with known examples
    to accurately predict class (or value) of
    instances that have not yet been assayed
    experimentally (potentially great savings of time
    and money)
  • Ideal situation split large original dataset
    into 3 subsets
  • Training set (learn model)
  • Validation set (optimize model by tweaking model
    parameters)
  • Test set (evaluate model on new data not used to
    develop model)
  • Errors measured at each step (resubstitution,
    validation, generalization)
  • Approaches Tenfold cross-validation (10-fold
    CV)
  • leave-one-out CV (i.e., jackknife or N-fold CV,
    N dataset size)
  • split (e.g., use only 2/3 for training, 1/3
    held out for testing)

41
Evaluating Algorithm Performance
  • 10-fold CV
  • Randomly split the dataset instances into 10
    equally-sized subsets
  • Hold-out subset 1 combine subsets 2-10 into one
    training set for learning a model use trained
    model to predict classes of instances in subset 1
  • Repeat previous step 9 more times (e.g., hold-out
    subset 2, combine subsets 1 and 3-10 together to
    train a model, use model to predict subset 2,
    etc)
  • We end up with 10 models, each trained using 90
    of the original dataset, and each used to predict
    the held-out 10 subset.
  • In the end, each instance has one class
    prediction compare to actual class
  • LOOCV (leave-one-out CV, jackknife, or N-fold
    CV)
  • Similar to above, but each subset contains only 1
    instance
  • Deterministic no randomness to which instances
    are grouped as subsets
  • Overall prediction accuracy provides rough idea
    of how a model trained with the full dataset will
    perform
  • split (self-explanatory)

42
Evaluating Algorithm Performance
  • Assume instances belong to two generic classes
    (Pos/Neg)
  • Results of comparing predictions with actual
    classes based on the approaches described
    (10-fold CV, LOOCV, split) can be summarized in
    a confusion matrix
  • Classification performance measures
  • accuracy (TPTN) / (TPFPTNFN) sensitivity
    TP / (TPFN)
  • specificity TN / (TNFP) precision TP /
    (TPFP)
  • BER 0.5 FP / (FPTN) FN / (FNTP)
  • MCC (TPTN FPFN) / ?(TPFN)(TPFP)(TNFN)(TN
    FP)
  • AUC area under ROC curve (plot of sensitivity
    vs. 1 specificity)
  • For regression models correlation coefficient,
    standard error

Predicted as Pos Neg
TP FN
FP TN
Pos Neg
Actual class
43
ROC Curve
  • Plot of true positive rate (sensitivity) versus
    false positive rate (1 specificity) in the unit
    square
  • AUC probability that classifier will rank a
    randomly chosen positive instance higher than a
    randomly chosen negative one
  • AUC 0.5 (ROC close to diagonal line joining
    points (0,0) and (1,1)) suggests no signal in
    dataset and that trained model is not likely to
    perform any better than random guessing
  • AUC 1 (piecewise linear ROC joining (0,0) to
    (0,1) and (0,1) to (1,1)) indicates a perfect
    classifier
Write a Comment
User Comments (0)
About PowerShow.com