Title: Amino acid conservation is strongly associated with functional and structural significance, and thus can be used to identify functionally or structurally important features, such as active sites and protein-protein interaction domains. However,
1Inferring Property Selection Pressure from
Positional Residue ConservationRose Hoberman,
Yong Lu, Judith Klein-Seetharamanand Roni
Rosenfeld Carnegie Mellon University,
University of Pittsburgh Medical School,
Pittsburgh, PA
Summary of Results for GPCRA
Evaluating Gaussian Goodness-of-Fit Assumption
We believe that conserved properties will most
often have a unimodal distributions, therefore we
use a Gaussian to model property
conservation. Method
Introduction
- Amino acid conservation is strongly associated
with functional and structural significance, and
thus can be used to identify functionally or
structurally important features, such as active
sites and protein-protein interaction domains.
However, identifying positional conservation is
only a first step. Different positions have
different functional and structural roles, and
thus the amino acids in each position will be
subject to different constraints. A more
difficult task is to understand the specific
selective pressures that have influenced the
distribution of residues at each position.
- Fit a maximum-likelihood Gaussian to amino acid
frequencies in property space - From Gaussian calculate expected amino acid
frequencies - Calculate goodness-of-fit to quantify
difference between expected and observed amino
acid frequencies - Identifies unimodal distributions
- Penalizes missing amino acids (holes)
Objectives
- Develop a method for systematically identifying
conservation of specific physical and chemical
properties at each site in a multiple sequence
alignment. - Design a test to determine whether the apparent
conservation of a particular property in a
position is statistically significant.
- Statistical Significance
- The probability of a property appearing conserved
purely by chance is high when the entropy is low
(when sequence conservation is high). Therefore
we use a Monte-Carlo method to estimate p-values
for each significant property. For each
position, property pair - Generate a large set of random (shuffled)
property scales - Calculate goodness-of-fit of each random scale
to a Gaussian - Distribution of these random values allows
us to estimate the probability of getting a
value more extreme than the observed
value merely by chance. - We use a Bonferroni correction to correct for
testing of multiple properties. For the chance of
a type I error to be no more than 5 per
position, then a Bonferroni correction for
multiple testing (of 240 properties) yields a
signi?cance threshold of a 0.05/240 0.0002.
Summary of results for GPCRA, where significant
property conservation at a position is indicated
by a dot, and the color of the dot indicates the
type of property predicted to be conserved.
Entropy and location of transmembrane helices are
also indicated. Reference positions are taken
from the Rhodopsin PDB structure.
Validation
Method
We compared our predictions for the GPCRA family
to current knowledge about the structure,
function, and dynamics of GPCR. We analyzed only
those positions with significant property
conservation and low sequence conservation
(entropy of 3.5 - 4). Our method identifies many
instances of known property conservation
Previous Work
Existing methods use amino acid similarity
matrices, set-theoretic methods, or ML
optimization of only a small handful of
properties 1,2 and generally dont provide a
test of statistical significance.
- Charge conserved at 134
- part of D/E R Y motif of importance to binding
and activation of G-protein - Size conserved at 54, 80, 87, 123, 132, 153, 299
- helix faces one or two other helices
- Dynamic properties conserved
- especially in third cytoplasmic loop
- in Rhodopsin this is the most flexible
interhelical loop - in TM at 215 and 269
- close proximity to retinal ligand the
ligand-binding pocket is the most rigid part of
crystal structure
Mapping Amino Acid Distributions into Property
Space
Results
Downloaded about 250 physical, chemical, and
structural properties of amino acids from the
online database PDBase 3. Each property scale
assigns a numerical value to each of the 20 amino
acids.
1 Hydrophobicity
2 Volume
3 Net charge
4 Transfer free energy
...
240 Average flexibility
- We applied our method to the
- G-Protein Receptor Family A.
- GPCRA is a family of transmembrane proteins
containing 7 transmembrane helices - Respond to a variety of ligands bind to
G-proteins - Diverse in sequence, but believed to share
similar structure (only known structure is for
Rhodopsin) - Used PFAM alignment of GPCRA proteins, but
analyzed only the 253 positions aligned to
nongapped positions in Rhodopsin. - Used significance threshold of a 0.025/240
0.0001
Conclusions
A C D E F G H I K L M N P Q R S T V W Y 0.66 2.87 0.10 0.87 3.15 1.64 2.17 1.67 0.09 2.77 0.67 0.87 1.52 0.00 0.85 0.07 0.07 1.87 3.77 2.67
- Proposed a method for systematically identifying
conservation of specific physical-chemical
properties at each site in a multiple sequence
alignment. - Our test identifies which properties are
significantly conserved in each position in a
protein family. - We show that our automated method correctly
identifies many instances of known property
conservation in Rhodopsin. - We make additional predictions regarding
previously unknown sites of property
conservation, providing guidance for future
site-directed mutagenesis studies by experimental
biologists.
FAMLRIVM LAMLRIVC IAMLRIVM P-EL-IVP GAELRIVA PGEI
RIVS L-EVYIVA L-EVRIVM I-MLKIVP WAELRIVP HAELYIVS
YAILYIVP WAML-IVA
For each column in an alignment we calculate the
frequency of each amino acid at that site. By
mapping the amino acids to their respective
property values we can then generate a histogram
of showing the distribution of property values at
this site. For example, the first column in the
alignment at left is shown below plotted against
two different property scales.
Sequence vs. Property Conservation
Future Work
- Simple extension to multivariate Gaussian to
identify conservation of more than one property. - Incorporate method into a full phylogenetic model
to model the effect of codon distances as well as
reduce bias from unequal availability of
subfamilies and non-independence of sequences.
We expect conserved properties to have lower
variance (indicated by red arrows above), so at
this site hydrophobicity (left) appears more
conserved than the radius of side chain (right).
This graph compares locations of significant
property conservation (yellow lines) with
sequence conservation (indicated by entropy,
black lines) in the GPCR family A. In GPCR the
TM helices (red bars) are much more highly
conserved than the loops (blue bars), and the
entropy and property conservation both illustrate
this trend. However, property conservation can
also be found in many positions with very low
sequence conservation (high entropy).
- References
- 1 Valdar, W. 2002. Scoring residue
conservation. Proteins, 4822741. - 2 Koshi, J, Mindell, D, and Goldstein, R.
1997. Beyond mutation matrices
Physical-chemistry based evolutionary models.
Genome Inform Ser Workshop, Genome Inform,
78089. - 3 PDbase
- www.scsb.utmb.edu/comp_biol.html/venkat/prop.html
However, low variance doesnt always indicate a
conserved property. In the graph on the left, for
example, the missing cystein makes it is unlikely
that this is the sole property that is important
at this site.
For Additional Information Please Contact
roseh_at_cs.cmu.edu
2- 3.3 Gaussian Goodness-of-Fit
- Fit a maximum-likelihood Gaussian to amino acid
frequencies in property space - From (discretized) Gaussian calculate expected AA
frequencies - Calculate goodness-of-fit to learned
Gaussian - Identifies unimodal distributions
- Penalizes missing amino acids (holes)
- Use Monte-Carlo method to calculate statistical
significance (p-value) - Otherwise will have high false discovery rate
when entropy is low
- Estimating statistical significance
- What is the probability of a property being
conserved purely by chance? - Generate a large set of random (shuffled)
property scales - Calculate goodness-of-fit to a Gaussian
- p-value estimated from distribution of the
statistic under random null-hypothesis
However, the Bonferroni is an overly conservative
adjustment, especially since there are signi?cant
correlations between many of the properties we
are testing.