Title: Point mutation trends across cSNP databases: Similarities and predictions1
1Point mutation trends across cSNP databases
Similarities and predictions1
- Monica M. Horvath, John W. Fondon III, and Harold
R. Garner.
McDermott Center for Human Growth and
Development, Center for Biomedical Inventions,
and Departments of Biochemistry and Internal
Medicine. University of Texas Southwestern
Medical Center, Dallas, TX.
2Summary We present a cSNP classification method
that both contrasts SNP databases and has the
potential to illuminate the relative mutational
load of genes caused by codon bias. We group
cSNPs gleaned from five public databases by their
wild-type and mutant codons, e.g. codon mutation
classes (CMCs, 576 possible such as ACG?ATG),
whose frequencies in a database are assembled
into a BLOSUM-style matrix describing the
likelihood of observing all possible single base
codon changes as tuned by the intertwined effects
of mutation rate and selection. The rankings of
the CMCs in any database are reshuffled according
to the population stratification of the typical
genotyping experiment producing that resources
data cohort size, cohort scope, and degree of
sample pooling. This phenomena exists because the
spectrum of observed variants in such experiments
is an inseparable function of both a mutation's
intrinsic rate and typical impact on the encoded
protein. Analysis reveals that a considerable
fraction of mutation in functional genes can be
described by a few CMCs regardless of gene
identity or population stratification in the
genotyping experiment. We compare the databases
in light of their point variation spectrums and
then demonstrate that this information may be
applied to predict a large fraction of variation
in individual genes, which has implications for
steering directed genotyping, estimating relative
gene mutational load, and developing hypotheses
about mutation discovery and population
structure.
3Table 1 Compilation of cSNP datasets
In order to relatively compare cSNP datasets of
differing origins, a database is deconvoluted
into a distribution of mutation preferences for
cSNP categories using a biologically relevant
classification system. Since this study focuses
on gene coding regions, a frame-dependent
trinucleotide classification metric is
appropriate (fig. 1). There are 138 silent and
438 nonsynonymous (nonsyn) codon mutation classes
(CMCs) possible.
4Figure 1 Mutability values can be constructed
from the raw counts of cSNPs in a database
(GGT-gtAGT)(G/S)0.0034 (GGT) (CGG-gtCAG)(R/Q)
0.0096 (CGG) (GGC-gtGGT)(G/G)0.0039
(GGC) (ATT-gtAGT)(I/S)0.0004
(ATT) (CAA-gtTAA)(Q/X)0.0013 (CAA)
CTGGCTg/aGTA CCg/aGATTTGC GCATCGGc/tAA
GCTTAAAt/gTT GTc/tAAATATC ATGCCc/aTAGC
CTAg/aTAAGTA
Annotate according to codon
GGT-gtAGT (G/S) CGG-gtCAG (R/Q) GGC-gtGGT
(G/G) ATT-gtAGT (I/S) CAA-gtTAA (Q/X) CCC-gtCCA
(P/P) AGT-gtAAT (S/N)
Grab cSNPs from a database
Calculate codon mutation class frequencies
5Codon mutation class distributions reveal
distinguishing features of cSNP datasets The
characteristics of each database CMC distribution
(fig. 2) depend upon the population
stratification of the genotyped cohort (table 1).
Database-specific distributions of silent cSNPs
across CMCs are statistically identical (fig. 3
and pgt0.25 in Wilcoxon rank-sum tests) as
expected if such alleles are mostly neutral.
Therefore CMC frequency differentials between
datasets are due to nonsilent mutation. Table 3
details the top 10 nonsynonymous CMCs from each
dataset. Frequent CMCs of one resource often
rank highly in any other database so that only18
CMCs describe the 10 most observable events in
all 5 collections. BLOSUM62 values provide a
rough estimate of the typical impact of the amino
acid substitution. Because of the conservative
nature of the replacement (BLOSUM62 3), cSNPs
such as V?I are more frequent in CGAP than in the
HGMD. The CGAP mouse data is dominated by both
conservative CMCs and an excess of silent
mutation, which is a consequence of the small,
unstratified population examined in CGAPs
discovery process. Such a low-throughput method
will have power to detect only extremely common
and therefore mostly benign alleles. This data
illustrates how mutation rate and impact are
inseparably coupled to shape the observable
allele frequencies in a genotyping study. The
identities of the most observable CMCs in any SNP
database are reshuffled according to the
population stratification of the studies that
generated it.
6Figure 2 A large fraction of gene cSNPs can be
described by a few codon mutation classes (CMCs)
40.8 of HGMD mutations in only 22 CMCs (expect
5.1)
27.4 of dbSNP cSNPs in only 29 CMCs (expect 6.3)
Number of CMCs
28.8 of SNP consortium cSNPs in only 29 CMCs
(expect 8.5)
34.2 of mouse CGAP cSNPs in only 29 CMCs (expect
9.4)
Codon mutation class frequency
7Figure 3 The pattern of synonymous mutation is
statistically identical between SNP databases
19 of all silent cSNPs fall into only four
classes TCG-gtTCA (Ser) ACG-gtACA (Thr)
CCG-gtCCA (Pro) GCG-gtGCA (Ala)
8Table 2 cSNP datasets differ primarily in the
dispersion of nonconservative variants
The relative likelihood of finding typically
nonconservative mutations is approximately HGMD
gtgt TSC gt dbSNP gt CGAP. The relative sequencing
depth is HGMD gtgt TSC (N24) gtdbSNP gt CGAP
(Ngt5). Therefore, a study that genotypes only
small populations will discover an excess of
typically neutral variants
A high rank represents a frequently observed
mutation class in a database
9- Figure 4 Given information about the design of
a genotyping study we know what pattern of
mutations to expect.
The pattern of UTSW PGA nonsynonymous cSNPs
expectedly matches the profile of dbSNP mutation
better than HGMD mutation given the few number of
individuals resequenced. The cSNP data for UT
Southwestern comes from work done for a
NHLBI-funded Program for Genomic Applications
grant 6 where genes relevant to cardiac disease
were resequenced in a group of 25 unrelated
individuals.
10Figure 5 Sequence context-associated
nonrandomness does not disappear upon examination
of SNPs less constrained by selection
jSNP trinucleotide mutation class frequency (3
prime UTR SNPs)
11Point Mutation Prediction The predictability of
mutation for a single gene using database-derived
codon mutation class (CMC) trends can be assessed
historically. For a given coding sequence, a
list of all possible cSNPs is generated where
each is assigned its CMC value as derived from
the database of choice (HGMD in table 3). Such an
assignment represents a probability estimate for
observing that variant in a population
stratification similar to the one that built the
employed CMC distribution. If each potential
cSNP was ranked in descending order according to
these probabilities, one could examine a top
fraction of predictions and query them in
published literature. For a set of seven
unrelated genes representing a broad spectrum of
allele frequencies and inheritance modes, we
recalculated the HGMD CMC distribution with those
genes mutations excluded from the database,
constructed a ranked set of cSNP predictions, and
then referenced highly probable subsets in the
HGMD resource to benchmark prediction accuracy
(table 3). All seven genes have an
easily-predicted volume of mutational space
despite their wide spectrum of allele inheritance
modes, which illustrates the generality of
CMC-described point mutation trends. When
considering only the top quarter percentile of
ranked nonsynonymous substitutions, 56/64
predictions (87.5) in this select fraction exist
and are associated or causative of a disease
state. Depending on a gene's cSNP saturation,
this method is between 6.3 to 181-fold more
accurate than predicting variants randomly.
12Table 3 A subset of gene point mutations can be
predicted
(a) Percentage of predicted point mutations
that have been experimentally observed according
to the HGMD ( correct predictions /
predictions made)100. (b) Calculated as in (a)
but using the null model, which predicts
mutations randomly across a gene. (c) Percentage
of HGMD-detailed point mutations that were
predicted for each gene using HGMD CMC
frequencies ( correct predictions / known
mutations)100. (d) Ratio of ( accuracy of
cSNP prediction)/( accuracy of null model) for
the top 0.25 mutation prediction threshold
level. (e) Factor 9, (f) cystic fibrosis
transmembrane receptor, (g) connexin 32, (h)
hydroxymethylbilane synthase, (i) paired box
homeotic 6, (j) alpha-1-antitrypsin, (k)
xeroderma pigmentosum, complementation group A.
13Conclusions
1. Convergent data from five cSNP datasets shows
a general, gene-independent pattern of point
mutation where a considerable fraction of cSNPs
can be described and consistently predicted by a
handful of trinucleotide sequence contexts. 2.
cSNP databases differ in the distribution of
nonsynonymous mutations where as the cohort
genotyped decreases in size and individuals are
more randomly selected, nonconservative mutations
rarify while the proportion of observed
conservative mutations and silent substitutions
dramatically increases. 3. Since sequence
diversity is not only a result of intrinsic
mutation rates but also of the evolutionary
forces that act on the targeted DNA sequence,
mutation rates cannot be calculated directly from
cSNP databases. However, deconvolution of a
cSNP database into a distribution of mutation
preferences for cSNP sequence contexts allows
relative comparison of cSNP datasets detected in
differing species, populations, and
environments. 4. Trinucleotide mutation
preferences gleaned from cSNP databases permits
prediction of the most likely handful of human
mutations that will be found in a similarly
stratified population and may both shed light on
how the relative mutation likelihood of gene
families differ and steer genotyping studies.
14References
- Monica M. Horvath, John W. Fondon III, and Harold
R. Garner (2003). Low hanging fruit A subset of
human cSNPs is both highly nonuniform and
predictable. Gene 312,197-206. - Krawczak, M., et al. (2000). Human Gene Mutation
Database--A Biomedical Information and Research
Resource. Hum Mut 15, 45-51. - Smigielski, E.M., et al. (2000). dbSNP A
database of single nucleotide polymorphisms.
Nucleic Acids Res 28, 352-5. - Thorisson, G.A. and Stein, L.D. The SNP
Consortium Web Site Past, present, and future.
Nucleic Acids Res 2003 Jan 131(1)124-7. - Riggins, G.J. and Strausberg, R.L. (2001).
Genome and genetic resources from the Cancer
Genome Anatomy Project. Hum Mol Genet 10, 663-7.
- Haga H, Yamada R, Ohnishi Y, Nakamura Y, Tanaka T
(2002). Gene-based SNP discovery as part of the
Japanese Millennium Genome Project
identification of 190,562 genetic variations in
the human genome. J Hum Genet 47, 605-610. - UT Southwestern PGA data http//pga.swmed.edu.
Acknowledgements
This work was supported by the National
Institutes of Health, NHLBI Program in Genomic
Applications grant P50 CA70907, and the State of
Texas Advanced Technology Program.
15Statement of Problem Efforts to catalog cSNPs
(coding SNPs) have accelerated due to their
presumed value in phenotype association studies.
The occurrence of a point mutation event is
well-known to be highly dependent on the local
DNA sequence context. However, identification and
subsequent deposition of a specific cSNP into a
database is not simply a matter of the events
inherent mutation rate, but depends substantially
upon the structure of the genotyped population
(e.g. size and stratification) as well as the
effect of selection on the new allele. The goal
of this study is to sort cSNPs from public
genotyping efforts according to a set of sequence
context categories in order to pinpoint any
unusually well- or underpopulated categories and
to identify associations between mutation and
coding sequence context and to realize any
general coding region mutability trends in the
human genome.