Title: Computational Complexity
1Abstract T31
Optimized measured genotype analysis for
genome-wide quantitative trait loci mapping
using dense SNP chips Jeffrey R.
OConnell University of Maryland School of
Medicine, Baltimore MD 21201
Introduction Advances in genotyping and
sequencing technology in the last decade have
increased the number of polymorphisms available
for genetic analyses in humans from an average of
1000 short tandem repeats (STRs) to over 2
million single nucleotide polymorphisms (SNPs).
This technology is now making cost-effective
large scale genotyping a reality in many other
species. The release of the BovineSNP50 marks the
beginning of a new era in marker assisted
selection to guide animal breeding of
economically important traits in the dairy
industry. The analytical and computational
challenges posed by genome-wide association
analysis (GWAS) with large SNP panels and large
numbers of animals will require novel approaches
and optimized software. A software package is
presented that provides a flexible and powerful
tool to meet many of the GWAS challenges. Quantita
tive Trait Analysis Quantitative trait locus
(QTL) mapping is used to identify alleles that
contribute to variation of traits of health and
economic importance. The Measured Genotype (MG)
model is a mixed model for testing association
between a quantitative trait and genotype(s) in
pedigree data. The MG model treats the genotype
of an individual as a measured covariate while
incorporating a polygenic component to control
residual correlation due to familial
relationships, includes genetic data as both
fixed and random effects. The focus on this model
is the effect size of each genotype rather than
the additive variance of each individual.
Measured Genotype Model The MG mixed model
equation is Yi m ? bj cij ? bm gim ai
ei, where Yi is the phenotype of the ith
individual, cij are environmental covariates, gim
is the coded genotype at the mth marker, effect,
ai the additive polygenic component and ei is the
residual error. The b terms are regression
coefficients the measure effect size of the
covariate. If we assume multivariate normality
Y N(Xb,s2V), then the log likelihood is log L
½logs2V - ½(Y-Xb)V-1(Y-Xb)/s2 where the
variance covariance matrix V h2R (1-h2)W is a
function of the heritability h2, the relationship
matrix R (twice kinship matrix) and diagonal
matrix of weights W. The mean and covariates are
specified in the design matrix X, with genotypes
coded as follows
Results Presented are the GWAS results for milk
yield under an additive genetic model. The two
most significant SNPs have p-values lt 1e-80 and
are located at positions 236533 and 443936 on
chromosome 14 near the major gene DGAT. These
results agree with prior expectations given the
existing literature on effect size of DGAT and
selection pressure on milk yield.
- Current Research and Extensions
- Major additions being implemented into software
include - Multivariate trait models for correlated
phenotypes to improve power to detect SNPs
contributing multiple traits - Low-density genotyping panels
- Bottom-up regression-model building to identify
SNPs that best predict trait - Genomic prediction
- Apply SNP estimates to predict genetic merit of
animals - Multilocus analysis
- Fine-mapping with additional SNPs to determine
most likely functional SNP - Haplotype analysis
- Increased power if haplotype is better predictor
of causal variant than component SNPs - Evaluating the weighted regression and replacing
A matrix by genomic matrix estimated from the
data - Implementing p
- Discussion
- The Measured Genotype model is a flexible
regression-based analytical tool for genetic
analysis of quantitative traits in human and
animal pedigrees. The approach is suitable for
both GWAS, fine-mapping and genomic selection in
large pedigrees. We have presented GWAS results
for milk yield that show that DGAT is the most
significant locus, thus agreeing with
expectation. Additional fine mapping and
bioinformatics will be required to discover the
causative variants underlying these QTLs. As
additional animals are genotyped the power to
resolve QTL location will improve. However, as
with all dense matrix models there will be limit
on the size of the R matrix that will be
computationally feasible. - Contact Information
- Email joconnel_at_medicine.umaryland.edu
- Computational Complexity
- The complexity of the MG model for a single SNP
is determined by the complexity of the matrix
operations in formulas used to iteratively
determine the maximum likelihood (ML) estimates
of b and s2 - b X(XV-1X)-1XV-1Y, independent of s2
- s2 1/n(Y- Xb)V-1(Y- Xb), where n is the
dimension of V - The major complexity is computing V-1 which is
O(n3)that is grows as the cube of the dimension.
- Thus the total GWAS complexity is
- ( SNPs) x ( Genetic Models) x ( ML
iterations). - As the n increases from tens to hundreds to
thousands, the computational time for single SNP
increases from seconds to minutes to even hours. - Improving MG Computational Performance
- Since the relationship matrix R is positive
definite and W is positive definite diagonal,
W-1/2RW-1/2 PDP where P is orthogonal and D is
the matrix of eigenvalues and P is P transpose.
Let QW1/2P. Thus, we can write the
variance-covariance matrix V as - V h2R (1-h2)W h2W1/2PDPW1/2
(1-h2)W1/2PPW1/2 - Qh2DQ Q(1-h2)Q
- Q(h2D (1-h2))Q'
- QM(h2)Q, where M(h2) h2D (1-h2)I
- Thus, V-1 W-1/2PM-1(h2)PW-1/2 SM-1(h2)T.
Substituting into the log-likelihood equation and
reorganizing matrix multiplications gives - XV-1X X(S M-1T)X (XS) M-1(XT)
- Since XS and XT have complexity O(pn2) and M-1
has O(n), the complexity of XV-1X is reduced
from O(n3) to O(pn2). - When pltlt n (number of SNPs is much smaller than
the number of animals), computational performance
is significantly improved. If no missing data is
assumed then S and T matrices are independent of
the SNP, thus requiring a single diagonalization
for the GWAS. - Performance of the Algorithm
Results The next graph shows results excluding
major signals on chromosome 14 to provide better
resolution of the SNP effects across the genome.
Points above the horizontal line are significant
at a genome-wide p-value of 0.05 using a
Bonferroni correction for multiple testing.
Significant signals are present across the genome
with chromosomes 3 and 15 showing the two
strongest clusters. Genotypes within clusters are
generally correlated through linkage and/or
linkage disequilibrium, thus do not generally
represent independent signals. The results fit
the oligogenic mixed model well a few genes of
detectable effect size and many genes of small
effect size.
Covariate Coding of SNP Genotype by Genetic Model Covariate Coding of SNP Genotype by Genetic Model Covariate Coding of SNP Genotype by Genetic Model Covariate Coding of SNP Genotype by Genetic Model Covariate Coding of SNP Genotype by Genetic Model
Genotype Additive Dominant Recessive 3 Genotype
AA 0 1 0 0 0
AB 1 1 1 1 0
BB 2 0 1 0 1