Title: R Packages for Genome-Wide Association Studies
1R Packages for Genome-Wide Association Studies
- Qunyuan Zhang
- Division of Statistical Genomics
- Statistical Genetics Forum
- March 10,2008
2What is R ?
- R is a free software environment for statistical
computing and graphics. - Run s on a wide variety of UNIX platforms,
Windows and MacOS (interactive or batch mode) - Free and open source, can be downloaded from
cran.r-project.org - Wide range of packages (base contributed),
novel methods available - Concise grammar good structure (function, data
object, methods and class) - Help from manuals and email group
- Slow, time and memory consuming (can be overcome
by parallel computation, and/or integration with
C) - Popular, used by 7080 statisticians
3R Task Viewshttp//cran.r-project.org/web/views/
4Statistical Genetics Packages in
Rhttp//cran.r-project.org/web/views/Genetics.htm
l
- Population Genetics genetics (basic), Geneland
(spatial structures of genetic data), rmetasim
(population genetics simulations), hapsim
(simulation), popgen (clustering SNP genotype
data and SNP simulation), hierfstat (hierarchical
F-statistics of genetic data), hwde (modeling
genotypic disequilibria), Biodem
(biodemographical analysis), kinship (pedigree
analysis), adegenet (population structure), ape
apTreeshape (Phylogenetic and evolution
analyses), ouch (Ornstein-Uhlenbeck models),
PHYLOGR (simulation and GLS model), stepwise
(recombination breakpoints) - Linkage and Association gap (both population
and family data, sample size calculations,
probability of familial disease aggregation,
kinship calculation, linkage and association
analyses, haplotype frequencies) tdthap (TDT for
haplotypes, powerpkg (power analyses for the
affected sib pair and the TDT design),hapassoc
(likelihood inference of trait associations with
haplotypes in GLMs), haplo.ccs (haplotype and
covariate relative risks in case-control data by
weighted logistic regression), haplo.stats
(haplotype analysis for unrelated subjects),
tdthap (haplotype transmission/disequilibrium
tests), ldDesign (experiment design for
association and LD studies), LDheatmap (heatmap
of pairwise LD),. mapLD (LD and haplotype
blocks), pbatR (R version of PBAT), GenABEL
SNPassoc for GWAS - QTL mapping for the data from experimental
crosses bqtl (inbred crosses and recombinant
inbred lines), qtl (genome-wide scans),
qtlDesign (designing QTL experiments power
computations), qtlbim (Bayesian Interval QTL
Mapping) - Sequence Array Data Processing seqinr,
BioConductor packages
5GenABELAulchenko Y.S., Ripke S., Isaacs A., van
Duijn C.M. GenABEL an R package for genome-wide
association analysis. Bioinformatics. 2007,
23(10)1294-6.
- GenABEL genome-wide SNP association analysis
- a package for genome-wide association analysis
between quantitative or binary traits and
single-nucleotides polymorphisms (SNPs). - Version 1.3-5
- Depends R ( 2.4.0), methods, genetics,
haplo.stats, qvalue, MASS - Date 2008-02-17
- Author Yurii Aulchenko, with contributions from
Maksim Struchalin, Stephan Ripke and Toby Johnson
- Maintainer Yurii Aulchenko lti.aoultchenko at
erasmusmc.nlgt - License GPL ( 2)
- In views Genetics
- CRAN checks GenABEL results
6GenABEL Data Objects
phdata phenotypic data (data frame)
gtdata genotypic data (snp.data-class)
- 2-bit storage
- 0 00
- 1 01
- 2 10
- 11
- Save 75
load.gwaa.data(phenofile "pheno.dat", genofile
"geno.raw)
convert.snp.text() from text file (GenABEL
default format) convert.snp.ped() from Linkage,
Merlin, Mach, and similar files convert.snp.mach()
from Mach format convert.snp.tped() from PLINK
TPED format convert.snp.illumina() from
Illumina/Affymetrix-like format
7GenABEL Data Manipulation
- snp.subset() subset data by snp names or by QC
criteria - add.phdata() merge extra phenotypic data to the
gwaa.data-class. - ztransform() standard normalization of
phenotypes - rntransform() rank-normalization of phenotypes
- npsubtreated() non-parametric adjustment of
phenotypes for medicated subjects
8GenABEL QC Summarization
- summary.snp.data() summary of snp data (Number
of observed genotypes, call rate, allelic
frequency, genotypic distribution, P-value of HWE
test - check.trait() summary of phenotypic data and
outlier check based on a specified p/FDR cut-off - check.marker() SNP selection based on call rate,
allele frequency and deviation from HWE - HWE.show() showing HWE tables, Chi2 and exact
HWE P-values - perid.summary() call rate and heterozygosity per
person - ibs() matrix of average IBS for a group of
people a given set of SNPs - hom() average homozygosity (inbreeding) for a
set of people, across multiple markers
9GenABEL SNP Association Scans
- scan.glm() snp association test using GLM in R
library - scan.glm((yx1x2CRSNP", family
gaussian(), data, snpsubset, idsubset) - scan.glm((yx1x2CRSNP", family binomial
(), data, snpsubset, idsubset) - scan.glm.2D() 2-snp interaction scan
- Fast Scan (call C language)
- ccfast() case-control association analysis by
computing chi-square test from 2x2 (allelic) or
2x3 (genotypic) tables - emp.ccfast() Genome-wide significance
(permutation) for ccfast() scan - qtscore() association test (GLM) for a trait
(quantitative or categorical) - emp.qtscore() Genome-wide significance
(permutation) for qscaore() scan - mmscore() score test for association between a
trait and genetic polymorphism, in samples of
related individuals (needs stratification
variable, scores are computed within strata and
then added up) - egscore() association test, adjusted for
possible stratification by principal components
of genomic kinship matrix(snp correlation matrix)
10GenABEL Haplotype Association Scans
- scan.haplo() haplotype association test using
GLM in R library - scan.haplo.2D() 2-haplotype interaction scan
- (haplo.stats package required)
- Sliding window strategy
- Posterior prob. of Haplotypes via EM algorithm
- GLM-based score test for haplotype-trait
association (Schaid DJ, Rowland CM, Tines DE,
Jacobson RM, Poland GA. 2002. Score tests for
association of traits with haplotypes when
linkage phase is ambiguous Am J Hum Genet 70
425-434. )
11GenABEL GWAS results from scan.glm,
scan.haplo, ccfast, qtscore, emp.ccfast,emp.qtscor
e
- scan.gwaa-class
- Names snpnames list of names of SNPs tested
- P1df p-values of 1-d.f. (additive or allelic)
test for association - P2df p-values of 2-d.f. (genotypic) test for
association - Pc1df p-values from the 1-d.f. test for
association between SNP and trait the statistics
is corrected for possible inflation - effB effect of the B allele in allelic test
- effAB effect of the AB genotype in genotypic
test - effBB effect of the BB genotype in genotypic
test - Map list of map positions of the SNPs
- Chromosome list of chromosomes the SNPs belong
to - Idnames list of subjects used in analysis
- Lambda inflation factor estimate, as computed
using lower portion (say, 90) of the
distribution, and standard error of the estimate - Formula formula/function used to compute
p-values - Family family of the link function / nature of
the test
12GenABEL Table Graphic Functions
- descriptives.marker() table of marker info.
- descriptives.trait() table of trait info.
- descriptives.scan() table of scan results
- plot.scan.gwaa() plot of scan results
- plot.check.marker() plot of marker data (QC
etc.)
13GenABELComputer Efficiency
2000 subjects x 500K chip Memory 3.2 G Loading
time 4 Min. SNP summary 1 Min. Call ccfast
0.5 Min. Call qtscore 2 Min. Total lt 10
Min. Permutation test N10,000 73 120 hrs, 35
days
Intel Xeon 2.8GHz processor,SuSE Linux 9.2, R
2.4.1
14SNPassocAn R package to perform whole genome
association studies, Juan R. González 1, et al.
Bioinformatics, 2007 23(5)654-655
- SNPassoc SNPs-based whole genome association
studies - This package carries out most common analysis
when performing whole genome association studies.
These analyses include descriptive statistics and
exploratory analysis of missing values,
calculation of Hardy-Weinberg equilibrium,
analysis of association based on generalized
linear models (either for quantitative or binary
traits), and analysis of multiple SNPs (haplotype
and epistasis analysis). Permutation test and
related tests (sum statistic and truncated
product) are also implemented. - Version1.4-9
- DependsR ( 2.4.0), haplo.stats, survival,
mvtnorm - Date2007-Oct-16
- AuthorJuan R González, Lluís Armengol, Elisabet
Guinó, Xavier Solé, and Víctor MorenoMaintainerJu
an R González ltjrgonzalez at imim.esgt - LicenseGPL version 2 or newerURLhttp//www.r-pro
ject.org and http//davinci.crg.es/estivill_lab/sn
passoc - In viewsGenetics
- CRAN checksSNPassoc results
15SNPassoc Data Summary
- setupSNP(datasnp-pheno.table, infomap.table,
- colSNPs, sep "/", ...)
- summary()
- allele frequencies
- percentage of missing values
- HWE test
16SNPassoc Association Tests
- WGassociation(yx1x2, data, model
(codominant, dominant, recessive, overdominant,
log-additive or all),quantitative , level
0.95) - scanWGassociation() only p values
- association() only for selected snps, can do
stratified, GxE interaction analyses - Results
- Summary a summary table by genes/chromosomes
- Wgstats detailed output(case-control numbers,
percentages, odds ratios/ mean differences, 95
confidence intervals, P-value for the likelihood
ratio test of association, and AIC, etc.) - Pvalues a table of p-values for each genetic
model for each SNP - Plot p values in the -log scale for
plot.Wgassociation() - Labels returns the names of the SNPs analyzed
17SNPassoc Multiple-SNP Analysis
- SNPSNP Interaction
- interactionPval()
- epistasis analysis between all pairs of SNPs (and
covariates). - Haplotype Analysis
- haplo.glm() using the R package haplo.stats
- association analysis of haplotypes with a
response via GLM - haplo.interaction()
- interactions between haplotypes (and covariates)
18SNPassoc Computer Efficiency
- 1000 subjects X 3000 SNPs
- 5 min. import data
- 40 min. setupSNP()
- 30 min. scanWGassociation() only p values
(including permutation test) - Memory usage 750 MB