Title: Handling and analyzing data from a genome-wide association study
1Handling and analyzing data from a genome-wide
association study
- Laura Scott
- Biostatistics Department
- Center for Statistical Genetics
- University of Michigan
2Outline
- Storing large amounts of genotype data
- Quality control
- Generating initial association analysis
- Viewing results
- Imputation of missing SNP genotypes
- Storing results and planning specialized analysis
3Genotype data is huge
- 500,000 SNPs 2000 cases controls
1,000,000,000 genotypes! - Need compact ways to store data
- If store each genotype as 00, 01, or 11 will have
file that looks like - Person 100011001010001001001100.
- 000000100001000010001000.
4Genotype data is huge
- 500,000 SNPs 2000 cases controls
1,000,000,000 genotypes! - Need compact ways to store data
- If store each genotype as 00, 01, or 11 will have
file that looks like - SNP 100011001010001001001100.
- 000000100001000010001000.
5Genotype data is huge
- 500,000 SNPs 2000 cases controls
1,000,000,000 genotypes! - Need compact ways to store data
- If store each genotype as 00, 01, or 11 will have
file that looks like - 100011001010001001001100.
- 000000100001000010001000.
- Total file space for 300K SNPs 4 Gigabytes
- Largest chromosome file .4 Gigabytes
6Need to do extensive planning for genotype data
before it arrives
- Chromosome datasets are too large for SAS and
other commonly used analytic packages - Need programs to select and write out genotype
data in multiple formats - Tests of procedures with large-scale trial
datasets
7Gather other data needed for analysis
- SNP information
- Chromosome
- Position
- SNP annotation
- Gene
- Function
- Translation of called allele to a standard allele
- Example forward strand of given genome build
8How good is the data?
- Identify and remove bad samples and SNPs
- Compute summary statistics
- Percent successfully genotyped samples
- Average genotyping success rate
- Duplicate sample error rate
- Non-Mendelian inheritance error rates (errors not
consistent with normal transmission of
chromosomes in family members)
9Identify bad samples and remove
- Poor quality samples
- Sample genotype success rate lt 95 to 97.5
- Greater proportion of heterozygous genotypes than
expected - Related individuals (if independent samples)
- Based on pair-wise comparisons of similarity of
genotypes - Sample switches
- Wrong sex
- Regions of homozygosity in cell line
10Identify poor quality SNPs and remove
- Expected proportions of genotypes are not
consistent with observed allele frequency (Hardy
Weinberg Equilibrium (HWE)) - HWE p-value lt 10-4 to 10-6
- Look for deviation from expected distribution of
p-values under the null - Genotyping success rate lt 95
- Duplicate sample or Non-Mendelian error rate is
elevated - Differential missingness in cases and controls
11Programs are available for large scale quality
control analysis
- Plink
- Duplicate error rates, sample relatedness, HWE,..
- Develop by Shaun Purcell
- http//pngu.mgh.harvard.edu/purcell/plink/
- GAINQC Software used for the quality control
analysis of the GAIN project - Duplicate error rates, sample relatedness, HWE,..
- Developed by Shyam Gopalakrishnan and Goncalo
Abecasis - Available from gopalakr_at_umich.edu
12Initial analysis is straightforward once have
everything in place
- Case/control association
- Use test that is not affected by deviations from
HWE - Cochran-Armitage test for trend
- Equivalent to score test in logistic regression
- TDT or other family-based test
- Quantitative trait association
13Programs are available for large scale
case-control or family-based analysis
- Plink
- Case/control, tdt, quantitative traits
- Develop by Shaun Purcell
- http//pngu.mgh.harvard.edu/purcell/plink/
- Merlin
- Quantitative traits in independent samples or
families, ability to impute genotypes for untyped
individuals based on genotyped family members - Developed by Goncalo Abecasis
- http//www.sph.umich.edu/csg/abecasis/Merlin/
14Are the results believable?
- Are stronger associations correlated with poorer
quality control measures? - Is there a strong deviation from expected
distribution of p-values? - Is there confounding from differences in the
genetic origins of case and control samples
(population stratification)? - Genomic control
- Eigenstrat analysis
15Seeing from many different angles is believing
(sometimes)
- Plink graphical output
- User added custom tracks in the UCSC browser
- http//genome.ucsc.edu/
- http//genome.ucsc.edu/goldenPath/help/hgTracksHel
p.htmlCustomTracks - Homemade graphes
16 FUSION T2D association
17Many different ways to display similar data
Zeggini et al. (2007) Science 316 13361341
Diabetes Genetic Initiative (2007) Science
3161331-1336 Scott et al., (2007) Science
3161341-1345
18Getting more for your genotyping dollars
Imputation of SNP genotypes
- Impute/predict genotypes for
- Missing data within genotyped markers
- Untyped markers
- Uses haplotype structure of existing sample such
as HapMap samples to infer data for samples with
sparser marker set
19Observed genotypes
Study Sample
HapMap
Gonçalo Abecasis
20Identify match among reference
Gonçalo Abecasis
21Phase chromosomes, impute missing genotypes
Gonçalo Abecasis
22Imputing genotype data allows much more thorough
analysis
- Allows testing of untyped variation
- Allows easy combination of data across genotyping
platforms - Provides complete data for analysis with multiple
SNPs
23Imputed data takes care to generate, analyze and
understand
- Requires large scale computing resources
- Need to assess quality of imputation
- Compare imputed gentoypes to actual genotypes
- Error rates are higher than for genotyped SNPs
- Works less well for rarer alleles
- Best to take account of uncertainty imputed SNPs
in analysis - Need ways to take into account fractional
genotype counts
24Imputation programs are available
- IMPUTE
- Developed by Jonathan Marchini
- Nature Genetics, Advance online publication
- http//www.stats.ox.ac.uk/marchini/software
- Mach 1.0, Markov Chain Haplotyping
- Developed by Goncalo Abecasis
- http//www.sph.umich.edu/csg/abecasis/MACH/
25Need to store results and prepare for large scale
specialized analysis
- System to store, view, merge results
- SQL database
- Plink
- Testing speed of specialized analyses in
different statistical packages - Potential development of software to run large
scale specialized analysis
26Summary Ideally what needs to happen before
getting the data
- Ability to store, select and write out genotype
data in multiple formats for quality control and
association analysis - Identification of primary quality control and
analysis programs - Systems to store, view, merge results
- Adequate computing resources to do intensive
computing - Testing of standard and specialized processes
with large-scale trial datasets
27FUSION study
U Michigan
CIDR
NHGRI / NIH
U Michigan
Gonçalo Abecasis Yun Li Jun Ding Paul Scheet
Kimberly Doheny Elizabeth Pugh
Michael Boehnke Karen Conneely Charles Ding
William Duren Terry Gliedt Larry Hu Anne
Jackson Xiao-Yi Li Andrew Skol Heather
Stringham Peggy White Cristen Willer Fang
Xiang Rui Xiao
Francis Collins Lori Bonnycastle Peter
Chines Michael Erdos Narisu NarisuL.
Prokunina-Olsson Nancy Riebow Andrew Sprau Amy
Swift Maurine Tong
Calvin College
Randall Pruim
USC
Richard Bergman Thomas Buchanan Richard Watanabe
UNC-Chapel Hill
National Public Health Institute Helsinki
Karen Mohlke Kyle Gaulton Jason Luo Li Qin
Jaakko Tuomilehto Timo Valle