Title: Assessment of genotype data quality and the effect of data quality on analysis
1Assessment of genotype data quality and the
effect of data quality on analysis
- Laura Scott
- March 29, 2006
2Questions
- What methods can be used to assess quality of
genotype data? - What criteria are used to eliminate bad
markers? - How do markers with poor genotype quality affect
tests of association and linkage? - When are gentoype errors not really errors?
3What is a genotyping error ?
- Genotype error The observed genotype and true
genotype dont match - Truth AA, Observed AB (or BB)
- Missing data Genotype exists but is not called
- TruthAA, Observed
4Terminology
- Blinded
- Genotyping group is unaware of the duplicates or
family relationships - Unblinded
- Genotyping group is aware of duplicates or family
relationships. Use information to - Form clusters (Illumina)
- Adjust algorithms (Affy)
5Sources of apparent genotyping error
- Sample quality
- Sample specific characteristics
- Human/machine error in sample handling
- Error in genotyping procedure
- Human error in data manipulation or test
- Genomic DNA characteristics
6Methods of detecting genotyping
errors/questionable data
7 Samples with poor success rate
- Poor quality DNA preparation
- Low DNA concentration
- Some whole genome amplification methods produce
poor quality samples - Poor genotyping
8SNPs with poor success rates
- Poor genotyping
- Deletions
- SNP under primer
- Sample contamination
9Duplicate sample error rates
- For each SNP
- For pair where both have data
- Number of discrepant pairs / total pairs
10Power to detect genotype errors by number of
duplicates and error rate
11Detection of Mendelian inheritance
inconsistencies using parent-child trios
Aa
aa
Aa
Aa
aa
Aa
Consistent
Inconsistent
Uninformative, consistent
12Reasons for inconsistencies in Mendelian
inheritance in parent-child trios
- Child is not consistent with the
- parents
- Possibilities
- Genotyping error
- New mutation
- Deletion
- Sample contamination
13Hardy-Weinberg equilibrium
- In 1908 Hardy and Weinberg separately noticed
that the allele frequency in a population could
be used to calculate the expected genotype
frequencies in randomly mating populations at
equilibrium - If p frequency of A allele, expected frequency
of - AAp2
- AB2p(1-p)
- BB(1-p) 2
?2S((obs count-exp count)/exp count)2 with 1 df
14Testing Hardy-Weinberg in samples with rarer
alelles
Exact test ?2 test The ?2 test is often very
anti-conservative when have small allele counts
Wigginton JE, Cutler DJ, Abecasis GR A note on
exact tests of Hardy-Weinberg equilibrium. Am J
Hum Genet 2005 76 887-893.
15Ability to detect Hardy-Weinberg deviations
Cox DG, Kraft P Quantification of the Power of
Hardy-Weinberg Equilibrium Testing to Detect
Genotyping Error. Hum Hered 2006 61 10-14.
16Power to detect Hardy Weinberg deviations
17Tight double recombinants
- Need family data
- Identification of genotypes that are very
unlikely given the surrounding haplotype
Abecasis http//www.sph.umich.edu/csg/abecasis/Me
rlin/tour/error.html
18Sample switches
- Samples switched in original tubes
- Mistakes in sample handling when plates are built
- Mis-orientation of sample plates during
genotyping
19Placement of samples in plates for genotyping to
detect sample switches
- Mix placement of cases and controls
- Make unique patterns of sex by row and column of
individuals - Place duplicate samples on separate plates in
different places - Position control samples so can assess genotyping
quality before end of project
20Affect of genotyping error on case/control
association tests
- Sample genotypes AA, AB, BB from a population in
HWE - Allele frequencies p, q
- Genotype frequencies p2, 2pq, q2
- Traditional estimate of allele frequency from
observed genotype frequencies
is dependent on assumption of HWE - Estimates may be biased when HWE is violated
Karen Conneely and Michael Boehnke, unpublished
21Bias in allele frequency estimates small for most
levels missing genotypes
- Missing data from early stage Illumina
genotyping - 5.1 of heterozygotes (6037/118959)
- 0.7 of homozygotes (2312/335783)
Karen Conneely and Michael Boehnke, unpublished
22Asymptotic bias in association testdue to loss
of genotypes
-
- Assuming equal sample sizes, t-statistic is
estimated - When there is truly no difference between cases
and controls, and no genotyping error or loss, - If when genotypes are missing the variance will
be under or over estimated.
Karen Conneely and Michael Boehnke, unpublished
23Asymptotic bias in association test due to
genotyping error
-
- The test will be similarly biased if there is
systematic genotyping error - AB mistyped as AA ? anticonservative test
- AA mistyped as AB ? conservative test
- Exception when AB ? AA and AB?BB with equal
frequency, biases cancel out. In this case, the
test is valid and no power is lost.
Karen Conneely and Michael Boehnke, unpublished
24Robust tests for case/control association
- Genotype frequency based tests robust to
mistyping and missing data because no assumption
of HWE - 2x3 ?2 test of equal genotype frequencies in
cases and controls - Armitages test for trend (Armitage, Biometrics
11375-86, 1955,Sasieni et al. Biometrics
531253-61, 1997) - Equivalent to using logistic regression with
score test
- Type 1 error is correct
- Power greater for Armitages test than for 2x3
test if model is additive on log odds scale - Small loss of power under systematic
mistyping/loss -
Karen Conneely and Michael Boehnke, unpublished
25Linkage analysis
26Affect of a single mis-genotyped marker on
linkage
Yonan AL, Palmer AA, Gilliam TC Hardy-Weinberg
disequilibrium identified genotyping error of the
serotonin transporter (SLC6A4) promoter
polymorphism. Psychiatr Genet 2006 16 31-34.
27Effect of data cleaning on linkage
Precleaning Postcleaning
BMI Tecumseh
BMI Tecumseh and Maywood
Chang YP, Kim JD, Schwander K et al The impact
of data quality on the identification of complex
disease genes experience from the Family Blood
Pressure Program. Eur J Hum Genet 2006 14
469-477.
28Hardy-Weinberg deviations caused by presence of
disease associated alleles
- Deviations from Hardy-Weinberg (HWD) can be
caused by disease association with marker - Hard to distinguish between disease related and
genotyping error related HWD - Let A non risk allele, qfrequency of A,
ggenotype frequency - Express HWD as ? gAA (1-q)2
Wittke-Thompson JK, Pluzhnikov A, Cox NJ
Rational inferences about departures from
Hardy-Weinberg equilibrium. Am J Hum Genet 2005
76 967-986.
29Is HWD in cases and controls consistent with a
genetic disease model?
- Parameterize ? in terms of risk of disease for
each genotype - Find best fitting additive, dominant, recessive,
multiplicative or general model - Ask if the expected counts from this model are
not significantly different than those observed - Assess using simulations
Wittke-Thompson JK, Pluzhnikov A, Cox NJ
Rational inferences about departures from
Hardy-Weinberg equilibrium. Am J Hum Genet 2005
76 967-986.
30HWD for cases and controls
? gAA (1-q)2 , ? gt1 excess homozygotes, ? lt1
deficit homozygotes
Cases
Controls
Dominant
Recessive
31HWD for cases and controls
Cases
Controls
Additive
Multiplicative
32HWD summary
- Test 60 polymorphisms with HWE departures
- Find 34 are consistent with tested biological
models - Does not prove HWD is due to disease
33Detection of deletions from data with multiple
null genotypes and Mendelian inconsistencies
- HapMap data
- Phase 1 with1.3 M SNPs
- 269 individuals
- Assess similarity of patterns of
- Mendelian inconsistencies
- Null gentoypes
- Calculate the binomial probability of observing
each pattern n times in m markers relative to
background rates - Use observed heterozygotes/ expected
heterozygotes lt .4 or .7 to help confirm
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
34Detection of deletions
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
35SNPs with deviations from expected tests are
found in close proximity to each other
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
36See clustered patterns of deviations from
expected test results
Size distribution of putative deletions
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
37Detection of deletions using Illumina
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
38Deletions summary
- Identified 541 potential deletions
- Confirmed majority of deletions that were tested
by more stringent methods - Deletions often in high r2 with surrounding SNPs
- Find deletions of a few genes olfactory
receptors, drug metabolism, sex steroid hormones
McCarroll SA, Hadnott TN, Perry GH et al Common
deletion polymorphisms in the human genome. Nat
Genet 2006 38 86-92.
39Summary
- Genotype errors can arise in many different ways
- Multiple methods exist for determining genotype
data quality - Poor quality genotyping data can strongly affect
some analysis - Some poor quality SNPs may have underlying
biological basis
40More references
- Review of sources of genotyping errors
- Pompanon F, Bonin A, Bellemain E, Taberlet P
Genotyping errors causes, consequences and
solutions. Nat Rev Genet 2005 6 847-859. - Error checking programs
- Douglas JA, Boehnke M, Lange K A multipoint
method for detecting genotyping errors and
mutations in sibling-pair linkage data. Am J Hum
Genet 2000 66 1287-1297. - Wigginton JE, Abecasis GR PEDSTATS descriptive
statistics, graphics and quality assessment for
gene mapping data. Bioinformatics 2005 21
3445-3447. - O'Connell JR, Weeks DE PedCheck a program for
identification of genotype incompatibilities in
linkage analysis. Am J Hum Genet 1998 63
259-266.