Title: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity
1Genotype Error Detection using Hidden Markov
Models of Haplotype Diversity
Ion Mandoiu CSE Department, University of
Connecticut Joint work with Justin Kennedy and
Bogdan Pasaniuc
2Outline
- Introduction
- Likelihood Sensitivity Approach to Error
Detection - HMM-Based Algorithms
- Experimental Results
- Conclusion
3Single Nucleotide Polymorphisms
- Main form of variation between individual
genomes single nucleotide polymorphisms (SNPs) - High density in the human genome ? 1 ? 107 SNPs
out of total 3 ? 109 base pairs
ataggtccCtatttcgcgcCgtatacacgggActata
ataggtccGtatttcgcgcCgtatacacgggTctata
ataggtccCtatttcgcgcCgtatacacgggTctata
4Haplotypes and Genotypes
- Diploids two homologous copies of each
chromosome - One inherited from mother and one from father
- Haplotype description of SNP alleles on a
chromosome - 0/1 vector 0 for major allele, 1 for minor
- Genotype description of alleles on both
chromosomes - 0/1/2 vector 0 (1) - both chromosomes contain
the major (minor) allele 2 - the chromosomes
contain different alleles
two haplotypes per individual
genotype
5Why SNP Genotypes?
- Identification and fine mapping of
disease-related genes - Methods Linkage analysis, allele-sharing,
association studies - Genotype data large pedigrees, sibling pairs,
trios, unrelated
6Genotyping Errors
- A real problem despite advances in genotyping
technology - Zaitlen et al. 2005 found 1.1 inconsistencies
among the 20 million dbSNP genotypes typed
multiple times - Error types
- Systematic errors (e.g., assay failure) detected
by departure from HWE Hosking et al. 2004 - For pedigree data some errors detected as
Mendelian Inconsistencies (MIs) - Undetected errors
- E.g., if mother/father/child are all
heterozygous, any error is Mendelian consistent - Only 30 detectable as MIs for trios Gordon et
al. 1999
7Effects of Undetected Genotyping Errors
- Even low error levels can have large effects for
some study designs (e.g. rare alleles,
haplotype-based) - Errors as low as .1 can increase Type I error
rates in haplotype sharing transmission
disequilibrium test (HS-TDT) KnappBecker04 - 1 errors decrease power by 10-50 for linkage,
and by 5-20 for association Douglas et al. 00,
Abecasis et al. 01
8Related Work
- Improved genotype calling algorithms
- Di et al. 05, RabbeeSpeed 06, Nicolae et al.
06 - Explicit modeling in analysis methods
- Sieberts et al. 01, Sobel et al. 02, Abecasis et
al. 02,Cheng 06 - Computationally complex
- Separate error detection step
- Douglas et al. 00, Abecasis et al. 02, Becker et
al. 06 - Detected errors can be retyped, imputed, or
ignored in downstream analyses
9Outline
- Introduction
- Likelihood Sensitivity Approach to Error
Detection - HMM-Based Algorithms
- Experimental Results
- Conclusion
10Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
11Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
?
Likelihood of best phasing for original trio T
12Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
Mother
Father
0 1 2 1 0 2
0 2 2 1 0 2
Child
0 2 2 1 0 2
?
- Large change in likelihood suggests likely error
- Flag genotype as an error if L(T)/L(T) gt R,
where R is the detection threshold (e.g., R104)
13Implementation in FAMHAPBecker et al. 06
- Window-based algorithm
- For each window including the SNP under test,
generate list of H most frequent haplotypes
(default H50) - Find most likely trio phasings by pruned search
over the H4 quadruples of frequent haplotypes - Flag genotype as an error if L(T)/L(T) gt R for
at least one window
14Limitations of FAMHAP Implementation
- Truncating the list of haplotypes to size H may
lead to sub-optimal phasings and inaccurate L(T)
values - False positives caused by nearby errors (due to
the use of multiple short windows) - Our approach
- HMM model of haplotype diversity ? all haplotypes
are represented no need for short windows - Alternate likelihood functions ? scalable runtime
15Outline
- Introduction
- Likelihood Sensitivity Approach to Error
Detection - HMM-Based Algorithms
- Experimental Results
- Conclusion
16HMM Model
(Figure from Rastas et al. 07)
- Similar to models proposed by Schwartz 04,
Rastas et al. 05, KimmelShamir 05 - Unlike ScheetStephens 06, recombination ratios
not modeled explicitly - Block-free model, paths with high transition
probability correspond to founder haplotypes
17HMM Training
- Previous works use EM training of HMM based on
unrelated genotype data - Our 2-step algorithm exploits pedigree info
- Step 1 Infer haplotypes using pedigree-aware
algorithm based on entropy-minimization - Step 2 train HMM based on inferred haplotypes,
using Baum-Welch
18Complexity of Computing Maximum Phasing
Probability
- For unrelated genotypes, computing maximum
phasing probability is hard to approximate within
a factor of O(f½-?) unless ZPPNP, where f is the
number of founders - For trios, hard to approx. within O(f1/4 -?)
- Reductions from the clique problem
19Alternate Likelihood Functions
- Viterbi probability (ViterbiProb) the maximum
probability of a set of 4 HMM paths that emit 4
haplotypes compatible with the trio - Probability of Viterbi Haplotypes (ViterbiHaps)
product of total probabilities of the 4 Viterbi
haplotypes - Total Trio Probability (TotalProb) total
probability P(T) that the HMM emits four
haplotypes that explain trio T along all possible
4-tuples of paths
20Efficient Computation of Viterbi Probability for
Trios
- For a fixed trio, Viterbi paths can be found
using a 4-path version of Viterbis algorithm in
time - K3 speed-up by factoring common terms
Where
21Overall Runtimes
- Viterbi probability
- Likelihoods of all 3N modified trios can be
computed within time using
forward-backward algorithm - Overall runtime for M trios
- Probability of Viterbi haplotypes
- Obtain haplotypes from standard traceback, then
compute haplotype probabilities using forward
algorithms - Overall runtime
- Total trio probability
- Similar pre-computation speed-up
forward-backward algorithm - Overall runtime
22Outline
- Introduction
- Likelihood Sensitivity Approach to Error
Detection - HMM-Based Algorithms
- Experimental Results
- Conclusion
23Datasets
- Real dataset Becker et al. 2006
- 35 SNP loci on chromosome 16 covering a region of
91kb - 551 trios
- Synthetic datasets
- 35 SNPs, 30-551 trios
- Preserved missing data pattern of real dataset
- Haplotypes assigned to trios based on frequencies
inferred from real dataset - 1 error rate, four error insertion models
- Random allele
- Random genotype
- Heterozygous-to-homozygous
- Homozygous-to-heterozygous
24Experimental Setup
- Two strategies for handling MIs
- Set all three individuals to unknown prior to
error detection, or - Set child only to unknown (preserving parents
original data) - Two testing strategies
- Test one SNP genotype ViterbiProb-1,
ViterbiHaps-1, TotalProb-1 - Simultaneously test three SNP genotypes at the
same locus ViterbiProb-3, ViterbiHaps-3,
TotalProb-3
25Comparison with FAMHAP (Random Allele Errors)
26Children vs. Parents (Random Allele Errors)
27Error Model Comparison(TrioProb-1 Parents)
28Error Model Comparison(TrioProb-1 Children)
29TrioProb-1 Results on Real Dataset
- Becker et al. 06 resequenced all trio members
at 41 loci flagged by FAMHAP-3 - 23 SNP genotypes were identified as true errors
- 413-23100 resequenced SNP genotypes agree with
original calls - Predictive value for R104 is between 18/2669
and 24/2692, compared to 23/4156 for FAMHAP-3
30Pedigree Info vs. Sample Size Effect
31Unrelated vs. Trio Likelihood Sensitivity
Unrelated ViterbiProb-1 Likelihood ratios
(children)
Trio ViterbiProb-1 Likelihood ratios (children)
32Combining Likelihood Functions (Children, Random
Allele Model)
33Combining Likelihood Functions (Parents, Random
Allele Model)
34Outline
- Introduction
- Likelihood Sensitivity Approach to Error
Detection - HMM-Based Algorithms
- Experimental Results
- Conclusion
35Conclusion
- Proposed efficient methods for error detection in
trio genotype data based on a HMM model of
haplotype diversity - Significantly improved detection accuracy
compared to FAMHAP - High sensitivity even for very low FP rates
- Runtime linear in SNPs and trios
- Ongoing work
- Iterative error detection
- Fix MIs using likelihood before error detection
- Correct errors with high likelihood ratio, then
recompute likelihood ratios (possibly after
re-phasing and HMM re-training) - Integration with genotype calling algorithms
- Combine low level intensity data with
haplotype-based likelihoods - Most useful when less pedigree info is available
(unrelated, sibling pairs w/o parent genotypes,
parents in trios) - Locus specific thresholds, p-values
- Via simulations similar to Douglas et al. 00
36Questions?