Genotype Error Detection using Hidden Markov Models of Haplotype Diversity - PowerPoint PPT Presentation

About This Presentation

Title:

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Description:

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin Kennedy and ... – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 36

Provided by: uco117

Learn more at: https://dna.engr.uconn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

1
Genotype Error Detection using Hidden Markov
Models of Haplotype Diversity
Ion Mandoiu CSE Department, University of
Connecticut Joint work with Justin Kennedy and
Bogdan Pasaniuc
2
Outline

Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion

3
Single Nucleotide Polymorphisms

Main form of variation between individual
genomes single nucleotide polymorphisms (SNPs)
High density in the human genome ? 1 ? 107 SNPs
out of total 3 ? 109 base pairs

ataggtccCtatttcgcgcCgtatacacgggActata
ataggtccGtatttcgcgcCgtatacacgggTctata
ataggtccCtatttcgcgcCgtatacacgggTctata
4
Haplotypes and Genotypes

Diploids two homologous copies of each
chromosome
One inherited from mother and one from father
Haplotype description of SNP alleles on a
chromosome
0/1 vector 0 for major allele, 1 for minor
Genotype description of alleles on both
chromosomes
0/1/2 vector 0 (1) - both chromosomes contain
the major (minor) allele 2 - the chromosomes
contain different alleles

two haplotypes per individual
genotype
5
Why SNP Genotypes?

Identification and fine mapping of
disease-related genes
Methods Linkage analysis, allele-sharing,
association studies
Genotype data large pedigrees, sibling pairs,
trios, unrelated

6
Genotyping Errors

A real problem despite advances in genotyping
technology
Zaitlen et al. 2005 found 1.1 inconsistencies
among the 20 million dbSNP genotypes typed
multiple times
Error types
Systematic errors (e.g., assay failure) detected
by departure from HWE Hosking et al. 2004
For pedigree data some errors detected as
Mendelian Inconsistencies (MIs)
Undetected errors
E.g., if mother/father/child are all
heterozygous, any error is Mendelian consistent
Only 30 detectable as MIs for trios Gordon et
al. 1999

7
Effects of Undetected Genotyping Errors

Even low error levels can have large effects for
some study designs (e.g. rare alleles,
haplotype-based)
Errors as low as .1 can increase Type I error
rates in haplotype sharing transmission
disequilibrium test (HS-TDT) KnappBecker04
1 errors decrease power by 10-50 for linkage,
and by 5-20 for association Douglas et al. 00,
Abecasis et al. 01

8
Related Work

Improved genotype calling algorithms
Di et al. 05, RabbeeSpeed 06, Nicolae et al.
06
Explicit modeling in analysis methods
Sieberts et al. 01, Sobel et al. 02, Abecasis et
al. 02,Cheng 06
Computationally complex
Separate error detection step
Douglas et al. 00, Abecasis et al. 02, Becker et
al. 06
Detected errors can be retyped, imputed, or
ignored in downstream analyses

9
Outline

Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion

10
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
11
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
?
Likelihood of best phasing for original trio T
12
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
Mother
Father
0 1 2 1 0 2
0 2 2 1 0 2
Child
0 2 2 1 0 2
?

Large change in likelihood suggests likely error
Flag genotype as an error if L(T)/L(T) gt R,
where R is the detection threshold (e.g., R104)

13
Implementation in FAMHAPBecker et al. 06

Window-based algorithm
For each window including the SNP under test,
generate list of H most frequent haplotypes
(default H50)
Find most likely trio phasings by pruned search
over the H4 quadruples of frequent haplotypes
Flag genotype as an error if L(T)/L(T) gt R for
at least one window

14
Limitations of FAMHAP Implementation

Truncating the list of haplotypes to size H may
lead to sub-optimal phasings and inaccurate L(T)
values
False positives caused by nearby errors (due to
the use of multiple short windows)
Our approach
HMM model of haplotype diversity ? all haplotypes
are represented no need for short windows
Alternate likelihood functions ? scalable runtime

15
Outline

Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion

16
HMM Model
(Figure from Rastas et al. 07)

Similar to models proposed by Schwartz 04,
Rastas et al. 05, KimmelShamir 05
Unlike ScheetStephens 06, recombination ratios
not modeled explicitly
Block-free model, paths with high transition
probability correspond to founder haplotypes

17
HMM Training

Previous works use EM training of HMM based on
unrelated genotype data
Our 2-step algorithm exploits pedigree info
Step 1 Infer haplotypes using pedigree-aware
algorithm based on entropy-minimization
Step 2 train HMM based on inferred haplotypes,
using Baum-Welch

18
Complexity of Computing Maximum Phasing
Probability

For unrelated genotypes, computing maximum
phasing probability is hard to approximate within
a factor of O(f½-?) unless ZPPNP, where f is the
number of founders
For trios, hard to approx. within O(f1/4 -?)
Reductions from the clique problem

19
Alternate Likelihood Functions

Viterbi probability (ViterbiProb) the maximum
probability of a set of 4 HMM paths that emit 4
haplotypes compatible with the trio
Probability of Viterbi Haplotypes (ViterbiHaps)
product of total probabilities of the 4 Viterbi
haplotypes
Total Trio Probability (TotalProb) total
probability P(T) that the HMM emits four
haplotypes that explain trio T along all possible
4-tuples of paths

20
Efficient Computation of Viterbi Probability for
Trios

For a fixed trio, Viterbi paths can be found
using a 4-path version of Viterbis algorithm in
time
K3 speed-up by factoring common terms

Where
21
Overall Runtimes

Viterbi probability
Likelihoods of all 3N modified trios can be
computed within time using
forward-backward algorithm
Overall runtime for M trios
Probability of Viterbi haplotypes
Obtain haplotypes from standard traceback, then
compute haplotype probabilities using forward
algorithms
Overall runtime
Total trio probability
Similar pre-computation speed-up
forward-backward algorithm
Overall runtime

22
Outline

Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion

23
Datasets

Real dataset Becker et al. 2006
35 SNP loci on chromosome 16 covering a region of
91kb
551 trios
Synthetic datasets
35 SNPs, 30-551 trios
Preserved missing data pattern of real dataset
Haplotypes assigned to trios based on frequencies
inferred from real dataset
1 error rate, four error insertion models
Random allele
Random genotype
Heterozygous-to-homozygous
Homozygous-to-heterozygous

24
Experimental Setup

Two strategies for handling MIs
Set all three individuals to unknown prior to
error detection, or
Set child only to unknown (preserving parents
original data)
Two testing strategies
Test one SNP genotype ViterbiProb-1,
ViterbiHaps-1, TotalProb-1
Simultaneously test three SNP genotypes at the
same locus ViterbiProb-3, ViterbiHaps-3,
TotalProb-3

25
Comparison with FAMHAP (Random Allele Errors)
26
Children vs. Parents (Random Allele Errors)
27
Error Model Comparison(TrioProb-1 Parents)
28
Error Model Comparison(TrioProb-1 Children)
29
TrioProb-1 Results on Real Dataset

Becker et al. 06 resequenced all trio members
at 41 loci flagged by FAMHAP-3
23 SNP genotypes were identified as true errors
413-23100 resequenced SNP genotypes agree with
original calls
Predictive value for R104 is between 18/2669
and 24/2692, compared to 23/4156 for FAMHAP-3

30
Pedigree Info vs. Sample Size Effect
31
Unrelated vs. Trio Likelihood Sensitivity
Unrelated ViterbiProb-1 Likelihood ratios
(children)
Trio ViterbiProb-1 Likelihood ratios (children)
32
Combining Likelihood Functions (Children, Random
Allele Model)
33
Combining Likelihood Functions (Parents, Random
Allele Model)
34
Outline

Introduction
Likelihood Sensitivity Approach to Error
Detection
HMM-Based Algorithms
Experimental Results
Conclusion

35
Conclusion

Proposed efficient methods for error detection in
trio genotype data based on a HMM model of
haplotype diversity
Significantly improved detection accuracy
compared to FAMHAP
High sensitivity even for very low FP rates
Runtime linear in SNPs and trios
Ongoing work
Iterative error detection
Fix MIs using likelihood before error detection
Correct errors with high likelihood ratio, then
recompute likelihood ratios (possibly after
re-phasing and HMM re-training)
Integration with genotype calling algorithms
Combine low level intensity data with
haplotype-based likelihoods
Most useful when less pedigree info is available
(unrelated, sibling pairs w/o parent genotypes,
parents in trios)
Locus specific thresholds, p-values
Via simulations similar to Douglas et al. 00