Genotype Error Detection using Hidden Markov Models of Haplotype Diversity - PowerPoint PPT Presentation

About This Presentation
Title:

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity

Description:

Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin Kennedy and ... – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 36
Provided by: uco117
Category:

less

Transcript and Presenter's Notes

Title: Genotype Error Detection using Hidden Markov Models of Haplotype Diversity


1
Genotype Error Detection using Hidden Markov
Models of Haplotype Diversity
Ion Mandoiu CSE Department, University of
Connecticut Joint work with Justin Kennedy and
Bogdan Pasaniuc
2
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

3
Single Nucleotide Polymorphisms
  • Main form of variation between individual
    genomes single nucleotide polymorphisms (SNPs)
  • High density in the human genome ? 1 ? 107 SNPs
    out of total 3 ? 109 base pairs

ataggtccCtatttcgcgcCgtatacacgggActata
ataggtccGtatttcgcgcCgtatacacgggTctata
ataggtccCtatttcgcgcCgtatacacgggTctata
4
Haplotypes and Genotypes
  • Diploids two homologous copies of each
    chromosome
  • One inherited from mother and one from father
  • Haplotype description of SNP alleles on a
    chromosome
  • 0/1 vector 0 for major allele, 1 for minor
  • Genotype description of alleles on both
    chromosomes
  • 0/1/2 vector 0 (1) - both chromosomes contain
    the major (minor) allele 2 - the chromosomes
    contain different alleles

two haplotypes per individual
genotype
5
Why SNP Genotypes?
  • Identification and fine mapping of
    disease-related genes
  • Methods Linkage analysis, allele-sharing,
    association studies
  • Genotype data large pedigrees, sibling pairs,
    trios, unrelated

6
Genotyping Errors
  • A real problem despite advances in genotyping
    technology
  • Zaitlen et al. 2005 found 1.1 inconsistencies
    among the 20 million dbSNP genotypes typed
    multiple times
  • Error types
  • Systematic errors (e.g., assay failure) detected
    by departure from HWE Hosking et al. 2004
  • For pedigree data some errors detected as
    Mendelian Inconsistencies (MIs)
  • Undetected errors
  • E.g., if mother/father/child are all
    heterozygous, any error is Mendelian consistent
  • Only 30 detectable as MIs for trios Gordon et
    al. 1999

7
Effects of Undetected Genotyping Errors
  • Even low error levels can have large effects for
    some study designs (e.g. rare alleles,
    haplotype-based)
  • Errors as low as .1 can increase Type I error
    rates in haplotype sharing transmission
    disequilibrium test (HS-TDT) KnappBecker04
  • 1 errors decrease power by 10-50 for linkage,
    and by 5-20 for association Douglas et al. 00,
    Abecasis et al. 01

8
Related Work
  • Improved genotype calling algorithms
  • Di et al. 05, RabbeeSpeed 06, Nicolae et al.
    06
  • Explicit modeling in analysis methods
  • Sieberts et al. 01, Sobel et al. 02, Abecasis et
    al. 02,Cheng 06
  • Computationally complex
  • Separate error detection step
  • Douglas et al. 00, Abecasis et al. 02, Becker et
    al. 06
  • Detected errors can be retyped, imputed, or
    ignored in downstream analyses

9
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

10
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
11
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
?
Likelihood of best phasing for original trio T
12
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
Mother
Father
0 1 2 1 0 2
0 2 2 1 0 2
Child
0 2 2 1 0 2
?
  • Large change in likelihood suggests likely error
  • Flag genotype as an error if L(T)/L(T) gt R,
    where R is the detection threshold (e.g., R104)

13
Implementation in FAMHAPBecker et al. 06
  • Window-based algorithm
  • For each window including the SNP under test,
    generate list of H most frequent haplotypes
    (default H50)
  • Find most likely trio phasings by pruned search
    over the H4 quadruples of frequent haplotypes
  • Flag genotype as an error if L(T)/L(T) gt R for
    at least one window

14
Limitations of FAMHAP Implementation
  • Truncating the list of haplotypes to size H may
    lead to sub-optimal phasings and inaccurate L(T)
    values
  • False positives caused by nearby errors (due to
    the use of multiple short windows)
  • Our approach
  • HMM model of haplotype diversity ? all haplotypes
    are represented no need for short windows
  • Alternate likelihood functions ? scalable runtime

15
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

16
HMM Model
(Figure from Rastas et al. 07)
  • Similar to models proposed by Schwartz 04,
    Rastas et al. 05, KimmelShamir 05
  • Unlike ScheetStephens 06, recombination ratios
    not modeled explicitly
  • Block-free model, paths with high transition
    probability correspond to founder haplotypes

17
HMM Training
  • Previous works use EM training of HMM based on
    unrelated genotype data
  • Our 2-step algorithm exploits pedigree info
  • Step 1 Infer haplotypes using pedigree-aware
    algorithm based on entropy-minimization
  • Step 2 train HMM based on inferred haplotypes,
    using Baum-Welch

18
Complexity of Computing Maximum Phasing
Probability
  • For unrelated genotypes, computing maximum
    phasing probability is hard to approximate within
    a factor of O(f½-?) unless ZPPNP, where f is the
    number of founders
  • For trios, hard to approx. within O(f1/4 -?)
  • Reductions from the clique problem

19
Alternate Likelihood Functions
  • Viterbi probability (ViterbiProb) the maximum
    probability of a set of 4 HMM paths that emit 4
    haplotypes compatible with the trio
  • Probability of Viterbi Haplotypes (ViterbiHaps)
    product of total probabilities of the 4 Viterbi
    haplotypes
  • Total Trio Probability (TotalProb) total
    probability P(T) that the HMM emits four
    haplotypes that explain trio T along all possible
    4-tuples of paths

20
Efficient Computation of Viterbi Probability for
Trios
  • For a fixed trio, Viterbi paths can be found
    using a 4-path version of Viterbis algorithm in
    time
  • K3 speed-up by factoring common terms

Where
21
Overall Runtimes
  • Viterbi probability
  • Likelihoods of all 3N modified trios can be
    computed within time using
    forward-backward algorithm
  • Overall runtime for M trios
  • Probability of Viterbi haplotypes
  • Obtain haplotypes from standard traceback, then
    compute haplotype probabilities using forward
    algorithms
  • Overall runtime
  • Total trio probability
  • Similar pre-computation speed-up
    forward-backward algorithm
  • Overall runtime

22
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

23
Datasets
  • Real dataset Becker et al. 2006
  • 35 SNP loci on chromosome 16 covering a region of
    91kb
  • 551 trios
  • Synthetic datasets
  • 35 SNPs, 30-551 trios
  • Preserved missing data pattern of real dataset
  • Haplotypes assigned to trios based on frequencies
    inferred from real dataset
  • 1 error rate, four error insertion models
  • Random allele
  • Random genotype
  • Heterozygous-to-homozygous
  • Homozygous-to-heterozygous

24
Experimental Setup
  • Two strategies for handling MIs
  • Set all three individuals to unknown prior to
    error detection, or
  • Set child only to unknown (preserving parents
    original data)
  • Two testing strategies
  • Test one SNP genotype ViterbiProb-1,
    ViterbiHaps-1, TotalProb-1
  • Simultaneously test three SNP genotypes at the
    same locus ViterbiProb-3, ViterbiHaps-3,
    TotalProb-3

25
Comparison with FAMHAP (Random Allele Errors)
26
Children vs. Parents (Random Allele Errors)
27
Error Model Comparison(TrioProb-1 Parents)
28
Error Model Comparison(TrioProb-1 Children)
29
TrioProb-1 Results on Real Dataset
  • Becker et al. 06 resequenced all trio members
    at 41 loci flagged by FAMHAP-3
  • 23 SNP genotypes were identified as true errors
  • 413-23100 resequenced SNP genotypes agree with
    original calls
  • Predictive value for R104 is between 18/2669
    and 24/2692, compared to 23/4156 for FAMHAP-3

30
Pedigree Info vs. Sample Size Effect
31
Unrelated vs. Trio Likelihood Sensitivity
Unrelated ViterbiProb-1 Likelihood ratios
(children)
Trio ViterbiProb-1 Likelihood ratios (children)
32
Combining Likelihood Functions (Children, Random
Allele Model)
33
Combining Likelihood Functions (Parents, Random
Allele Model)
34
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

35
Conclusion
  • Proposed efficient methods for error detection in
    trio genotype data based on a HMM model of
    haplotype diversity
  • Significantly improved detection accuracy
    compared to FAMHAP
  • High sensitivity even for very low FP rates
  • Runtime linear in SNPs and trios
  • Ongoing work
  • Iterative error detection
  • Fix MIs using likelihood before error detection
  • Correct errors with high likelihood ratio, then
    recompute likelihood ratios (possibly after
    re-phasing and HMM re-training)
  • Integration with genotype calling algorithms
  • Combine low level intensity data with
    haplotype-based likelihoods
  • Most useful when less pedigree info is available
    (unrelated, sibling pairs w/o parent genotypes,
    parents in trios)
  • Locus specific thresholds, p-values
  • Via simulations similar to Douglas et al. 00

36
Questions?
Write a Comment
User Comments (0)
About PowerShow.com