Statistical Approaches to Haplotype Inference - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Statistical Approaches to Haplotype Inference

Description:

Haplotype Inference (HI) Genetic variability between individuals ... Biological Basis for Population HI ... EM for HI ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 45
Provided by: engrU
Category:

less

Transcript and Presenter's Notes

Title: Statistical Approaches to Haplotype Inference


1
Statistical Approaches to Haplotype Inference
  • Jin Jun
  • 11/10/2004

2
Overview
  • Biological background
  • Problem definition
  • Likelihood function and MLE
  • EM method
  • Bayesian inference
  • Gibbs sampler
  • Pseudo-Gibbs sampler
  • Conclusions

3
SNPs
Biological background
  • SNP (Single Nucleotide Polymorphism)
  • a position on a chromosome where some of the
    population have different values
  • Why different chromosomes in population
  • mutation, recombination

4
Haplotype, Genotype
Biological background
  • Haplotype
  • two identical copies of each chromosome from
    parents
  • string of SNPs value (e.g. CTG...)
  • Genotype
  • description of both chromosome
  • conflated data (e.g. CA,TG...)

5
Homozygous Heterozygous
Biological background
  • ataggtccCtatttccaggcgcCgtatacttcgacgggActata
  • ataggtccGtatttccaggcgcCgtatacttcgacgggTctata
  • Homozygous SNP
  • Heterozygous SNP
  • At a SNP exactly 2 out of 4 nucleotides occur

haplotype
genotype
haplotype
SNP1 (heterozygous)
SNP2 (homozygous)
SNP3 (heterozygous)
6
Haplotype Inference (HI)
Biological background
  • Genetic variability between individuals
  • mapping of complex-disease genes
  • identifying genetic variants that influence drug
    response
  • Haplotyping approaches
  • direct observation
  • expensive and time consuming
  • determine haplotypes from genotype data
  • easy to obtain conflated data

7
Notations
Problem Definition
  • haplotype word over 0,1genotype word over
    0,1,2
  • Haplotype 1 CCA 011Haplotype 2
    GCT 110Genotype C,GCA,T 212

8
Uncertainty
Problem Definition
  • 2k-1 pairs of haplotypes that can explain a
    genotype ( k number of 2s in g)
  • Haplotype Inference Which pair of haplotypes is
    the correct one?

0110 0011
g 0212 - 22-1 pairs
0010 0111
9
Biological Basis for Population HI
Problem Definition
  • In real population few recombination and few
    mutation events occur
  • Parsimony approach minimize number of haplotypes
  • Statistical approach most probable haplotypes

10
Problem Definition
Problem Definition
  • Haplotype Inference in a Population
  • Given the conflated data (a set of genotypes) for
    a population, find the haplotypes that gave rise
    to this conflated data.
  • Input
  • Given genotypes G g1, g2,, gn
  • Output
  • The most probable haplotype pairs per each
    genotype

11
Population and Sample
Problem Definition
  • Estimate population frequencies of haplotypes
  • Infer the most probable haplotype pair per each
    genotype
  • Fpop R.V. of haplotype probability in population

Population

Sample



12
Population and Sample
Problem Definition
  • Sampled genotypes Gsample g1, g2, , gn from
    population
  • Estimate Fpop as Fsample from Gsample
  • Fsample R.V. of haplotype probability in sample
  • Gsample R.V. of genotype in sample

Population

Sample



13
Population and Sample
Problem Definition
  • (Unknown) population probability
  • Sample (of size n)
  • For convenience,

14
Likelihood Function andMaximum Likelihood
Estimation
15
Likelihood Function
Likelihood and MLE
  • Probability (likelihood) for the occurrence of
    (given) outputs with (unknown) parameters
  • Defined as the joint distribution of outputs as a
    function of parameters
  • ? unknown parameters
  • Yi occurred output, 1 i n

16
Maximum Likelihood Estimator
Likelihood and MLE
  • MLE estimator for parameters which maximizes
    likelihood function
  • Obtain by partial derivatives of (the logarithm
    of) the likelihood function

17
Properties of MLE
Likelihood and MLE
  • Biased estimator in some case f(x)1/?
  • When Likelihood function is monotonic
  • MLE found at boundary point is biased
  • Asymptotical invariance property
  • The bias could be small and simply corrected
  • With bigger size of sample, the variance of MLE
    has Cramer-Rao bound ? good estimator

18
Likelihood Function of HI
Likelihood and MLE
  • hi unknown corresponding haplotype pair for gi
  • Hi R.V. of haplotype pair consistent with gi

Population

Sample



19
MLE and its drawback
Likelihood and MLE
  • Maximum Likelihood Estimator (MLE) F
  • Number of possible haplotypes is too large to
    calculate

20
Expectation Maximization Algorithm
21
Problem and Goal
EM algorithm
  • To estimate unknown parameters(?), given outputs
    (Y), with some hidden variables (Z) which are
    consistent with outputs
  • Finding the parameters that maximize the
    posterior probability, s.t.

22
Ideas and Steps
EM algorithm
  • Instead of finding the best Zi given an estimate
    parameter, computing a distribution over the
    space of Z
  • Constructing a local lower-bound to the posterior
    probability (Expectation step)
  • Optimizing the bound, thereby improving the
    estimate (Maximization step)

23
Expectation Step
EM algorithm
  • Maximize the logarithm of the joint distribution
    (proportional to the posterior)
  • Lower bound LB(? )
  • Compute the expectation of the hidden variables
    under current parameter and output

24
Maximization Step
EM algorithm
  • Maximize the bound with respect to current
    parameters (? t) and distribution of hidden
    variables ( Pr(Z ? t,Y) )

25
Algorithm
EM algorithm
  • Given initial values (? 0)
  • equal frequency or randomly selected
  • Expectation step
  • calculate the distribution of hidden variables
    (Z) with ? t
  • Maximization step
  • find new parameter (? t1) which maximizes the
    likelihood of posterior
  • Repeat E-M steps until converged

26
Convergence of the algorithm
EM algorithm
  • Dempster et al. 1977
  • Successive iterations (updated parameters) always
    increase the likelihood (of posterior probability)

27
EM for HI
EM algorithm
  • Compute successive sets of haplotype frequencies
    considered all possible assignments of haplotype
    pairs to each genotype, weighted by their
    relative frequency
  • Converge into MLE of posterior probability
    Pr(G,HF)

28
Steps and equations
EM algorithm
  • Expectation step
  • Maximization step

29
Algorithm for HI
EM algorithm
  • Given initial values (F0)
  • all haplotypes (F) are equally frequent
  • Expectation step estimate Hi with Ft
  • Maximization step maximize F with Hi
  • count the haplotypes in all Hi for all G with
    probability
  • Repeat until converged

30
Drawbacks
EM algorithm
  • how to reconstruct the haplotype pair (H), not
    probability (F)
  • choosing the most probable haplotype assignment ?
    max Pr(H F, G)
  • number of possible haplotype pairs (H) are huge
    (2k-1) with k heterozygous loci

31
Bayesian Inference
  • Gibbs Sampling
  • Pseudo-Gibbs Sampling

32
Solving Steps
Problem Definition
  • Likelihood approaches
  • Estimate the haplotype frequencies in the
    population
  • MLE
  • EM algorithm
  • Infer the haplotype pair for each genotype
  • Select the most probable haplotype pair
  • Bayesian Inferences
  • Estimate the probability of the haplotype pair
    per each genotype directly
  • Gibbs sampling
  • Pseudo-Gibbs sampling

33
Bayesian estimation
Bayesian Inference
  • posterior Pr(FG)
  • prior Pr(F)
  • marginal density of G Pr(G) sum or integral
    over all possible F
  • find F s.t. max Pr(FG)
  • With hidden data, e.g. haplotype pair in HI, we
    need more technique

34
Structure of Bayesian Inference
Bayesian Inference
  • Guessing posterior by
  • Prior Pr(F)
  • no assumption ? Gibbs Sampler
  • coalescent based ? Pseudo-Gibbs Sampler
  • and sample data G
  • by iterative selection of one of possible hidden
    data and estimation of corresponding probability
    ? Gibbs Sampling

35
Gibbs sampling I
Bayesian Inference
  • Let Yi, i 1,,k be discrete finite R.V. with
    joint distribution of Pr(Y1, Y2,,Yk)
  • Define Markov chain whose states are the possible
    values of Yi, i 1,,k and transition
    probabilities as Pr(Yi Y-i )
  • stationary distribution as Pr(Y1, Y2,,Yk)
  • Iterative selection and estimation of conditional
    probability guides to TRUE probability
  • One of MCMC (Markov chain-Monte Carlo) techniques

36
Gibbs sampling II
Bayesian Inference
  • MCMC Markov chain-Monte Carlo
  • sample randomly from a specific probability
    distribution then design a Markov chain whose
    long-time equilibrium is that distribution
  • Convergence
  • Pr(Yi Y-i ) is a transition probability
  • Markov chain of this transition matrix is
    irreducible and aperiodic ? has a stationary
    distribution Pr(Y1, Y2,,Yk)

37
Application to HI
Gibbs Sampler
  • The process of
  • Picking hi for randomly selected gi
  • and estimating Pr(Hihi H-i,G)
  • assumed that the inferred haplotype pairs (H-i)
    for all the other genotypes are true
  • ? Markov chain with stationary distribution
    Pr(HiG)
  • Pr(Hihi H-i,G) Pr(Hihi H-i) Pr(Fif1
    H-i) Pr(Fif2 Fif1, H-i)
  • where Hi(f1,f2) consistent with Gi

38
1. Prior
Gibbs Sampler
  • parent-independent mutation
  • fraction of fi in H, considered mutation
  • r total number of haplotypes
  • vi probability that fi is mutant
  • ? mutation rate, usually constant

39
Algorithm
Gibbs Sampler
  • Initial guess H(0)h for all genotypes
  • Pick i randomly, make list H-i
  • Repeat following till Pr(HiG, H-i(t)) converged
  • Estimate Hi(t1) with Pr(HiG, H-i(t))
  • Pr(HihiG,H-i)Pr(Fif1H-i)Pr(Fif2 Fif1,H-i)
  • With Pr(Ff H) calculations
  • Set Hj(t1) Hj(t) for j 1,,n, j?i
  • Select most probable Hj hj per each gj

40
2. Coalescent-based Prior
Pseudo-Gibbs Sampler
  • the genetic sequence of a mutant offspring will
    differ only slightly from the progenitor sequence
    (often by a single-base change)

Known haplotypes
0110 0011
1110 0011
g 0212
0010 0111
41
2. Coalescent-based Prior
Pseudo-Gibbs Sampler
  • E a set of limited number of haplotypes which
    are similar with f
  • P mutation transition matrix from a haplotype to
    f haplotype
  • log2n the number of SNPs

42
Conclusions
  • Experimental results
  • PGS dominates Gibbs Sampler, EM algorithm in
    coalescent cases
  • PGS is competitive with the others in
    recombination cases, too

43
Conclusions
  • Statistical approaches for HI
  • Maximum Likelihood Estimator
  • EM algorithm
  • Gibbs Sampler
  • Pseudo-Gibbs Sampler

PHASE
44
References
  • Dempster et al., Maximum likelihood from
    incomplete data via the EM algorithm, J. R.
    Stat. Soc. 391-38, 1977.
  • Excoffier and Slatkin, Maximum-Likelihood
    Estimation of Molecular Haplotype Frequencies in
    a Diploid Population, Mol. Bio. Evol.
    12(5)921-927, 1995.
  • Stephens et al., A New Statistical Method for
    Haplotype Reconstruction from Population Data,
    Am. J. Hum. Genet. 68978-989, 2001.
  • Stephens and Donnelly, A Comparison of Bayesian
    Methods for Haplotype Reconstruction from
    Population Genotype Data, Am. J. Hum. Genet.
    731162-1169, 2003.
Write a Comment
User Comments (0)
About PowerShow.com