Haplotype phasing - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Haplotype phasing

Description:

4 Gamete Test. Format the haplotypes in a matrix, two rows for each individual. ... 4 Gamete condition violated! 31. Combinatorial Algorithm. Consider all ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 47
Provided by: nathanjoh
Category:

less

Transcript and Presenter's Notes

Title: Haplotype phasing


1
Haplotype phasing
  • Lecture 25 December 1, 2005
  • Algorithms in Biosequence Analysis
  • Nathan Edwards - Fall, 2005

2
Haplotypes
  • 1ACGACTCAGATCACTACGTACGACT
  • 1ACGACTCAGATAACTACGGACGACT
  • 2ACGACTCAGATCACTACGTACGACT
  • 2ACGACTCAGATCACTACGTACGACT
  • 3ACGAGTCAGATCACTACGTACGACT
  • 3ACGAGTCAGATAACTACGGACGACT

3
Haplotypes
  • 1ACGACTCAGATCACTACGTACGACT
  • 1ACGACTCAGATAACTACGGACGACT
  • 2ACGACTCAGATCACTACGTACGACT
  • 2ACGACTCAGATCACTACGTACGACT
  • 3ACGAGTCAGATCACTACGTACGACT
  • 3ACGAGTCAGATAACTACGGACGACT

4
Genotypes
  • ACGACCTCAGATCAACTACGTGACGACT
  • ACGACCTCAGATCCACTACGTTACGACT
  • ACGAGGTCAGATCAACTACGTGACGACT

5
Genotyping Technology
  • Can only tell us about a particular locus, so we
    lose information about an individuals haplotypes
  • 1 C, A,C, G,T
  • 2 C, C, T
  • 3 G, A,C, G,T

6
Genotyping Technology
  • Can only tell us about a particular locus, so we
    lose information about an individuals haplotypes
  • 1 0, 2, 2
  • 2 0, 1, 1
  • 3 1, 2, 2

7
Haplotype phasing
  • Given n genotypes, resolve each genotype into its
    2 haplotypes.
  • 1 0, 2, 2
  • 2 0, 1, 1
  • 3 1, 2, 2
  • 1 0, 0, 0
  • 1 0, 1, 1
  • 2 0, 1, 1
  • 2 0, 1, 1
  • 3 1, 0, 0
  • 3 1, 1, 1

Phasing
8
Haplotype phasing
  • Given n genotypes, resolve each genotype into its
    2 haplotypes.
  • 1 0, 2, 2
  • 2 0, 1, 1
  • 3 1, 2, 2
  • 1 0, 0, 1
  • 1 0, 1, 0
  • 2 0, 1, 1
  • 2 0, 1, 1
  • 3 1, 0, 1
  • 3 1, 1, 0

Phasing
9
Clarks Rule
  • Find unambiguous individuals
  • (at most 1 ambiguous locus)
  • Form initial list of known haplotypes
  • Resolve ambiguous individuals
  • If possible, use two known haplotypes
  • Otherwise, use one known haplotype and add new
    haplotype to list.
  • If unphased individuals remain
  • Assign phase randomly for one individual

10
Clarks Rule
  • Initial list of known haplotypes 0, 1, 1
  • 1 0, 2, 2
  • 2 0, 1, 1
  • 3 1, 2, 2
  • 1 0, 0, 0
  • 1 0, 1, 1
  • 2 0, 1, 1
  • 2 0, 1, 1
  • 3 1, 0, 0
  • 3 1, 1, 1

Phasing
11
Clarks Rule
  • Heuristic Use unambiguous genotypes
  • Doesnt necessarily minimize the number of
    haplotypes
  • Doesnt pay any attention (as such) to the
    haplotype frequency

12
Haplotype frequencies
  • HWE says that the probability of phase h,h
    for genotype g
  • P g h,h 2 P h P h if h ? h
  • P h 2 if h h
  • If we knew the haplotype frequencies, fh, we
    would assign the phase of g based on these
    frequencies. P h ph fh / 2n

13
Estimating haplotype frequencies
  • Wed like to extract the haplotype frequencies
    from our genotype data
  • Well use an EM algorithm to find
    maximum-likelihood estimates of the haplotype
    frequencies.
  • Find haplotype frequencies ph so thatP
    g1,....,gn ph is maximum.

14
Estimating haplotyping frequencies
  • If we had haplotype frequencies, but no genotype
    frequencies
  • Pg Sh,h g Ph,h
  • Compute Ph,h from HW (and ph) as before
  • So P genotype g with phase h,h P phase
    h,h g P genotype g Ph,h/Pg x
    1/n
  • Use previous estimate of ph to compute first term

15
Estimating haplotype frequencies
  • Given P genotype g with phase h,h , based on
    our estimated ph, we now re-estimate fh.
  • Efh Sg Sh,h g d(h,h,h)
    Pgh,hwhere d(h,h,h) is 0,1,2, times
    h in h,h
  • ph Efh/2n

16
Estimating haplotype frequencies
  • This EM algorithm (Excoffier and Slatkin, 1995)
    will converge to the maximum likelihood estimates
    of the haplotype frequencies
  • Initial guess for frequencies is not clear
  • Pgh,h 1/ of phasings of g

17
Estimating haplotype frequencies
  • Unfortunately, this approach is exponential in
    the number of ambiguous sites (or number of
    phasings)
  • Many heuristics have been proposed, lots of open
    problems to consider.
  • Unambiguous or partially phased genotypes
    contribute most.

18
Alternative method
  • For a large enough sample, if we assign a new
    genotype to its most likely phase, the haplotype
    frequencies wont change (much).
  • Choose g at random, re-assign phase based on
    haplotype frequencies of other genotypes phases.
  • Iterate until stable.
  • Monte Carlo Markov Chain (MCMC) approach
  • Estimates maximum-likelihood haplotype
    frequencies too!
  • Variations allow extra population assumptions to
    be built in.

19
Combinatorial Approach
  • There is good reason to use these statistical
    approaches, especially if n is large, and the
    haplotypes are not too rare.
  • However, with smaller samples, or when we have
    more structure, or rarer haplotypes, we might
    consider each phasing carefully.
  • These approaches make the most sense when there
    is additional structure to consider.

20
Perfect Phylogeny
  • Perfect phylogeny
  • all-0 haplotype at the root of tree
  • remaining haplotypes at leaves of tree
  • minor alleles accumulated on path from root to
    leaf
  • each locus is represented by exactly one edge.
    (infinite sites model)

21
Perfect phylogeny
  • all-0 at root
  • otherwise at leaves
  • minor-alleles accumulated on path from root
  • each locus on exactly one edge
  • Implies no evidence of recombination!

00000
1
4
3
2
10100
00010
10000
5
01011
01010
22
Haplotyping by Perfect Phylogeny
  • PPH Given a set of genotypes, find phasing
    haplotypes that fit a perfect phylogeny.
  • 1 2, 2
  • 2 0, 2
  • 3 1, 0

1 1, 0 1 0, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
23
Perfect phylogeny
  • 1 1, 0
  • 1 0, 1
  • 2 0, 0
  • 2 0, 1
  • 3 1, 0
  • 3 1, 0
  • This phasing fits a perfect phylogeny!

00
1
2
00
10
01
01
10
10
24
Haplotyping by Perfect Phylogeny
  • PPH Given a set of genotypes, find phasing
    haplotypes that fit a perfect phylogeny.
  • 1 2, 2
  • 2 0, 2
  • 3 1, 0
  • No perfect phylogeny is possible!

1 0, 0 1 1, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
25
4 Gamete Test
  • Format the haplotypes in a matrix, two rows for
    each individual.
  • The haplotypes fit a perfect phylogeny if and
    only if no two columns contain all four pairs
    00, 01, 10, 11.

26
Combinatorial Problem
  • Input matrix G (n rows) with elements from
    0,1,2
  • Output matrix H (2n rows) with elements 0,1,
    each 2 of G replaced by 0 and 1, such that H
    passes 4 gamete test.
  • Independently, from 2002
  • Gusfield
  • Eskin, Halperin, Karp
  • Bafna, Gusfield, Lancia, Yooseph

27
Initial Observations
  • Forced Expansions
  • G has two columns (loci) with these rows
  • 2 0
  • 0 2
  • H must have
  • 0 0
  • 1 0
  • 0 0
  • 0 1
  • That is, we know that 0 1 and 1 0 are present in
    the column

28
Initial Observations
  • Forced Expansions
  • G has two columns (loci) with these rows
  • 2 1
  • 0 2
  • H must have
  • 0 1
  • 1 1
  • 0 0
  • 0 1
  • That is, we know that 0 0 and 1 1 are present in
    the column

29
Initial Observations
  • 2 2 can be phased in two ways
  • 0 0, 1 1 and 0 1, 1 0.
  • If two columns contain 0 0 and 1 1 already
    (forced or unambiguous), then 2 2 must be phased
    0 0, 1 1.
  • These columns are forced in-phase
  • If two columns contain 0 1 and 1 0 already,
    then 2 2 must be phased 0 1, 1 0
  • These columns are forced out-of-phase

30
Immediate Failure
  • Forced Expansions
  • G has two columns (loci) with these rows
  • 2 1
  • 2 0
  • H must have
  • 0 1
  • 1 1
  • 0 0
  • 1 0
  • 4 Gamete condition violated!

31
Combinatorial Algorithm
  • Consider all columns in turn.
  • - Do all forced expansions
  • - Find all forced in-phase or out-of- phase
    column pair relationships
  • - Use invariant column phase pairs
  • to fix unknown phase pairs
  • - Find a set of column pairs whose
  • phase can be arbitrarily set
  • - Force the remaining phase pairs

32
Running Example
33
Companion Graph
  • Node column of G
  • Edge 2 cols with a row of 2s.

34
Companion Graph
  • Red Edge
  • forced in-phase
  • Blue Edge
  • forced out-of-phase

35
Forced Graph
  • Drop dashed edges and define connected components
  • 3 connected components
  • Fill in each component in turn by coloring dashed
    edges

36
Phase parity lemma
  • There is a PP solution for G if and only if the
    dashed edges can be colored so that
  • For every triangle in the companion graph with
    at least one dashed edge, either 0 or 2 edges
    are colored blue.

37
Weak Triangulation Rule
  • If any dashed edges have ends are in the same
    component of the forced graph, at least one must
    be in a triangle with two colored edges.
  • Color this edge appropriately

38
Weak Triangulation Rule
39
Corollary
  • For each connected component of the forced graph,
    all the edge relationships are uniquely
    determined.
  • Phase relationships of all columns in a connected
    component of the forced graph are invariant for
    all solutions that satisfy PP

40
Phase Parity Lemma
  • Consider
  • 2 x
  • y 2
  • 2 2
  • If x ? 2 and y ? 2
  • 2 0
  • 1 2
  • 2 2
  • then the columns are forced in, or out-of phase.

41
Phase parity lemma
  • Lemma If a triangle contains a dashed edge, then
    a PP solution exists only if there are 0 or 2
    blue edges in coloring.
  • A B C
  • 2 2 y
  • x 2 2
  • 2 z 2
  • Must have x2, y2 or z2 or no dashed edge.
  • Row of all 2s must have even of blue edges.

42
Weak Triangulation Rule
  • Theorem
  • If any dashed edges have ends are in the same
    component of the forced graph, at least one must
    be in a triangle with two colored edges.
  • In forced graph component, can find path from any
    node to any other, using forced edges.
  • Find cycle of one unforced (dashed) edge and
    forced edges of length gt 3

43
Weak Triangulation Rule
  • Let (J,J) be dashed edge connecting path of gt 2
    forced edges J, K, ..., K, J
  • ...K J J K...
  • 2 2 x
  • y 2 2
  • 2 2
  • If x 2, then (K,J) is an edge
  • If y 2, then (J,K) is an edge
  • If x ? 2 and y ? 2, then (J,J) is forced.

44
Finishing up
  • We still need to color the dashed edges between
    components of the forced graph.
  • If there are k components, select (k-1) edges
    that form a spanning tree on the component graph
  • Color these edges arbitrarily. Remaining dashed
    edges will be forced by weak triangulation rule.

45
Corollary
  • If companion graph has r connected components,
    and forced graph has t connected components,
    there are exactly 2(t-r) perfect phylogeny
    phasings for input matrix G.
  • PP solution is unique iff each component of
    companion graph is connected in forced graph.

46
Summary
  • If no (evidence) of recombination with the
    infinite sites model, haplotypes will fit a
    perfect phylogeny
  • O(ns2) to test for PP, determine if solution is
    unique, represent all possible solutions, and
    generate one solution
  • Others are pushing towards O(ns)
  • Can we relax the no recombination requirement?
Write a Comment
User Comments (0)
About PowerShow.com