Title: Haplotype phasing
1Haplotype phasing
- Lecture 25 December 1, 2005
- Algorithms in Biosequence Analysis
- Nathan Edwards - Fall, 2005
2Haplotypes
- 1ACGACTCAGATCACTACGTACGACT
- 1ACGACTCAGATAACTACGGACGACT
- 2ACGACTCAGATCACTACGTACGACT
- 2ACGACTCAGATCACTACGTACGACT
- 3ACGAGTCAGATCACTACGTACGACT
- 3ACGAGTCAGATAACTACGGACGACT
3Haplotypes
- 1ACGACTCAGATCACTACGTACGACT
- 1ACGACTCAGATAACTACGGACGACT
- 2ACGACTCAGATCACTACGTACGACT
- 2ACGACTCAGATCACTACGTACGACT
- 3ACGAGTCAGATCACTACGTACGACT
- 3ACGAGTCAGATAACTACGGACGACT
4Genotypes
- ACGACCTCAGATCAACTACGTGACGACT
- ACGACCTCAGATCCACTACGTTACGACT
- ACGAGGTCAGATCAACTACGTGACGACT
5Genotyping Technology
- Can only tell us about a particular locus, so we
lose information about an individuals haplotypes - 1 C, A,C, G,T
- 2 C, C, T
- 3 G, A,C, G,T
6Genotyping Technology
- Can only tell us about a particular locus, so we
lose information about an individuals haplotypes - 1 0, 2, 2
- 2 0, 1, 1
- 3 1, 2, 2
7Haplotype phasing
- Given n genotypes, resolve each genotype into its
2 haplotypes. - 1 0, 2, 2
- 2 0, 1, 1
- 3 1, 2, 2
- 1 0, 0, 0
- 1 0, 1, 1
- 2 0, 1, 1
- 2 0, 1, 1
- 3 1, 0, 0
- 3 1, 1, 1
Phasing
8Haplotype phasing
- Given n genotypes, resolve each genotype into its
2 haplotypes. - 1 0, 2, 2
- 2 0, 1, 1
- 3 1, 2, 2
- 1 0, 0, 1
- 1 0, 1, 0
- 2 0, 1, 1
- 2 0, 1, 1
- 3 1, 0, 1
- 3 1, 1, 0
Phasing
9Clarks Rule
- Find unambiguous individuals
- (at most 1 ambiguous locus)
- Form initial list of known haplotypes
- Resolve ambiguous individuals
- If possible, use two known haplotypes
- Otherwise, use one known haplotype and add new
haplotype to list. - If unphased individuals remain
- Assign phase randomly for one individual
10Clarks Rule
- Initial list of known haplotypes 0, 1, 1
- 1 0, 2, 2
- 2 0, 1, 1
- 3 1, 2, 2
- 1 0, 0, 0
- 1 0, 1, 1
- 2 0, 1, 1
- 2 0, 1, 1
- 3 1, 0, 0
- 3 1, 1, 1
Phasing
11Clarks Rule
- Heuristic Use unambiguous genotypes
- Doesnt necessarily minimize the number of
haplotypes - Doesnt pay any attention (as such) to the
haplotype frequency
12Haplotype frequencies
- HWE says that the probability of phase h,h
for genotype g - P g h,h 2 P h P h if h ? h
- P h 2 if h h
- If we knew the haplotype frequencies, fh, we
would assign the phase of g based on these
frequencies. P h ph fh / 2n
13Estimating haplotype frequencies
- Wed like to extract the haplotype frequencies
from our genotype data - Well use an EM algorithm to find
maximum-likelihood estimates of the haplotype
frequencies. - Find haplotype frequencies ph so thatP
g1,....,gn ph is maximum.
14Estimating haplotyping frequencies
- If we had haplotype frequencies, but no genotype
frequencies - Pg Sh,h g Ph,h
- Compute Ph,h from HW (and ph) as before
- So P genotype g with phase h,h P phase
h,h g P genotype g Ph,h/Pg x
1/n - Use previous estimate of ph to compute first term
15Estimating haplotype frequencies
- Given P genotype g with phase h,h , based on
our estimated ph, we now re-estimate fh. - Efh Sg Sh,h g d(h,h,h)
Pgh,hwhere d(h,h,h) is 0,1,2, times
h in h,h - ph Efh/2n
16Estimating haplotype frequencies
- This EM algorithm (Excoffier and Slatkin, 1995)
will converge to the maximum likelihood estimates
of the haplotype frequencies - Initial guess for frequencies is not clear
- Pgh,h 1/ of phasings of g
17Estimating haplotype frequencies
- Unfortunately, this approach is exponential in
the number of ambiguous sites (or number of
phasings) - Many heuristics have been proposed, lots of open
problems to consider. - Unambiguous or partially phased genotypes
contribute most.
18Alternative method
- For a large enough sample, if we assign a new
genotype to its most likely phase, the haplotype
frequencies wont change (much). - Choose g at random, re-assign phase based on
haplotype frequencies of other genotypes phases. - Iterate until stable.
- Monte Carlo Markov Chain (MCMC) approach
- Estimates maximum-likelihood haplotype
frequencies too! - Variations allow extra population assumptions to
be built in.
19Combinatorial Approach
- There is good reason to use these statistical
approaches, especially if n is large, and the
haplotypes are not too rare. - However, with smaller samples, or when we have
more structure, or rarer haplotypes, we might
consider each phasing carefully. - These approaches make the most sense when there
is additional structure to consider.
20Perfect Phylogeny
- Perfect phylogeny
- all-0 haplotype at the root of tree
- remaining haplotypes at leaves of tree
- minor alleles accumulated on path from root to
leaf - each locus is represented by exactly one edge.
(infinite sites model)
21Perfect phylogeny
- all-0 at root
- otherwise at leaves
- minor-alleles accumulated on path from root
- each locus on exactly one edge
- Implies no evidence of recombination!
00000
1
4
3
2
10100
00010
10000
5
01011
01010
22Haplotyping by Perfect Phylogeny
- PPH Given a set of genotypes, find phasing
haplotypes that fit a perfect phylogeny. - 1 2, 2
- 2 0, 2
- 3 1, 0
1 1, 0 1 0, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
23Perfect phylogeny
- 1 1, 0
- 1 0, 1
- 2 0, 0
- 2 0, 1
- 3 1, 0
- 3 1, 0
- This phasing fits a perfect phylogeny!
00
1
2
00
10
01
01
10
10
24Haplotyping by Perfect Phylogeny
- PPH Given a set of genotypes, find phasing
haplotypes that fit a perfect phylogeny. - 1 2, 2
- 2 0, 2
- 3 1, 0
- No perfect phylogeny is possible!
1 0, 0 1 1, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
254 Gamete Test
- Format the haplotypes in a matrix, two rows for
each individual. - The haplotypes fit a perfect phylogeny if and
only if no two columns contain all four pairs
00, 01, 10, 11.
26Combinatorial Problem
- Input matrix G (n rows) with elements from
0,1,2 - Output matrix H (2n rows) with elements 0,1,
each 2 of G replaced by 0 and 1, such that H
passes 4 gamete test. - Independently, from 2002
- Gusfield
- Eskin, Halperin, Karp
- Bafna, Gusfield, Lancia, Yooseph
27Initial Observations
- Forced Expansions
- G has two columns (loci) with these rows
- 2 0
- 0 2
- H must have
- 0 0
- 1 0
- 0 0
- 0 1
- That is, we know that 0 1 and 1 0 are present in
the column
28Initial Observations
- Forced Expansions
- G has two columns (loci) with these rows
- 2 1
- 0 2
- H must have
- 0 1
- 1 1
- 0 0
- 0 1
-
- That is, we know that 0 0 and 1 1 are present in
the column
29Initial Observations
- 2 2 can be phased in two ways
- 0 0, 1 1 and 0 1, 1 0.
- If two columns contain 0 0 and 1 1 already
(forced or unambiguous), then 2 2 must be phased
0 0, 1 1. - These columns are forced in-phase
- If two columns contain 0 1 and 1 0 already,
then 2 2 must be phased 0 1, 1 0 - These columns are forced out-of-phase
30Immediate Failure
- Forced Expansions
- G has two columns (loci) with these rows
- 2 1
- 2 0
- H must have
- 0 1
- 1 1
- 0 0
- 1 0
-
- 4 Gamete condition violated!
31Combinatorial Algorithm
- Consider all columns in turn.
- - Do all forced expansions
- - Find all forced in-phase or out-of- phase
column pair relationships - - Use invariant column phase pairs
- to fix unknown phase pairs
- - Find a set of column pairs whose
- phase can be arbitrarily set
- - Force the remaining phase pairs
32Running Example
33Companion Graph
- Node column of G
- Edge 2 cols with a row of 2s.
34Companion Graph
- Red Edge
- forced in-phase
- Blue Edge
- forced out-of-phase
35Forced Graph
- Drop dashed edges and define connected components
- 3 connected components
- Fill in each component in turn by coloring dashed
edges
36Phase parity lemma
- There is a PP solution for G if and only if the
dashed edges can be colored so that - For every triangle in the companion graph with
at least one dashed edge, either 0 or 2 edges
are colored blue.
37Weak Triangulation Rule
- If any dashed edges have ends are in the same
component of the forced graph, at least one must
be in a triangle with two colored edges. - Color this edge appropriately
38Weak Triangulation Rule
39Corollary
- For each connected component of the forced graph,
all the edge relationships are uniquely
determined. - Phase relationships of all columns in a connected
component of the forced graph are invariant for
all solutions that satisfy PP
40Phase Parity Lemma
- If x ? 2 and y ? 2
- 2 0
- 1 2
- 2 2
- then the columns are forced in, or out-of phase.
41Phase parity lemma
- Lemma If a triangle contains a dashed edge, then
a PP solution exists only if there are 0 or 2
blue edges in coloring. -
- A B C
- 2 2 y
- x 2 2
- 2 z 2
- Must have x2, y2 or z2 or no dashed edge.
- Row of all 2s must have even of blue edges.
42Weak Triangulation Rule
- Theorem
- If any dashed edges have ends are in the same
component of the forced graph, at least one must
be in a triangle with two colored edges.
- In forced graph component, can find path from any
node to any other, using forced edges. - Find cycle of one unforced (dashed) edge and
forced edges of length gt 3
43Weak Triangulation Rule
- Let (J,J) be dashed edge connecting path of gt 2
forced edges J, K, ..., K, J - ...K J J K...
- 2 2 x
- y 2 2
- 2 2
- If x 2, then (K,J) is an edge
- If y 2, then (J,K) is an edge
- If x ? 2 and y ? 2, then (J,J) is forced.
44Finishing up
- We still need to color the dashed edges between
components of the forced graph. - If there are k components, select (k-1) edges
that form a spanning tree on the component graph - Color these edges arbitrarily. Remaining dashed
edges will be forced by weak triangulation rule.
45Corollary
- If companion graph has r connected components,
and forced graph has t connected components,
there are exactly 2(t-r) perfect phylogeny
phasings for input matrix G. - PP solution is unique iff each component of
companion graph is connected in forced graph.
46Summary
- If no (evidence) of recombination with the
infinite sites model, haplotypes will fit a
perfect phylogeny - O(ns2) to test for PP, determine if solution is
unique, represent all possible solutions, and
generate one solution - Others are pushing towards O(ns)
- Can we relax the no recombination requirement?