Haplotype phasing

About This Presentation

Title:

Haplotype phasing

Description:

4 Gamete Test. Format the haplotypes in a matrix, two rows for each individual. ... 4 Gamete condition violated! 31. Combinatorial Algorithm. Consider all ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 47

Provided by: nathanjoh

Category:

more less

Transcript and Presenter's Notes

Title: Haplotype phasing

1
Haplotype phasing

Lecture 25 December 1, 2005
Algorithms in Biosequence Analysis
Nathan Edwards - Fall, 2005

2
Haplotypes

1ACGACTCAGATCACTACGTACGACT
1ACGACTCAGATAACTACGGACGACT
2ACGACTCAGATCACTACGTACGACT
2ACGACTCAGATCACTACGTACGACT
3ACGAGTCAGATCACTACGTACGACT
3ACGAGTCAGATAACTACGGACGACT

3
Haplotypes

1ACGACTCAGATCACTACGTACGACT
1ACGACTCAGATAACTACGGACGACT
2ACGACTCAGATCACTACGTACGACT
2ACGACTCAGATCACTACGTACGACT
3ACGAGTCAGATCACTACGTACGACT
3ACGAGTCAGATAACTACGGACGACT

4
Genotypes

ACGACCTCAGATCAACTACGTGACGACT
ACGACCTCAGATCCACTACGTTACGACT
ACGAGGTCAGATCAACTACGTGACGACT

5
Genotyping Technology

Can only tell us about a particular locus, so we
lose information about an individuals haplotypes
1 C, A,C, G,T
2 C, C, T
3 G, A,C, G,T

6
Genotyping Technology

Can only tell us about a particular locus, so we
lose information about an individuals haplotypes
1 0, 2, 2
2 0, 1, 1
3 1, 2, 2

7
Haplotype phasing

Given n genotypes, resolve each genotype into its
2 haplotypes.
1 0, 2, 2
2 0, 1, 1
3 1, 2, 2

1 0, 0, 0
1 0, 1, 1
2 0, 1, 1
2 0, 1, 1
3 1, 0, 0
3 1, 1, 1

Phasing
8
Haplotype phasing

Given n genotypes, resolve each genotype into its
2 haplotypes.
1 0, 2, 2
2 0, 1, 1
3 1, 2, 2

1 0, 0, 1
1 0, 1, 0
2 0, 1, 1
2 0, 1, 1
3 1, 0, 1
3 1, 1, 0

Phasing
9
Clarks Rule

Find unambiguous individuals
(at most 1 ambiguous locus)
Form initial list of known haplotypes
Resolve ambiguous individuals
If possible, use two known haplotypes
Otherwise, use one known haplotype and add new
haplotype to list.
If unphased individuals remain
Assign phase randomly for one individual

10
Clarks Rule

Initial list of known haplotypes 0, 1, 1
1 0, 2, 2
2 0, 1, 1
3 1, 2, 2

1 0, 0, 0
1 0, 1, 1
2 0, 1, 1
2 0, 1, 1
3 1, 0, 0
3 1, 1, 1

Phasing
11
Clarks Rule

Heuristic Use unambiguous genotypes
Doesnt necessarily minimize the number of
haplotypes
Doesnt pay any attention (as such) to the
haplotype frequency

12
Haplotype frequencies

HWE says that the probability of phase h,h
for genotype g
P g h,h 2 P h P h if h ? h
P h 2 if h h
If we knew the haplotype frequencies, fh, we
would assign the phase of g based on these
frequencies. P h ph fh / 2n

13
Estimating haplotype frequencies

Wed like to extract the haplotype frequencies
from our genotype data
Well use an EM algorithm to find
maximum-likelihood estimates of the haplotype
frequencies.
Find haplotype frequencies ph so thatP
g1,....,gn ph is maximum.

14
Estimating haplotyping frequencies

If we had haplotype frequencies, but no genotype
frequencies
Pg Sh,h g Ph,h
Compute Ph,h from HW (and ph) as before
So P genotype g with phase h,h P phase
h,h g P genotype g Ph,h/Pg x
1/n
Use previous estimate of ph to compute first term

15
Estimating haplotype frequencies

Given P genotype g with phase h,h , based on
our estimated ph, we now re-estimate fh.
Efh Sg Sh,h g d(h,h,h)
Pgh,hwhere d(h,h,h) is 0,1,2, times
h in h,h
ph Efh/2n

16
Estimating haplotype frequencies

This EM algorithm (Excoffier and Slatkin, 1995)
will converge to the maximum likelihood estimates
of the haplotype frequencies
Initial guess for frequencies is not clear
Pgh,h 1/ of phasings of g

17
Estimating haplotype frequencies

Unfortunately, this approach is exponential in
the number of ambiguous sites (or number of
phasings)
Many heuristics have been proposed, lots of open
problems to consider.
Unambiguous or partially phased genotypes
contribute most.

18
Alternative method

For a large enough sample, if we assign a new
genotype to its most likely phase, the haplotype
frequencies wont change (much).
Choose g at random, re-assign phase based on
haplotype frequencies of other genotypes phases.
Iterate until stable.
Monte Carlo Markov Chain (MCMC) approach
Estimates maximum-likelihood haplotype
frequencies too!
Variations allow extra population assumptions to
be built in.

19
Combinatorial Approach

There is good reason to use these statistical
approaches, especially if n is large, and the
haplotypes are not too rare.
However, with smaller samples, or when we have
more structure, or rarer haplotypes, we might
consider each phasing carefully.
These approaches make the most sense when there
is additional structure to consider.

20
Perfect Phylogeny

Perfect phylogeny
all-0 haplotype at the root of tree
remaining haplotypes at leaves of tree
minor alleles accumulated on path from root to
leaf
each locus is represented by exactly one edge.
(infinite sites model)

21
Perfect phylogeny

all-0 at root
otherwise at leaves
minor-alleles accumulated on path from root
each locus on exactly one edge
Implies no evidence of recombination!

00000
1
4
3
2
10100
00010
10000
5
01011
01010
22
Haplotyping by Perfect Phylogeny

PPH Given a set of genotypes, find phasing
haplotypes that fit a perfect phylogeny.
1 2, 2
2 0, 2
3 1, 0

1 1, 0 1 0, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
23
Perfect phylogeny

1 1, 0
1 0, 1
2 0, 0
2 0, 1
3 1, 0
3 1, 0
This phasing fits a perfect phylogeny!

00
1
2
00
10
01
01
10
10
24
Haplotyping by Perfect Phylogeny

PPH Given a set of genotypes, find phasing
haplotypes that fit a perfect phylogeny.
1 2, 2
2 0, 2
3 1, 0
No perfect phylogeny is possible!

1 0, 0 1 1, 1 2 0, 0 2 0, 1 3 1, 0 3 1, 0
Phasing
25
4 Gamete Test

Format the haplotypes in a matrix, two rows for
each individual.
The haplotypes fit a perfect phylogeny if and
only if no two columns contain all four pairs
00, 01, 10, 11.

26
Combinatorial Problem

Input matrix G (n rows) with elements from
0,1,2
Output matrix H (2n rows) with elements 0,1,
each 2 of G replaced by 0 and 1, such that H
passes 4 gamete test.
Independently, from 2002
Gusfield
Eskin, Halperin, Karp
Bafna, Gusfield, Lancia, Yooseph

27
Initial Observations

Forced Expansions
G has two columns (loci) with these rows
2 0
0 2
H must have
0 0
1 0
0 0
0 1
That is, we know that 0 1 and 1 0 are present in
the column

28
Initial Observations

Forced Expansions
G has two columns (loci) with these rows
2 1
0 2
H must have
0 1
1 1
0 0
0 1
That is, we know that 0 0 and 1 1 are present in
the column

29
Initial Observations

2 2 can be phased in two ways
0 0, 1 1 and 0 1, 1 0.
If two columns contain 0 0 and 1 1 already
(forced or unambiguous), then 2 2 must be phased
0 0, 1 1.
These columns are forced in-phase
If two columns contain 0 1 and 1 0 already,
then 2 2 must be phased 0 1, 1 0
These columns are forced out-of-phase

30
Immediate Failure

Forced Expansions
G has two columns (loci) with these rows
2 1
2 0
H must have
0 1
1 1
0 0
1 0
4 Gamete condition violated!

31
Combinatorial Algorithm

Consider all columns in turn.
- Do all forced expansions
- Find all forced in-phase or out-of- phase
column pair relationships
- Use invariant column phase pairs
to fix unknown phase pairs
- Find a set of column pairs whose
phase can be arbitrarily set
- Force the remaining phase pairs

32
Running Example
33
Companion Graph

Node column of G
Edge 2 cols with a row of 2s.

34
Companion Graph

Red Edge
forced in-phase
Blue Edge
forced out-of-phase

35
Forced Graph

Drop dashed edges and define connected components
3 connected components
Fill in each component in turn by coloring dashed
edges

36
Phase parity lemma

There is a PP solution for G if and only if the
dashed edges can be colored so that
For every triangle in the companion graph with
at least one dashed edge, either 0 or 2 edges
are colored blue.

37
Weak Triangulation Rule

If any dashed edges have ends are in the same
component of the forced graph, at least one must
be in a triangle with two colored edges.
Color this edge appropriately

38
Weak Triangulation Rule
39
Corollary

For each connected component of the forced graph,
all the edge relationships are uniquely
determined.
Phase relationships of all columns in a connected
component of the forced graph are invariant for
all solutions that satisfy PP

40
Phase Parity Lemma

Consider
2 x
y 2
2 2

If x ? 2 and y ? 2
2 0
1 2
2 2
then the columns are forced in, or out-of phase.

41
Phase parity lemma

Lemma If a triangle contains a dashed edge, then
a PP solution exists only if there are 0 or 2
blue edges in coloring.
A B C
2 2 y
x 2 2
2 z 2

Must have x2, y2 or z2 or no dashed edge.
Row of all 2s must have even of blue edges.

42
Weak Triangulation Rule

Theorem
If any dashed edges have ends are in the same
component of the forced graph, at least one must
be in a triangle with two colored edges.

In forced graph component, can find path from any
node to any other, using forced edges.
Find cycle of one unforced (dashed) edge and
forced edges of length gt 3

43
Weak Triangulation Rule

Let (J,J) be dashed edge connecting path of gt 2
forced edges J, K, ..., K, J
...K J J K...
2 2 x
y 2 2
2 2

If x 2, then (K,J) is an edge
If y 2, then (J,K) is an edge
If x ? 2 and y ? 2, then (J,J) is forced.

44
Finishing up

We still need to color the dashed edges between
components of the forced graph.
If there are k components, select (k-1) edges
that form a spanning tree on the component graph
Color these edges arbitrarily. Remaining dashed
edges will be forced by weak triangulation rule.

45
Corollary

If companion graph has r connected components,
and forced graph has t connected components,
there are exactly 2(t-r) perfect phylogeny
phasings for input matrix G.
PP solution is unique iff each component of
companion graph is connected in forced graph.

46
Summary

If no (evidence) of recombination with the
infinite sites model, haplotypes will fit a
perfect phylogeny
O(ns2) to test for PP, determine if solution is
unique, represent all possible solutions, and
generate one solution
Others are pushing towards O(ns)
Can we relax the no recombination requirement?

Write a Comment

User Comments (0)

About PowerShow.com

Haplotype phasing - PowerPoint PPT Presentation

Haplotype phasing

4 Gamete Test. Format the haplotypes in a matrix, two rows for each individual. ... 4 Gamete condition violated! 31. Combinatorial Algorithm. Consider all ... – PowerPoint PPT presentation