Title: Haplotyping algorithms and structure of human variation
1Haplotyping algorithms and structure of human
variation
- EECS 458 CWRU Fall 2004
- Readings see papers on the course website
2Roadmap
- Definition haplotype and haplotype inference
- Why infer haplotypes
- Infer haplotypes from pedigree data
- Most probable haplotype configurations
- Haplotype configurations with minimum
recombinations - Infer haplotypes from population data
- Combinatorial Clarks, Perfect Phylogeny
- Statistical methods EM, Bayesian (MCMC)
- Infer haplotypes from pooled samples
- Haplotype block partition
- Tag SNP selection
3Genotype and Haplotype
4Typical Genotype Data
Observation
- Two alleles for each individual
- Chromosome origin for each allele is unknown
- Multiple haplotype pairs can fit observed
genotype - Molecular haplotyping is expensive
A
C
Marker1
Marker2
G
A
T
C
Marker3
Possible haplotypes
5Haplotypes are important!
- Phase may determine phenotype
-
- Phase helps exploit linkage disequilibrium
- Infer state of neighboring alleles
- Phase clarifies identity-by-descent status
6Common Uses of Haplotypes
- Linkage disequilibrium studies
- Summarize genetic variation
- Selecting markers to genotype
- Identify haplotype tag SNPs
- Candidate gene association studies
- Test haplotype associations
- Help interpret single marker associations
- Understanding evolution of human populations
7The problem
- Haplotypes are hard to measure directly
- X-chromosome in males
- Sperm typing
- Other molecular techniques
- Often, statistical or combinatorial methods for
reconstruction required
8Haplotype Inference on population data
m6, m4
9Information on Relatives
- Number of ambiguous individuals increases rapidly
with number of markers - Family information can help, but many ambiguities
remain
10Haplotype Inference on Pedigrees, Mendelian Law
11Haplotype inference on pooled samples
- The input contain n pools
- Each pool contains k individuals, thus 2k
haplotypes and m markers - At each marker, we are given the number of
alleles for the k individuals for each pool - The goal is to find the haplotype frequencies
- Example n3, k2, m5
12Haptotyping pedigree data statistical formulation
- Statistical formulation find the most probable
haplotype configuration - Need to calculate the probability of a pedigree
on every haplotype configuration - Recall for linkage analysis, we need to calculate
the probability of a pedigree, that sums over all
possible haplotype configs
13Haptotyping pedigree data statistical
formulation
- Thus the linkage programs like Genehunter,
Allegro, Merlin could compute the most probable
haplotypes - But, it is time consuming.
- In addition to exact computation, there are some
approximation algorithms, mainly based on
important sampling, e.g. SimWalk. - Still very time consuming, may consider many
configurations with very small probabilities
14Recombination and combinatorial formulation
15MRHC Problem
- Find a minimum recombinant haplotype
configuration from a given pedigree with genotype
data. - Assumptions
- Mendelian law (no mutations)
- Recombination events are rare.
- Well supported from real data.
Input
16MRHC Problem (contd)
- PS parental source of the two alleles at the
locus (i.e. phase) - GS grandparental source of an allele
- Haplotype configuration assignment of PS and GS
values.
PS0
PS1
GS21
Output
17Previous Results
- Genotype elimination (OConnell00).
- For data requiring no recombinant, exhaustive
elimination. - Genetic algorithm (Tapadar et al.00).
- Time consuming.
- MRH (Qian Beckmann02).
- Six step rule-based algorithm.
- Locus by locus at every step, extremely slow for
biallelic (e.g. SNP) markers.
18Thm. MRHC is NP-Hard.
- Idea Reduction from a variant of set cover.
- First complexity result.
- Remains hard for two loci.
- Remains hard when no loops.
Li Jiang03, Doi, Li Jiang03
19Block-Extension Algorithm
- Iterative, heuristic, five steps. Rules are
derived from Mendelian law, MR principle, block
concept and some greedy ideas based on the
following observations - Block structures are common in haplotypes.
- Double recombination events are rare.
- Common haplotype blocks shared in siblings.
-
- Advantages/Disadvantages
- Time complexity (BE O(dmn) / MRH O(2dm3n2))
Li Jiang03
20Block-Extension Algorithm
21Block-Extension Algorithm
11 12 2 3 3 4
23 3 4 1 4 2
12(-1,0) 23(1,-1) 21(-1,-1) 3 4
13(-1,1) 24(1,-1) 24(1,-1) 32(-1,-1)
13 3 2 2 4 3 4
31(1,0) 42(1,-1) 42(1,-1) 23(1,-1)
22Dynamic Programming Algorithms
- Locus-based dynamic programming algorithm
- Linear time in the number of the members
- Applicable to only tree pedigrees
- Member-based dynamic programming algorithm
- Linear time in the number of the loci
- Applicable to general pedigrees with small sizes
Doi, Li Jiang03
23Locus-Based Dynamic Programming
24Constraint-Finding Algorithm
- Assumptions
- No missing alleles, no errors.
- Zero recombinants.
- Idea finding all feasible (i.e. 0-recombinant)
haplotype configurations is equivalent to
reducing the degree of freedom in PS/GS
assignment.
Li Jiang03
25Four Levels of Constraints
- Based on Mendelian law (on single locus)
- Level 1 GS constraint
- Level 2 PS constraint
- Based on 0-recombinant
- (for a pair of loci)
- Level 3 Haplotype constraint
- Level 4 Grouping constraint
26Level 3 and Level 4 Constraints
27Level 3 and Level 4 Constraints
28Analysis of Constraint-Finding Algorithm
- Thm. Every solution consistent with the
constraint equations is a feasible solution and
vice versa. - Steps
- find all constraints, in the form of linear
equations over Z2 - solve the equations by Gaussian elimination
- enumerate all feasible haplotype configurations
- Exact polynomial time (O(n3m3) genotype
elimination exponential)
29Integer Linear Programming
- Combines missing data imputation and haplotype
inference. - Regardless of the pedigree structure, number of
recombinants, number of variables are linear of
problem size. - Implicitly checks the Mendelian consistency for
pedigree genotype data with missing alleles,
which is also an NPC problem. - Could find all possible optimal solutions.
- Solved by a branch-and-bound algorithm.
- Effective for practical size problems in terms of
time efficiency. - Accurate in terms of missing alleles imputation
and haplotype inference.
Li Jiang04a
30ILP for MRHC with Missing Data
- Define variables .
- Define linear constraints.
- Define a linear objective function of the
variables. - Preprocess constraints.
- Apply branch-and-bound strategy to find
solutions. (a partial order relationship and some
other special relationships). - Estimate bounds.
- Apply a maximum likelihood approach to multiple
optimal solutions.
31Formulation
Mjmk set of all possible alleles at marker
locus j and let tj Mj. M1 1, 2 , M2
1,2
32Formulation Variables
- Define 2 g vars for each paternal allele and
maternal allele at locus j for individual i
- Var g1 0 (or 1) iff paternal allele is copied
from fathers paternal (or maternal) allele. Var
g2 defined similarly.
33Formulation Objective Function
Subject to Genotype constraints
34Formulation Constraints
- Mendelian law of inheritance constraints (a child
i and its father f )
35A Partial Order Relationship
Inequalities with 2 variables
36Forced Variables
37Lower and Upper Bounds
- Lower bounds
- Linear relaxation.
- Summation of the number of recombinants in each
nuclear family. - Effective for data with large number of
recombinants. - Upper bound
- Obtained by block-extension algorithm.
- Effective for data with small number of
recombinants.
38Statistical Assessment
- E-M algorithm to estimate haplotype frequencies
for data that consist of multiple pedigrees.
39PedPhase software
- Simulated data were generated to compare our
algorithms, as well as MRH in terms of
efficiency, accuracy. - Three different pedigree structures.
- Multiallelic and biallelic data.
- Numbers of loci 10, 25 and 50.
- Number of recombinants 0-4.
- 100 runs per data set.
40Pedigree Structures
41Accuracy Results of BE Algorithm
42Efficiency Results
43More Results from ILP
44Real Data Analysis
- Data set (Gabriel et al.02)
- 93 members, 12 pedigrees (each with 7-8 members)
- chromosome 3, 4 regions, each region 1-4 blocks.
45Common Haplotypes Frequencies
46Results From ILP on the Whole Dataset
3.82 4.00 0.45 0.034
47What if there are no relatives?
- Rely on linkage disequilibrium
- Assume that population consists of small number
of distinct haplotypes - Haplotypes tend to be similar
48Clarks Haplotyping Algorithm
- Clark (1990) Mol Biol Evol 7111-122
- One of the first haplotyping algorithms
- Computationally efficient
- Very fast
- Today, more accurate alternatives are often
available
49Clarks Haplotyping Algorithm
- Find homozygous individuals
- Initialize a list of known haplotypes
- Resolve ambiguous individuals
- If possible, use two haplotypes from list
- Otherwise, use one known haplotype and augment
list - If unphased individuals remain
- Assign phase randomly to one individual
- Augment haplotype list and continue from previous
step
50Haplotyping via Perfect Phylogeny - Model,
Algorithms, Empirical studies
- Dan Gusfield, Ren Hua Chung
- U.C. Davis
- Cocoon 2003
51The Perfect Phylogeny Model of Haplotype Evolution
sites
12345
00000
Ancestral haplotype
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
52The Perfect Phylogeny Model
- We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site - changes from 0 to 1 on exactly one edge, and
each extant haplotype is created by accumulating
the changes on a path from the root to a leaf,
where that haplotype is displayed. - In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.
53Justification for Perfect Phylogeny Model
- In the absence of recombination each haplotype of
any individual has a single parent, so tracing
back the history of the haplotypes in a
population gives a tree. - Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project. (See NYT October 30, 2002) - Mutations are rare at selected sites, so are
assumed non-recurrent. - Connection with coalescent models.
54 Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining
set of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a genotype if the merge
of the haplotypes creates the genotype. Example
The merge of 0 1 and 1 0 explains 2 2.
S
Genotype matrix
55The PPH Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
56The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
1
2
b
00
a
a
b
c
c
01
01
10
10
10
57The Alternative Explanation
No tree possible for this explanation
58Efficient Solutions to the PPH problem - n
genotypes, m sites
- Reduction to a graph realization problem (GPPH) -
build on Bixby-Wagner or Fushishige solution to
graph realization O(nm alpha(nm)) time. - Reduction to graph realization - build on Tuttes
graph realization method O(nm2) time. - Direct, from scratch combinatorial approach
-O(nm2) Bafna et al. - Berkeley (EHK) approach - specialize the Tutte
solution to the PPH problem - O(nm2) time.
59The DPPH Method
- Bafna et al. O(nm2) time
- Based on deeper combinatorial observations about
the PPH problem. - A matrix-centric approach (rather than
tree-centric), although a graph is used in the
algorithm.
First, we need to understand why some sets of
haplotypes have a perfect phylogeny, and some do
not.
60When does a set of haplotypes fit a perfect
phylogeny?
- Arrange the haplotypes in a matrix, two
haplotypes for each individual. Then (with no
duplicate columns), the haplotypes fit a unique
perfect phylogeny if and only if no two columns
contain all three pairs - 0,1 and 1,0 and 1,1
This is the 3-Gamete Test
61The Alternative Explanation
No tree possible for this explanation
62The Tree Explanation Again
0 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
63PPH The Combinatorial Problem
Input A ternary matrix (0,1,2) M with 2N
rows partitioned into N pairs of rows, where
the two rows in each pair are identical. Def
If a pair of rows (r,r) in the partition have
entry values of 2 in a column j then positions
(r,j) and (r,j) are called Mates.
64- Output A binary matrix M created from M
- by replacing each 2 in M with either 0 or 1,
- such that
- A position is assigned 0 if and only if its Mate
- is assigned 1.
- b) M passes the 3-Gamete Test, i.e., does
- not contain a 3x2 submatrix (after row and
- column permutations) with all three
- combinations 0,1 1,0 and 1,1
-
65Initial Observations
- If two columns of M contain the following
rows - 2 0
- 2 0 mates
-
- then M will contain a row with 1 0 and a
row with 0 1 in those columns. -
- This is a forced expansion.
66Initial Observations
- Similarly, if two columns of M contain the
mates - 2 1
- 2 1
- then M will contain a row with 1 1 and a row
with 0 1 in those columns. - This is a forced expansion.
67If a forced expansion of two columns creates 0 1
in those columns, then any 2 2 1 0
2 2
in those columns must be set
to be 0 1 1 0 We say that two columns are
forced out-of-phase.
If a forced expansion of two columns creates 1 1
in those columns, then any 2 2
2
2 in those columns must be
set to be 1 1 0 0 We say that two columns are
forced in-phase.
68 1 2 3
a
Example
a
Columns 1 and 2, and 1 and 3 are forced
in-phase. Columns 2 and 3 are forced
out-of-phase.
b
b
c
c
1 3
1 2
2 3
d
a
a
b
d
a
a
b
e
e
b
e
e
e
b
e
69Immediate Failure
It can happen that the forced expansion of
cells creates a 3x2 submatrix that fails the
3-Gamete Test. In that case, there is no PPH
solution for M.
20 20 11 11 02 02
Example
Will fail the 3-Gamete Test
70An O(nm2)-time Algorithm
- Find all the forced phase relationships by
considering columns in pairs. - Find all the inferred, invariant, phase
relationships. - Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred. - Result An implicit representation of all
solutions to the PPH problem.
71 1 2 3 4 5 6 7
a
A running example.
a
b
b
c
c
d
d
e
e
72Overview of Bafna et al. algorithm
First, represent the forced phase relationships,
and the needed decisions, in a graph G.
737
1
Graph G
Each node represents a column in M, and each edge
indicates that the pair of columns has a row with
2s in both columns. The algorithm builds
this graph, and then checks whether any pair of
nodes is forced in or out of phase.
6
3
4
2
5
747
1
Graph Gc
Each Red edge indicates that the columns
are forced in-phase. Each Blue edge
indicates that the columns are forced
out-of-phase.
6
3
4
2
Let Gf be the subgraph of Gc defined by the red
and blue edges.
5
757
1
Graph Gf has three connected components.
6
3
4
2
5
76The Central Theorem
- There is a solution to the PPH problem for M if
- and only if there is a coloring of the dashed
edges of Gc - with the following property
- For any triangle (i,j,k) in Gc, where there
is one row - containing 2s in all three columns i,j and
k - (any triangle containing at least one
- dashed edge will be of this type), the
coloring makes - either 0 or 2 of the edges blue
(out-of-phase). -
- Nice, but how do we find such a coloring?
777
1
Triangle Rule
Graph Gf
Theorem 1 If there are any dashed edges whose
ends are in the same connected component of Gf,
at least one edge is in a triangle where the
other edges are not dashed, and in every
PPH solution, it must be colored so that the
triangle has an even number of Blue (out
of Phase) edges. This is an inferred coloring.
6
3
4
2
5
787
1
6
3
4
2
5
797
1
6
3
4
2
5
807
1
6
3
4
2
5
81Corollary
Inside any connected component of Gf, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pairwise column
comparisons, or by triangle-based inferred
colorings. Hence, the phase relationships of all
the columns in a connected component of Gf are
INVARIANT over all the solutions to the PPH
problem.
82The dashed edges in Gf can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS. See the paper for details,
or assign it as a homework exercise.
83Finishing the Solution
- Problem A connected component C of G may
contain several connected components of Gf, so
any edge crossing two components of Gf will still
be dashed. How should they be colored?
847
1
How should we color the remaining dashed edges in
a connected component C of Gc?
6
3
4
2
5
85Answer
For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 dashed edges in C, so that S together
with the red and blue edges span all the nodes of
C. Arbitrarily, color each edge in S either red
or blue. Infer the color of any remaining dashed
edges by successive use of the triangle rule.
867
1
Pick and color edges (2,5) and (3,7) The
remaining dashed edges are colored by using the
triangle rule.
6
3
4
2
5
877
1
6
3
4
2
5
88Theorem 2
- Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining dashed
edges. - Different colorings of S determine different
colorings of the remaining dashed edges. - Each different coloring of S determines a
different solution to the PPH problem. - All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.
89Comparing the programs - R.H. Chung
- All three are fast and practical (under one
second) on problem instances of size 50 x 30. - DPPH is the fastest, followed by HPPH and GPPH.
- HPPH encounters memory problems with large input.
90sites individ GPPH DPPH HPPH
times shown are in seconds on an 800 Mhz machine.
91A Phase-Transition
Problem, as the ratio of sites to genotypes
changes, how does the probability that the PPH
solution is unique change? For greatest utility,
we want genotype data where the PPH solution is
unique. Intuitively, as the ratio of genotypes
to sites increases, the probability of uniqueness
increases.
92Extension
- With recombination
- The papers See wwwcsif.cs.ucdavis.edu/gusfield
93The E-M Haplotyping Algorithm
- Excoffier and Slatkin (1995) Mol Biol Evol
12921-927 - Provide a clear outline of how the algorithm can
be applied to genetic data - Combination of two strategies
- E-M statistical algorithm for missing data
- Counting algorithm for allele frequencies
94E-M Algorithm For Haplotyping
- 1. Guesstimate haplotype frequencies
- 2. Use current frequency estimates to replace
ambiguous genotypes with fractional counts of
phased genotypes - 3. Estimate frequency of each haplotype by
counting - 4. Repeat steps 2 and 3 until frequencies are
stable
95E-M Algorithm for Haplotyping
- Cost grows rapidly with number of markers
- Typically appropriate for lt 25 SNPs
- Fewer microsatellites
- More accurate than Clarks method
- Fully or partially phased individuals contribute
most of the information
96Enhancements to E-M
- List only haplotypes present in sample
- Gradually expand subset of markers under
consideration, eliminating haplotypes with low
estimated frequency from consideration at each
stage - SNPHAP Clayton (2001)
- HAPLOTYPER Qin et al. (2002)
97Divide-And-Conquer Approximation
- No. of potential haplotypes increases
exponentially - Actual no. of haplotypes doesnt
- Approximation
- Successively divide marker set
- Run E-M assuming segments associate randomly
- Proceed, ignoring composites of segments with
zero frequency - Order m log m
- Exact E-M is order 2m
98Other Recent Developments
- Newer methods try to further improve haplotype
estimation by favoring sets of similar haplotypes - Stephens et al. (2001) Am J Hum Genet 68978-89
- Genealogical approach, which implies haplotypes
are similar to each other
99Method based on Gibbs sampler
- MCMC method
- Stochastic, random procedure
- Improves solution gradually
- Given initial set of haplotypes
- Sample haplotypes for one individual at a time,
assuming other haplotypes are true - Repeat a few million times
100Update Procedure I
- Pick individual U to update at random
- Calculate haplotype frequencies F in all other
individuals - Since everyone is phased, this is done by
counting - Sample new haplotypes for U from conditional
distribution of Us haplotypes given F
101Update Procedure I
- This procedure would produce an estimate of
haplotype frequencies that equivalent to the E-M
algorithm - Stephens et al (2001) suggested an alternative
estimate of F
102Update Procedure II
- Estimate F from the other individuals
- Construct F to include haplotypes in F and also
other similar (possibly differing at a few sites,
due to mutations) - Update Us haplotypes conditional on F