Haplotyping algorithms and structure of human variation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Haplotyping algorithms and structure of human variation

1
Haplotyping algorithms and structure of human
variation

EECS 458 CWRU Fall 2004
Readings see papers on the course website

2
Roadmap

Definition haplotype and haplotype inference
Why infer haplotypes
Infer haplotypes from pedigree data
Most probable haplotype configurations
Haplotype configurations with minimum
recombinations
Infer haplotypes from population data
Combinatorial Clarks, Perfect Phylogeny
Statistical methods EM, Bayesian (MCMC)
Infer haplotypes from pooled samples
Haplotype block partition
Tag SNP selection

3
Genotype and Haplotype
4
Typical Genotype Data
Observation

Two alleles for each individual
Chromosome origin for each allele is unknown
Multiple haplotype pairs can fit observed
genotype
Molecular haplotyping is expensive

A
C
Marker1
Marker2
G
A
T
C
Marker3
Possible haplotypes
5
Haplotypes are important!

Phase may determine phenotype
Phase helps exploit linkage disequilibrium
Infer state of neighboring alleles
Phase clarifies identity-by-descent status

6
Common Uses of Haplotypes

Linkage disequilibrium studies
Summarize genetic variation
Selecting markers to genotype
Identify haplotype tag SNPs
Candidate gene association studies
Test haplotype associations
Help interpret single marker associations
Understanding evolution of human populations

7
The problem

Haplotypes are hard to measure directly
X-chromosome in males
Sperm typing
Other molecular techniques
Often, statistical or combinatorial methods for
reconstruction required

8
Haplotype Inference on population data
m6, m4
9
Information on Relatives

Number of ambiguous individuals increases rapidly
with number of markers
Family information can help, but many ambiguities
remain

10
Haplotype Inference on Pedigrees, Mendelian Law
11
Haplotype inference on pooled samples

The input contain n pools
Each pool contains k individuals, thus 2k
haplotypes and m markers
At each marker, we are given the number of
alleles for the k individuals for each pool
The goal is to find the haplotype frequencies
Example n3, k2, m5

12
Haptotyping pedigree data statistical formulation

Statistical formulation find the most probable
haplotype configuration
Need to calculate the probability of a pedigree
on every haplotype configuration
Recall for linkage analysis, we need to calculate
the probability of a pedigree, that sums over all
possible haplotype configs

13
Haptotyping pedigree data statistical
formulation

Thus the linkage programs like Genehunter,
Allegro, Merlin could compute the most probable
haplotypes
But, it is time consuming.
In addition to exact computation, there are some
approximation algorithms, mainly based on
important sampling, e.g. SimWalk.
Still very time consuming, may consider many
configurations with very small probabilities

14
Recombination and combinatorial formulation
15
MRHC Problem

Find a minimum recombinant haplotype
configuration from a given pedigree with genotype
data.
Assumptions
Mendelian law (no mutations)
Recombination events are rare.
Well supported from real data.

Input
16
MRHC Problem (contd)

PS parental source of the two alleles at the
locus (i.e. phase)
GS grandparental source of an allele
Haplotype configuration assignment of PS and GS
values.

PS0
PS1
GS21
Output
17
Previous Results

Genotype elimination (OConnell00).
For data requiring no recombinant, exhaustive
elimination.
Genetic algorithm (Tapadar et al.00).
Time consuming.
MRH (Qian Beckmann02).
Six step rule-based algorithm.
Locus by locus at every step, extremely slow for
biallelic (e.g. SNP) markers.

18
Thm. MRHC is NP-Hard.

Idea Reduction from a variant of set cover.
First complexity result.
Remains hard for two loci.
Remains hard when no loops.

Li Jiang03, Doi, Li Jiang03
19
Block-Extension Algorithm

Iterative, heuristic, five steps. Rules are
derived from Mendelian law, MR principle, block
concept and some greedy ideas based on the
following observations
Block structures are common in haplotypes.
Double recombination events are rare.
Common haplotype blocks shared in siblings.
Advantages/Disadvantages
Time complexity (BE O(dmn) / MRH O(2dm3n2))

Li Jiang03
20
Block-Extension Algorithm
21
Block-Extension Algorithm
11 12 2 3 3 4
23 3 4 1 4 2
12(-1,0) 23(1,-1) 21(-1,-1) 3 4
13(-1,1) 24(1,-1) 24(1,-1) 32(-1,-1)
13 3 2 2 4 3 4
31(1,0) 42(1,-1) 42(1,-1) 23(1,-1)
22
Dynamic Programming Algorithms

Locus-based dynamic programming algorithm
Linear time in the number of the members
Applicable to only tree pedigrees
Member-based dynamic programming algorithm
Linear time in the number of the loci
Applicable to general pedigrees with small sizes

Doi, Li Jiang03
23
Locus-Based Dynamic Programming
24
Constraint-Finding Algorithm

Assumptions
No missing alleles, no errors.
Zero recombinants.
Idea finding all feasible (i.e. 0-recombinant)
haplotype configurations is equivalent to
reducing the degree of freedom in PS/GS
assignment.

Li Jiang03
25
Four Levels of Constraints

Based on Mendelian law (on single locus)
Level 1 GS constraint
Level 2 PS constraint

Based on 0-recombinant
(for a pair of loci)
Level 3 Haplotype constraint
Level 4 Grouping constraint

26
Level 3 and Level 4 Constraints
27
Level 3 and Level 4 Constraints
28
Analysis of Constraint-Finding Algorithm

Thm. Every solution consistent with the
constraint equations is a feasible solution and
vice versa.
Steps
find all constraints, in the form of linear
equations over Z2
solve the equations by Gaussian elimination
enumerate all feasible haplotype configurations
Exact polynomial time (O(n3m3) genotype
elimination exponential)

29
Integer Linear Programming

Combines missing data imputation and haplotype
inference.
Regardless of the pedigree structure, number of
recombinants, number of variables are linear of
problem size.
Implicitly checks the Mendelian consistency for
pedigree genotype data with missing alleles,
which is also an NPC problem.
Could find all possible optimal solutions.
Solved by a branch-and-bound algorithm.
Effective for practical size problems in terms of
time efficiency.
Accurate in terms of missing alleles imputation
and haplotype inference.

Li Jiang04a
30
ILP for MRHC with Missing Data

Define variables .
Define linear constraints.
Define a linear objective function of the
variables.
Preprocess constraints.
Apply branch-and-bound strategy to find
solutions. (a partial order relationship and some
other special relationships).
Estimate bounds.
Apply a maximum likelihood approach to multiple
optimal solutions.

31
Formulation
Mjmk set of all possible alleles at marker
locus j and let tj Mj. M1 1, 2 , M2
1,2
32
Formulation Variables

Define 2 g vars for each paternal allele and
maternal allele at locus j for individual i

Var g1 0 (or 1) iff paternal allele is copied
from fathers paternal (or maternal) allele. Var
g2 defined similarly.

Define r vars

33
Formulation Objective Function

Objective function

Subject to Genotype constraints
34
Formulation Constraints

Mendelian law of inheritance constraints (a child
i and its father f )

Constraints for r vars

35
A Partial Order Relationship

Denote

Inequalities with 2 variables
36
Forced Variables

Rule 1
Rule 2
Rule 3

37
Lower and Upper Bounds

Lower bounds
Linear relaxation.
Summation of the number of recombinants in each
nuclear family.
Effective for data with large number of
recombinants.
Upper bound
Obtained by block-extension algorithm.
Effective for data with small number of
recombinants.

38
Statistical Assessment

E-M algorithm to estimate haplotype frequencies
for data that consist of multiple pedigrees.

39
PedPhase software

Simulated data were generated to compare our
algorithms, as well as MRH in terms of
efficiency, accuracy.
Three different pedigree structures.
Multiallelic and biallelic data.
Numbers of loci 10, 25 and 50.
Number of recombinants 0-4.
100 runs per data set.

40
Pedigree Structures
41
Accuracy Results of BE Algorithm
42
Efficiency Results
43
More Results from ILP
44
Real Data Analysis

Data set (Gabriel et al.02)
93 members, 12 pedigrees (each with 7-8 members)
chromosome 3, 4 regions, each region 1-4 blocks.

45
Common Haplotypes Frequencies
46
Results From ILP on the Whole Dataset
3.82 4.00 0.45 0.034
47
What if there are no relatives?

Rely on linkage disequilibrium
Assume that population consists of small number
of distinct haplotypes
Haplotypes tend to be similar

48
Clarks Haplotyping Algorithm

Clark (1990) Mol Biol Evol 7111-122
One of the first haplotyping algorithms
Computationally efficient
Very fast
Today, more accurate alternatives are often
available

49
Clarks Haplotyping Algorithm

Find homozygous individuals
Initialize a list of known haplotypes
Resolve ambiguous individuals
If possible, use two haplotypes from list
Otherwise, use one known haplotype and augment
list
If unphased individuals remain
Assign phase randomly to one individual
Augment haplotype list and continue from previous
step

50
Haplotyping via Perfect Phylogeny - Model,
Algorithms, Empirical studies

Dan Gusfield, Ren Hua Chung
U.C. Davis
Cocoon 2003

51
The Perfect Phylogeny Model of Haplotype Evolution

sites
12345
00000
Ancestral haplotype
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
52
The Perfect Phylogeny Model

We assume that the evolution of extant haplotypes
can be displayed on a rooted, directed tree, with
the all-0 haplotype at the root, where each site
changes from 0 to 1 on exactly one edge, and
each extant haplotype is created by accumulating
the changes on a path from the root to a leaf,
where that haplotype is displayed.
In other words, the extant haplotypes evolved
along a perfect phylogeny with all-0 root.

53
Justification for Perfect Phylogeny Model

In the absence of recombination each haplotype of
any individual has a single parent, so tracing
back the history of the haplotypes in a
population gives a tree.
Recent strong evidence for long regions of DNA
with no recombination. Key to the NIH haplotype
mapping project. (See NYT October 30, 2002)
Mutations are rare at selected sites, so are
assumed non-recurrent.
Connection with coalescent models.

54
Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining
set of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a genotype if the merge
of the haplotypes creates the genotype. Example
The merge of 0 1 and 1 0 explains 2 2.
S
Genotype matrix
55
The PPH Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
56
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
1
2
b
00
a
a
b
c
c
01
01

10
10
10
57
The Alternative Explanation
No tree possible for this explanation
58
Efficient Solutions to the PPH problem - n
genotypes, m sites

Reduction to a graph realization problem (GPPH) -
build on Bixby-Wagner or Fushishige solution to
graph realization O(nm alpha(nm)) time.
Reduction to graph realization - build on Tuttes
graph realization method O(nm2) time.
Direct, from scratch combinatorial approach
-O(nm2) Bafna et al.
Berkeley (EHK) approach - specialize the Tutte
solution to the PPH problem - O(nm2) time.

59
The DPPH Method

Bafna et al. O(nm2) time
Based on deeper combinatorial observations about
the PPH problem.
A matrix-centric approach (rather than
tree-centric), although a graph is used in the
algorithm.

First, we need to understand why some sets of
haplotypes have a perfect phylogeny, and some do
not.
60
When does a set of haplotypes fit a perfect
phylogeny?

Arrange the haplotypes in a matrix, two
haplotypes for each individual. Then (with no
duplicate columns), the haplotypes fit a unique
perfect phylogeny if and only if no two columns
contain all three pairs
0,1 and 1,0 and 1,1

This is the 3-Gamete Test
61
The Alternative Explanation
No tree possible for this explanation
62
The Tree Explanation Again
0 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
63
PPH The Combinatorial Problem
Input A ternary matrix (0,1,2) M with 2N
rows partitioned into N pairs of rows, where
the two rows in each pair are identical. Def
If a pair of rows (r,r) in the partition have
entry values of 2 in a column j then positions
(r,j) and (r,j) are called Mates.
64

Output A binary matrix M created from M
by replacing each 2 in M with either 0 or 1,
such that
A position is assigned 0 if and only if its Mate
is assigned 1.
b) M passes the 3-Gamete Test, i.e., does
not contain a 3x2 submatrix (after row and
column permutations) with all three
combinations 0,1 1,0 and 1,1

65
Initial Observations

If two columns of M contain the following
rows
2 0
2 0 mates
then M will contain a row with 1 0 and a
row with 0 1 in those columns.
This is a forced expansion.

66
Initial Observations

Similarly, if two columns of M contain the
mates
2 1
2 1
then M will contain a row with 1 1 and a row
with 0 1 in those columns.
This is a forced expansion.

67
If a forced expansion of two columns creates 0 1
in those columns, then any 2 2 1 0
2 2
in those columns must be set
to be 0 1 1 0 We say that two columns are
forced out-of-phase.
If a forced expansion of two columns creates 1 1
in those columns, then any 2 2
2
2 in those columns must be
set to be 1 1 0 0 We say that two columns are
forced in-phase.
68
1 2 3
a
Example
a
Columns 1 and 2, and 1 and 3 are forced
in-phase. Columns 2 and 3 are forced
out-of-phase.
b
b
c
c
1 3
1 2
2 3
d
a
a
b
d
a
a
b
e
e
b
e
e
e
b
e
69
Immediate Failure
It can happen that the forced expansion of
cells creates a 3x2 submatrix that fails the
3-Gamete Test. In that case, there is no PPH
solution for M.
20 20 11 11 02 02
Example
Will fail the 3-Gamete Test
70
An O(nm2)-time Algorithm

Find all the forced phase relationships by
considering columns in pairs.
Find all the inferred, invariant, phase
relationships.
Find a set of column pairs whose phase
relationship can be arbitrarily set, so that all
the remaining phase relationships can be
inferred.
Result An implicit representation of all
solutions to the PPH problem.

71
1 2 3 4 5 6 7
a
A running example.
a
b
b
c
c
d
d
e
e
72
Overview of Bafna et al. algorithm
First, represent the forced phase relationships,
and the needed decisions, in a graph G.
73
7
1
Graph G
Each node represents a column in M, and each edge
indicates that the pair of columns has a row with
2s in both columns. The algorithm builds
this graph, and then checks whether any pair of
nodes is forced in or out of phase.
6
3
4
2
5
74
7
1
Graph Gc
Each Red edge indicates that the columns
are forced in-phase. Each Blue edge
indicates that the columns are forced
out-of-phase.
6
3
4
2
Let Gf be the subgraph of Gc defined by the red
and blue edges.
5
75
7
1
Graph Gf has three connected components.
6
3
4
2
5
76
The Central Theorem

There is a solution to the PPH problem for M if
and only if there is a coloring of the dashed
edges of Gc
with the following property
For any triangle (i,j,k) in Gc, where there
is one row
containing 2s in all three columns i,j and
k
(any triangle containing at least one
dashed edge will be of this type), the
coloring makes
either 0 or 2 of the edges blue
(out-of-phase).
Nice, but how do we find such a coloring?

77
7
1
Triangle Rule
Graph Gf
Theorem 1 If there are any dashed edges whose
ends are in the same connected component of Gf,
at least one edge is in a triangle where the
other edges are not dashed, and in every
PPH solution, it must be colored so that the
triangle has an even number of Blue (out
of Phase) edges. This is an inferred coloring.
6
3
4
2
5
78
7
1
6
3
4
2
5
79
7
1
6
3
4
2
5
80
7
1
6
3
4
2
5
81
Corollary
Inside any connected component of Gf, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pairwise column
comparisons, or by triangle-based inferred
colorings. Hence, the phase relationships of all
the columns in a connected component of Gf are
INVARIANT over all the solutions to the PPH
problem.
82
The dashed edges in Gf can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS. See the paper for details,
or assign it as a homework exercise.
83
Finishing the Solution

Problem A connected component C of G may
contain several connected components of Gf, so
any edge crossing two components of Gf will still
be dashed. How should they be colored?

84
7
1
How should we color the remaining dashed edges in
a connected component C of Gc?
6
3
4
2
5
85
Answer
For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 dashed edges in C, so that S together
with the red and blue edges span all the nodes of
C. Arbitrarily, color each edge in S either red
or blue. Infer the color of any remaining dashed
edges by successive use of the triangle rule.
86
7
1
Pick and color edges (2,5) and (3,7) The
remaining dashed edges are colored by using the
triangle rule.
6
3
4
2
5
87
7
1
6
3
4
2
5
88
Theorem 2

Any selected S works (allows the triangle rule to
work) and any coloring of the edges in S
determines the colors of any remaining dashed
edges.
Different colorings of S determine different
colorings of the remaining dashed edges.
Each different coloring of S determines a
different solution to the PPH problem.
All PPH solutions can be obtained in this way,
i.e. using just one selected S set, but coloring
it in all 2(k-1) ways.

89
Comparing the programs - R.H. Chung

All three are fast and practical (under one
second) on problem instances of size 50 x 30.
DPPH is the fastest, followed by HPPH and GPPH.
HPPH encounters memory problems with large input.

90
sites individ GPPH DPPH HPPH
times shown are in seconds on an 800 Mhz machine.
91
A Phase-Transition
Problem, as the ratio of sites to genotypes
changes, how does the probability that the PPH
solution is unique change? For greatest utility,
we want genotype data where the PPH solution is
unique. Intuitively, as the ratio of genotypes
to sites increases, the probability of uniqueness
increases.
92
Extension

With recombination
The papers See wwwcsif.cs.ucdavis.edu/gusfield

93
The E-M Haplotyping Algorithm

Excoffier and Slatkin (1995) Mol Biol Evol
12921-927
Provide a clear outline of how the algorithm can
be applied to genetic data
Combination of two strategies
E-M statistical algorithm for missing data
Counting algorithm for allele frequencies

94
E-M Algorithm For Haplotyping

1. Guesstimate haplotype frequencies
2. Use current frequency estimates to replace
ambiguous genotypes with fractional counts of
phased genotypes
3. Estimate frequency of each haplotype by
counting
4. Repeat steps 2 and 3 until frequencies are
stable

95
E-M Algorithm for Haplotyping

Cost grows rapidly with number of markers
Typically appropriate for lt 25 SNPs
Fewer microsatellites
More accurate than Clarks method
Fully or partially phased individuals contribute
most of the information

96
Enhancements to E-M

List only haplotypes present in sample
Gradually expand subset of markers under
consideration, eliminating haplotypes with low
estimated frequency from consideration at each
stage
SNPHAP Clayton (2001)
HAPLOTYPER Qin et al. (2002)

97
Divide-And-Conquer Approximation

No. of potential haplotypes increases
exponentially
Actual no. of haplotypes doesnt
Approximation
Successively divide marker set
Run E-M assuming segments associate randomly
Proceed, ignoring composites of segments with
zero frequency
Order m log m
Exact E-M is order 2m

98
Other Recent Developments

Newer methods try to further improve haplotype
estimation by favoring sets of similar haplotypes
Stephens et al. (2001) Am J Hum Genet 68978-89
Genealogical approach, which implies haplotypes
are similar to each other

99
Method based on Gibbs sampler

MCMC method
Stochastic, random procedure
Improves solution gradually
Given initial set of haplotypes
Sample haplotypes for one individual at a time,
assuming other haplotypes are true
Repeat a few million times

100
Update Procedure I

Pick individual U to update at random
Calculate haplotype frequencies F in all other
individuals
Since everyone is phased, this is done by
counting
Sample new haplotypes for U from conditional
distribution of Us haplotypes given F

101
Update Procedure I

This procedure would produce an estimate of
haplotype frequencies that equivalent to the E-M
algorithm
Stephens et al (2001) suggested an alternative
estimate of F

102
Update Procedure II

Estimate F from the other individuals
Construct F to include haplotypes in F and also
other similar (possibly differing at a few sites,
due to mutations)
Update Us haplotypes conditional on F

Write a Comment

User Comments (0)

About PowerShow.com

Haplotyping algorithms and structure of human variation PowerPoint PPT Presentation