Haplotyping algorithms and structure of human variation PowerPoint PPT Presentation

presentation player overlay
1 / 91
About This Presentation
Transcript and Presenter's Notes

Title: Haplotyping algorithms and structure of human variation


1
Haplotyping algorithms and structure of human
variation
  • EECS 458 CWRU Fall 2004
  • Readings see papers on the course website

2
Roadmap
  • Definition haplotype and haplotype inference
  • Why infer haplotypes
  • Infer haplotypes from pedigree data
  • Most probable haplotype configurations
  • Haplotype configurations with minimum
    recombinations
  • Infer haplotypes from population data
  • Combinatorial Clarks, Perfect Phylogeny
  • Statistical methods EM, Bayesian (MCMC)
  • Infer haplotypes from pooled samples
  • Haplotype block partition
  • Tag SNP selection

3
Genotype and Haplotype
4
Typical Genotype Data
Observation
  • Two alleles for each individual
  • Chromosome origin for each allele is unknown
  • Multiple haplotype pairs can fit observed
    genotype
  • Molecular haplotyping is expensive

A
C
Marker1
Marker2
G
A
T
C
Marker3
Possible haplotypes
5
Haplotypes are important!
  • Phase may determine phenotype
  • Phase helps exploit linkage disequilibrium
  • Infer state of neighboring alleles
  • Phase clarifies identity-by-descent status

6
Common Uses of Haplotypes
  • Linkage disequilibrium studies
  • Summarize genetic variation
  • Selecting markers to genotype
  • Identify haplotype tag SNPs
  • Candidate gene association studies
  • Test haplotype associations
  • Help interpret single marker associations
  • Understanding evolution of human populations

7
The problem
  • Haplotypes are hard to measure directly
  • X-chromosome in males
  • Sperm typing
  • Other molecular techniques
  • Often, statistical or combinatorial methods for
    reconstruction required

8
Haplotype Inference on population data
m6, m4
9
Information on Relatives
  • Number of ambiguous individuals increases rapidly
    with number of markers
  • Family information can help, but many ambiguities
    remain

10
Haplotype Inference on Pedigrees, Mendelian Law
11
Haplotype inference on pooled samples
  • The input contain n pools
  • Each pool contains k individuals, thus 2k
    haplotypes and m markers
  • At each marker, we are given the number of
    alleles for the k individuals for each pool
  • The goal is to find the haplotype frequencies
  • Example n3, k2, m5

12
Haptotyping pedigree data statistical formulation
  • Statistical formulation find the most probable
    haplotype configuration
  • Need to calculate the probability of a pedigree
    on every haplotype configuration
  • Recall for linkage analysis, we need to calculate
    the probability of a pedigree, that sums over all
    possible haplotype configs

13
Haptotyping pedigree data statistical
formulation
  • Thus the linkage programs like Genehunter,
    Allegro, Merlin could compute the most probable
    haplotypes
  • But, it is time consuming.
  • In addition to exact computation, there are some
    approximation algorithms, mainly based on
    important sampling, e.g. SimWalk.
  • Still very time consuming, may consider many
    configurations with very small probabilities

14
Recombination and combinatorial formulation
15
MRHC Problem
  • Find a minimum recombinant haplotype
    configuration from a given pedigree with genotype
    data.
  • Assumptions
  • Mendelian law (no mutations)
  • Recombination events are rare.
  • Well supported from real data.

Input
16
MRHC Problem (contd)
  • PS parental source of the two alleles at the
    locus (i.e. phase)
  • GS grandparental source of an allele
  • Haplotype configuration assignment of PS and GS
    values.

PS0
PS1
GS21
Output
17
Previous Results
  • Genotype elimination (OConnell00).
  • For data requiring no recombinant, exhaustive
    elimination.
  • Genetic algorithm (Tapadar et al.00).
  • Time consuming.
  • MRH (Qian Beckmann02).
  • Six step rule-based algorithm.
  • Locus by locus at every step, extremely slow for
    biallelic (e.g. SNP) markers.

18
Thm. MRHC is NP-Hard.
  • Idea Reduction from a variant of set cover.
  • First complexity result.
  • Remains hard for two loci.
  • Remains hard when no loops.

Li Jiang03, Doi, Li Jiang03
19
Block-Extension Algorithm
  • Iterative, heuristic, five steps. Rules are
    derived from Mendelian law, MR principle, block
    concept and some greedy ideas based on the
    following observations
  • Block structures are common in haplotypes.
  • Double recombination events are rare.
  • Common haplotype blocks shared in siblings.
  • Advantages/Disadvantages
  • Time complexity (BE O(dmn) / MRH O(2dm3n2))

Li Jiang03
20
Block-Extension Algorithm
21
Block-Extension Algorithm
11 12 2 3 3 4
23 3 4 1 4 2
12(-1,0) 23(1,-1) 21(-1,-1) 3 4
13(-1,1) 24(1,-1) 24(1,-1) 32(-1,-1)
13 3 2 2 4 3 4
31(1,0) 42(1,-1) 42(1,-1) 23(1,-1)
22
Dynamic Programming Algorithms
  • Locus-based dynamic programming algorithm
  • Linear time in the number of the members
  • Applicable to only tree pedigrees
  • Member-based dynamic programming algorithm
  • Linear time in the number of the loci
  • Applicable to general pedigrees with small sizes

Doi, Li Jiang03
23
Locus-Based Dynamic Programming
24
Constraint-Finding Algorithm
  • Assumptions
  • No missing alleles, no errors.
  • Zero recombinants.
  • Idea finding all feasible (i.e. 0-recombinant)
    haplotype configurations is equivalent to
    reducing the degree of freedom in PS/GS
    assignment.

Li Jiang03
25
Four Levels of Constraints
  • Based on Mendelian law (on single locus)
  • Level 1 GS constraint
  • Level 2 PS constraint
  • Based on 0-recombinant
  • (for a pair of loci)
  • Level 3 Haplotype constraint
  • Level 4 Grouping constraint

26
Level 3 and Level 4 Constraints
27
Level 3 and Level 4 Constraints
28
Analysis of Constraint-Finding Algorithm
  • Thm. Every solution consistent with the
    constraint equations is a feasible solution and
    vice versa.
  • Steps
  • find all constraints, in the form of linear
    equations over Z2
  • solve the equations by Gaussian elimination
  • enumerate all feasible haplotype configurations
  • Exact polynomial time (O(n3m3) genotype
    elimination exponential)

29
Integer Linear Programming
  • Combines missing data imputation and haplotype
    inference.
  • Regardless of the pedigree structure, number of
    recombinants, number of variables are linear of
    problem size.
  • Implicitly checks the Mendelian consistency for
    pedigree genotype data with missing alleles,
    which is also an NPC problem.
  • Could find all possible optimal solutions.
  • Solved by a branch-and-bound algorithm.
  • Effective for practical size problems in terms of
    time efficiency.
  • Accurate in terms of missing alleles imputation
    and haplotype inference.

Li Jiang04a
30
ILP for MRHC with Missing Data
  • Define variables .
  • Define linear constraints.
  • Define a linear objective function of the
    variables.
  • Preprocess constraints.
  • Apply branch-and-bound strategy to find
    solutions. (a partial order relationship and some
    other special relationships).
  • Estimate bounds.
  • Apply a maximum likelihood approach to multiple
    optimal solutions.

31
Formulation
Mjmk set of all possible alleles at marker
locus j and let tj Mj. M1 1, 2 , M2
1,2
32
Formulation Variables
  • Define 2 g vars for each paternal allele and
    maternal allele at locus j for individual i
  • Var g1 0 (or 1) iff paternal allele is copied
    from fathers paternal (or maternal) allele. Var
    g2 defined similarly.
  • Define r vars

33
Formulation Objective Function
  • Objective function

Subject to Genotype constraints
34
Formulation Constraints
  • Mendelian law of inheritance constraints (a child
    i and its father f )
  • Constraints for r vars

35
A Partial Order Relationship
  • Denote

Inequalities with 2 variables
36
Forced Variables
  • Rule 1
  • Rule 2
  • Rule 3

37
Lower and Upper Bounds
  • Lower bounds
  • Linear relaxation.
  • Summation of the number of recombinants in each
    nuclear family.
  • Effective for data with large number of
    recombinants.
  • Upper bound
  • Obtained by block-extension algorithm.
  • Effective for data with small number of
    recombinants.

38
Statistical Assessment
  • E-M algorithm to estimate haplotype frequencies
    for data that consist of multiple pedigrees.

39
PedPhase software
  • Simulated data were generated to compare our
    algorithms, as well as MRH in terms of
    efficiency, accuracy.
  • Three different pedigree structures.
  • Multiallelic and biallelic data.
  • Numbers of loci 10, 25 and 50.
  • Number of recombinants 0-4.
  • 100 runs per data set.

40
Pedigree Structures
41
Accuracy Results of BE Algorithm
42
Efficiency Results
43
More Results from ILP
44
Real Data Analysis
  • Data set (Gabriel et al.02)
  • 93 members, 12 pedigrees (each with 7-8 members)
  • chromosome 3, 4 regions, each region 1-4 blocks.

45
Common Haplotypes Frequencies
46
Results From ILP on the Whole Dataset
3.82 4.00 0.45 0.034
47
What if there are no relatives?
  • Rely on linkage disequilibrium
  • Assume that population consists of small number
    of distinct haplotypes
  • Haplotypes tend to be similar

48
Clarks Haplotyping Algorithm
  • Clark (1990) Mol Biol Evol 7111-122
  • One of the first haplotyping algorithms
  • Computationally efficient
  • Very fast
  • Today, more accurate alternatives are often
    available

49
Clarks Haplotyping Algorithm
  • Find homozygous individuals
  • Initialize a list of known haplotypes
  • Resolve ambiguous individuals
  • If possible, use two haplotypes from list
  • Otherwise, use one known haplotype and augment
    list
  • If unphased individuals remain
  • Assign phase randomly to one individual
  • Augment haplotype list and continue from previous
    step

50
Haplotyping via Perfect Phylogeny - Model,
Algorithms, Empirical studies
  • Dan Gusfield, Ren Hua Chung
  • U.C. Davis
  • Cocoon 2003

51
The Perfect Phylogeny Model of Haplotype Evolution

sites
12345
00000
Ancestral haplotype
1
4
Site mutations on edges
3
00010
2
10100
5
10000
01010
01011
Extant haplotypes at the leaves
52
The Perfect Phylogeny Model
  • We assume that the evolution of extant haplotypes
    can be displayed on a rooted, directed tree, with
    the all-0 haplotype at the root, where each site
  • changes from 0 to 1 on exactly one edge, and
    each extant haplotype is created by accumulating
    the changes on a path from the root to a leaf,
    where that haplotype is displayed.
  • In other words, the extant haplotypes evolved
    along a perfect phylogeny with all-0 root.

53
Justification for Perfect Phylogeny Model
  • In the absence of recombination each haplotype of
    any individual has a single parent, so tracing
    back the history of the haplotypes in a
    population gives a tree.
  • Recent strong evidence for long regions of DNA
    with no recombination. Key to the NIH haplotype
    mapping project. (See NYT October 30, 2002)
  • Mutations are rare at selected sites, so are
    assumed non-recurrent.
  • Connection with coalescent models.

54
Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining
set of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a genotype if the merge
of the haplotypes creates the genotype. Example
The merge of 0 1 and 1 0 explains 2 2.
S
Genotype matrix
55
The PPH Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
56
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
1
2
b
00
a
a
b
c
c
01
01

10
10
10
57
The Alternative Explanation
No tree possible for this explanation
58
Efficient Solutions to the PPH problem - n
genotypes, m sites
  • Reduction to a graph realization problem (GPPH) -
    build on Bixby-Wagner or Fushishige solution to
    graph realization O(nm alpha(nm)) time.
  • Reduction to graph realization - build on Tuttes
    graph realization method O(nm2) time.
  • Direct, from scratch combinatorial approach
    -O(nm2) Bafna et al.
  • Berkeley (EHK) approach - specialize the Tutte
    solution to the PPH problem - O(nm2) time.

59
The DPPH Method
  • Bafna et al. O(nm2) time
  • Based on deeper combinatorial observations about
    the PPH problem.
  • A matrix-centric approach (rather than
    tree-centric), although a graph is used in the
    algorithm.

First, we need to understand why some sets of
haplotypes have a perfect phylogeny, and some do
not.
60
When does a set of haplotypes fit a perfect
phylogeny?
  • Arrange the haplotypes in a matrix, two
    haplotypes for each individual. Then (with no
    duplicate columns), the haplotypes fit a unique
    perfect phylogeny if and only if no two columns
    contain all three pairs
  • 0,1 and 1,0 and 1,1

This is the 3-Gamete Test
61
The Alternative Explanation
No tree possible for this explanation
62
The Tree Explanation Again
0 0
1
2
b
0 0
a
b
a
c
c
0 1
0 1
63
PPH The Combinatorial Problem
Input A ternary matrix (0,1,2) M with 2N
rows partitioned into N pairs of rows, where
the two rows in each pair are identical. Def
If a pair of rows (r,r) in the partition have
entry values of 2 in a column j then positions
(r,j) and (r,j) are called Mates.
64
  • Output A binary matrix M created from M
  • by replacing each 2 in M with either 0 or 1,
  • such that
  • A position is assigned 0 if and only if its Mate
  • is assigned 1.
  • b) M passes the 3-Gamete Test, i.e., does
  • not contain a 3x2 submatrix (after row and
  • column permutations) with all three
  • combinations 0,1 1,0 and 1,1

65
Initial Observations
  • If two columns of M contain the following
    rows
  • 2 0
  • 2 0 mates
  • then M will contain a row with 1 0 and a
    row with 0 1 in those columns.
  • This is a forced expansion.

66
Initial Observations
  • Similarly, if two columns of M contain the
    mates
  • 2 1
  • 2 1
  • then M will contain a row with 1 1 and a row
    with 0 1 in those columns.
  • This is a forced expansion.

67
If a forced expansion of two columns creates 0 1
in those columns, then any 2 2 1 0
2 2
in those columns must be set
to be 0 1 1 0 We say that two columns are
forced out-of-phase.
If a forced expansion of two columns creates 1 1
in those columns, then any 2 2
2
2 in those columns must be
set to be 1 1 0 0 We say that two columns are
forced in-phase.
68
1 2 3
a
Example
a
Columns 1 and 2, and 1 and 3 are forced
in-phase. Columns 2 and 3 are forced
out-of-phase.
b
b
c
c
1 3
1 2
2 3
d
a
a
b
d
a
a
b
e
e
b
e
e
e
b
e
69
Immediate Failure
It can happen that the forced expansion of
cells creates a 3x2 submatrix that fails the
3-Gamete Test. In that case, there is no PPH
solution for M.
20 20 11 11 02 02
Example
Will fail the 3-Gamete Test
70
An O(nm2)-time Algorithm
  • Find all the forced phase relationships by
    considering columns in pairs.
  • Find all the inferred, invariant, phase
    relationships.
  • Find a set of column pairs whose phase
    relationship can be arbitrarily set, so that all
    the remaining phase relationships can be
    inferred.
  • Result An implicit representation of all
    solutions to the PPH problem.

71
1 2 3 4 5 6 7
a
A running example.
a
b
b
c
c
d
d
e
e
72
Overview of Bafna et al. algorithm
First, represent the forced phase relationships,
and the needed decisions, in a graph G.
73
7
1
Graph G
Each node represents a column in M, and each edge
indicates that the pair of columns has a row with
2s in both columns. The algorithm builds
this graph, and then checks whether any pair of
nodes is forced in or out of phase.
6
3
4
2
5
74
7
1
Graph Gc
Each Red edge indicates that the columns
are forced in-phase. Each Blue edge
indicates that the columns are forced
out-of-phase.
6
3
4
2
Let Gf be the subgraph of Gc defined by the red
and blue edges.
5
75
7
1
Graph Gf has three connected components.
6
3
4
2
5
76
The Central Theorem
  • There is a solution to the PPH problem for M if
  • and only if there is a coloring of the dashed
    edges of Gc
  • with the following property
  • For any triangle (i,j,k) in Gc, where there
    is one row
  • containing 2s in all three columns i,j and
    k
  • (any triangle containing at least one
  • dashed edge will be of this type), the
    coloring makes
  • either 0 or 2 of the edges blue
    (out-of-phase).
  • Nice, but how do we find such a coloring?

77
7
1
Triangle Rule
Graph Gf
Theorem 1 If there are any dashed edges whose
ends are in the same connected component of Gf,
at least one edge is in a triangle where the
other edges are not dashed, and in every
PPH solution, it must be colored so that the
triangle has an even number of Blue (out
of Phase) edges. This is an inferred coloring.
6
3
4
2
5
78
7
1
6
3
4
2
5
79
7
1
6
3
4
2
5
80
7
1
6
3
4
2
5
81
Corollary
Inside any connected component of Gf, ALL the
phase relationships on edges (columns of M) are
uniquely determined, either as forced
relationships based on pairwise column
comparisons, or by triangle-based inferred
colorings. Hence, the phase relationships of all
the columns in a connected component of Gf are
INVARIANT over all the solutions to the PPH
problem.
82
The dashed edges in Gf can be ordered so that the
inferred colorings can be done in linear time.
Modification of DFS. See the paper for details,
or assign it as a homework exercise.
83
Finishing the Solution
  • Problem A connected component C of G may
    contain several connected components of Gf, so
    any edge crossing two components of Gf will still
    be dashed. How should they be colored?

84
7
1
How should we color the remaining dashed edges in
a connected component C of Gc?
6
3
4
2
5
85
Answer
For a connected component C of G with k
connected components of Gf, select any subset S
of k-1 dashed edges in C, so that S together
with the red and blue edges span all the nodes of
C. Arbitrarily, color each edge in S either red
or blue. Infer the color of any remaining dashed
edges by successive use of the triangle rule.
86
7
1
Pick and color edges (2,5) and (3,7) The
remaining dashed edges are colored by using the
triangle rule.
6
3
4
2
5
87
7
1
6
3
4
2
5
88
Theorem 2
  • Any selected S works (allows the triangle rule to
    work) and any coloring of the edges in S
    determines the colors of any remaining dashed
    edges.
  • Different colorings of S determine different
    colorings of the remaining dashed edges.
  • Each different coloring of S determines a
    different solution to the PPH problem.
  • All PPH solutions can be obtained in this way,
    i.e. using just one selected S set, but coloring
    it in all 2(k-1) ways.

89
Comparing the programs - R.H. Chung
  • All three are fast and practical (under one
    second) on problem instances of size 50 x 30.
  • DPPH is the fastest, followed by HPPH and GPPH.
  • HPPH encounters memory problems with large input.

90
sites individ GPPH DPPH HPPH
times shown are in seconds on an 800 Mhz machine.
91
A Phase-Transition
Problem, as the ratio of sites to genotypes
changes, how does the probability that the PPH
solution is unique change? For greatest utility,
we want genotype data where the PPH solution is
unique. Intuitively, as the ratio of genotypes
to sites increases, the probability of uniqueness
increases.
92
Extension
  • With recombination
  • The papers See wwwcsif.cs.ucdavis.edu/gusfield

93
The E-M Haplotyping Algorithm
  • Excoffier and Slatkin (1995) Mol Biol Evol
    12921-927
  • Provide a clear outline of how the algorithm can
    be applied to genetic data
  • Combination of two strategies
  • E-M statistical algorithm for missing data
  • Counting algorithm for allele frequencies

94
E-M Algorithm For Haplotyping
  • 1. Guesstimate haplotype frequencies
  • 2. Use current frequency estimates to replace
    ambiguous genotypes with fractional counts of
    phased genotypes
  • 3. Estimate frequency of each haplotype by
    counting
  • 4. Repeat steps 2 and 3 until frequencies are
    stable

95
E-M Algorithm for Haplotyping
  • Cost grows rapidly with number of markers
  • Typically appropriate for lt 25 SNPs
  • Fewer microsatellites
  • More accurate than Clarks method
  • Fully or partially phased individuals contribute
    most of the information

96
Enhancements to E-M
  • List only haplotypes present in sample
  • Gradually expand subset of markers under
    consideration, eliminating haplotypes with low
    estimated frequency from consideration at each
    stage
  • SNPHAP Clayton (2001)
  • HAPLOTYPER Qin et al. (2002)

97
Divide-And-Conquer Approximation
  • No. of potential haplotypes increases
    exponentially
  • Actual no. of haplotypes doesnt
  • Approximation
  • Successively divide marker set
  • Run E-M assuming segments associate randomly
  • Proceed, ignoring composites of segments with
    zero frequency
  • Order m log m
  • Exact E-M is order 2m

98
Other Recent Developments
  • Newer methods try to further improve haplotype
    estimation by favoring sets of similar haplotypes
  • Stephens et al. (2001) Am J Hum Genet 68978-89
  • Genealogical approach, which implies haplotypes
    are similar to each other

99
Method based on Gibbs sampler
  • MCMC method
  • Stochastic, random procedure
  • Improves solution gradually
  • Given initial set of haplotypes
  • Sample haplotypes for one individual at a time,
    assuming other haplotypes are true
  • Repeat a few million times

100
Update Procedure I
  • Pick individual U to update at random
  • Calculate haplotype frequencies F in all other
    individuals
  • Since everyone is phased, this is done by
    counting
  • Sample new haplotypes for U from conditional
    distribution of Us haplotypes given F

101
Update Procedure I
  • This procedure would produce an estimate of
    haplotype frequencies that equivalent to the E-M
    algorithm
  • Stephens et al (2001) suggested an alternative
    estimate of F

102
Update Procedure II
  • Estimate F from the other individuals
  • Construct F to include haplotypes in F and also
    other similar (possibly differing at a few sites,
    due to mutations)
  • Update Us haplotypes conditional on F
Write a Comment
User Comments (0)
About PowerShow.com