Title: Discovering Patterns in Haplotype Data
1Discovering Patterns in Haplotype Data
- Lipari, July 24, 2003
- Esko Ukkonen
- University of Helsinki
- Esko.Ukkonen_at_Helsinki.Fi
2only recombinations mutations not shown
3(No Transcript)
4(No Transcript)
5Haplotype blocks
- Seminal paper by Daly et al. (Nat Gen, 2001)
- Data 500 kb region, 103 SNPs, 258 chromosomes
(haplotypes) - Finds blocks with striking lack of variation
- Recombination hot spots? (physical explanation)
- or just population history? (by chance)
- Other papers
- Patil et al. (Science 2001) Gabriel et al.
(Science 2002) Zhang et al (PNAS 2002) -
6Blocks by Daly et al. (Nat Gen 2001)
7Blocks by Daly et al. (Nat Gen 2001)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12Figure (Patil al)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Analysis results
- Seminal paper by Daly et al. (Nat Gen, 2001),
data - 500 kb region,
- 103 SNPs,
- 258 chromosomes (haplotypes)
-
30(No Transcript)
31Best segmentation
Boundary strength
Segmentations with gaps
32 33Best segmentation
Boundary strength
Gaps allowed
34 Gaps allowed
35Generalizations
- microsatellite markers
- generalization to unphased data
- distances between markers
- population mixtures / comparing several block
structures
36Global blocks vs mosaic
- blocks convenient for describing potential
cross-over points - not every haplotype must have a cross-over at
every block boundary - the true underlying fragmentation more like a
mosaic than simple blocks - How to find the best mosaic?
37Other view set cover
- no a priori assumption of a block structure
- find all consistent submatrices of D
- consistent submatrix consists of (almost)
identical (sub)haplotypes (hence, it can come
from the same founder) - find an optimal cover of D from the submatrices
38Each fragment of a haplotype induces a submatrix
consisting of all (almost) identical fragments
39- Generalized mosaic model
- Leave some fraction of D uncovered
- Allow some noise inside each submatrix
40Greedy set cover heuristics
41Daly et al data analyzed by the greedy set cover
algorithm
42Uncovering founder sequences
- Problem Given current sequences (haplotypes),
construct their founders that could produce the
sequences by recombinations - recombination hot spots?
- visualizations of potential recombination
structures How do the blocks look like?
43Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
44Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
45Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0
0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0
6 cross-overs
46Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
47Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
18 cross-overs
48Example founders recombinants (generated data)
49Founder sequence reconstruction
- Set of recombinants D D1,...,Dn, sequences of
length p over some alphabet (0,1,
A,C,G,T,...) - Find founder sequences F F1,...,Fk of length
p such that D has parse in terms of F each Di is
a catenation of fragments of Fjs
50Founder sequence reconstruction
- Set of recombinants D D1,...,Dn, sequences of
length p over some alphabet (0,1,
A,C,G,T,...) - Find founder sequences F F1,...,Fk of length
p such that D has parse in terms of F each Di is
a catenation of fragments of Fjs
Di Fj
51Example founders recombinants (generated data)
52Colorings, cross-overs, fragments
- find a consistent coloring different symbols on
the same column of D must have different color - colors founders
- cross-over two different colors adjacent on the
same Di - fragment block of Di with the same color
53Optimization
- minimize number of colors ( founders)
- minimize number of fragments or number of
cross-overs ( maximize the length of fragments) - given an upper bound M for colors, maximize
fragment length - given a lower bound L for (average) fragment
length, minimize colors
54Block driven approach
- D
- find the vertical segments ( haplotype blocks)
first, then minimize colors or cross-overs
55Color propagation
Y
X
56Color propagation
Y
X
If we select color(Y) color(X) then w(X,Y) X
? Y crossovers are eliminated
57Color propagation
Y
X
If we select color(Y) color(X) then w(X,Y) X
? Y crossovers are eliminated
58P
P
W(X,Y)
X
Y
w(X,Y) common rows of X and Y W(P,P)
maximum weight of a maching U(P,P) n W(P,P)
minimum number of cross-overs
59Color propagation (cont.)
- weighted bipartite graph whose maximum weight
matching gives minimum number of crossovers at
this boundary - Hungarian algorithm O(M3)
- gt optimal coloring for given segmentation if
crossovers only on segment boundaries and all M
colors non-empty - coloring time O(p(nM3))
60General case
- all segmentations, all consistent colorings
- color propagation as above
- dyn prog bipartite matching
61Dynamic programming
P, P partitions of columns j-1 and j U(P,P)
minimum possible number of cross-overs when P and
P are given colors S(j,P) smallest possible
number of cross-overs in M-colorings of D(1,j)
that have coloring-induced partition P at column j
62Dynamic programming (cont.)
S(1,P) 0 S(j,P) min S(j-1,P)U(P,P)
where the minimum is over all consistent
partitions of column j-1
63General case (cont.)
- maximize average fragment length when M colors
available - O(pN) where N different M partitions of n
elements - Slow, only small M feasible
64Optimal result for M4
65- Data Daly et al
- full cover
- Hamming 0
66- Data Daly et al
- full cover
- Hamming 1
67- Data Daly et al
- 85 cover
- - Hamming 0
68- Data Daly et al
- 85 cover
- - Hamming 1
69Implementations
- MDL block finder (M. Koivisto)
www.cs.helsinki.fi/u/mkhkoivi/software.html - HaploVisual (P. Rastas) www.cs.helsinki.fi/u/pras
tas/haplovisual/
70On the MDL
- M. Hansen B. Yu Model selection and the
principle of minimum description length. J Am
Stat Assoc 96 (2001), 746-774.
71Papers
- M. Koivisto, M. Perola, T. Varilo, W. Hennah, J.
Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, H.
Mannila An MDL method for finding haplotype
blocks and for estimating the strength of
haplotype block boundaries. Pacific Symposium on
Biocomputing 2003. - H. Mannila et al. Minimum description length
block finder... . AJHG 73 (2003). - E. Ukkonen Finding founder sequences from a set
of recombinants. Proc. Workshop on Algorithms for
Bioinformatics WABI-2002. LNCS, Springer 2002. - K. Zhang et al Dynamic programming algorits for
haplotype block ... . RECOMB 2003.