Discovering Patterns in Haplotype Data PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Discovering Patterns in Haplotype Data


1
Discovering Patterns in Haplotype Data
  • Lipari, July 24, 2003
  • Esko Ukkonen
  • University of Helsinki
  • Esko.Ukkonen_at_Helsinki.Fi

2
only recombinations mutations not shown
3
(No Transcript)
4
(No Transcript)
5
Haplotype blocks
  • Seminal paper by Daly et al. (Nat Gen, 2001)
  • Data 500 kb region, 103 SNPs, 258 chromosomes
    (haplotypes)
  • Finds blocks with striking lack of variation
  • Recombination hot spots? (physical explanation)
  • or just population history? (by chance)
  • Other papers
  • Patil et al. (Science 2001) Gabriel et al.
    (Science 2002) Zhang et al (PNAS 2002)

6
Blocks by Daly et al. (Nat Gen 2001)
7
Blocks by Daly et al. (Nat Gen 2001)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
Figure (Patil al)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Analysis results
  • Seminal paper by Daly et al. (Nat Gen, 2001),
    data
  • 500 kb region,
  • 103 SNPs,
  • 258 chromosomes (haplotypes)

30
(No Transcript)
31
Best segmentation
Boundary strength
Segmentations with gaps
32



33
Best segmentation
Boundary strength
Gaps allowed
34


Gaps allowed
35
Generalizations
  • microsatellite markers
  • generalization to unphased data
  • distances between markers
  • population mixtures / comparing several block
    structures

36
Global blocks vs mosaic
  • blocks convenient for describing potential
    cross-over points
  • not every haplotype must have a cross-over at
    every block boundary
  • the true underlying fragmentation more like a
    mosaic than simple blocks
  • How to find the best mosaic?

37
Other view set cover
  • no a priori assumption of a block structure
  • find all consistent submatrices of D
  • consistent submatrix consists of (almost)
    identical (sub)haplotypes (hence, it can come
    from the same founder)
  • find an optimal cover of D from the submatrices

38
Each fragment of a haplotype induces a submatrix
consisting of all (almost) identical fragments
39
  • Generalized mosaic model
  • Leave some fraction of D uncovered
  • Allow some noise inside each submatrix

40
Greedy set cover heuristics

41
Daly et al data analyzed by the greedy set cover
algorithm
42
Uncovering founder sequences
  • Problem Given current sequences (haplotypes),
    construct their founders that could produce the
    sequences by recombinations
  • recombination hot spots?
  • visualizations of potential recombination
    structures How do the blocks look like?

43
Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
44
Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
45
Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0
0 1 0 1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 1 0 1 1 0
6 cross-overs
46
Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1
47
Example
0 0 1 0 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0
0 1 0 1 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 0 0 1 1 0
0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
18 cross-overs
48
Example founders recombinants (generated data)
49
Founder sequence reconstruction
  • Set of recombinants D D1,...,Dn, sequences of
    length p over some alphabet (0,1,
    A,C,G,T,...)
  • Find founder sequences F F1,...,Fk of length
    p such that D has parse in terms of F each Di is
    a catenation of fragments of Fjs


50
Founder sequence reconstruction
  • Set of recombinants D D1,...,Dn, sequences of
    length p over some alphabet (0,1,
    A,C,G,T,...)
  • Find founder sequences F F1,...,Fk of length
    p such that D has parse in terms of F each Di is
    a catenation of fragments of Fjs
    Di Fj

51
Example founders recombinants (generated data)
52
Colorings, cross-overs, fragments
  • find a consistent coloring different symbols on
    the same column of D must have different color
  • colors founders
  • cross-over two different colors adjacent on the
    same Di
  • fragment block of Di with the same color

53
Optimization
  • minimize number of colors ( founders)
  • minimize number of fragments or number of
    cross-overs ( maximize the length of fragments)
  • given an upper bound M for colors, maximize
    fragment length
  • given a lower bound L for (average) fragment
    length, minimize colors

54
Block driven approach
  • D
  • find the vertical segments ( haplotype blocks)
    first, then minimize colors or cross-overs

55
Color propagation
  • D

Y
X

56
Color propagation
  • D

Y
X
If we select color(Y) color(X) then w(X,Y) X
? Y crossovers are eliminated
57
Color propagation
  • D

Y
X
If we select color(Y) color(X) then w(X,Y) X
? Y crossovers are eliminated
58
P
P
W(X,Y)
X
Y
w(X,Y) common rows of X and Y W(P,P)
maximum weight of a maching U(P,P) n W(P,P)
minimum number of cross-overs
59
Color propagation (cont.)
  • weighted bipartite graph whose maximum weight
    matching gives minimum number of crossovers at
    this boundary
  • Hungarian algorithm O(M3)
  • gt optimal coloring for given segmentation if
    crossovers only on segment boundaries and all M
    colors non-empty
  • coloring time O(p(nM3))

60
General case
  • all segmentations, all consistent colorings
  • color propagation as above
  • dyn prog bipartite matching

61
Dynamic programming
P, P partitions of columns j-1 and j U(P,P)
minimum possible number of cross-overs when P and
P are given colors S(j,P) smallest possible
number of cross-overs in M-colorings of D(1,j)
that have coloring-induced partition P at column j
62
Dynamic programming (cont.)
S(1,P) 0 S(j,P) min S(j-1,P)U(P,P)
where the minimum is over all consistent
partitions of column j-1
63
General case (cont.)
  • maximize average fragment length when M colors
    available
  • O(pN) where N different M partitions of n
    elements
  • Slow, only small M feasible

64
Optimal result for M4
65
  • Data Daly et al
  • full cover
  • Hamming 0

66
  • Data Daly et al
  • full cover
  • Hamming 1

67
  • Data Daly et al
  • 85 cover
  • - Hamming 0

68
  • Data Daly et al
  • 85 cover
  • - Hamming 1

69
Implementations
  • MDL block finder (M. Koivisto)
    www.cs.helsinki.fi/u/mkhkoivi/software.html
  • HaploVisual (P. Rastas) www.cs.helsinki.fi/u/pras
    tas/haplovisual/

70
On the MDL
  • M. Hansen B. Yu Model selection and the
    principle of minimum description length. J Am
    Stat Assoc 96 (2001), 746-774.

71
Papers
  • M. Koivisto, M. Perola, T. Varilo, W. Hennah, J.
    Ekelund, M. Lukk, L. Peltonen, E. Ukkonen, H.
    Mannila An MDL method for finding haplotype
    blocks and for estimating the strength of
    haplotype block boundaries. Pacific Symposium on
    Biocomputing 2003.
  • H. Mannila et al. Minimum description length
    block finder... . AJHG 73 (2003).
  • E. Ukkonen Finding founder sequences from a set
    of recombinants. Proc. Workshop on Algorithms for
    Bioinformatics WABI-2002. LNCS, Springer 2002.
  • K. Zhang et al Dynamic programming algorits for
    haplotype block ... . RECOMB 2003.
Write a Comment
User Comments (0)
About PowerShow.com