Title: Roadmap
1Roadmap
- Discovering Patterns
- Structure-preserving patterns
- Strings, Networks
- Permuting patterns
- Combinatorics
- Algorithmics
- Statistics
- Analyzing Patterns
- Genographic Project
- LD Patterns
- Then
- Now (IRIS)
2- Who ? National Geographic and IBM on a five year
study, launched in April 2005 - What ? Although fossil records fix human origins
in Africa, little is known about the great
journey that took Homo sapiens to the far reaches
of the earth.
How did we,
each of us, end up where we are? - How ? Using genetics as a tool samples all
around the world are being collected and the
mtDNA and NRY chr are being analyzed
3www.nationalgeographic.com/genographic
4www.ibm.com/genographic
5Public Participation
- Over 250,000 public participants to date (April,
2008) - www.nationalgeographic.com/genographic
- www.ibm.com/genographic
- www.ibm.com/dna
6Map of Migration
7How?
- Each of us carries ancestral material
- marked by signatures due to imperfections in DNA
replication - SNPs (Single Nucleotide Polymorphisms)
- STR numbers (Short Tandem Repeats)
- Inversions
- ..etc
- Uni-parental Model (topologytree)
- Non-recombining segments of genome
8mtDNA Micro-Phylogeny Tree
22 (coding-region) SNPs
The Genographic Project Public Participation
MtDNA Database, Behar et al, PLoS Genetics. 2007
9mtDNA Haplogroup Distribution
10Migration Map based on mtDNA
11Locus
16000 bp
58 mill bp 0.38
12Missing information in unilinear transmissions
past
present
13Population over generations (flow of ancestral
material)
past
MRCA
present
14Bi-parental Model
past
GMRCA MRCA
present
15What is recombination?
- Genetic recombination is the process by which a
strand of DNA is broken then joined to the end
of a different DNA molecule. - It occurs during meiosis and between paired
chromosomes. This process leads to offspring
having different combinations of genes from their
parents
16Recombnations CaptureRequirements Specification
- Enumerate the (multiple) recombinations
- Statistical averages not adequate..
- Identify the participating lineages
- Detect ancient recombinations as well as recent
ones
17Then our task is to
- Estimate the phylogenetic network, called the
-
- Ancestral Recombinations Graph (ARG)
ARG coined by Griffiths Marjoram, 1996
Joint work with Marta Mele, Jaume Bertranpetit,
Francesc Callafel
18An Inconvenient Truth
- Theorem Given data D, the problem of computing
the ARG G with minimum number of recombinations
is NP-complete.
19An Inconvenient Truth
- Theorem Given data D, the problem of computing
the ARG G with minimum number of recombinations
is NP-complete.
Recall other inconvenient truths.
Theorem The problem of computing the most
parsimonious tree T is NP-complete.
20Flavors of hardness.
- (Uni-parental)
- In a NON infinite-sites model, TREE construction
hard - No back mutations No parallel mutations
- But reality is infinite-sites
- Yet, problem is tractable, in practice
- (Bi-parental)
- In a pure recombinations model, problem is hard
- Generally a statistical average has been
pursued thru LD - Combining potentially misleading mutations with
recombinations makes the general problem
intractable in practice
21Tractability Model(Balance between reality and
simplicity)
- Use characteristics of the observed haplotypes
- Use a compatible network model
(not a generic phylogenetic model)
22IRIS(Identifying Recombinations In Sequences)
Stage Haplotypes use SNP block patterns
Segment along the length infer trees
computational insights
Infer network (ARG)
23Input Haplotypes
24Stage 1 Staging the Input
- 0 1 2 3 4 5 6 7 8 910 1 2 3 4 5 6 7
8 910 1 2 3 4 5 6 7 8 0) 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1) 1 3
4 2 3 2 2 2 2 3 4 3 3 3 3 2 2 3 3 4 4 3 2 2 2 2 2
2 1 2 1 2) 1 3 4 2 3 2 2 2 2 3 4 3 3 3 3 2 2 3
3 4 4 3 2 4 1 1 4 3 2 6 1 3) 1 3 5 2 3 2 2 2 2
4 5 1 1 1 1 1 3 1 4 1 5 4 3 5 3 4 5 3 1 3 1 4)
2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 3
0 3 3 1 2 1 5) 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 115 1 6) 2 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 3 1
7) 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 1 1 1 1 1
3 2 3 3 3 1 3 1 8) 2 1 2 2 2 2 2 2 2 2 2 1 1 1
2 2 2 2 1 2 2 2 2 2 2 2 2 2 111 1 9) 2 3 4 2 3
2 2 2 2 3 4 3 3 3 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2 8
1 10) 2 3 5 2 3 2 2 2 2 4 5 1 1 1 1 1 3 1 4 1 5
4 3 5 3 4 5 3 1 5 1 11) 3 2 3 2 2 3 2 3 2 2 3 2
2 2 3 2 2 2 2 3 1 1 1 3 2 3 3 3 1 6 1 12) 3 2 3
2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 0 1 1 4 3
2 2 1 13) 5 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 2 1 14) 5 1 6 2 2 2 2 2 2 5
3 4 2 2 3 3 2 3 1 1 1 1 1 1 1 1 1 1 1 2 1 15) 7
2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2 2 2 3 1 1 1 3 2 3
3 3 1 2 1 16) 7 2 3 2 2 3 2 3 2 2 3 2 2 2 3 2 2
2 2 3 1 1 1 6 1 1 1 1 1 2 1 17) 3 2 3 2 2 3 2 3
2 2 3 2 2 2 3 2 2 2 2 3 0 2 2 2 2 2 2 2 1 1 1
18) 2 1 1 1 1 1 1 1 1 1 4 3 3 3 3 2 2 3 3 4 4 3
2 4 1 1 4 3 2 1 1 19) 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 5 1 1 1 20) 3 2 3 2
2 3 2 3 2 2 3 2 2 2 3 2 2 3 3 4 4 3 2 4 1 1 4 3 2
1 1 21) 2 3 5 2 3 2 2 2 2 4 5 1 1 1 1 1 3 1 4 1
5 4 3 5 2 2 2 2 1 1 1 22) 5 1 2 2 2 2 2 2 2 2 2
1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1
25Stage 2 Segmentation (Marginal
Compatible--Trees)
26Ancestral Recombination Graph (ARG) (our
characterization)
- ARG is a fortified compatible graph
- Defined on k segments G(k)
- A node can have at most 2 incoming edges (2
parents) - When 2 parents denotes recombination of two
segments incoming edge is labeled by one segment
each
27ARG
28Stage 3 Trees to ForestDSR Algorithm
Input Two graphs G1 and G2 Output
Consensus ARG G
Optimization
Topology
DSR
29DSR Algorithm Overview
initialization
- Let G1 and G2 be defined on leaf labels L
- Let universe U ? L
- P1 and P2 are partitions on U at leaf level
- DO-WHILE
- A network structure with nodes in G and the
labels derived from P1 and P2 - Universe U ? this nodes in G
- Increment layer and update P1 and P2 as sets on U
of this layer - P1 has labels from G1
- P2 has labels from G2
- WHILE (P1 is nonempty) OR (P2 is nonempty)
iterative loop
30Walk-through of DSR Algorithm
- (0 2 5 9 12-14 16 18-20 23-24 28-30 33 35-36)
2 7241 7251 - (1 6 8 17 21-22 26 31) 1 7252
- (7 11 15 34 37) 1 7253
- (25) 1 7707
- (3 10 27) 1 7254
- (4) 1 7009
- (32) 1 7000
(0 5 13-14 16 23-24 29-30) 1 8271 (1 6 8
17 21-22 26) 1 8272 (19 36) 1 8275
(2 9 12 18 20 31) 1 8282 (3-4 7 10-11
15 25 27-28 32 34-35 37) 1 8806 (33)
1 8000
31DSR Dominant Subdominant Recombinant
- Dominant labels of G1 AND G2
- Subdominant label of G1 OR G2
- Recombinant no labels (NEITHER G1 NOR G2)
- Rules
- 1. Each row and each column
- has at most one dominant
- ELSE has at most one subdominant
- ELSE all recombinants
- 2. A non-recombinant can have non-recombinants
either in its row or its column but NOT both
32DSR Algorithm X-matrix
P2
P1 labels
P1
P2 labels
33DSR Algorithm Assign DSR colors (optimization)
P2
P1 labels
P1
P2 labels
34DSR Algorithm rows cols DSR
P2
P1 labels
P1
P2 labels
35DSR Algorithm rows cols DSR
P2
P1 labels
P1
P2 labels
36DSR Algorithm rows cols DSR
P2
P1 labels
P1
P2 labels
37DSR ? Feasible Topology
Next layer
Last layer
38DSR Continuity Across Layers (iterations)
39chr2114505500 -14602168
- Chinese (2 subpops CBx, HNx) Japanese (JTx)
data - Around 200 SNPs
- Around 100 haplotypes
40Network
Median-joining networks for inferring
intraspecific phylogenies, Bandelt, Forster
Rohl, Molecular Biology and Evolution, Vol 16,
37-48, 1999
41IRIS(Identifying Recombinations In Sequences)
12345678901234567890123456789012345678901234567890
12345678901234567890123456789012345678901234567890
1234567890123456789012345 111111111111111111111111
11111111111111112222222222222222222222222222222222
233333333344444444455555555555555----
42IRIS Non-recombining Cluster Ids
11 12 13 14 15 16 0 17 1 18 4 19 6
5 20 8 21 9 10 7 22 23 3 2 24
43Chr 21 locus Preliminary Results
- Not distinguishable
- share recent ancient recombinations
- No pop-specific mutation/recombination
44Mazumdar et al, Journal of Genetics, 2008.
45The Big Picture
Ecosystem
Population Genomics
Species
Organism
Physiology
Metabolism
Network
Function
Structure
Sequence
46Thank You!
success stories in bioinformatics will depend
on algorithmic and statistical ingenuity.
Pavel
Pevzner