Title: Minimal Recombinations Histories and Global Pedigrees
1Minimal Recombinations Histories and Global
Pedigrees
Finding Minimal Recombination Histories
1
2
3
4
1
2
3
1
4
2
3
4
Global Pedigrees
Acknowledgements Yun Song - Rune
Lyngsø - Mike Steel - Carsten Wiuf
2Basic Evolutionary Events
Recombination
Gene Conversion
Coalescent/Duplication
Mutation
3Time slices
All positions have found a common ancestors on
one sequence
All positions have found a common ancestors
Time
1 2
1 2
1 2
1 2
1 2
N
1
Population
4Recombination-Coalescence Illustration
Copied from Hudson 1991
Intensities Coales. Recomb.
0 ?
1 (1b)?
b
3 (2b)?
6 2?
3 2?
1 2?
5Encoding, Phylogenies and Incompatibility
0
1,2,3,4
1 C 2 C 3 C 4 C 5 A 6 A 7 A
0 0 0 0 1 1 1
1 mutation per site
0
1
1
5,6,7
Four combinations
Incompatibility
0 0 0 1 1 0 1
0 0 0 0 1 1 1
00
10
01
11
6The 1983 Kreitman Data the infinite site
assumption (M. Kreitman 1983 Nature)
- 11 sequences of alcohol dehydrogenase gene in
Drosophila melanogaster. - Can be reduced to 9 sequences (3 of 11 are
identical). - 3200 bp long, 43 segregating sites, 28 of which
are informative
Recoded Kreitman data i. (0,1) ancestor
state known ii. Multiple copies represented
by 1 sequence iii. Non-informative sites
could be removed
7Hudson Kaplans RM
0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0
0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 1 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1 1 1 1
1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 1
1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1
0 1
If you equate RM with expected number of
recombinations, this could be used as an
estimator. Unfortunately, RM is a gross
underestimate of the real number of
recombinations.
8Recombination Parsimony Hein, 1990,93 Song
Hein, 2002
9Metrics on Trees based on subtree transfers.
Trees including branch lengths
Unrooted tree topologies
Rooted tree topologies
Tree topologies with age ordered internal nodes
Pretending the easy problem (unrooted) is the
real problem (age ordered), causes violation of
the triangle inequality
10Tree Combinatorics and Neighborhoods
Observe that the size of the unit-neighbourhood
of a tree does not grow nearly as fast as the
number of trees
Due to Yun Song
Song (2003)
Allen Steel (2001)
111
122
133
144
155
6
167
17(No Transcript)
18Branch and Bound Algorithm
0 3 1 91 2 1314 3
8618 4 30436 5 62794 6 78970 7
63049 8 32451 9 10467 10 1727
Lower bound
?
Upper Bound
Exact length
k
k-recombinatination neighborhood
1. The number of ancestral sequences in the
ACs.
2. Number of ancestral sequences in the ACs
for neighbor pairs
3. AC compatible with the minimal ARG.
4. AC compatible with close-to-minimal ARG.
19The Minimal Recombination History for the
Kreitman Data
Methods of rec events obtained
Hudson Kaplan (1985) 5
Myers Griffiths (2003) 6
Song Hein (2004). Set theory based approach. 7
Song Hein (2003). Current program using rooted trees. Lyngsø, Song Hein (2006). Massive Acceleration using Branch and Bound Algorithm. Lyngsø, Song Hein (2006). Minimal number of Gene Conversions (in prep.) 7 7 5-2
20Spatial Coalescent-Recombination Algorithm (Wiuf
Hein 1999 TPB)
Spatial Process
Temporal Process
i. The process is non-Markovian
ii. The trees cannot be reduced to Topologies
21Gene Conversions Treeness
Gene Conversion
Recombination
Coalescent
Star tree
22The Bad News Actual, potentially detectable and
detected recombinations
Minimal ARG
True ARG
0
4 Mb
23The Good News Quality of the estimated local tree
((1,2),(1,2,3))
True ARG
1
2
3
4
5
Reconstructed ARG
1
2
3
4
5
((1,3),(1,2,3))
n7 r10 Q75
24Simultaneous Inference of Haplotypes
Recombination Events Combinatorial Optimization
Version
Data Genotypes/SNPs
Gusfield, 2002
A
C
A,G
C,G
G
G
1
?
?
??
2
?,?
?
?
?
?
3
?,?
?,?
?
?
Song et al., 2006
Rahman/Lyngsø (unpubl.) Heuristic Sequence of
Phylogenies
25The Griffiths-Ethier-Tavare Recursions
No recombination Infinite Site Assumption
Ancestral State
Known
History Graph Recursions Exists
No cycles
Possible Histories without Recombination for
simple data example
0
1
1
1
4
2
3
5
4
5
5
5
6
3
7
2
- recombination 27 ACs recombination
3108 ACs
8
1
26Ancestral configurations to 2 sequences with 2
segregating sites
27Counting Recursion
Summary statistic lumping configurations
k1(k21)1 padded with -
1
k1
k
28Enumeration of Ancestral States(via counting
restricted non-negative integer matrices with
given row and column sums)
Due to Yun Song
29Examples of Likelihood Calculations
010 010 101 101 110
R3
R1
R2
30Time slices
All positions have found a common ancestors on
one sequence
All positions have found a common ancestors
Time
1 2
1 2
1 2
1 2
1 2
N
1
Population
31Number of genetic ancestors to the Human Genome
Sr number of Segments E(Sr) 1 r
time
C
C
C
R
R
R
sequence
Simulations
Statements about number of ancestors are much
harder to make.
32Applications to Human Genome (Wiuf and Hein,97)
Parameters used 4Ne 20.000 Chromos. 1 263 Mb.
263 cM Chromosome 1 Segments 52.000
Ancestors 6.800 All chromosomes Ancestors
86.000 Physical Population. 1.3-5.0 Mill.
A randomly picked ancestor (ancestral material
comes in batteries!)
33Multiple and Simultaneous Coalescents
1. Simultaneous Events 2. Multifurcations. 3.
Underestimation of Coalescent Rates
34Recombination Induced Multiple Coalescent Events
P(X2 gt 1) (2N-1)/2N 1-(1/2N)
1
High recombination rate will create many
ancestors violating the coalescent assumption
that sample size ltlt 2N 2N10.000, sample size
(10, 200, 3000, 8000)
35Recombination Induced Multiple Coalescent Events
Number of our genetic ancestors Recombination
Carriers Gene Conversion Carriers Gene
Conversion Length 300, GR,100R
- Recombination Gene Conversion
Recombination Carriers Gene Conversion
Carriers Mixed
36Recombination Induced Multiple Coalescent Events
Coalescent Rate Discrete versus Continuous
Coalescent Rate Discrete versus Continuous
Consequences for Recombination-Coalescent
Process Globally Wrong, Locally Correct.
37Questions based on Large Data Sets
Much much more sequence data 1.Comparative
Genomics of a Huge Scale 2.Population
Genomics One issue reconstructing
population pedigrees. Extreme
data Identifiability of pedigrees
3.Association Mapping on the Tree of Life
4. Somatic Gene Genealogies and the Models of
Embryology
38Global Pedigrees
99 Chang and Derrida. Time to a universal common
ancestor 04 Rhode tries to answer this for
realistic population model
- Combining the Coalescent and Pedigree Process
- Super-pedigree problem
- Bound on how much data is needed to infer a
pedigree - Does embedded phylogenies determined the pedigree
- Wiuf Hein (1999) 'A contribution to the
discussion of J. Chang's paper "Recent Common
Ancestor of All Present Human Individuals" ' (
Adv. Appl. Prob. vol. 31.4) - Hein (2004) "Pedigrees for all Humanity" Nature
431.512-13. - Steel and Hein (2005) Reconstructing Pedigrees
A combinatorial perspective. J.Theor. Biol.
39Combining Ancestral Individuals and the
Coalescent Wiuf Hein, 2000.
Let T be the time, when somebody was everybodys
ancestor. Changs result lim
T/log2(N) 1 prob. 1 Unify the two
processes I. Sample more individuals
II. Let each have 2 parents with probabilty p.
Result A discontinuity at 1.
For plt1 change log2?logp Comment
Genetic Ancestors is a vanishing set within
Genealogical Ancestors.
40Pedigree Ancestors and Human History Rhode,
Olson Chang, 2004
More realistic Model of Human History Geography
and Growth
E(T) 2300 years ago E(U) 4500 years ago
41Probability of Data given a Pedigree.
Elston-Stewart (1971) -Temporal Peeling
Algorithm
Condition on parental states Recombination and
mutation are Markovian
42Counting Pedigrees Tong Chen Rune Lyngsø
2
3
1
0
1
2
1
4
Ak(i,j) - the number of pedigrees k generations
back with i females, k males.
2 4
3 279
4 2.8107
5 2.81020
6 7.41052
7 2.810131
8 2.910317
9 3.510749
10 3.9101737
43Pedigree Counting
- Counting gender un-labelled pedigrees
- Much harder.
- Counting gender labellings on un-labelled
pedigree.
gender un-labelable
44 Inverting Random Functions a bound on
segregating sites needed to reconstruct a global
pedigrees Steel Szekely, 1998 Steel and Hein,
2005
The population can be partitioned into triples a
couple that gets a pair of children an outsider
that has a child with one of them. This creates
a a mapping from a generation to the previous,
fundamentally labeling all ancestors.
The number of global pedigrees for k generations
with 3n individuals
Number of segregating sites - s - needed to
predict correct global pedigree with at least 0.5
probability of a population of size n for d
generations
Ex. 3106, 300 generations (7000 years) this
lower bound would give a minimum of 2000 sites.
(probably a gross underestimate).
45Reconstructing global pedigrees Steel and Hein,
2005
Knowing the gender-labeled pedigrees for all
pairs, defines the global pedigree (last k
generations)
Links and lassos determine the global pedigree
(last k generations)
gender labelling of ancestors are crucial
46Benevolent Mutation and Recombination Process
Genomes with r and m/r --gt infinity r -
recombination rate, m - mutation rate
- All embedded phylogenies are observable
- Do they determine the pedigree?
Counter example
Embedded phylogenies
47Pedigree Reconstruction Principles
Distance Based Reconstructions
Gender specific rates
Continuous Birth Time with Perfect Clock
t3
t2
t1
Subtree Transfer Identification of Ancestors
Recursive Definition of Ancestral Genomes
48The Coalescent with Recombination
Retrospective in stead of Prospective formulation
of Genetical Processes (Ewens, 1979) 40s
retrospective arguments used by both Fisher and
Wright. 75 Watterson full formulation of
probability of genealogical relationship of a set
of alleles. 82 Three Famous articles by
Kingman. 83 Hudson Includes Recombination in
Genealogical Process.
- Number of Ancestors to a DNA Sequence.
- Reformulation of Genealogical Process.
- Inclusion of Gene Conversion in Genealogical
Process.
- Wiuf Hein (1997) On the Number of Ancestors to
a DNA Sequence - Wiuf Hein (1999) The Ancestry of a Sample of
Sequences Subject to Recombination - Wiuf Hein (1999) The Coalescent with
Recombination as a point process moving along
sequences. - Wiuf Hein (2000) The Coalescent with Gene
Conversion
49Finding Minimal Recombination Histories
64 Bodmer Edwards Parsimony defined as
reconstruction principle 85 Hudson Kaplan uses
minimal recombination histories as observed
recombinations
- Attempts to find minimal histories of sequences
- Definition of recombination as Subtree Prune
Regraft operations
- J.J.Hein Reconstructing the history of
sequences subject to Gene Conversion and
Recombination. Mathematical Biosciences. (1990)
98.185-200. - J.J.Hein A Heuristic Method to Reconstruct the
History of Sequences Subject to Recombination.
J.Mol.Evol. 20.402-411. 1993 - Hein,J.J., T.Jiang, L.Wang K.Zhang (1996) "On
the complexity of comparing evolutionary trees"
Discrete Applied Mathematics 71.153-169 - Song, Y.S. (2003) On the combinatorics of rooted
binary phylogenetic trees. Annals of
Combinatorics, 7365379 - Song, Y.S. Hein, J. (2005) Constructing
Minimal Ancestral Recombination Graphs. J. Comp.
Biol., 12147169 - Song, Y.S. Hein, J. (2004) On the minimum
number of recombination events in the
evolutionary history of DNA sequences. J. Math.
Biol., 48160186. - Song, Y.S. Hein, J. (2003) Parsimonious
reconstruction of sequence evolution and
haplotype blocks finding the minimum number of
recombination events, Lecture Notes in
Bioinformatics, Proceedings of WABI'03,
2812287302. - Lyngsø, Song and Hein (2005) Minimal
Recombination Histories by Branch and Bound WABI
50Likelihood of Data Set
72 Ewens likelihood of allele number
observations 87 Griffiths recursions for infinite
site data 90 Felsenstein uses Metropolis Hastings
94 Griffiths-Tavare uses MCMC on
coalescent-mutation process 96 Griffiths-Marjoram
uses MCMC on coalescent-mutation-recombination
process 99 Donnelly-Matthews-Fearnhead uses IS
to accellerate earlier methods 00 Hudson
introduces pseuodolikelihood method
- How hard is the coalescent-mutation-recombination
process?
- Song, Y.S., Lyngsø, R.B. Hein, J. (2005)
Counting Ancestral States in Population
Genetics. In Press