Title: Coalescent Module- Faro July 26th-28th 04
1Coalescent Module- Faro July 26th-28th
04 www.coalescent.dk
Monday H The Basic Coalescent W Forest
Fire W The Coalescent History, Geography
Selection H The Coalescent with
Recombination Tuesday H Recombination cont.
W The Coalescent Combinatorics HW
Computer Session H The Coalescent Human
Evolution Wednesday H The Coalescent
Statistics HW Linkage Disequilibrium Mapping
2Zooming in!(from Harding Sanger)
3109 bp
5.000
b-globin
(chromosome 11)
6104 bp
20
Exon 3
Exon 1
Exon 2
3103 bp
5 flanking
3 flanking
103
ATTGCCATGTCGATAATTGGACTATTTTTTTTTT
30 bp
3Human Migrations
From Cavalli-Sforza,2001
4Data b-globin from sampled humans. From
Griffiths, 2001
Assume 1. At most 1 substitution per
position. 2.No recombination
Reducing nucleotide columns to bi-partitions
gives a bijection between data unrooted gene
trees.
C
G
5Simplified model of human sequence evolution.
Past
0.2
Rate of common ancestry 1
Wait to common ancestry 2Ne
Mutation rate 2.5
Present
Africa
Non-Africa
6From Griffiths, 2001
7Models and their benefits.
- Models Data
- probability of data (statistics...)
- probability of individual histories
- hypothesis testing
- parameter estimation
8Coalescent Theory in Biology www. coalescent.dk
Fixed Parameters Population Structure, Mutation,
Selection, Recombination,...
Reproductive Structure
Genealogies of non-sequenced data
Genealogies of sequenced data
CATAGT
CGTTAT
TGTTGT
Parameter Estimation Model Testing
9Wright-Fisher Model of Population Reproduction
Haploid Model
i. Individuals are made by sampling with
replacement in the previous generation. ii. The
probability that 2 alleles have same ancestor in
previous generation is 1/2N
- Assumptions
- Constant population size
- No geography
- No Selection
- No recombination
Diploid Model
Individuals are made by sampling a chromosome
from the female and one from the male previous
generation with replacement
1010 Alleles Ancestry for 15 generations
11Waiting for most recent common ancestor - MRCA
Distribution until 2 alleles had a common
ancestor, X2?
P(X2 gt j) (1-(1/2N))j
P(X2 j) (1-(1/2N))j-1 (1/2N)
P(X2 gt 1) (2N-1)/2N 1-(1/2N)
j
j
2
2
1
1
1
1
1
1
2N
2N
2N
Mean, E(X2) 2N. Ex. 2N 20.000, Generation
time 30 years, E(X2) 600000 years.
12P(k)Pk alleles had k distinct parents
1
1
2N
Ancestor choices
k -gt any
k -gt k
k -gt k-1
k -gt j
(2N)k
2N (2N-1) .. (2N-(k-1)) (2N)k
Sk,j - the number of ways to group k labelled
objects into j groups.(Stirling Numbers of second
kind.
For k ltlt 2N
13Geometric/Exponential Distributions The
Geometric Distribution 1,.. Geo(p)
PZj)pj(1-p) PZgtj)pj E(Z)1/p. The
Exponential Distribution R Exp (a)
Density f(t) ae-at, P(Xgtt) e-at
Properties X Exp(a) Y Exp(b) independent
i. P(Xgtt2Xgtt1) P(Xgtt2-t1) (t2 gt t1)
ii. E(X) 1/a. iii. P(Zgtt)()P(Xgtt)
small a (pe-a). iv. P(X lt Y) a/(a
b). v. min(X,Y) Exp (a b).
14Discrete ? Continuous Time
1.0 corresponds to 2N generations
1.0
0.0
2
5
6
3
15Adding Mutations
m mutation pr. nucleotide pr.generation. L
seq. length µ mL Mutation pr. allele
pr.generation. 2Ne - allele number. Q 4Nµ --
Mutation intensity in scaled process.
Continuous time Continuous sequence
Discrete time Discrete sequence
1/L
time
1/(2Ne)
time
sequence
sequence
mutation
mutation
coalescence
Probability for two genes being
identical P(Coalescence lt Mutation) 1/(1Q).
1
Q/2
Q/2
Note Mutation rate and population size usually
appear together as a product, making separate
estimation difficult.
16The Standard Coalescent
Two independent Processes Continuous
Exponential Waiting Times Discrete
Choosing Pairs to Coalesce.
Waiting
Coalescing
1,2,3,4,5
(1,2)--(3,(4,5))
1,23,4,5
1--2
123,4,5
3--(4,5)
1234,5
4--5
12345
17Expected Height and Total Branch Length
Branch Lengths
Time Epoch
1
2
1
2
1
1/3
3
2/(k-1)
k
Expected Total height of tree Hk 2(1-1/k)
i.Infinitely many alleles finds 1 allele in
finite time. ii. In takes less than twice as
long for k alleles to find 1 ancestors as it does
for 2 alleles. Expected Total branch length in
tree, Lk 2(1 1/2 1/3 .. 1/(k-1)) ca
2ln(k-1)
18Kingman (Stoch.Proc. Appl. 13.235-248 2
other articles,1982)
A. Stochastic Processes on Equivalence
Relations. D (i,i)i 1,..n Q
(i,j)i,j1,..n 1 if s lt t qs,t
0 otherwise This defines a
process, Rt , going from to through equivalence
relations on 1,..,n.
B. The Paint Box exchangable distributions on
Partitions. C. All coalescents are
restrictions of The Coalescent a process with
entrance boundary infinity. D. Robustness of
The Coalescent If offspring distribution is
exchangeable and Var(n1) --gt s2 E(n1m) lt Mm
for all m, then genealogies follows The
Coalescent in distribution. E. A series of
combinatorial results.
19Effective Populations Size, Ne. In an idealised
Wright-Fisher model i. loss of variation per
generation is 1-1/(2N). ii. Waiting time for
random alleles to find a common ancestor is
2N. Factors that influences Ne i. Variance in
offspring. WF 1. If variance is higher, then
effective population size is smaller. ii.
Population size variation - example k cycle
N1, N2,..,Nk. k/Ne 1/N1.. 1/Nk. N1 10
N2 1000 gt Ne 50.5 iii. Two sexes Ne
4NfNm/(NfNm)I.e. Nf- 10 Nm -1000 Ne - 40
206 Realisations with 25 leaves
Observations Variation great close to root.
Trees are unbalanced.
21Sampling more sequences
The probability that the ancestor of the sample
of size n is in a sub-sample of size k
is Letting n go to infinity gives (k-1)/(k1),
i.e. even for quite small samples it is quite
large.
22Three Models of Alleles and Mutations.
Finite Site
Infinite Allele
Infinite Site
acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat
Q
Q
Q
acgtgctt acgtgcgt acctgcat tcctggct tcctgcat
i. Allele is represented by a sequence. ii. A
mutation changes nucleotide at chosen position.
i. Only identity, non-identity is
determinable ii. A mutation creates a new type.
i. Allele is represented by a line. ii. A
mutation always hits a new position.
23Infinite Allele Model
4
5
1
2
3
24Infinite Site Model
Final Aligned Data Set
250
1
1
1
4
2
3
5
4
5
5
5
6
3
7
2
8
1
26Number of paths
0
1
1
1
4
2
2
2
2
2
3
5
4
5
6
4
2
3
4
7
5
5
7
8
14
2
22
28
10
6
3
32
7
50
2
8
1
82
27Labelling and unlabellingpositions and sequences
1
2
3
4
5
Ignoring mutation position
Ignoring sequence label
1
2
3
5
4
Ignoring mutation position Ignoring sequence label
,
,
The forward-backward argument
4 classes of mutation events incompatible with
data
9 coalescence events incompatible with data
28Infinite Site Model An example
Theta2.12
2
3
2
3
4
5
5
9
5
10
14
19
33
29Impossible Ancestral States
30Finite Site Model
acgtgctt acgtgcgt acctgcat tcctgcat tcctgcat s s
s
Final Aligned Data Set
31Simplifying assumptions
1) Only substitutions. s1 TCGGTA s1
TCGGA s2 TGGT-T
s2 TGGTT 2) Processes in different positions
of the molecule are independent. 3) A nucleotide
follows a continuous time Markov Chain. 4)
Time reversibility I.e. pi Pi,j(t) pj
Pj,i(t), where pi is the stationary distribution
of i. This implies that
a
l2l1
l1
l2
N2
N1
N2
N1
5) The rate matrix, Q, for the continuous time
Markov Chain is the same at all times.
32Evolutionary Substitution Process
t1
e
A
t2
C
C
Pi,j(t) probability of going from i to j in
time t.
33Jukes-Cantor 69 Total Symmetry.
TO
A C G T
FROM
-3? ????? ? ? ? -3?
? ? ? ? -3? ?
? ? ? -3??
A C G T
- Stationary Distribution (.25,.25,.25,.25)
- B. Expected number of substitutions 3at
t
0
ATTGTGTATATAT.CAG ATTGCGTATCTAT.CCG
Chimp
Mouse
E.coli
Higher Cells
Fish
34History of Coalescent Approach to Data Analysis
1930-40s Genealogical arguments well known to
Wright Fisher. 1964 Crow Kimura Infinite
Allele Model 1968 Motoo Kimura proposes neutral
explanation of molecular evolution population
variation. So does King Jukes 1971 Kimura
Otha proposes infinite sites model. 1972
Ewens Formula Probability of data under
infinite allele model. 1975 Watterson makes
explicit use of The Coalescent 1982 Kingman
introduces The Coalescent. 1983 Hudson
introduces The Coalescent with
Recombination 1983 Kreitman publishes first
major population sequences.
35History of Coalescent Approach to Data Analysis
1987-95 Griffiths, Ethier Tavare calculates
site data probability under infinite site
model. 1994- Griffiths-Tavaré
Kuhner-Yamoto-Felsenstein introduces highly
computer intensitive simulation techniquees to
estimate parameters in population models.
1996- Krone-Neuhauser introduces selection in
Coalescent 1998- Donnelly, Stephens, Fearnhead
et al. Major accelerations in coalescent based
data analysis. 2000- Several groups combines
Coalescent Theory Gene Mapping. 2002 HapMap
project is started.
36Basic Coalescent Summary
i. Genealogical approach to population
genetics. ii. The Coalescent - generic
probability distribution on allele trees. iii.
Combining The Coalescent with Allele/Mutation
Models allows the calculation the probability of
data.
37(No Transcript)