Title: CSE280b: Population Genetics
1CSE280b Population Genetics
- Vineet Bafna/Pavel Pevzner
www.cse.ucsd.edu/classes/sp06/cse280b
2Population Sub-structure
3Population sub-structure can increase LD
- Consider two populations that were isolated and
evolving independently. - They might have different allele frequencies in
some regions. - Pick two regions that are far apart (LD is very
low, close to 0)
4Recent ad-mixing of population
- If the populations came together recently (Ex
African and European population), artificial LD
might be created. - D 0.15 (instead of 0.01), increases 10-fold
- This spurious LD might lead to false associations
- Other genetic events can cause LD to arise, and
one needs to be careful
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
5Determining population sub-structure
- Given a mix of people, can you sub-divide them
into ethnic populations. - Turn the problem of spurious LD into a clue.
- Find markers that are too far apart to show LD
- If they do show LD (correlation), that shows the
existence of multiple populations. - Sub-divide them into populations so that LD
disappears.
6Determining Population sub-structure
- Same example as before
- The two markers are too similar to show any LD,
yet they do show LD. - However, if you split them so that all 0..1 are
in one population and all 1..0 are in another, LD
disappears
7Iterative algorithm for population sub-structure
- Define
- N number of individuals (each has a single
chromosome) - k number of sub-populations.
- Z ? 1..kN is a vector giving the
sub-population. - Zik gt individual i is assigned to population
k - Xi,j allelic value for individual i in position
j - Pk,j,l frequency of allele l at position j in
population k
8Example
- Ex consider the following assignment
- P1,1,0 0.9
- P2,1,0 0.1
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
9Goal
- X is known.
- P, Z are unknown.
- The goal is to estimate Pr(P,ZX)
- Various learning techniques can be employed.
- maxP,Z Pr(XP,Z) (Max likelihood estimate)
- maxP,Z Pr(XP,Z) Pr(P,Z) (MAP)
- Sample P,Z from Pr(P,ZX)
- Here a Bayesian (MCMC) scheme is employed to
sample from Pr(P,ZX). We will only consider a
simplified version
10AlgorithmStructure
- Iteratively estimate
- (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
- After convergence, Z(m) is the answer.
- Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m) from Pr(P X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m))
- How is this sampling done?
11Example
- Choose Z at random, so each individual is
assigned to be in one of 2 populations. See
example. - Now, we need to sample P(1) from Pr(P X, Z(0))
- Simply count
- Nk,j,l number of people in pouplation k which
have allele l in position j - pk,j,l Nk,j,l / N
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
12Example
- Nk,j,l number of people in population k which
have allele l in position j - pk,j,l Nk,j,l / Nk,j,
- N1,1,0 4
- N1,1,1 6
- p1,1,0 4/10
- p1,2,0 4/10
- Thus, we can sample P(m)
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
13Sampling Z
- PrZ1 1 Pr01 belongs to population 1?
- We know that each position should be in linkage
equilibrium and independent. - Pr01 Population 1 p1,1,0 p1,2,1
(4/10)(6/10)(0.24) - Pr01 Population 2 p2,1,0 p2,2,1
(6/10)(4/10)0.24 - Pr Z1 1 0.24/(0.240.24) 0.5
Assuming, HWE, and LE
14Sampling
- Suppose, during the iteration, there is a bias.
- Then, in the next step of sampling Z, we will do
the right thing - Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
0.49 - Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
0.09 - PrZ1 1 0.49/(0.490.09) 0.85
- PrZ6 1 0.49/(0.490.09) 0.85
- Eventually all 01 will become 1 population, and
all 10 will become a second population
0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
15Allowing for admixture
- Define qi,k as the fraction of individual i that
originated from population k. - Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m),Q(m))
16Estimating Z (admixture case)
- Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
Pr(Z(i,j,l)kX,P,Q)
i,1
i,2
j
17Results on admixture prediction
18Results Thrush data
- For each individual, q(i) is plotted as the
distance to the opposite side of the triangle. - The assignment is reliable, and there is evidence
of admixture.
19Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
Oceania
Eurasia
East Asia
America
Africa
20Population sub-structureresearch problem
- Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual - The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem - (w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations - (w/ admixture) Assign (individuals, loci) to
sub-populations
21Allowing for admixture
- Define qi,k as the fraction of individual i that
originated from population k. - Iteration
- Guess Z(0)
- For m 1,2,..
- Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
- Sample Z(m) from Pr(Z X, P(m),Q(m))
22Estimating Z (admixture case)
- Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
Pr(Z(i,j,l)kX,P,Q)
i,1
i,2
j
23Results on admixture prediction simulated data
24Results Thrush data
- For each individual, q(i) is plotted as the
distance to the opposite side of the triangle. - The assignment is reliable, and there is evidence
of admixture.
25Population Structure
- 377 locations (loci) were sampled in 1000 people
from 52 populations. - 6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)
Oceania
Eurasia
East Asia
America
Africa
26NJ versus Structurethrush data
- Objective function is different in standard
clustering algorithms!
27Population sub-structureresearch problem
- Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual - The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem - (w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations - (w/ admixture) Assign (individuals, loci) to
sub-populations
28Admixture mapping
29Estimating Recombination Rates
30Recombination in human chromosome 22 (Mb scale)
Dawson et al. Nature 2002
Q Can we give a direct count of the number of
the recombination events?
31Recombination hot-spots (fine scale)
32Recombination rates (chimp/human)
- Fine scale recombination rates differ between
chimp and human - The six hot-spots seen in human are not seen in
chimp
33Combinatorial Bounds for estimating recombination
rate
- Recall that expected recombinations ? log n
- Procedure
- Generate N random ARGs that results in the given
sample - Compute mean of the number of recombinations
- Alternatively, generate a summary statistic s
from the population. - For each ?, generate many populations, and
compute the mean and variance of s (This only
needs to be done once). - Use this to select the most likely ?
- What is the correct summary statistic?
- Today, we talk about the min. number of
recombination events as a possible summary
statistic. It is not the most natural, but it is
the most interesting computationally.
34The Infinite Sites Assumption the 4 gamete
condition
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
- Consider a history without recombination. No pair
of sites shows all four gametes 00,01,10,11. - A pair of sites with all 4 gametes implies a
recombination event
35Hudson Kaplan
- Any pair of sites (i,j) containing 4 gametes must
admit a recombination event. - Disjoint (non-overlapping) sites must contain
distinct recombination events, which can be
summed! This gives a lower bound on the number of
recombination events. - Based on simulations, this bound is not tight.
36Myers and Griffiths03 Idea 1
- Let B(i,j) be a lower bound on the number of
recombinations between sites i and j.
1i1 i2 i3 i4 i5 i6
ikn
- Can we compute maxP R(P) efficiently?
37The Rm bound
38Improved lower bounds
- The Rm bound also gives a general technique for
combining local lower bounds into an overall
lower bound. - In the example, Rm2, but we cannot give any ARG
with 2 recombination events. - Can we improve upon Hudson and Kaplan to get
better local lower bounds?
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
39Myers Griffiths Idea 2
- Consider the history of individuals. Let Ht
denote the number of distinct haplotypes at time
t - One of three things might happen at time t
- Mutation Ht increase by at most 1
- Recombination Ht increase by at most 1
- Coalescence Ht does not increase
40The RH bound
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Ex Rgt 8-3-14
41RH bound
- In general, RH can be quite weak
- consider the case when SgtH
- However, it can be improved
- Partitioning idea sum RH over disjoint intervals
- Apply to any subset of columns. Ex Apply RH to
the yellow columns
000000000000000 000000000000001 000000010000000 00
0000010000001 100000000000000 100000000000001 1000
00010000000 111111111111111
(BB05)
42The Rs bound
- Compute the minimum number of recombination
events R in any ARG. Note that, we do not
explicitly construct the ARG. - Consider a matrix with M with H rows and S
columns. - The rows correspond to haplotypes.
- Columns correspond to sites.
43Rs bound Observation I
s
- Non-informative column If a site contains at
most one 1, or one 0, then in any history, it can
be obtained by adding a mutation to a branch. - EX if a is the haplotype containing a 1, It can
simply be added to the branch without increasing
number of recombination events - R(M) R(M-s)
0 0 0 1
a
44Rs bound Observation 2
- Redundant rows If two rows h1 and h2 are
identical, then - R(M) R(M-h1)
c
r1
r2
45Rs bound Observation 3
- Suppose M has no non-informative columns, or
redundant rows. - Then, at least one of the haplotypes is a
recombinant. - There exists h s.t.
- R(M) R(M-h)1
- Which h should you choose?
46Rs bound (Procedural)
- Procedure Compute_Rs(M)
- If ? non-informative column s
- return (Compute_Rs(M-s))
- Else if ? redundant row h
- return (Compute_Rs(M-h))
- Else
- return (1 minh(Compute_Rs(M-h))
47Results
48Additional results/problems
- Using dynamic programming, Rs can be computed in
2n poly(mn) time. - Also, Rs can be augmented to handle
intermediates. - Are there poly. time lower bounds?
- The number of connected components in the
conflict graph is a lower bound (BB04). - Fast algorithms for computing ARGs with minimum
recombination. - Poly. Time to get ARG with 0 recombination
- Poly. Time to get ARGs that are galled trees
(Gusfield03)
49Underperforming lower bounds
- Sometimes, Rs can be quite weak
- An RI lower bound that uses intermediates can
help (BB05)
50LPL data set
- 71 individuals, 9.7Kbp genomic sequence
- Rm22, Rh70
51(No Transcript)