CSE280b: Population Genetics - PowerPoint PPT Presentation

About This Presentation

Title:

CSE280b: Population Genetics

Description:

Consider two populations that were isolated and evolving ... Myers and Griffiths'03: Idea 1 ... Myers & Griffiths: Idea 2. Consider the history of individuals. ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 52

Provided by: vineet50

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE280b: Population Genetics

1
CSE280b Population Genetics

Vineet Bafna/Pavel Pevzner

www.cse.ucsd.edu/classes/sp06/cse280b
2
Population Sub-structure
3
Population sub-structure can increase LD

Consider two populations that were isolated and
evolving independently.
They might have different allele frequencies in
some regions.
Pick two regions that are far apart (LD is very
low, close to 0)

4
Recent ad-mixing of population

If the populations came together recently (Ex
African and European population), artificial LD
might be created.
D 0.15 (instead of 0.01), increases 10-fold
This spurious LD might lead to false associations
Other genetic events can cause LD to arise, and
one needs to be careful

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
5
Determining population sub-structure

Given a mix of people, can you sub-divide them
into ethnic populations.
Turn the problem of spurious LD into a clue.
Find markers that are too far apart to show LD
If they do show LD (correlation), that shows the
existence of multiple populations.
Sub-divide them into populations so that LD
disappears.

6
Determining Population sub-structure

Same example as before
The two markers are too similar to show any LD,
yet they do show LD.
However, if you split them so that all 0..1 are
in one population and all 1..0 are in another, LD
disappears

7
Iterative algorithm for population sub-structure

Define
N number of individuals (each has a single
chromosome)
k number of sub-populations.
Z ? 1..kN is a vector giving the
sub-population.
Zik gt individual i is assigned to population
k
Xi,j allelic value for individual i in position
j
Pk,j,l frequency of allele l at position j in
population k

8
Example

Ex consider the following assignment
P1,1,0 0.9
P2,1,0 0.1

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
9
Goal

X is known.
P, Z are unknown.
The goal is to estimate Pr(P,ZX)
Various learning techniques can be employed.
maxP,Z Pr(XP,Z) (Max likelihood estimate)
maxP,Z Pr(XP,Z) Pr(P,Z) (MAP)
Sample P,Z from Pr(P,ZX)
Here a Bayesian (MCMC) scheme is employed to
sample from Pr(P,ZX). We will only consider a
simplified version

10
AlgorithmStructure

Iteratively estimate
(Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
After convergence, Z(m) is the answer.
Iteration
Guess Z(0)
For m 1,2,..
Sample P(m) from Pr(P X, Z(m-1))
Sample Z(m) from Pr(Z X, P(m))
How is this sampling done?

11
Example

Choose Z at random, so each individual is
assigned to be in one of 2 populations. See
example.
Now, we need to sample P(1) from Pr(P X, Z(0))
Simply count
Nk,j,l number of people in pouplation k which
have allele l in position j
pk,j,l Nk,j,l / N

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
12
Example

Nk,j,l number of people in population k which
have allele l in position j
pk,j,l Nk,j,l / Nk,j,
N1,1,0 4
N1,1,1 6
p1,1,0 4/10
p1,2,0 4/10
Thus, we can sample P(m)

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
13
Sampling Z

PrZ1 1 Pr01 belongs to population 1?
We know that each position should be in linkage
equilibrium and independent.
Pr01 Population 1 p1,1,0 p1,2,1
(4/10)(6/10)(0.24)
Pr01 Population 2 p2,1,0 p2,2,1
(6/10)(4/10)0.24
Pr Z1 1 0.24/(0.240.24) 0.5

Assuming, HWE, and LE
14
Sampling

Suppose, during the iteration, there is a bias.
Then, in the next step of sampling Z, we will do
the right thing
Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
0.49
Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
0.09
PrZ1 1 0.49/(0.490.09) 0.85
PrZ6 1 0.49/(0.490.09) 0.85
Eventually all 01 will become 1 population, and
all 10 will become a second population

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
15
Allowing for admixture

Define qi,k as the fraction of individual i that
originated from population k.
Iteration
Guess Z(0)
For m 1,2,..
Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
Sample Z(m) from Pr(Z X, P(m),Q(m))

16
Estimating Z (admixture case)

Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
Pr(Z(i,j,l)kX,P,Q)

i,1
i,2
j
17
Results on admixture prediction
18
Results Thrush data

For each individual, q(i) is plotted as the
distance to the opposite side of the triangle.
The assignment is reliable, and there is evidence
of admixture.

19
Population Structure

377 locations (loci) were sampled in 1000 people
from 52 populations.
6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
20
Population sub-structureresearch problem

Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual
The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem
(w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations
(w/ admixture) Assign (individuals, loci) to
sub-populations

21
Allowing for admixture

Define qi,k as the fraction of individual i that
originated from population k.
Iteration
Guess Z(0)
For m 1,2,..
Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
Sample Z(m) from Pr(Z X, P(m),Q(m))

22
Estimating Z (admixture case)

Instead of estimating Pr(Z(i)kX,P,Q), (origin
of individual i is k), we estimate
Pr(Z(i,j,l)kX,P,Q)

i,1
i,2
j
23
Results on admixture prediction simulated data
24
Results Thrush data

For each individual, q(i) is plotted as the
distance to the opposite side of the triangle.
The assignment is reliable, and there is evidence
of admixture.

25
Population Structure

377 locations (loci) were sampled in 1000 people
from 52 populations.
6 genetic clusters were obtained, which
corresponded to 5 geographic regions (Rosenberg
et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
26
NJ versus Structurethrush data

Objective function is different in standard
clustering algorithms!

27
Population sub-structureresearch problem

Systematically explore the effect of admixture.
Can admixture be predicted for a locus, or for an
individual
The sampling approach may or may not be
appropriate. Formulate as an optimization/learning
problem
(w/out admixture). Assign individuals to
sub-populations so as to maximize linkage
equilibrium, and hardy weinberg equilibrium in
each of the sub-populations
(w/ admixture) Assign (individuals, loci) to
sub-populations

28
Admixture mapping
29
Estimating Recombination Rates
30
Recombination in human chromosome 22 (Mb scale)
Dawson et al. Nature 2002
Q Can we give a direct count of the number of
the recombination events?
31
Recombination hot-spots (fine scale)
32
Recombination rates (chimp/human)

Fine scale recombination rates differ between
chimp and human
The six hot-spots seen in human are not seen in
chimp

33
Combinatorial Bounds for estimating recombination
rate

Recall that expected recombinations ? log n
Procedure
Generate N random ARGs that results in the given
sample
Compute mean of the number of recombinations
Alternatively, generate a summary statistic s
from the population.
For each ?, generate many populations, and
compute the mean and variance of s (This only
needs to be done once).
Use this to select the most likely ?
What is the correct summary statistic?
Today, we talk about the min. number of
recombination events as a possible summary
statistic. It is not the most natural, but it is
the most interesting computationally.

34
The Infinite Sites Assumption the 4 gamete
condition
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1

Consider a history without recombination. No pair
of sites shows all four gametes 00,01,10,11.
A pair of sites with all 4 gametes implies a
recombination event

35
Hudson Kaplan

Any pair of sites (i,j) containing 4 gametes must
admit a recombination event.
Disjoint (non-overlapping) sites must contain
distinct recombination events, which can be
summed! This gives a lower bound on the number of
recombination events.
Based on simulations, this bound is not tight.

36
Myers and Griffiths03 Idea 1

Let B(i,j) be a lower bound on the number of
recombinations between sites i and j.

1i1 i2 i3 i4 i5 i6
ikn

Can we compute maxP R(P) efficiently?

37
The Rm bound
38
Improved lower bounds

The Rm bound also gives a general technique for
combining local lower bounds into an overall
lower bound.
In the example, Rm2, but we cannot give any ARG
with 2 recombination events.
Can we improve upon Hudson and Kaplan to get
better local lower bounds?

0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
39
Myers Griffiths Idea 2

Consider the history of individuals. Let Ht
denote the number of distinct haplotypes at time
t
One of three things might happen at time t
Mutation Ht increase by at most 1
Recombination Ht increase by at most 1
Coalescence Ht does not increase

40
The RH bound
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Ex Rgt 8-3-14
41
RH bound

In general, RH can be quite weak
consider the case when SgtH
However, it can be improved
Partitioning idea sum RH over disjoint intervals
Apply to any subset of columns. Ex Apply RH to
the yellow columns

000000000000000 000000000000001 000000010000000 00
0000010000001 100000000000000 100000000000001 1000
00010000000 111111111111111
(BB05)
42
The Rs bound

Compute the minimum number of recombination
events R in any ARG. Note that, we do not
explicitly construct the ARG.
Consider a matrix with M with H rows and S
columns.
The rows correspond to haplotypes.
Columns correspond to sites.

43
Rs bound Observation I
s

Non-informative column If a site contains at
most one 1, or one 0, then in any history, it can
be obtained by adding a mutation to a branch.
EX if a is the haplotype containing a 1, It can
simply be added to the branch without increasing
number of recombination events
R(M) R(M-s)

0 0 0 1
a
44
Rs bound Observation 2

Redundant rows If two rows h1 and h2 are
identical, then
R(M) R(M-h1)

c
r1
r2
45
Rs bound Observation 3

Suppose M has no non-informative columns, or
redundant rows.
Then, at least one of the haplotypes is a
recombinant.
There exists h s.t.
R(M) R(M-h)1
Which h should you choose?

46
Rs bound (Procedural)

Procedure Compute_Rs(M)
If ? non-informative column s
return (Compute_Rs(M-s))
Else if ? redundant row h
return (Compute_Rs(M-h))
Else
return (1 minh(Compute_Rs(M-h))

47
Results
48
Additional results/problems

Using dynamic programming, Rs can be computed in
2n poly(mn) time.
Also, Rs can be augmented to handle
intermediates.
Are there poly. time lower bounds?
The number of connected components in the
conflict graph is a lower bound (BB04).
Fast algorithms for computing ARGs with minimum
recombination.
Poly. Time to get ARG with 0 recombination
Poly. Time to get ARGs that are galled trees
(Gusfield03)

49
Underperforming lower bounds