CSE280b: Population Genetics - PowerPoint PPT Presentation

About This Presentation
Title:

CSE280b: Population Genetics

Description:

Consider two populations that were isolated and evolving ... Myers and Griffiths'03: Idea 1 ... Myers & Griffiths: Idea 2. Consider the history of individuals. ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 52
Provided by: vineet50
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: CSE280b: Population Genetics


1
CSE280b Population Genetics
  • Vineet Bafna/Pavel Pevzner

www.cse.ucsd.edu/classes/sp06/cse280b
2
Population Sub-structure
3
Population sub-structure can increase LD
  • Consider two populations that were isolated and
    evolving independently.
  • They might have different allele frequencies in
    some regions.
  • Pick two regions that are far apart (LD is very
    low, close to 0)

4
Recent ad-mixing of population
  • If the populations came together recently (Ex
    African and European population), artificial LD
    might be created.
  • D 0.15 (instead of 0.01), increases 10-fold
  • This spurious LD might lead to false associations
  • Other genetic events can cause LD to arise, and
    one needs to be careful

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1
Pop. AB
p10.5 q10.5 P110.1 D0.1-0.250.15
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0
5
Determining population sub-structure
  • Given a mix of people, can you sub-divide them
    into ethnic populations.
  • Turn the problem of spurious LD into a clue.
  • Find markers that are too far apart to show LD
  • If they do show LD (correlation), that shows the
    existence of multiple populations.
  • Sub-divide them into populations so that LD
    disappears.

6
Determining Population sub-structure
  • Same example as before
  • The two markers are too similar to show any LD,
    yet they do show LD.
  • However, if you split them so that all 0..1 are
    in one population and all 1..0 are in another, LD
    disappears

7
Iterative algorithm for population sub-structure
  • Define
  • N number of individuals (each has a single
    chromosome)
  • k number of sub-populations.
  • Z ? 1..kN is a vector giving the
    sub-population.
  • Zik gt individual i is assigned to population
    k
  • Xi,j allelic value for individual i in position
    j
  • Pk,j,l frequency of allele l at position j in
    population k

8
Example
  • Ex consider the following assignment
  • P1,1,0 0.9
  • P2,1,0 0.1

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
9
Goal
  • X is known.
  • P, Z are unknown.
  • The goal is to estimate Pr(P,ZX)
  • Various learning techniques can be employed.
  • maxP,Z Pr(XP,Z) (Max likelihood estimate)
  • maxP,Z Pr(XP,Z) Pr(P,Z) (MAP)
  • Sample P,Z from Pr(P,ZX)
  • Here a Bayesian (MCMC) scheme is employed to
    sample from Pr(P,ZX). We will only consider a
    simplified version

10
AlgorithmStructure
  • Iteratively estimate
  • (Z(0),P(0)), (Z(1),P(1)),.., (Z(m),P(m))
  • After convergence, Z(m) is the answer.
  • Iteration
  • Guess Z(0)
  • For m 1,2,..
  • Sample P(m) from Pr(P X, Z(m-1))
  • Sample Z(m) from Pr(Z X, P(m))
  • How is this sampling done?

11
Example
  • Choose Z at random, so each individual is
    assigned to be in one of 2 populations. See
    example.
  • Now, we need to sample P(1) from Pr(P X, Z(0))
  • Simply count
  • Nk,j,l number of people in pouplation k which
    have allele l in position j
  • pk,j,l Nk,j,l / N

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
12
Example
  • Nk,j,l number of people in population k which
    have allele l in position j
  • pk,j,l Nk,j,l / Nk,j,
  • N1,1,0 4
  • N1,1,1 6
  • p1,1,0 4/10
  • p1,2,0 4/10
  • Thus, we can sample P(m)

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 2 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
13
Sampling Z
  • PrZ1 1 Pr01 belongs to population 1?
  • We know that each position should be in linkage
    equilibrium and independent.
  • Pr01 Population 1 p1,1,0 p1,2,1
    (4/10)(6/10)(0.24)
  • Pr01 Population 2 p2,1,0 p2,2,1
    (6/10)(4/10)0.24
  • Pr Z1 1 0.24/(0.240.24) 0.5

Assuming, HWE, and LE
14
Sampling
  • Suppose, during the iteration, there is a bias.
  • Then, in the next step of sampling Z, we will do
    the right thing
  • Pr01 pop. 1 p1,1,0 p1,2,1 0.70.7
    0.49
  • Pr01 pop. 2 p2,1,0 p2,2,1 0.30.3
    0.09
  • PrZ1 1 0.49/(0.490.09) 0.85
  • PrZ6 1 0.49/(0.490.09) 0.85
  • Eventually all 01 will become 1 population, and
    all 10 will become a second population

0 .. 1 0 .. 1 0 .. 0 1 .. 1 0 .. 1 0 .. 1 0 ..
1 0 .. 1 0 .. 1 0 .. 1
1 1 1 2 1 2 1 2 1 1 2 2 2 1 2 2 1 2 2 1
1 .. 0 1 .. 0 0 .. 0 1 .. 1 1 .. 0 1 .. 0 1 ..
0 1 .. 0 1 .. 0 1 .. 0
15
Allowing for admixture
  • Define qi,k as the fraction of individual i that
    originated from population k.
  • Iteration
  • Guess Z(0)
  • For m 1,2,..
  • Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
  • Sample Z(m) from Pr(Z X, P(m),Q(m))

16
Estimating Z (admixture case)
  • Instead of estimating Pr(Z(i)kX,P,Q), (origin
    of individual i is k), we estimate
    Pr(Z(i,j,l)kX,P,Q)

i,1
i,2
j
17
Results on admixture prediction
18
Results Thrush data
  • For each individual, q(i) is plotted as the
    distance to the opposite side of the triangle.
  • The assignment is reliable, and there is evidence
    of admixture.

19
Population Structure
  • 377 locations (loci) were sampled in 1000 people
    from 52 populations.
  • 6 genetic clusters were obtained, which
    corresponded to 5 geographic regions (Rosenberg
    et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
20
Population sub-structureresearch problem
  • Systematically explore the effect of admixture.
    Can admixture be predicted for a locus, or for an
    individual
  • The sampling approach may or may not be
    appropriate. Formulate as an optimization/learning
    problem
  • (w/out admixture). Assign individuals to
    sub-populations so as to maximize linkage
    equilibrium, and hardy weinberg equilibrium in
    each of the sub-populations
  • (w/ admixture) Assign (individuals, loci) to
    sub-populations

21
Allowing for admixture
  • Define qi,k as the fraction of individual i that
    originated from population k.
  • Iteration
  • Guess Z(0)
  • For m 1,2,..
  • Sample P(m),Q(m) from Pr(P,Q X, Z(m-1))
  • Sample Z(m) from Pr(Z X, P(m),Q(m))

22
Estimating Z (admixture case)
  • Instead of estimating Pr(Z(i)kX,P,Q), (origin
    of individual i is k), we estimate
    Pr(Z(i,j,l)kX,P,Q)

i,1
i,2
j
23
Results on admixture prediction simulated data
24
Results Thrush data
  • For each individual, q(i) is plotted as the
    distance to the opposite side of the triangle.
  • The assignment is reliable, and there is evidence
    of admixture.

25
Population Structure
  • 377 locations (loci) were sampled in 1000 people
    from 52 populations.
  • 6 genetic clusters were obtained, which
    corresponded to 5 geographic regions (Rosenberg
    et al. Science 2003)

Oceania
Eurasia
East Asia
America
Africa
26
NJ versus Structurethrush data
  • Objective function is different in standard
    clustering algorithms!

27
Population sub-structureresearch problem
  • Systematically explore the effect of admixture.
    Can admixture be predicted for a locus, or for an
    individual
  • The sampling approach may or may not be
    appropriate. Formulate as an optimization/learning
    problem
  • (w/out admixture). Assign individuals to
    sub-populations so as to maximize linkage
    equilibrium, and hardy weinberg equilibrium in
    each of the sub-populations
  • (w/ admixture) Assign (individuals, loci) to
    sub-populations

28
Admixture mapping
29
Estimating Recombination Rates
30
Recombination in human chromosome 22 (Mb scale)
Dawson et al. Nature 2002
Q Can we give a direct count of the number of
the recombination events?
31
Recombination hot-spots (fine scale)
32
Recombination rates (chimp/human)
  • Fine scale recombination rates differ between
    chimp and human
  • The six hot-spots seen in human are not seen in
    chimp

33
Combinatorial Bounds for estimating recombination
rate
  • Recall that expected recombinations ? log n
  • Procedure
  • Generate N random ARGs that results in the given
    sample
  • Compute mean of the number of recombinations
  • Alternatively, generate a summary statistic s
    from the population.
  • For each ?, generate many populations, and
    compute the mean and variance of s (This only
    needs to be done once).
  • Use this to select the most likely ?
  • What is the correct summary statistic?
  • Today, we talk about the min. number of
    recombination events as a possible summary
    statistic. It is not the most natural, but it is
    the most interesting computationally.

34
The Infinite Sites Assumption the 4 gamete
condition
0 0 0 0 0 0 0 0
3
0 0 1 0 0 0 0 0
5
8
0 0 1 0 1 0 0 0
0 0 1 0 0 0 0 1
  • Consider a history without recombination. No pair
    of sites shows all four gametes 00,01,10,11.
  • A pair of sites with all 4 gametes implies a
    recombination event

35
Hudson Kaplan
  • Any pair of sites (i,j) containing 4 gametes must
    admit a recombination event.
  • Disjoint (non-overlapping) sites must contain
    distinct recombination events, which can be
    summed! This gives a lower bound on the number of
    recombination events.
  • Based on simulations, this bound is not tight.

36
Myers and Griffiths03 Idea 1
  • Let B(i,j) be a lower bound on the number of
    recombinations between sites i and j.

1i1 i2 i3 i4 i5 i6
ikn
  • Can we compute maxP R(P) efficiently?

37
The Rm bound
38
Improved lower bounds
  • The Rm bound also gives a general technique for
    combining local lower bounds into an overall
    lower bound.
  • In the example, Rm2, but we cannot give any ARG
    with 2 recombination events.
  • Can we improve upon Hudson and Kaplan to get
    better local lower bounds?

0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
39
Myers Griffiths Idea 2
  • Consider the history of individuals. Let Ht
    denote the number of distinct haplotypes at time
    t
  • One of three things might happen at time t
  • Mutation Ht increase by at most 1
  • Recombination Ht increase by at most 1
  • Coalescence Ht does not increase

40
The RH bound
0 0 0 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 1 0 1 1 1
Ex Rgt 8-3-14
41
RH bound
  • In general, RH can be quite weak
  • consider the case when SgtH
  • However, it can be improved
  • Partitioning idea sum RH over disjoint intervals
  • Apply to any subset of columns. Ex Apply RH to
    the yellow columns

000000000000000 000000000000001 000000010000000 00
0000010000001 100000000000000 100000000000001 1000
00010000000 111111111111111
(BB05)
42
The Rs bound
  • Compute the minimum number of recombination
    events R in any ARG. Note that, we do not
    explicitly construct the ARG.
  • Consider a matrix with M with H rows and S
    columns.
  • The rows correspond to haplotypes.
  • Columns correspond to sites.

43
Rs bound Observation I
s
  • Non-informative column If a site contains at
    most one 1, or one 0, then in any history, it can
    be obtained by adding a mutation to a branch.
  • EX if a is the haplotype containing a 1, It can
    simply be added to the branch without increasing
    number of recombination events
  • R(M) R(M-s)

0 0 0 1
a
44
Rs bound Observation 2
  • Redundant rows If two rows h1 and h2 are
    identical, then
  • R(M) R(M-h1)

c
r1
r2
45
Rs bound Observation 3
  • Suppose M has no non-informative columns, or
    redundant rows.
  • Then, at least one of the haplotypes is a
    recombinant.
  • There exists h s.t.
  • R(M) R(M-h)1
  • Which h should you choose?

46
Rs bound (Procedural)
  • Procedure Compute_Rs(M)
  • If ? non-informative column s
  • return (Compute_Rs(M-s))
  • Else if ? redundant row h
  • return (Compute_Rs(M-h))
  • Else
  • return (1 minh(Compute_Rs(M-h))

47
Results
48
Additional results/problems
  • Using dynamic programming, Rs can be computed in
    2n poly(mn) time.
  • Also, Rs can be augmented to handle
    intermediates.
  • Are there poly. time lower bounds?
  • The number of connected components in the
    conflict graph is a lower bound (BB04).
  • Fast algorithms for computing ARGs with minimum
    recombination.
  • Poly. Time to get ARG with 0 recombination
  • Poly. Time to get ARGs that are galled trees
    (Gusfield03)

49
Underperforming lower bounds
  • Sometimes, Rs can be quite weak
  • An RI lower bound that uses intermediates can
    help (BB05)

50
LPL data set
  • 71 individuals, 9.7Kbp genomic sequence
  • Rm22, Rh70

51
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com