SNP Haplotype Estimation From Pooled DNA Samples - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

SNP Haplotype Estimation From Pooled DNA Samples

Description:

Outward, physical manifestation of the organism. Disease status, survival times, quantitative ... Huntington disease, cystic fibrosis. Method: Linkage analysis ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 55
Provided by: rockefelle6
Category:

less

Transcript and Presenter's Notes

Title: SNP Haplotype Estimation From Pooled DNA Samples


1
SNP Haplotype Estimation From Pooled DNA Samples
  • Yaning Yang
  • Lab of Statistical Genetics
  • Rockefeller University
  • (yyang_at_linkage.Rockefeller.edu)

2
I. Background

3
Genotype-phenotype
  • Genotype-phenotype association the central
    objective of genetic studies.
  • Completion of human genome sequence is the
    foundation (Collin et al. 2003).

4
Geno-phenotype
  • Genotype
  • Internally coded, inheritable information
  • Need to be polymorphic (variation)
  • Interaction with environmental factors
  • Phenotype
  • Outward, physical manifestation of the organism
  • Disease status, survival times, quantitative
    traits (QTL)
  • Complex traits/diseases, simple traits/diseases

5
Simple (Mendelian) Traits
  • Single gene, simple mode of inheritance.
  • Huntington disease, cystic fibrosis.
  • Method Linkage analysis
  • Co-segregation of marker and disease within
    pedigrees based on recombination events.
  • More than 400 simple diseases have been
    genetically mapped.

6
Complex Traits
  • Polygenic environmental factors Complex
    epistasis (interaction) hard to dissect.
  • Polygenic multiple genes each with small to
    moderate effect.
  • Enviromental factors race, gender, diet etc.
  • Cancer, diabetes, Alzheimers disease (AD) etc.
  • Methods association analysis
  • population-based
  • family-based

7
Polymorphism
  • A difference in DNA sequence of
  • nucleotide among individuals or
  • populations.
  • P(minor allele)gt0.1, say.
  • Genetic mutation is polymorphism.
  • Marker locus-specific polymorphism
  • Like a (chaining) in approximation
  • A significant marker may itself be or close to
    the causal genetic variant

8
SNP Single Nucleotide Polymorphism
  • The most simple and common genetic polymorphism
  • Simple a single base mutation in DNA
  • Common 90 of all human DNA variations
  • Abundance 0.1
  • Biallelic (binary)
  • 2 million SNPs reported

9
Genotyping
SNP1 (locus 1)
SNP2 (locus 2)
haplotype
G
A
Diploid
T
C
Genotype G/T A/C At each locus, two
possible alleles, e.g, at locus 1, the two
alleles can be G/G, T/T or T/G.
10
Association
  • Population-based
  • Epidemiological methods Case-control cohort
    design
  • Stratification control over confounding factors
  • Powerful but easily produce spurious associations
    due to population admixture (heterogeneity)
  • Family-based
  • TDT test (McNemars test for matched pairs)
  • No need of stratification, true associations
  • Sampling is costly

11
Association
  • Identification of causal genetic variants.
  • Understanding their functions and disease
    etiology.
  • Help for disease prevention, diagnostics, drug
    development (e.g.personal medicine).

12
Allele Association (marginal)
1. Disease-allele association
G
T
case
control
2. Disease-genotype association
G/G
G/T
T/T
case
control
13
Haplotype Association (joint)
  • Most genome screen test one locus each time.
  • Dependence structure (linkage disequilibrium)
    need to be considered.
  • Example Haplotypes for case,control
  • Case Control
  • ---A-----B--- ---A-----b---
  • ---a------b--- ---a-----B---

case
control
---A-----B---
---a-----b---
---A----b---
---a-----B---
14
Haplotype
  • Total of haplotypes is 2m for m SNPs.
  • For example (m3, biallelic at each position
    A/a, B/b, C/c)

15
Why Haplotype?
  • LD
  • Alleles in Linkage disequilibrium (LD) are
    tightly linked and tend to be co-segregated.
  • LD plays a fundamental role in genetic mapping of
    complex diseases.
  • Haplotypes preserve LD information.

16
Why Haplotype?
  • Haplotype
  • A haplotype is a binary sequence along one
    chromosome.
  • Haplotype has a block-wise structure separated by
    hot spots.
  • Within each block, recombination is rare due to
    tight linkage and only very few haplotypes really
    occur.

17
A Brief Summary
  • Genotype-phenotype association analysis for
    complex disease
  • Genetic variant/polymorphism/marker
  • Genetic variation human variation
  • SNP simple, abundant genetic variant/marker
  • LD dependence of markers
  • Haplotype analysis joint distribution

18
II. Haplotype Estimation From Pooled DNA
19
Introduction
Key Words Efficiency, EM Algorithm, Haplotype
Frequency, LD Coefficients, Pooling, Variance
estimates.
20
Estimating haplotype frequencies from individual
genotypes
  • Individual samples are genotyped.
  • No phase information.
  • Likelihood analysis, (Escoffier Slatkin,1995),
    but no variance estimate.
  • Other methods Clarks parsimonious method (Clark
    1990), Bayesian MCMC

21
Genotyping individual DNA
Diploid ---A-----B--- haplotype
---a------b--- haplotype
Genotyping A/a B/b observed
genotypes (phase information is lost)
Reconstruct ---A-----B---
---a------b--- haplotype configurations or
---A-----b--- ---a------B---
22
Pooling Reduce Genotyping Cost
  • Unrelated individual samples are mixed, more
    ambiguities in recovering haplotypes.
  • No individual information and no phase
    information.
  • Efficient in allele frequency estimation, but is
    it efficient in estimating haplotype frequency?
  • Wang et al. (2003), Ito et al. (2003).

23
Genotyping pooled DNA
----A------B---- Pooling
----a-------b---- diploid for individual 1
----A------b---- ----a-------B----
diploid for individual 2
Genotyping AAaa BBbb observed
pool-genotypes Hap config ---A-----B--- or
---A----B--- or ---A----b---
---A-----b--- ---A----B---
---A----b--- ---a-----B---
---a----b--- ---a----B---
---a-----b--- ---a----b---
---a----B---
24
Pool-genotype of K- pool
  • Pool-genotype of allele 1. E.g.
  • An individual can be viewed as a pool of two
    independent chromosomes.
  • We will say a chromosome is a ½-pool

SNP 1 2 3 4 5
Individual 1

Individual 2
Pool-genotypes
25
Missing values
  • Pool-genotype at m SNP loci,
  • Completely missing no information.
  • Partially missing partial information, e.g.,
    only know

26
Statistical Methods
  • Key Words
  • Asymptotic variance , EM, MLE, missing data,
    Relative efficiency,

27
Notations
  • For m SNPs, each position can take two possible
    alleles. Denote them by 1 0.
  • Totally possible haplotypes.
  • Haplotype frequencies
  • m3

SNP 1 2 3
28
Maximum Likelihood Estimate
  • Assumptions HWE, random mating
  • Likelihood
  • where
  • When K1/2, multinomial!

29
An Example
  • m2, K2, observation X(2,1). Consistent
    haplotype configurations are
  • Likelihood

30
EM algorithm
hkhk(0), k1,2,,2m (initial value)
NO
cJ(k)Number of haplotye k in configuration J
YES
END
31
Variance Estimate
Variance matrix for estimated h
and the (k,l) element of matrix is given by
32
Asymptotic Variances
  • Fisher information matrix

Asymptotic variance of
33
Properties
Fisher Information can be represented as
where diag(1/h), ? of the haplotypes, ?
MultiNomial(2K, h), (a latent r.v.).
where
34
Reformulation of the Problem
  • Let ? MN(2K, h), and 0-1 matrix
  • Genotype X can be represented as.
  • X A ? (compressed info.)
  • From the incomplete observations X, make
    inference on the distribution/parameter h of
    unobservable ?

35
Simulated and estimated variances
variances
variances
K
D
Variances decrease with D, increase with
K. (n120, pa0.4,pb0.5)
36
Relative efficiency of pooling
Error
Cost
efficient
Error
Cost
inefficient
37
Relative Efficiency of Pooling
  • Asymptotic relative efficiency (ARE)
  • Relative efficiency (RE) (for fixed individual
    number, n, V variance or MSE)

38
Asymptotic variances
Table 2 SNPs, pa0.4, pb0.5
K
39
Asymptotic Relative Efficiency (ARE)
pa0.4, pb0.5
ARE
Higher LD, higher efficiency.
40
Simulations (1,000 replicates)
  • 2-locus (a/A, and b/B)
  • Different choices of allele frequencies, LD
    coefficients, sample sizes and pool sizes
  • pa 0.4, pb 0.5 and pa 0.2, pb 0.3
  • D0.25, 0.5, 0.75
  • n60, 120, 180
  • K1,2,,6.

41
Simulations (1,000 replicates)
  • 3-locus based on real individual genotype data
  • Infinite population
  • generate haplotypes according to the known
    haplotype frequencies, then pool 2 haplotypes
    to form individual genotypes, pool K individuals
    to form pool-genotypes.
  • Finite population (pseudo-pooling)
  • Randomly pooling every K individual genotypes
    to generate pool- genotypes.

42
MSE of haplotype estimates
Fig. Two-locus a/A and b/B, pa0.4, pb0.5
n180)
  • MSE increases as K increases.
  • For SNPs in higher LD, it is easier (less error)
    to estimate haplotype frequencies.

43
Relative efficiencies
Fig. Two-locus a/A and b/B, pa0.4, pb0.5 n180
  • RE(K) increases with K, but seems to level off
    when K 4.
  • The higher the LD, the higher the efficiency of
    pooling.
  • V6 2V1

Relative efficiency
44
Individual genotype data 3-locus(data provided
by Dr. Kumar)
  • 135 unrelated individuals genotyped at 3 SNPs in
    the AGT gene.
  • All the individuals are normal Caucasians.
  • High LD
  • This data set was used for
  • Simulation according to the estimated h
  • Pseudo-pooling simulation.

45
Relative efficiency of pooling (3-locus)
Fig. Haplotypes are generated according to
h(0, 0.082, 0, 0, 0.524,0.283, 0.005, 0.106).

n180

n120
RE
n60
46
Pseudo-pooling (table)
Table Haplotype frequency estimates of the
pseudo-pooling experiment based on Kumar data
(n120)
47
Influence of missing values on MSE and RE
48
Real Data
  • Pool-genotypes at 10 SNPs in the AGT gene,
  • 15 pools, each with K2 independent individuals.
  • Individual genotypes are not available.
  • All individuals are unrelated.
  • There are 2 completely missing values.

49
Haplotype frequency estimates (7 SNPs)
50
Case-control study
  • Association of haplotypes and diseases.
  • Test difference of haplotypes between case and
    control group.
  • LRT test
  • LRT2log(Lcase)2log(Lcontrol)2log(Lcasecontrol
    ).

51
Summary
  • Pooling is efficient for m1,2 but not for m ? 2
    when LD is low.
  • For m ? 2 and high LD case, pooling is good.
  • The variance estimates are good if n/K is large
    (say, ? 30) otherwise, bootstrap.

52
Summary (contd)
  • Pools may have different pool sizes.
  • The algorithm allows for different types of
    missing values.
  • Can be applied to case-control design for
    disease-haplotype association.
  • Need algorithms for long haplotypes.
  • Need considering genotyping errors.

53
References
  • Clark A (1990) Mol. Biol. Evol. 7, 111-122.
  • Collin F et al. (2003) Nature, 422835-847
  • Excoffier L Slatkin M (1995) Mol. Biol. Evol.
    12, 921-927.
  • Ito et al. (2003) Am . J. Hum. Genet. 72,
    384-398.
  • Wang S, Kidd KK Zhao H (2003) Genet. Epidemiol.
    24, 74-82.
  • Yang et al. (2003) PNAS 1007225-7230.

54
Acknowledgement
  • We thank Dr. A. Kumar at the New York Medical
    College in Valhalla for providing the individual
    SNP data.
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com