Title: SNP Haplotype Estimation From Pooled DNA Samples
1SNP Haplotype Estimation From Pooled DNA Samples
- Yaning Yang
- Lab of Statistical Genetics
- Rockefeller University
- (yyang_at_linkage.Rockefeller.edu)
2I. Background
3Genotype-phenotype
- Genotype-phenotype association the central
objective of genetic studies. - Completion of human genome sequence is the
foundation (Collin et al. 2003).
4Geno-phenotype
- Genotype
- Internally coded, inheritable information
- Need to be polymorphic (variation)
- Interaction with environmental factors
- Phenotype
- Outward, physical manifestation of the organism
- Disease status, survival times, quantitative
traits (QTL) - Complex traits/diseases, simple traits/diseases
5Simple (Mendelian) Traits
- Single gene, simple mode of inheritance.
- Huntington disease, cystic fibrosis.
- Method Linkage analysis
- Co-segregation of marker and disease within
pedigrees based on recombination events. - More than 400 simple diseases have been
genetically mapped.
6Complex Traits
- Polygenic environmental factors Complex
epistasis (interaction) hard to dissect. - Polygenic multiple genes each with small to
moderate effect. - Enviromental factors race, gender, diet etc.
- Cancer, diabetes, Alzheimers disease (AD) etc.
- Methods association analysis
- population-based
- family-based
7Polymorphism
- A difference in DNA sequence of
- nucleotide among individuals or
- populations.
- P(minor allele)gt0.1, say.
- Genetic mutation is polymorphism.
- Marker locus-specific polymorphism
- Like a (chaining) in approximation
- A significant marker may itself be or close to
the causal genetic variant
8SNP Single Nucleotide Polymorphism
- The most simple and common genetic polymorphism
- Simple a single base mutation in DNA
- Common 90 of all human DNA variations
- Abundance 0.1
- Biallelic (binary)
- 2 million SNPs reported
9Genotyping
SNP1 (locus 1)
SNP2 (locus 2)
haplotype
G
A
Diploid
T
C
Genotype G/T A/C At each locus, two
possible alleles, e.g, at locus 1, the two
alleles can be G/G, T/T or T/G.
10Association
- Population-based
- Epidemiological methods Case-control cohort
design - Stratification control over confounding factors
- Powerful but easily produce spurious associations
due to population admixture (heterogeneity) - Family-based
- TDT test (McNemars test for matched pairs)
- No need of stratification, true associations
- Sampling is costly
11Association
- Identification of causal genetic variants.
- Understanding their functions and disease
etiology. - Help for disease prevention, diagnostics, drug
development (e.g.personal medicine).
12Allele Association (marginal)
1. Disease-allele association
G
T
case
control
2. Disease-genotype association
G/G
G/T
T/T
case
control
13Haplotype Association (joint)
- Most genome screen test one locus each time.
- Dependence structure (linkage disequilibrium)
need to be considered. - Example Haplotypes for case,control
-
- Case Control
- ---A-----B--- ---A-----b---
- ---a------b--- ---a-----B---
case
control
---A-----B---
---a-----b---
---A----b---
---a-----B---
14Haplotype
- Total of haplotypes is 2m for m SNPs.
- For example (m3, biallelic at each position
A/a, B/b, C/c)
15Why Haplotype?
- LD
- Alleles in Linkage disequilibrium (LD) are
tightly linked and tend to be co-segregated. - LD plays a fundamental role in genetic mapping of
complex diseases. - Haplotypes preserve LD information.
16Why Haplotype?
- Haplotype
- A haplotype is a binary sequence along one
chromosome. - Haplotype has a block-wise structure separated by
hot spots. - Within each block, recombination is rare due to
tight linkage and only very few haplotypes really
occur.
17A Brief Summary
- Genotype-phenotype association analysis for
complex disease - Genetic variant/polymorphism/marker
- Genetic variation human variation
- SNP simple, abundant genetic variant/marker
- LD dependence of markers
- Haplotype analysis joint distribution
-
18II. Haplotype Estimation From Pooled DNA
19Introduction
Key Words Efficiency, EM Algorithm, Haplotype
Frequency, LD Coefficients, Pooling, Variance
estimates.
20Estimating haplotype frequencies from individual
genotypes
- Individual samples are genotyped.
- No phase information.
- Likelihood analysis, (Escoffier Slatkin,1995),
but no variance estimate. - Other methods Clarks parsimonious method (Clark
1990), Bayesian MCMC
21Genotyping individual DNA
Diploid ---A-----B--- haplotype
---a------b--- haplotype
Genotyping A/a B/b observed
genotypes (phase information is lost)
Reconstruct ---A-----B---
---a------b--- haplotype configurations or
---A-----b--- ---a------B---
22Pooling Reduce Genotyping Cost
- Unrelated individual samples are mixed, more
ambiguities in recovering haplotypes. - No individual information and no phase
information. - Efficient in allele frequency estimation, but is
it efficient in estimating haplotype frequency? - Wang et al. (2003), Ito et al. (2003).
23Genotyping pooled DNA
----A------B---- Pooling
----a-------b---- diploid for individual 1
----A------b---- ----a-------B----
diploid for individual 2
Genotyping AAaa BBbb observed
pool-genotypes Hap config ---A-----B--- or
---A----B--- or ---A----b---
---A-----b--- ---A----B---
---A----b--- ---a-----B---
---a----b--- ---a----B---
---a-----b--- ---a----b---
---a----B---
24Pool-genotype of K- pool
- Pool-genotype of allele 1. E.g.
- An individual can be viewed as a pool of two
independent chromosomes. - We will say a chromosome is a ½-pool
SNP 1 2 3 4 5
Individual 1
Individual 2
Pool-genotypes
25Missing values
- Pool-genotype at m SNP loci,
- Completely missing no information.
- Partially missing partial information, e.g.,
only know
26Statistical Methods
- Key Words
- Asymptotic variance , EM, MLE, missing data,
Relative efficiency,
27Notations
- For m SNPs, each position can take two possible
alleles. Denote them by 1 0. - Totally possible haplotypes.
- Haplotype frequencies
- m3
SNP 1 2 3
28Maximum Likelihood Estimate
- Assumptions HWE, random mating
- Likelihood
- where
-
-
-
- When K1/2, multinomial!
29An Example
- m2, K2, observation X(2,1). Consistent
haplotype configurations are - Likelihood
30EM algorithm
hkhk(0), k1,2,,2m (initial value)
NO
cJ(k)Number of haplotye k in configuration J
YES
END
31Variance Estimate
Variance matrix for estimated h
and the (k,l) element of matrix is given by
32Asymptotic Variances
- Fisher information matrix
Asymptotic variance of
33Properties
Fisher Information can be represented as
where diag(1/h), ? of the haplotypes, ?
MultiNomial(2K, h), (a latent r.v.).
where
34Reformulation of the Problem
- Let ? MN(2K, h), and 0-1 matrix
- Genotype X can be represented as.
- X A ? (compressed info.)
- From the incomplete observations X, make
inference on the distribution/parameter h of
unobservable ?
35Simulated and estimated variances
variances
variances
K
D
Variances decrease with D, increase with
K. (n120, pa0.4,pb0.5)
36Relative efficiency of pooling
Error
Cost
efficient
Error
Cost
inefficient
37Relative Efficiency of Pooling
- Asymptotic relative efficiency (ARE)
-
-
- Relative efficiency (RE) (for fixed individual
number, n, V variance or MSE)
38Asymptotic variances
Table 2 SNPs, pa0.4, pb0.5
K
39Asymptotic Relative Efficiency (ARE)
pa0.4, pb0.5
ARE
Higher LD, higher efficiency.
40Simulations (1,000 replicates)
- 2-locus (a/A, and b/B)
- Different choices of allele frequencies, LD
coefficients, sample sizes and pool sizes - pa 0.4, pb 0.5 and pa 0.2, pb 0.3
- D0.25, 0.5, 0.75
- n60, 120, 180
- K1,2,,6.
41Simulations (1,000 replicates)
- 3-locus based on real individual genotype data
- Infinite population
- generate haplotypes according to the known
haplotype frequencies, then pool 2 haplotypes
to form individual genotypes, pool K individuals
to form pool-genotypes. - Finite population (pseudo-pooling)
- Randomly pooling every K individual genotypes
to generate pool- genotypes.
42MSE of haplotype estimates
Fig. Two-locus a/A and b/B, pa0.4, pb0.5
n180)
- MSE increases as K increases.
- For SNPs in higher LD, it is easier (less error)
to estimate haplotype frequencies.
43Relative efficiencies
Fig. Two-locus a/A and b/B, pa0.4, pb0.5 n180
- RE(K) increases with K, but seems to level off
when K 4. - The higher the LD, the higher the efficiency of
pooling. - V6 2V1
Relative efficiency
44Individual genotype data 3-locus(data provided
by Dr. Kumar)
- 135 unrelated individuals genotyped at 3 SNPs in
the AGT gene. - All the individuals are normal Caucasians.
- High LD
- This data set was used for
- Simulation according to the estimated h
- Pseudo-pooling simulation.
45Relative efficiency of pooling (3-locus)
Fig. Haplotypes are generated according to
h(0, 0.082, 0, 0, 0.524,0.283, 0.005, 0.106).
n180
n120
RE
n60
46Pseudo-pooling (table)
Table Haplotype frequency estimates of the
pseudo-pooling experiment based on Kumar data
(n120)
47Influence of missing values on MSE and RE
48Real Data
- Pool-genotypes at 10 SNPs in the AGT gene,
- 15 pools, each with K2 independent individuals.
- Individual genotypes are not available.
- All individuals are unrelated.
- There are 2 completely missing values.
49Haplotype frequency estimates (7 SNPs)
50Case-control study
- Association of haplotypes and diseases.
- Test difference of haplotypes between case and
control group. - LRT test
- LRT2log(Lcase)2log(Lcontrol)2log(Lcasecontrol
).
51Summary
- Pooling is efficient for m1,2 but not for m ? 2
when LD is low. - For m ? 2 and high LD case, pooling is good.
- The variance estimates are good if n/K is large
(say, ? 30) otherwise, bootstrap.
52Summary (contd)
- Pools may have different pool sizes.
- The algorithm allows for different types of
missing values. - Can be applied to case-control design for
disease-haplotype association. - Need algorithms for long haplotypes.
- Need considering genotyping errors.
53References
- Clark A (1990) Mol. Biol. Evol. 7, 111-122.
- Collin F et al. (2003) Nature, 422835-847
- Excoffier L Slatkin M (1995) Mol. Biol. Evol.
12, 921-927. - Ito et al. (2003) Am . J. Hum. Genet. 72,
384-398. - Wang S, Kidd KK Zhao H (2003) Genet. Epidemiol.
24, 74-82. - Yang et al. (2003) PNAS 1007225-7230.
54Acknowledgement
- We thank Dr. A. Kumar at the New York Medical
College in Valhalla for providing the individual
SNP data. - Thank You!