SNP Haplotype Estimation From Pooled DNA Samples - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

SNP Haplotype Estimation From Pooled DNA Samples

Description:

Outward, physical manifestation of the organism. Disease status, survival times, quantitative ... Huntington disease, cystic fibrosis. Method: Linkage analysis ... – PowerPoint PPT presentation

Number of Views:130

Avg rating:3.0/5.0

Slides: 55

Provided by: rockefelle6

Category:

more less

Transcript and Presenter's Notes

Title: SNP Haplotype Estimation From Pooled DNA Samples

1
SNP Haplotype Estimation From Pooled DNA Samples

Yaning Yang
Lab of Statistical Genetics
Rockefeller University
(yyang_at_linkage.Rockefeller.edu)

2
I. Background

3
Genotype-phenotype

Genotype-phenotype association the central
objective of genetic studies.
Completion of human genome sequence is the
foundation (Collin et al. 2003).

4
Geno-phenotype

Genotype
Internally coded, inheritable information
Need to be polymorphic (variation)
Interaction with environmental factors
Phenotype
Outward, physical manifestation of the organism
Disease status, survival times, quantitative
traits (QTL)
Complex traits/diseases, simple traits/diseases

5
Simple (Mendelian) Traits

Single gene, simple mode of inheritance.
Huntington disease, cystic fibrosis.
Method Linkage analysis
Co-segregation of marker and disease within
pedigrees based on recombination events.
More than 400 simple diseases have been
genetically mapped.

6
Complex Traits

Polygenic environmental factors Complex
epistasis (interaction) hard to dissect.
Polygenic multiple genes each with small to
moderate effect.
Enviromental factors race, gender, diet etc.
Cancer, diabetes, Alzheimers disease (AD) etc.
Methods association analysis
population-based
family-based

7
Polymorphism

A difference in DNA sequence of
nucleotide among individuals or
populations.
P(minor allele)gt0.1, say.
Genetic mutation is polymorphism.
Marker locus-specific polymorphism
Like a (chaining) in approximation
A significant marker may itself be or close to
the causal genetic variant

8
SNP Single Nucleotide Polymorphism

The most simple and common genetic polymorphism
Simple a single base mutation in DNA
Common 90 of all human DNA variations
Abundance 0.1
Biallelic (binary)
2 million SNPs reported

9
Genotyping
SNP1 (locus 1)
SNP2 (locus 2)
haplotype
G
A
Diploid
T
C
Genotype G/T A/C At each locus, two
possible alleles, e.g, at locus 1, the two
alleles can be G/G, T/T or T/G.
10
Association

Population-based
Epidemiological methods Case-control cohort
design
Stratification control over confounding factors
Powerful but easily produce spurious associations
due to population admixture (heterogeneity)
Family-based
TDT test (McNemars test for matched pairs)
No need of stratification, true associations
Sampling is costly

11
Association

Identification of causal genetic variants.
Understanding their functions and disease
etiology.
Help for disease prevention, diagnostics, drug
development (e.g.personal medicine).

12
Allele Association (marginal)
1. Disease-allele association
G
T
case
control
2. Disease-genotype association
G/G
G/T
T/T
case
control
13
Haplotype Association (joint)

Most genome screen test one locus each time.
Dependence structure (linkage disequilibrium)
need to be considered.
Example Haplotypes for case,control
Case Control
---A-----B--- ---A-----b---
---a------b--- ---a-----B---

case
control
---A-----B---
---a-----b---
---A----b---
---a-----B---
14
Haplotype

Total of haplotypes is 2m for m SNPs.
For example (m3, biallelic at each position
A/a, B/b, C/c)

15
Why Haplotype?

LD
Alleles in Linkage disequilibrium (LD) are
tightly linked and tend to be co-segregated.
LD plays a fundamental role in genetic mapping of
complex diseases.
Haplotypes preserve LD information.

16
Why Haplotype?

Haplotype
A haplotype is a binary sequence along one
chromosome.
Haplotype has a block-wise structure separated by
hot spots.
Within each block, recombination is rare due to
tight linkage and only very few haplotypes really
occur.

17
A Brief Summary

Genotype-phenotype association analysis for
complex disease
Genetic variant/polymorphism/marker
Genetic variation human variation
SNP simple, abundant genetic variant/marker
LD dependence of markers
Haplotype analysis joint distribution

18
II. Haplotype Estimation From Pooled DNA
19
Introduction
Key Words Efficiency, EM Algorithm, Haplotype
Frequency, LD Coefficients, Pooling, Variance
estimates.
20
Estimating haplotype frequencies from individual
genotypes

Individual samples are genotyped.
No phase information.
Likelihood analysis, (Escoffier Slatkin,1995),
but no variance estimate.
Other methods Clarks parsimonious method (Clark
1990), Bayesian MCMC

21
Genotyping individual DNA
Diploid ---A-----B--- haplotype
---a------b--- haplotype
Genotyping A/a B/b observed
genotypes (phase information is lost)
Reconstruct ---A-----B---
---a------b--- haplotype configurations or
---A-----b--- ---a------B---
22
Pooling Reduce Genotyping Cost

Unrelated individual samples are mixed, more
ambiguities in recovering haplotypes.
No individual information and no phase
information.
Efficient in allele frequency estimation, but is
it efficient in estimating haplotype frequency?
Wang et al. (2003), Ito et al. (2003).

23
Genotyping pooled DNA
----A------B---- Pooling
----a-------b---- diploid for individual 1
----A------b---- ----a-------B----
diploid for individual 2
Genotyping AAaa BBbb observed
pool-genotypes Hap config ---A-----B--- or
---A----B--- or ---A----b---
---A-----b--- ---A----B---
---A----b--- ---a-----B---
---a----b--- ---a----B---
---a-----b--- ---a----b---
---a----B---
24
Pool-genotype of K- pool

Pool-genotype of allele 1. E.g.
An individual can be viewed as a pool of two
independent chromosomes.
We will say a chromosome is a ½-pool

SNP 1 2 3 4 5
Individual 1

Individual 2
Pool-genotypes
25
Missing values

Pool-genotype at m SNP loci,
Completely missing no information.
Partially missing partial information, e.g.,
only know

26
Statistical Methods

Key Words
Asymptotic variance , EM, MLE, missing data,
Relative efficiency,

27
Notations

For m SNPs, each position can take two possible
alleles. Denote them by 1 0.
Totally possible haplotypes.
Haplotype frequencies
m3

SNP 1 2 3
28
Maximum Likelihood Estimate

Assumptions HWE, random mating
Likelihood
where
When K1/2, multinomial!

29
An Example

m2, K2, observation X(2,1). Consistent
haplotype configurations are
Likelihood

30
EM algorithm
hkhk(0), k1,2,,2m (initial value)
NO
cJ(k)Number of haplotye k in configuration J
YES
END
31
Variance Estimate
Variance matrix for estimated h
and the (k,l) element of matrix is given by
32
Asymptotic Variances

Fisher information matrix

Asymptotic variance of
33
Properties
Fisher Information can be represented as
where diag(1/h), ? of the haplotypes, ?
MultiNomial(2K, h), (a latent r.v.).
where
34
Reformulation of the Problem

Let ? MN(2K, h), and 0-1 matrix
Genotype X can be represented as.
X A ? (compressed info.)
From the incomplete observations X, make
inference on the distribution/parameter h of
unobservable ?

35
Simulated and estimated variances
variances
variances
K
D
Variances decrease with D, increase with
K. (n120, pa0.4,pb0.5)
36
Relative efficiency of pooling
Error
Cost
efficient
Error
Cost
inefficient
37
Relative Efficiency of Pooling

Asymptotic relative efficiency (ARE)
Relative efficiency (RE) (for fixed individual
number, n, V variance or MSE)

38
Asymptotic variances
Table 2 SNPs, pa0.4, pb0.5
K
39
Asymptotic Relative Efficiency (ARE)
pa0.4, pb0.5
ARE
Higher LD, higher efficiency.
40
Simulations (1,000 replicates)

2-locus (a/A, and b/B)
Different choices of allele frequencies, LD
coefficients, sample sizes and pool sizes
pa 0.4, pb 0.5 and pa 0.2, pb 0.3
D0.25, 0.5, 0.75
n60, 120, 180
K1,2,,6.

41
Simulations (1,000 replicates)

3-locus based on real individual genotype data
Infinite population
generate haplotypes according to the known
haplotype frequencies, then pool 2 haplotypes
to form individual genotypes, pool K individuals
to form pool-genotypes.
Finite population (pseudo-pooling)
Randomly pooling every K individual genotypes
to generate pool- genotypes.

42
MSE of haplotype estimates
Fig. Two-locus a/A and b/B, pa0.4, pb0.5
n180)

MSE increases as K increases.
For SNPs in higher LD, it is easier (less error)
to estimate haplotype frequencies.

43
Relative efficiencies
Fig. Two-locus a/A and b/B, pa0.4, pb0.5 n180

RE(K) increases with K, but seems to level off
when K 4.
The higher the LD, the higher the efficiency of
pooling.
V6 2V1

Relative efficiency
44
Individual genotype data 3-locus(data provided
by Dr. Kumar)

135 unrelated individuals genotyped at 3 SNPs in
the AGT gene.
All the individuals are normal Caucasians.
High LD
This data set was used for
Simulation according to the estimated h
Pseudo-pooling simulation.

45
Relative efficiency of pooling (3-locus)
Fig. Haplotypes are generated according to
h(0, 0.082, 0, 0, 0.524,0.283, 0.005, 0.106).

n180

n120
RE
n60
46
Pseudo-pooling (table)
Table Haplotype frequency estimates of the
pseudo-pooling experiment based on Kumar data
(n120)
47
Influence of missing values on MSE and RE
48
Real Data

Pool-genotypes at 10 SNPs in the AGT gene,
15 pools, each with K2 independent individuals.
Individual genotypes are not available.
All individuals are unrelated.
There are 2 completely missing values.

49
Haplotype frequency estimates (7 SNPs)
50
Case-control study

Association of haplotypes and diseases.
Test difference of haplotypes between case and
control group.
LRT test
LRT2log(Lcase)2log(Lcontrol)2log(Lcasecontrol
).

51
Summary

Pooling is efficient for m1,2 but not for m ? 2
when LD is low.
For m ? 2 and high LD case, pooling is good.
The variance estimates are good if n/K is large
(say, ? 30) otherwise, bootstrap.

52
Summary (contd)