The Problem of Detecting Differentially Expressed Genes - PowerPoint PPT Presentation

About This Presentation
Title:

The Problem of Detecting Differentially Expressed Genes

Description:

and consider all genes that differ by more than an arbitrary cutoff ... Genovese and Wasserman (2002) Storey (2002, 2003) Storey and Tibshirani (2003a, 2003b) ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 102
Provided by: kar143
Category:

less

Transcript and Presenter's Notes

Title: The Problem of Detecting Differentially Expressed Genes


1
The Problem of Detecting Differentially
Expressed Genes
2
(No Transcript)
3
Class 1
Class 2
4
Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
5
Test of a Single Hypothesis
(1) For gene consider the null hypothesis of no
association between its expression level and its
class membership.
(2) Decide on level of significance (commonly 5).
(3) Perform a test (e.g Students t-test) for
each gene.
(4) Obtain P-value corresponding to that test
statistic.
(5) Compare P-value with the significance level.
Then either reject or retain the null hypothesis.
6
Two-Sample t-Statistic
Students t-statistic
7
Two-Sample t-Statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
8
Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001)
9
Types of Errors in Hypothesis Testing
PREDICTED
Retain Null Reject Null
Null False positive
Non-null False negative
TRUE
10
Multiplicity Problem
When many hypotheses are tested, the probability
of a false positive increases sharply with the
number of hypotheses.
Further Genes are co-regulated, subsequently
there is correlation between the test statistics.
11
Example
Suppose we measure the expression of 10,000
genes in a microarray experiment.
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives. P
0.05/10,000 for each test, .05 false positives.
12
Controlling the Error Rate
  • Methods for controlling false positives e.g.
    Bonferroni are too strict for microarray analyses
  • Use the False Discovery Rate instead (FDR)
  • (Benjamini and Hochberg 1995)

13
Methods for dealing with the Multiplicity Problem
  • The Bonferroni Method
  • controls the family wise error rate (FWER) i.e.
  • the probability that at least one false positive
    error will be made
  • Too strict for gene expression data, tries to
    make it unlikely that even one false rejection of
    the null is made, may lead to missed findings
  • The False Discovery Rate (FDR)
  • emphasizes the proportion of false positives
    among the identified differentially expressed
    genes.
  • Good for gene expression data says something
    about the chosen genes

14
False Discovery Rate Benjamini and Hochberg
(1995)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
15
  
Possible Outcomes for N Hypothesis Tests
16
where
17
Positive FDR
18
  • Lindsay, Kettenring, and Siegmund (2004).
  • A Report on the Future of Statistics.
  • Statist. Sci. 19.

19
(No Transcript)
20
Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
  • Genovese and Wasserman (2002)
  • Storey (2002, 2003)
  • Storey and Tibshirani (2003a, 2003b)
  • Storey, Taylor and Siegmund (2004)
  • Black (2004)
  • Cox and Wong (2004)

21
Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values
following the null distribution are independent
and uniformly distributed.
(1) Let be the
observed P-values.
(2) Calculate
.
(3) If exists then reject null hypotheses
corresponding to
. Otherwise, reject nothing.
22
Example Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are
carried out leading to the following ordered
P-values
0.00017 0.00448 0.00671 0.00907
0.01220 0.33626 0.39341 0.53882 0.58125
0.98617
(a) With a 0.05, the Bonferroni test rejects
any hypothesis whose P-value is less than a / 10
0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such
that P(k) lt ka / N.
Here k 5, thus we reject the first five
hypotheses.
23
q-VALUE
q-value of a gene j is expected proportion of
false positives when calling that gene
significant. P-value is the probability under
the null hypothesis of obtaining a value of the
test statistic as or more extreme than its
observed value. The q-value for an observed
test statistic can be viewed as the expected
proportion of false positives among all genes
with their test statistics as or more extreme
than the observed value.
24
LIST OF SIGNIFICANT GENES
Call all genes significant if pj lt 0.05 or Call
all genes significant if qj lt 0.05 to produce a
set of significant genes so that a proportion of
them (lt0.05) is expected to be false (at least
for a large no. of genes not necessarily
independent)
25
BRCA1 versus BRCA2-mutation positive tumours
(Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive
tumours, p3226 genes P.001 gave 51
genes differentially expressed P0.0001 gave
9-11 genes
26
Using qlt0.05, gives 160 genes are taken to be
significant. It means that approx. 8 of these
160 genes are expected to be false
positives. Also, it is estimated that 33 of the
genes are differentially expressed.
27
(No Transcript)
28
(No Transcript)
29
Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
30
Using just the B permutations of the class labels
for the gene-specific statistic Wj , the P-value
for Wj wj is assessed as
where w(b)0j is the null version of wj after the
bth permutation of the class labels.
31
If we pool over all N genes, then
32
Null Distribution of the Test Statistic Example
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples,
with three samples from each class. Consider
the expressions of two genes, Gene 1 and Gene 2.
33
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
To find the null distribution of the test
statistic for Gene 1, we proceed under the
assumption that there is no difference between
the classes (for Gene 1) so that
Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1)
A6(1)
And permute the class labels
Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1)
A6(1) ... There are 10 distinct permutations.
34
Ten Permutations of Gene 1
A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) A1(1)
A2(1) A4(1) A3(1) A5(1) A6(1) A1(1) A2(1)
A5(1) A3(1) A4(1) A6(1) A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1) A1(1) A3(1) A4(1) A2(1)
A5(1) A6(1) A1(1) A3(1) A5(1) A2(1) A4(1)
A6(1) A1(1) A3(1) A6(1) A2(1) A4(1)
A5(1) A1(1) A4(1) A5(1) A2(1) A3(1)
A6(1) A1(1) A4(1) A6(1) A2(1) A3(1)
A5(1) A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)
35
As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
36
Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1. Suppose that
Gene 2 is differentially expressed, then the
null values of the test statistic for Gene 2 will
have a mixed distribution.
37
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2)
B6(2)
Permute the class labels
Perm. 1 A1(2) A2(2) B4(2) A3(2) B5(2)
B6(2) ...
38
Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
39
Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
40
The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher
et al., 2001).
The order statistics t(1), ... , t(N) are
plotted against their null expectations above.
A good test in situations where there are more
genes being over-expressed than under-expressed,
or vice-versa.
41
The FDR and other error rates
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
TRUE
R
FNDR
FDR
42
The FDR and other error rates
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
TRUE
R
FNR
FDR
43
Toy Example with 10,000 Genes
Test Result non-DE DE Total Test Result non-DE DE Total Test Result non-DE DE Total
True non-DE DE Total A 9025 B 475 C 100 D 400 9125 875 9500 500 10000
FDR B / (B D) 475 / 875 54
FNDR C / (A C) 100 / 9125 1
44
Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
45
Use of the P-Value as a Summary
Statistic (Allison et al., 2002)
Instead of using the pooled form of the
t-statistic, we can work with the value pj,
which is the P-value associated with tj in the
test of the null hypothesis of no difference in
expression between the two classes.
The distribution of the P-value is modelled by
the h-component mixture model
,
where a11 a12 1.
46
Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in
expression for the jth gene, pj will have a
uniform distribution on the unit interval ie
the b1,1 distribution.
The ba1,a2 density is given by
where
47
(No Transcript)
48
  • Efron B, Tibshirani R, Storey JD, Tusher V (2001)
    Empirical Bayes analysis of a microarray
    experiment. JASA 96,1151-1160.
  • Efron B (2004) Large-scale simultaenous
    hypothesis testing the choice of a null
    hypothesis. JASA 99, 96-104.
  • Efron B (2004) Selection and Estimation for
    Large-Scale Simultaneous Inference.
  • Efron B (2005) Local False Discovery Rates.
  • Efron B (2006) Correlation and Large-Scale
    Simultaneous Significance Testing.

49
  • McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX.
    Using mixture models to detect differentially
    expressed genes. Australian Journal of
    Experimental Agriculture 45, 859-866.
  • McLachlan GJ, Bean RW, Ben-Tovim Jones L. A
    simple implentation of a normal mixture approach
    to differential gene expression in multiclass
    microarrays. Bioinformatics 26. To appear.

50
Two component mixture model
p0 is the proportion of genes that are not
differentially expressed. The two-component
mixture model is
Using Bayes Theorem, we calculate the posterior
probability that gene j is not differentially
expressed
51
If we conclude that gene j is differentially
expressed if
0(wj) c0,
then this decision minimizes the (estimated)
Bayes risk
52
where e01 is the probability of a false
positive and e10 is the probability of a false
negative.
where
53
Estimated FDR
where
54
Similarly, the false positive rate is given by
and the false non-discovery rate and false
negative rate by
55
Glonek and Solomon (2003)
F0 N(0,1), p00.9 F1 N(1,1), p10.1 Reject
H0 if z2 t0(2) 0.99972 but FDR0.17
F0 N(0,1), p00.6 F1 N(1,1), p10.4 Reject
H0 if z2 t0(2) 0.251 but FDR0.177
56
(No Transcript)
57
(No Transcript)
58
Gene Statistics Two-Sample t-Statistic
Students t-statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
Modified t-statistic of Tusher et al. (2001)
59
The Procedure
1. Obtain the z-score for each of the genes

2. Rank the genes on the basis of the z-scores,
starting with the largest ones (the same
ordering as with the P-values, pj).
3. The posterior probability of non-differential
expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed
if 0(zj) lt c0
60
t0(zj) c
If p0/p1 9,
then
e.g. c 0.2,
61
t0(zj) c
Suppose
p0/ p1 0.9
then
t0(zj) 0.2 if
Much stronger level of evidence against the
null than in standard one-at-a-time testing.
62
e.g.
Zj N(µ,1)
H0 µ 0 vs H1 µ 2.8
Rejecting H0 if Zj 1.96 yields a
two-sided test of size 0.05 and power
0.80. f1(1.96)/f0(1.96) 4.8.
63
Z-scores, null case
Z-scores, 1
Z-scores, 2
Z-scores, 3
64
Use of Wilson-Hilferty Transformation as in Broet
et al. (2004)
zj
j
65
Plot of approximate z-scores obtained
by Wilson-Hilferty transformation of simulated
values of F-statistic with 1 and 8 degrees of
freedom.
66
EMMIX-FDR
  • A program has been written in R which interfaces
    with EMMIX to implement the algorithm described
    in McLachlan et al (2006).
  • We fit a mixture of two normal components
  • to test statistics calculated from the gene
    expression data.

67
When we equate the sample mean and variance of
the mixture to their population counterparts, we
obtain
68
When we are working with the theoretical null, we
can easily estimate the mean and variance of the
non-null component with the following formulae.
69
Following the approach of Storey and Tibshirani
(2003) we can obtain an initial estimate for p0
as follows
This estimate of p0 is used as input to EMMIX
as part of the initial parameter estimates. Thus
no random or k-means starts are required, as is
usually the case. There are two different
versions of EMMIX used, the standard version for
the empirical null and a modified version for the
theoretical null which fixes the mean
and variance of the null component to be 0 and 1
respectively.
70
Theoretical and empirical nulls
  • Efron (2004) suggested the use of two kinds of
    null component the theoretical and the empirical
    null. In the theoretical case the null component
    has mean 0 and variance 1 and the empirical null
    has unrestricted mean and variance.

71
Examples
  • We examined the performance of EMMIX-FDR on
    three well-known data sets
  • from the literature.
  • Alon colon cancer data (1999)
  • Hedenfalk breast cancer data (2001)
  • vant Wout HIV data (2003)

72
Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to
obtain gene expression profiles of tumours from
carriers of either the BRCA1 or BRCA2 mutation
(hereditary breast cancers), as well as sporadic
breast cancer. We consider their data set of M
15 patients, comprising two patient groups
BRCA1 (7) versus BRCA2 - mutation positive (8),
with N 3,226 genes.
The problem is to find genes which are
differentially expressed between the BRCA1 and
BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
73
Two component model for the Breast Cancer Data
Fit to the N values of wj (based on pooled
two-sample t-statistic)
j th gene is taken to be differentially expressed
if
74
Fitting two component mixture model to Hedenfalk
data
Null
NonNull (DE genes)
75
Estimates of p0 for Hedenfalk data
  • 0.52 (Broet, 2004)
  • 0.64 (Gottardo, 2006)
  • 0.61 (Ploner et al, 2006)
  • 0.47 (Storey, 2002)
  • Using a theoretical null, we estimated p0 to be
    0.65.

76
  • We used
  • pj 2 F13(-wj)
  • Similar results where pj obtained by permutation
    methods.

77
Ranking and Selecting the Genes
Gene j Pj zj 0 ( zj )
Gene 1 . . . Gene R . . . . Gene RR1 . . . Gene N 0.002 . . . 0.1 0.12 0.18 . . 0.20
FDR Sum/R 0.06
Local FDR
c0 0.1
Proportion of False Negatives 1 Sum1/ R1
78
Estimated FDR for various levels of c0
c0 R
0.5 906 0.26
0.4 694 0.21
0.3 518 0.16
0.2 326 0.11
0.1 143 0.06
79
c0 Nr
0.1 143 0.06 0.32 0.88 0.004
0.2 338 0.11 0.28 0.73 0.02
0.3 539 0.16 0.25 0.60 0.04
0.4 742 0.21 0.22 0.48 0.08
0.5 971 0.27 0.18 0.37 0.12
Estimated FDR and other error rates for various
levels of threshold c0 applied to the posterior
probability of nondifferential expression for the
breast cancer data (Nrnumber of selected genes)
80
Comparison of identified DE genes
Our method (143)
Hedenfalk (175)
6
29
24
101
12
39
8
Storey and Tibshirani (160)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
81
Uniquely Identified Genes Differentially
Expressed between BRCA1 and BRCA2
Gene GO Term
UBE2B, DDB2 (UBE2V1) RAB9, RHOC ITGB5, ITGA3 PRKCBP1 HDAC3, MIF KIF5B, spindle body pole protein CTCL1 TNAFIP1 HARS, HSD17B7 DNA repair (cell cycle) small GTPase signal transduction integrin mediated signalling pathway regulation of transcription negative regulation of apoptosis cytoskeleton organisation vesicle mediated transport cation transport metabolism
82
The Procedure
1. Obtain the z-score for each of the genes

2. Rank the genes on the basis of the z-scores,
starting with the largest ones (the same
ordering as with the P-values, pj).
3. The posterior probability of non-differential
expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed
if 0(zj) lt c0
83
Breast cancer data plot of fitted
two-component normal mixture model with
theoretical N(0,1) null and non-null components
(weighted respectively by the estimated
proportion of null and non-null genes) imposed on
histogram of z-scores.
84
Hedenfalk breast cancer data plot of fitted
two-component normal mixture model with empirical
null and non-null components (weighted
respectively by the estimated proportion of null
and non-null genes) imposed on histogram of
z-scores.
85
Histogram of N3,226 z-Values from the Breast
Cancer Study. The theoretical N(0,1) null is much
narrower than the central peak, which has
(d0,s0)(-.02,1.58). In this case the central
peak seems to include the entire histogram.
86
Alon Colon Cancer Data
  • Affymetrix arrays, samples from colon tissues
    from 62 patients
  • Dataset of M 62 patients 40 tumor vs 22
    normal, with
  • N 2,000 genes.

Alon et al. (1999) PNAS, 96, 6745-6750
87
Alon data theoretical null
  • We estimated p0 to be 0.39 using the theoretical
    null. With the empirical null, it is 0.53.
  • Six smooth muscle genes are included in the
    2,000 genes and each has an estimated posterior
    probability of non-DE less than 0.015.

88
Colon cancer data plot of fitted
two-component normal mixture model with
theoretical N(0,1) null and non-null components
(weighted respectively by the estimated
proportion of null and non-null genes) imposed on
histogram of z-scores.
89
Colon cancer data plot of fitted
two-component normal mixture model with empirical
null and non-null components (weighted
respectively by the estimated proportion of null
and non-null genes) imposed on histogram of
z-scores.
90
vant Wout HIV Data
  • Four experiments using the same RNA preparation
    on 4 different slides
  • Expression levels of transcripts assessed in
    CD4-T-cell lines 24 hours after infection with
    HIV type 1
  • Two samples one corresponds to HIV infected
    cells and one to non-infected cells
  • Dataset of M 8 samples 4 HIV-infected vs 4
    non-infected, with N 7,680 genes. Analysed by
    Gottardo et al (2006)

vant Wout et al (2003), J Virology 77,
1392-1402 Gottardo et al (2006), Biometrics 62,
10-18
91
HIV data plot of fitted two-component normal
mixture model with empirical null and
non-null components (weighted respectively by the
estimated proportion of null and non-null genes)
imposed on histogram of z-scores.
92
Conclusions
  • Mixture-model based approach to finding DE genes
    can yield new information
  • Gives a measure of the posterior probability
    that a gene is not DE (i.e. a local FDR rather
    than global)
  • Provides estimates of global rates the FDR
    and the false negative rate FNR
    (FNR1-sensitivity)

93
Mixture Model-Based Approach Advantages
  • Provides a local FDR and a measure of the FNR, as
    well as the global FDR for the selected genes.
  • We show that it can yield new information

94
Allison Mice Simulation
Allison generated data for 10 mice over 3000
genes. The data are generated in six groups of
500 with a value ? of 0, 0.4, or 0.8 in the
off-diagonal elements of the 500 x 500 covariance
matrix used to generate each group. For a random
20 of the genes, a value d of 0, 4, or 8 is
added to the gene expression levels of the last
five mice.
95
Theoretical null, ?0.8, d4
96
Empirical null, ?0.8, d4
97
Theoretical null, ?0.8, d8
98
Empirical null, ?0.8, d8
99
Suppose t0(w) is monotonic decreasing in w. Then
for
100
Suppose t0(w) is monotonic decreasing in w. Then
for
where
101
For a desired control level a, say a 0.05,
define
(1)
w
is monotonic in w, then using (1)
If
to control the FDR with and
taken to be the empirical distribution
function is equivalent to using the
Benjamini-Hochberg procedure based on the
P-values corresponding to the statistic wj.
Write a Comment
User Comments (0)
About PowerShow.com