Title: The Problem of Detecting Differentially Expressed Genes
1 The Problem of Detecting Differentially
Expressed Genes
2(No Transcript)
3Class 1
Class 2
4Fold Change is the Simplest Method Calculate the
log ratio between the two classes and consider
all genes that differ by more than an arbitrary
cutoff value to be differentially expressed. A
two-fold difference is often chosen. Fold
change is not a statistical test.
5Test of a Single Hypothesis
(1) For gene consider the null hypothesis of no
association between its expression level and its
class membership.
(2) Decide on level of significance (commonly 5).
(3) Perform a test (e.g Students t-test) for
each gene.
(4) Obtain P-value corresponding to that test
statistic.
(5) Compare P-value with the significance level.
Then either reject or retain the null hypothesis.
6Two-Sample t-Statistic
Students t-statistic
7Two-Sample t-Statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
8Two-Sample t-Statistic
Modified t-statistic of Tusher et al. (2001)
9Types of Errors in Hypothesis Testing
PREDICTED
Retain Null Reject Null
Null False positive
Non-null False negative
TRUE
10Multiplicity Problem
When many hypotheses are tested, the probability
of a false positive increases sharply with the
number of hypotheses.
Further Genes are co-regulated, subsequently
there is correlation between the test statistics.
11Example
Suppose we measure the expression of 10,000
genes in a microarray experiment.
If all 10,000 genes were not differentially
expressed, then we would expect for P 0.05
for each test, 500 false positives. P
0.05/10,000 for each test, .05 false positives.
12Controlling the Error Rate
- Methods for controlling false positives e.g.
Bonferroni are too strict for microarray analyses
- Use the False Discovery Rate instead (FDR)
- (Benjamini and Hochberg 1995)
13Methods for dealing with the Multiplicity Problem
- The Bonferroni Method
- controls the family wise error rate (FWER) i.e.
- the probability that at least one false positive
error will be made
- Too strict for gene expression data, tries to
make it unlikely that even one false rejection of
the null is made, may lead to missed findings
- The False Discovery Rate (FDR)
- emphasizes the proportion of false positives
among the identified differentially expressed
genes.
- Good for gene expression data says something
about the chosen genes
14False Discovery Rate Benjamini and Hochberg
(1995)
The FDR is essentially the expectation of the
proportion of false positives among the
identified differentially expressed genes.
15Â Â
Possible Outcomes for N Hypothesis Tests
16where
17 Positive FDR
18- Lindsay, Kettenring, and Siegmund (2004).
- A Report on the Future of Statistics.
- Statist. Sci. 19.
19(No Transcript)
20Controlling FDR
Benjamini and Hochberg (1995)
Key papers on controlling the FDR
- Genovese and Wasserman (2002)
- Storey (2002, 2003)
- Storey and Tibshirani (2003a, 2003b)
- Storey, Taylor and Siegmund (2004)
- Black (2004)
- Cox and Wong (2004)
21Benjamini-Hochberg (BH) Procedure
Controls the FDR at level a when the P-values
following the null distribution are independent
and uniformly distributed.
(1) Let be the
observed P-values.
(2) Calculate
.
(3) If exists then reject null hypotheses
corresponding to
. Otherwise, reject nothing.
22Example Bonferroni and BH Tests
Suppose that 10 independent hypothesis tests are
carried out leading to the following ordered
P-values
0.00017 0.00448 0.00671 0.00907
0.01220 0.33626 0.39341 0.53882 0.58125
0.98617
(a) With a 0.05, the Bonferroni test rejects
any hypothesis whose P-value is less than a / 10
0.005.
Thus only the first two hypotheses are rejected.
(b) For the BH test, we find the largest k such
that P(k) lt ka / N.
Here k 5, thus we reject the first five
hypotheses.
23q-VALUE
q-value of a gene j is expected proportion of
false positives when calling that gene
significant. P-value is the probability under
the null hypothesis of obtaining a value of the
test statistic as or more extreme than its
observed value. The q-value for an observed
test statistic can be viewed as the expected
proportion of false positives among all genes
with their test statistics as or more extreme
than the observed value.
24LIST OF SIGNIFICANT GENES
Call all genes significant if pj lt 0.05 or Call
all genes significant if qj lt 0.05 to produce a
set of significant genes so that a proportion of
them (lt0.05) is expected to be false (at least
for a large no. of genes not necessarily
independent)
25BRCA1 versus BRCA2-mutation positive tumours
(Hedenfalk et al., 2001)
BRCA1 (7) versus BRCA2-mutation (8) positive
tumours, p3226 genes P.001 gave 51
genes differentially expressed P0.0001 gave
9-11 genes
26Using qlt0.05, gives 160 genes are taken to be
significant. It means that approx. 8 of these
160 genes are expected to be false
positives. Also, it is estimated that 33 of the
genes are differentially expressed.
27(No Transcript)
28(No Transcript)
29Null Distribution of the Test Statistic
Permutation Method
The null distribution has a resolution on the
order of the number of permutations. If we
perform B permutations, then the P-value will be
estimated with a resolution of 1/B. If we
assume that each gene has the same null
distribution and combine the permutations, then
the resolution will be 1/(NB) for the pooled
null distribution.
30Using just the B permutations of the class labels
for the gene-specific statistic Wj , the P-value
for Wj wj is assessed as
where w(b)0j is the null version of wj after the
bth permutation of the class labels.
31If we pool over all N genes, then
32Null Distribution of the Test Statistic Example
Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Suppose we have two classes of tissue samples,
with three samples from each class. Consider
the expressions of two genes, Gene 1 and Gene 2.
33Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
To find the null distribution of the test
statistic for Gene 1, we proceed under the
assumption that there is no difference between
the classes (for Gene 1) so that
Gene 1 A1(1) A2(1) A3(1) A4(1) A5(1)
A6(1)
And permute the class labels
Perm. 1 A1(1) A2(1) A4(1) A3(1) A5(1)
A6(1) ... There are 10 distinct permutations.
34Ten Permutations of Gene 1
A1(1) A2(1) A3(1) A4(1) A5(1) A6(1) A1(1)
A2(1) A4(1) A3(1) A5(1) A6(1) A1(1) A2(1)
A5(1) A3(1) A4(1) A6(1) A1(1) A2(1) A6(1)
A3(1) A4(1) A5(1) A1(1) A3(1) A4(1) A2(1)
A5(1) A6(1) A1(1) A3(1) A5(1) A2(1) A4(1)
A6(1) A1(1) A3(1) A6(1) A2(1) A4(1)
A5(1) A1(1) A4(1) A5(1) A2(1) A3(1)
A6(1) A1(1) A4(1) A6(1) A2(1) A3(1)
A5(1) A1(1) A5(1) A6(1) A2(1) A3(1) A4(1)
35As there are only 10 distinct permutations here,
the null distribution based on these permutations
is too granular. Hence consideration is given
to permuting the labels of each of the other
genes and estimating the null distribution of a
gene based on the pooled permutations so
obtained. But there is a problem with this
method in that the null values of the test
statistic for each gene does not necessarily
have the theoretical null distribution that we
are trying to estimate.
36Suppose we were to use Gene 2 also to estimate
the null distribution of Gene 1. Suppose that
Gene 2 is differentially expressed, then the
null values of the test statistic for Gene 2 will
have a mixed distribution.
37Class 1 Class 2
Gene 1 A1(1) A2(1) A3(1) B4(1) B5(1)
B6(1)
Gene 2 A1(2) A2(2) A3(2)
B4(2) B5(2) B6(2)
Gene 2 A1(2) A2(2) A3(2) B4(2) B5(2)
B6(2)
Permute the class labels
Perm. 1 A1(2) A2(2) B4(2) A3(2) B5(2)
B6(2) ...
38Example of a null case with 7 N(0,1) points
and 8 N(0,1) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
39Example of a null case with 7 N(0,1) points
and 8 N(10,9) points histogram of the pooled
two-sample t-statistic under 1000 permutations
of the class labels with t13 density
superimposed.
ty
40The SAM Method
Use the permutation method to calculate the null
distribution of the modified t-statistic (Tusher
et al., 2001).
The order statistics t(1), ... , t(N) are
plotted against their null expectations above.
A good test in situations where there are more
genes being over-expressed than under-expressed,
or vice-versa.
41The FDR and other error rates
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
TRUE
R
FNDR
FDR
42The FDR and other error rates
PREDICTED
Retain Null Reject Null
Null a b (false positive)
Non-null c (false negative) d
TRUE
R
FNR
FDR
43Toy Example with 10,000 Genes
Test Result non-DE DE Total Test Result non-DE DE Total Test Result non-DE DE Total
True non-DE DE Total A 9025 B 475 C 100 D 400 9125 875 9500 500 10000
FDR B / (B D) 475 / 875 54
FNDR C / (A C) 100 / 9125 1
44Two-component mixture model
is the proportion of genes that are not
differentially expressed, and
45Use of the P-Value as a Summary
Statistic (Allison et al., 2002)
Instead of using the pooled form of the
t-statistic, we can work with the value pj,
which is the P-value associated with tj in the
test of the null hypothesis of no difference in
expression between the two classes.
The distribution of the P-value is modelled by
the h-component mixture model
,
where a11 a12 1.
46Use of the P-Value as a Summary Statistic
Under the null hypothesis of no difference in
expression for the jth gene, pj will have a
uniform distribution on the unit interval ie
the b1,1 distribution.
The ba1,a2 density is given by
where
47(No Transcript)
48- Efron B, Tibshirani R, Storey JD, Tusher V (2001)
Empirical Bayes analysis of a microarray
experiment. JASA 96,1151-1160. - Efron B (2004) Large-scale simultaenous
hypothesis testing the choice of a null
hypothesis. JASA 99, 96-104. - Efron B (2004) Selection and Estimation for
Large-Scale Simultaneous Inference. - Efron B (2005) Local False Discovery Rates.
- Efron B (2006) Correlation and Large-Scale
Simultaneous Significance Testing.
49- McLachlan GJ, Bean RW, Ben-Tovim Jones L, Zhu JX.
Using mixture models to detect differentially
expressed genes. Australian Journal of
Experimental Agriculture 45, 859-866. - McLachlan GJ, Bean RW, Ben-Tovim Jones L. A
simple implentation of a normal mixture approach
to differential gene expression in multiclass
microarrays. Bioinformatics 26. To appear.
50Two component mixture model
p0 is the proportion of genes that are not
differentially expressed. The two-component
mixture model is
Using Bayes Theorem, we calculate the posterior
probability that gene j is not differentially
expressed
51If we conclude that gene j is differentially
expressed if
0(wj) c0,
then this decision minimizes the (estimated)
Bayes risk
52where e01 is the probability of a false
positive and e10 is the probability of a false
negative.
where
53Estimated FDR
where
54Similarly, the false positive rate is given by
and the false non-discovery rate and false
negative rate by
55Glonek and Solomon (2003)
F0 N(0,1), p00.9 F1 N(1,1), p10.1 Reject
H0 if z2 t0(2) 0.99972 but FDR0.17
F0 N(0,1), p00.6 F1 N(1,1), p10.4 Reject
H0 if z2 t0(2) 0.251 but FDR0.177
56(No Transcript)
57(No Transcript)
58Gene Statistics Two-Sample t-Statistic
Students t-statistic
Pooled form of the Students t-statistic, assumed
common variance in the two classes
Modified t-statistic of Tusher et al. (2001)
59The Procedure
1. Obtain the z-score for each of the genes
2. Rank the genes on the basis of the z-scores,
starting with the largest ones (the same
ordering as with the P-values, pj).
3. The posterior probability of non-differential
expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed
if 0(zj) lt c0
60t0(zj) c
If p0/p1 9,
then
e.g. c 0.2,
61t0(zj) c
Suppose
p0/ p1 0.9
then
t0(zj) 0.2 if
Much stronger level of evidence against the
null than in standard one-at-a-time testing.
62e.g.
Zj N(µ,1)
H0 µ 0 vs H1 µ 2.8
Rejecting H0 if Zj 1.96 yields a
two-sided test of size 0.05 and power
0.80. f1(1.96)/f0(1.96) 4.8.
63Z-scores, null case
Z-scores, 1
Z-scores, 2
Z-scores, 3
64Use of Wilson-Hilferty Transformation as in Broet
et al. (2004)
zj
j
65Plot of approximate z-scores obtained
by Wilson-Hilferty transformation of simulated
values of F-statistic with 1 and 8 degrees of
freedom.
66EMMIX-FDR
- A program has been written in R which interfaces
with EMMIX to implement the algorithm described
in McLachlan et al (2006). - We fit a mixture of two normal components
- to test statistics calculated from the gene
expression data.
67When we equate the sample mean and variance of
the mixture to their population counterparts, we
obtain
68When we are working with the theoretical null, we
can easily estimate the mean and variance of the
non-null component with the following formulae.
69Following the approach of Storey and Tibshirani
(2003) we can obtain an initial estimate for p0
as follows
This estimate of p0 is used as input to EMMIX
as part of the initial parameter estimates. Thus
no random or k-means starts are required, as is
usually the case. There are two different
versions of EMMIX used, the standard version for
the empirical null and a modified version for the
theoretical null which fixes the mean
and variance of the null component to be 0 and 1
respectively.
70Theoretical and empirical nulls
- Efron (2004) suggested the use of two kinds of
null component the theoretical and the empirical
null. In the theoretical case the null component
has mean 0 and variance 1 and the empirical null
has unrestricted mean and variance.
71Examples
- We examined the performance of EMMIX-FDR on
three well-known data sets - from the literature.
- Alon colon cancer data (1999)
- Hedenfalk breast cancer data (2001)
- vant Wout HIV data (2003)
72Hedenfalk Breast Cancer Data
Hedenfalk et al. (2001) used cDNA arrays to
obtain gene expression profiles of tumours from
carriers of either the BRCA1 or BRCA2 mutation
(hereditary breast cancers), as well as sporadic
breast cancer. We consider their data set of M
15 patients, comprising two patient groups
BRCA1 (7) versus BRCA2 - mutation positive (8),
with N 3,226 genes.
The problem is to find genes which are
differentially expressed between the BRCA1 and
BRCA2 patients.
Hedenfalk et al. (2001) NEJM, 344, 539-547
73Two component model for the Breast Cancer Data
Fit to the N values of wj (based on pooled
two-sample t-statistic)
j th gene is taken to be differentially expressed
if
74Fitting two component mixture model to Hedenfalk
data
Null
NonNull (DE genes)
75Estimates of p0 for Hedenfalk data
- 0.52 (Broet, 2004)
- 0.64 (Gottardo, 2006)
- 0.61 (Ploner et al, 2006)
- 0.47 (Storey, 2002)
- Using a theoretical null, we estimated p0 to be
0.65.
76- We used
- pj 2 F13(-wj)
- Similar results where pj obtained by permutation
methods.
77Ranking and Selecting the Genes
Gene j Pj zj 0 ( zj )
Gene 1 . . . Gene R . . . . Gene RR1 . . . Gene N 0.002 . . . 0.1 0.12 0.18 . . 0.20
FDR Sum/R 0.06
Local FDR
c0 0.1
Proportion of False Negatives 1 Sum1/ R1
78Estimated FDR for various levels of c0
c0 R
0.5 906 0.26
0.4 694 0.21
0.3 518 0.16
0.2 326 0.11
0.1 143 0.06
79c0 Nr
0.1 143 0.06 0.32 0.88 0.004
0.2 338 0.11 0.28 0.73 0.02
0.3 539 0.16 0.25 0.60 0.04
0.4 742 0.21 0.22 0.48 0.08
0.5 971 0.27 0.18 0.37 0.12
Estimated FDR and other error rates for various
levels of threshold c0 applied to the posterior
probability of nondifferential expression for the
breast cancer data (Nrnumber of selected genes)
80Comparison of identified DE genes
Our method (143)
Hedenfalk (175)
6
29
24
101
12
39
8
Storey and Tibshirani (160)
Storey and Tibshirani (2003) PNAS, 100, 9440-9445
81 Uniquely Identified Genes Differentially
Expressed between BRCA1 and BRCA2
Gene GO Term
UBE2B, DDB2 (UBE2V1) RAB9, RHOC ITGB5, ITGA3 PRKCBP1 HDAC3, MIF KIF5B, spindle body pole protein CTCL1 TNAFIP1 HARS, HSD17B7 DNA repair (cell cycle) small GTPase signal transduction integrin mediated signalling pathway regulation of transcription negative regulation of apoptosis cytoskeleton organisation vesicle mediated transport cation transport metabolism
82The Procedure
1. Obtain the z-score for each of the genes
2. Rank the genes on the basis of the z-scores,
starting with the largest ones (the same
ordering as with the P-values, pj).
3. The posterior probability of non-differential
expression of gene j, is given by 0(zj).
4. Conclude gene j to be differentially expressed
if 0(zj) lt c0
83Breast cancer data plot of fitted
two-component normal mixture model with
theoretical N(0,1) null and non-null components
(weighted respectively by the estimated
proportion of null and non-null genes) imposed on
histogram of z-scores.
84Hedenfalk breast cancer data plot of fitted
two-component normal mixture model with empirical
null and non-null components (weighted
respectively by the estimated proportion of null
and non-null genes) imposed on histogram of
z-scores.
85Histogram of N3,226 z-Values from the Breast
Cancer Study. The theoretical N(0,1) null is much
narrower than the central peak, which has
(d0,s0)(-.02,1.58). In this case the central
peak seems to include the entire histogram.
86Alon Colon Cancer Data
- Affymetrix arrays, samples from colon tissues
from 62 patients - Dataset of M 62 patients 40 tumor vs 22
normal, with - N 2,000 genes.
Alon et al. (1999) PNAS, 96, 6745-6750
87Alon data theoretical null
- We estimated p0 to be 0.39 using the theoretical
null. With the empirical null, it is 0.53. - Six smooth muscle genes are included in the
2,000 genes and each has an estimated posterior
probability of non-DE less than 0.015.
88Colon cancer data plot of fitted
two-component normal mixture model with
theoretical N(0,1) null and non-null components
(weighted respectively by the estimated
proportion of null and non-null genes) imposed on
histogram of z-scores.
89Colon cancer data plot of fitted
two-component normal mixture model with empirical
null and non-null components (weighted
respectively by the estimated proportion of null
and non-null genes) imposed on histogram of
z-scores.
90vant Wout HIV Data
- Four experiments using the same RNA preparation
on 4 different slides - Expression levels of transcripts assessed in
CD4-T-cell lines 24 hours after infection with
HIV type 1 - Two samples one corresponds to HIV infected
cells and one to non-infected cells - Dataset of M 8 samples 4 HIV-infected vs 4
non-infected, with N 7,680 genes. Analysed by
Gottardo et al (2006)
vant Wout et al (2003), J Virology 77,
1392-1402 Gottardo et al (2006), Biometrics 62,
10-18
91HIV data plot of fitted two-component normal
mixture model with empirical null and
non-null components (weighted respectively by the
estimated proportion of null and non-null genes)
imposed on histogram of z-scores.
92Conclusions
- Mixture-model based approach to finding DE genes
can yield new information - Gives a measure of the posterior probability
that a gene is not DE (i.e. a local FDR rather
than global) - Provides estimates of global rates the FDR
and the false negative rate FNR
(FNR1-sensitivity)
93Mixture Model-Based Approach Advantages
- Provides a local FDR and a measure of the FNR, as
well as the global FDR for the selected genes. - We show that it can yield new information
94Allison Mice Simulation
Allison generated data for 10 mice over 3000
genes. The data are generated in six groups of
500 with a value ? of 0, 0.4, or 0.8 in the
off-diagonal elements of the 500 x 500 covariance
matrix used to generate each group. For a random
20 of the genes, a value d of 0, 4, or 8 is
added to the gene expression levels of the last
five mice.
95Theoretical null, ?0.8, d4
96Empirical null, ?0.8, d4
97Theoretical null, ?0.8, d8
98Empirical null, ?0.8, d8
99Suppose t0(w) is monotonic decreasing in w. Then
for
100Suppose t0(w) is monotonic decreasing in w. Then
for
where
101For a desired control level a, say a 0.05,
define
(1)
w
is monotonic in w, then using (1)
If
to control the FDR with and
taken to be the empirical distribution
function is equivalent to using the
Benjamini-Hochberg procedure based on the
P-values corresponding to the statistic wj.