Title: Detecting Differentially Expressed Genes
1Detecting Differentially Expressed Genes
Pengyu Hong 09/13/2005
2Background (Microarray)
Extract RNA
Cells
3Background
Extract RNA
Cells
4Background
Extract RNA
Cells
5Background
Extract RNA
Cells
6Background
Extract RNA
Cells
104 genes
7Background
Extract RNA
Cells
104 genes
8Background
Extract RNA
Cells
104 genes
9Background
Biological sample
- RNA extraction (total RNA or mRNA)
- Amplification (in vitro transcription)
- Label samples
- Hybridization
- Washing and staining
- Microarrays are highly noisy
- Use replicated experiments to make inferences
about differential expression for the population
from which the biological samples originate
Scanning
10Background
Normalization
Calculate Gene Expression Index
11An Example
5 normal sample and 9 myeloma (MM) samples 12558
genes (rows)
12Genes of Interest
- Statistical significance that the observed
differential expression is unlikely to be due to
chance. - Scientific significance that the observed level
of differential expression is of sufficient
magnitude to be of biological relevance.
13Parametric Test t-test
Statistical significance in the two group problem
Group 1 (N samples) X1, X2, XN Group 2 (M
samples) Y1, Y2, YM
Assume
Xi Normal (µ1, s2)
Yj Normal (µ2, s2)
Null hypothesis Group 1 is the same to Group 2
(i.e., µ1 µ2)
14Parametric Test t-test
Statistical significance in the two group problem
Yj Normal (µ2, s2)
Xi Normal (µ1, s2)
Null hypothesis µ1 µ2
Test null hypothesis with test statistics
15Xi Normal (µ1, s12)
s1 ? s2
If variances are unequal
Yj Normal (µ2, s22)
(1) When NM gt 30, this is approximately normal
(2) When ?1 gtgt ?2, this is approximately t(df
N1)
(3) In general, Welch approximation t t(df),
where
16Wilcoxon rank sum test
- Consider row 7 of MM study
- 16 253 633 1008 708 36 72 28 14 33
19 49 58 23 - 13 4 3 1 2 8 5 10
14 9 12 7 6 11 - ---------------------------
- rank sum 23
- This test is more appropriate than the t-tests
when the underlying distribution is far from
normal. (But it requires large group sizes)
17P-value
- p-value P(Tgtt) is calculated based on the
distribution of T under the null hypothesis. - p-value is a function of the test statistics and
can be viewed as a random variable. - e.g. p-value 2(1 - F(t), F cdf of t(NM
2). - A small p-value represents evidence against the
null hypothesis ? differentially expressed in our
case.
18Permutation test
- A non-parametric way of computation p-value for
any test statistics. - In the MM-study, each gene has (14 choose 5)
2002 different test values obtainable from
permuting the group labels. - Under the null hypothesis that the distribution
for the two groups are identical, all these test
values are equally probable. What is the
probability of getting a test value at least as
extreme as the observed one? This is the
permutation p-value.
19Permutation technique
Compute TS0
Compute TS1
Compute TS2
Compute TS3
The set of TSi form the empirical distribution of
the test statistic TS
20Scientific Significance
- Fold change FC
- May not be high when statistical significance is
high. - Not an appropriate measure if the dispersion is
not taken into consideration.
21Conservative fold change Conservative fold
change (CFC) Max (25th percentile of sample 1
/ 75th percentile of sample 2, 25th
percentile of sample 2 / 75th percentile of
sample 1)
22Sample 1 Normal (100, 1) Sample 2 Normal (103,
1) CFC 1.0164
23CFC2.89
CFC3.53
CFC1.45
CFC1.07
24P-values and FC contains different information
25Gene Selection and Ranking
- A high threshold of statistical significance ?
Select genes with p-values smaller than a
threshold - The selected genes are ordered according to their
scientific significance (i.e. ranked by
fold-changes)
26The False Positive Rate (FPR)
- If we select genes with p-value lt 0.01, then the
probability of making a positive call when the
gene is in fact not differential is less than
0.01. Thus selection by p-value controls the FPR. - However, if we have 12,000 genes in a microarray,
then a FPR 0.01 still allows up to 120 false
positives. To make sensible decision, we must
take multiple comparisons into consideration.
27Dealing with Multiple Comparison
- Bonferroni inequality To control the family-wise
error rate for testing m hypotheses at level a,
we need to control the FPR for each individual
test at a/m - Then P(false rejection at least one hypothesis)
lt a - or P(no false rejection) gt 1- a
- This is appropriate for some applications (e.g.
testing a new drug versus several existing ones),
but is too conservative for our task of gene
selection.