Detecting Differentially Expressed Genes - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Detecting Differentially Expressed Genes

Description:

... False Positive ... the probability of making a positive call when the gene is in ... (false rejection at least one hypothesis) a. or P(no false rejection) ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 28

Provided by: csBra

Category:

more less

Transcript and Presenter's Notes

Title: Detecting Differentially Expressed Genes

1
Detecting Differentially Expressed Genes
Pengyu Hong 09/13/2005
2
Background (Microarray)
Extract RNA
Cells
3
Background
Extract RNA
Cells
4
Background
Extract RNA
Cells
5
Background
Extract RNA
Cells
6
Background
Extract RNA
Cells
104 genes
7
Background
Extract RNA
Cells
104 genes
8
Background
Extract RNA
Cells
104 genes
9
Background
Biological sample

RNA extraction (total RNA or mRNA)
Amplification (in vitro transcription)
Label samples
Hybridization
Washing and staining

Microarrays are highly noisy
Use replicated experiments to make inferences
about differential expression for the population
from which the biological samples originate

Scanning
10
Background
Normalization
Calculate Gene Expression Index
11
An Example
5 normal sample and 9 myeloma (MM) samples 12558
genes (rows)
12
Genes of Interest

Statistical significance that the observed
differential expression is unlikely to be due to
chance.
Scientific significance that the observed level
of differential expression is of sufficient
magnitude to be of biological relevance.

13
Parametric Test t-test
Statistical significance in the two group problem
Group 1 (N samples) X1, X2, XN Group 2 (M
samples) Y1, Y2, YM
Assume
Xi Normal (µ1, s2)
Yj Normal (µ2, s2)
Null hypothesis Group 1 is the same to Group 2
(i.e., µ1 µ2)
14
Parametric Test t-test
Statistical significance in the two group problem
Yj Normal (µ2, s2)
Xi Normal (µ1, s2)
Null hypothesis µ1 µ2
Test null hypothesis with test statistics
15
Xi Normal (µ1, s12)
s1 ? s2
If variances are unequal
Yj Normal (µ2, s22)
(1) When NM gt 30, this is approximately normal
(2) When ?1 gtgt ?2, this is approximately t(df
N1)
(3) In general, Welch approximation t t(df),
where

16
Wilcoxon rank sum test

Consider row 7 of MM study
16 253 633 1008 708 36 72 28 14 33
19 49 58 23
13 4 3 1 2 8 5 10
14 9 12 7 6 11
---------------------------
rank sum 23
This test is more appropriate than the t-tests
when the underlying distribution is far from
normal. (But it requires large group sizes)

17
P-value

p-value P(Tgtt) is calculated based on the
distribution of T under the null hypothesis.
p-value is a function of the test statistics and
can be viewed as a random variable.
e.g. p-value 2(1 - F(t), F cdf of t(NM
2).
A small p-value represents evidence against the
null hypothesis ? differentially expressed in our
case.

18
Permutation test

A non-parametric way of computation p-value for
any test statistics.
In the MM-study, each gene has (14 choose 5)
2002 different test values obtainable from
permuting the group labels.
Under the null hypothesis that the distribution
for the two groups are identical, all these test
values are equally probable. What is the
probability of getting a test value at least as
extreme as the observed one? This is the
permutation p-value.

19
Permutation technique
Compute TS0
Compute TS1
Compute TS2
Compute TS3
The set of TSi form the empirical distribution of
the test statistic TS
20
Scientific Significance

Fold change FC
May not be high when statistical significance is
high.
Not an appropriate measure if the dispersion is
not taken into consideration.

21
Conservative fold change Conservative fold
change (CFC) Max (25th percentile of sample 1
/ 75th percentile of sample 2, 25th
percentile of sample 2 / 75th percentile of
sample 1)
22
Sample 1 Normal (100, 1) Sample 2 Normal (103,
1) CFC 1.0164
23
CFC2.89
CFC3.53
CFC1.45
CFC1.07
24
P-values and FC contains different information
25
Gene Selection and Ranking

A high threshold of statistical significance ?
Select genes with p-values smaller than a
threshold
The selected genes are ordered according to their
scientific significance (i.e. ranked by
fold-changes)

26
The False Positive Rate (FPR)

If we select genes with p-value lt 0.01, then the
probability of making a positive call when the
gene is in fact not differential is less than
0.01. Thus selection by p-value controls the FPR.
However, if we have 12,000 genes in a microarray,
then a FPR 0.01 still allows up to 120 false
positives. To make sensible decision, we must
take multiple comparisons into consideration.

27
Dealing with Multiple Comparison

Bonferroni inequality To control the family-wise
error rate for testing m hypotheses at level a,
we need to control the FPR for each individual
test at a/m
Then P(false rejection at least one hypothesis)
lt a
or P(no false rejection) gt 1- a
This is appropriate for some applications (e.g.
testing a new drug versus several existing ones),
but is too conservative for our task of gene
selection.