Detecting Differentially Expressed Genes - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Detecting Differentially Expressed Genes

Description:

... False Positive ... the probability of making a positive call when the gene is in ... (false rejection at least one hypothesis) a. or P(no false rejection) ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 28
Provided by: csBra
Category:

less

Transcript and Presenter's Notes

Title: Detecting Differentially Expressed Genes


1
Detecting Differentially Expressed Genes
Pengyu Hong 09/13/2005
2
Background (Microarray)
Extract RNA
Cells
3
Background
Extract RNA
Cells
4
Background
Extract RNA
Cells
5
Background
Extract RNA
Cells
6
Background
Extract RNA
Cells
104 genes
7
Background
Extract RNA
Cells
104 genes
8
Background
Extract RNA
Cells
104 genes
9
Background
Biological sample
  • RNA extraction (total RNA or mRNA)
  • Amplification (in vitro transcription)
  • Label samples
  • Hybridization
  • Washing and staining
  • Microarrays are highly noisy
  • Use replicated experiments to make inferences
    about differential expression for the population
    from which the biological samples originate

Scanning
10
Background
Normalization
Calculate Gene Expression Index
11
An Example
5 normal sample and 9 myeloma (MM) samples 12558
genes (rows)
12
Genes of Interest
  • Statistical significance that the observed
    differential expression is unlikely to be due to
    chance.
  • Scientific significance that the observed level
    of differential expression is of sufficient
    magnitude to be of biological relevance.

13
Parametric Test t-test
Statistical significance in the two group problem
Group 1 (N samples) X1, X2, XN Group 2 (M
samples) Y1, Y2, YM
Assume
Xi Normal (µ1, s2)
Yj Normal (µ2, s2)
Null hypothesis Group 1 is the same to Group 2
(i.e., µ1 µ2)
14
Parametric Test t-test
Statistical significance in the two group problem
Yj Normal (µ2, s2)
Xi Normal (µ1, s2)
Null hypothesis µ1 µ2
Test null hypothesis with test statistics
15
Xi Normal (µ1, s12)
s1 ? s2
If variances are unequal
Yj Normal (µ2, s22)
(1) When NM gt 30, this is approximately normal
(2) When ?1 gtgt ?2, this is approximately t(df
N1)
(3) In general, Welch approximation t t(df),
where

16
Wilcoxon rank sum test
  • Consider row 7 of MM study
  • 16 253 633 1008 708 36 72 28 14 33
    19 49 58 23
  • 13 4 3 1 2 8 5 10
    14 9 12 7 6 11
  • ---------------------------
  • rank sum 23
  • This test is more appropriate than the t-tests
    when the underlying distribution is far from
    normal. (But it requires large group sizes)

17
P-value
  • p-value P(Tgtt) is calculated based on the
    distribution of T under the null hypothesis.
  • p-value is a function of the test statistics and
    can be viewed as a random variable.
  • e.g. p-value 2(1 - F(t), F cdf of t(NM
    2).
  • A small p-value represents evidence against the
    null hypothesis ? differentially expressed in our
    case.

18
Permutation test
  • A non-parametric way of computation p-value for
    any test statistics.
  • In the MM-study, each gene has (14 choose 5)
    2002 different test values obtainable from
    permuting the group labels.
  • Under the null hypothesis that the distribution
    for the two groups are identical, all these test
    values are equally probable. What is the
    probability of getting a test value at least as
    extreme as the observed one? This is the
    permutation p-value.

19
Permutation technique
Compute TS0
Compute TS1
Compute TS2
Compute TS3
The set of TSi form the empirical distribution of
the test statistic TS
20
Scientific Significance
  • Fold change FC
  • May not be high when statistical significance is
    high.
  • Not an appropriate measure if the dispersion is
    not taken into consideration.

21
Conservative fold change Conservative fold
change (CFC) Max (25th percentile of sample 1
/ 75th percentile of sample 2, 25th
percentile of sample 2 / 75th percentile of
sample 1)
22
Sample 1 Normal (100, 1) Sample 2 Normal (103,
1) CFC 1.0164
23
CFC2.89
CFC3.53
CFC1.45
CFC1.07
24
P-values and FC contains different information
25
Gene Selection and Ranking
  • A high threshold of statistical significance ?
    Select genes with p-values smaller than a
    threshold
  • The selected genes are ordered according to their
    scientific significance (i.e. ranked by
    fold-changes)

26
The False Positive Rate (FPR)
  • If we select genes with p-value lt 0.01, then the
    probability of making a positive call when the
    gene is in fact not differential is less than
    0.01. Thus selection by p-value controls the FPR.
  • However, if we have 12,000 genes in a microarray,
    then a FPR 0.01 still allows up to 120 false
    positives. To make sensible decision, we must
    take multiple comparisons into consideration.

27
Dealing with Multiple Comparison
  • Bonferroni inequality To control the family-wise
    error rate for testing m hypotheses at level a,
    we need to control the FPR for each individual
    test at a/m
  • Then P(false rejection at least one hypothesis)
    lt a
  • or P(no false rejection) gt 1- a
  • This is appropriate for some applications (e.g.
    testing a new drug versus several existing ones),
    but is too conservative for our task of gene
    selection.
Write a Comment
User Comments (0)
About PowerShow.com