Identification of Differential Genes - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Identification of Differential Genes

Description:

Number of replicates is usually low (n=2-4) Approaches for identification of differential genes ... Usually no estimation of false positive rate is provided ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 22
Provided by: YossiS7
Category:

less

Transcript and Presenter's Notes

Title: Identification of Differential Genes


1
Identification of Differential Genes
2
Identification of differential genes
  • The most basic experimental design comparison
    between 2 conditions treatment vs control
  • The goal to identify genes that are
    differentially expressed in the examined
    conditions
  • Number of replicates is usually low (n2-4)

3
Approaches for identification of differential
genes
  • Fold Change
  • T-test
  • Cyber-T
  • SAM

4
1. Fold Change
  • Consider genes whose mean expression level was
    change by at least 1.75-2 fold as differential
    genes
  • Limits
  • Usually no estimation of false positive rate is
    provided
  • Biased to genes with low expression level
  • Ignores the variability of gene levels over
    replicates.

5
Fold Change limit Biased to low expression
levels
30
70
110
Determine floor cut-off according to estimate
of background level and set all expression levels
below it to this floor level
6
Fold Change limit ignores variability over
replicates
  • Seek for score that punishes genes with high
    variability over replicates

7
Approaches for identification of differential
genes
  • Fold Change
  • T-test
  • Cyber-T
  • SAM

8
2. T-test
  • Compute a t-score for each gene

mc, mt mean levels in Control and
Treatment Sc2, St2 variance estimates in
Control and Treatment nc, nt number of
replicates in in Control and Treatment
9
T - test
  • The t-score can be associated with statistical
    significance (p-value) under the assumption that
    expression levels follow normal distribution
  • Log-transformation
  • Set cut-off for p-value (a0.01)
  • Consider all genes with p-value lt a as
    differential genes

10
Multiple Testing
  • P-valg associated with the t-score Tg is the
    probability for obtaining by random a t-score
    that is at least as extreme as Tg.
  • Multiplicity problem thousands of genes are
    tested simultaneously.
  • e.g. suppose
  • 10,000 genes on a chip
  • not a single one is differentially expressed.
  • a0.01
  • 10000x0.01 100 genes are expected to have a
    p-value lt 0.01 just by chance.

11
Multiple testing
  • Individual pvalues of e.g. 0.01 no longer
    correspond to significant findings.
  • Need to adjust for multiple testing when
    assessing the statistical significance of findings

12
Multiple Testing Bonferroni correction
  • Consider as differential genes only those with
    p-value lt (a/N)
  • N number of tests
  • a0.01, N10,000 cut-off0.000001
  • Ensure very low probability for having any false
    positive genes (less than a)
  • Advantage very clean list of differential genes
  • Limit the list usually contains very few genes
    unacceptable high rate of false negatives

13
Multiple Testing FDR correction (Benjamini
Hochberg)
  • False Discovery Rate
  • In high-throughput studies certain proportion of
    false positives is tolerable
  • Control the expected proportion of false
    positives among the genes identified as
    differential (q10).
  • Scheme
  • Rank genes according to their p-vals
    p(1)ltp(2)ltp(N)
  • Consider as differential genes the top k that
    satisfy
  • p(i) lt i(q/N), 1ik

14
Approaches for identification of differential
genes
  • Fold Change
  • T-test
  • Cyber-T
  • SAM

15
3. Cyber-T (Baldi Long)
  • Regularized t-test
  • Problem Low number of replicates ? unstable
    estimations of gene variances
  • Found that in microarray datasets, after
    log-transformation, the variance is dependant on
    the expression level
  • Lower expression level ? larger variance

16
  • Utilize this rule to improve the estimation of
    gene variances
  • Lower expression level ? larger variance

In t-test, use s2 in place of s2 s2 genes
variance over the replicates n number of
replicates s02 expected variance given the
expression level of the gene ?0 weight of
s0 s02 estimated over a window of size 101
genes ?0 n 10
Log (expression)
17
Stabilization of the variance estimation
s2
s2
Log (expression)
Log (expression)
18
Cyber-T
  • Regularized t-test performs better than the
    conventional t-test when the number of replicates
    is low (2-3)
  • By 5 replicates (5 control, 5 treatment) the
    performances were similar

19
Approaches for identification of differential
genes
  • Fold Change
  • T-test
  • Cyber-T
  • SAM

20
4. SAM (Tusher, Tibshirani Chu)
  • Significance Analysis of Microarray
  • Limit of analytical FDR approach assumes that
    the tests are independent
  • However in the microarray context, the expression
    levels of some genes are highly correlated ?
    unreliable FDR estimate
  • SAM uses permutations to get an estimate for the
    FDR of the reported differential genes

21
SAM
  • Scheme
  • Compute for each gene a statistic that measures
    its relative expression difference in control vs
    treatment (t-score or a variant)
  • Rank the genes according to their difference
    score
  • Set a cut off (d0) and consider all genes above
    it as differential (Nd)
  • Permute the condition labels, and count how many
    genes got score above d0 (Np)
  • Repeat on all possible permutations and count
    (Npj)
  • estimate FDR as the proportion ltNpjgt/Nd

22
Permutation on condition labels
BACK
Write a Comment
User Comments (0)
About PowerShow.com