Identification of Differential Genes presentation

About This Presentation

Transcript and Presenter's Notes

Title: Identification of Differential Genes

1
Identification of Differential Genes
2
Identification of differential genes

The most basic experimental design comparison
between 2 conditions treatment vs control
The goal to identify genes that are
differentially expressed in the examined
conditions
Number of replicates is usually low (n2-4)

3
Approaches for identification of differential
genes

Fold Change
T-test
Cyber-T
SAM

4
1. Fold Change

Consider genes whose mean expression level was
change by at least 1.75-2 fold as differential
genes
Limits
Usually no estimation of false positive rate is
provided
Biased to genes with low expression level
Ignores the variability of gene levels over
replicates.

5
Fold Change limit Biased to low expression
levels
30
70
110
Determine floor cut-off according to estimate
of background level and set all expression levels
below it to this floor level
6
Fold Change limit ignores variability over
replicates

Seek for score that punishes genes with high
variability over replicates

7
Approaches for identification of differential
genes

Fold Change
T-test
Cyber-T
SAM

8
2. T-test

Compute a t-score for each gene

mc, mt mean levels in Control and
Treatment Sc2, St2 variance estimates in
Control and Treatment nc, nt number of
replicates in in Control and Treatment
9
T - test

The t-score can be associated with statistical
significance (p-value) under the assumption that
expression levels follow normal distribution
Log-transformation
Set cut-off for p-value (a0.01)
Consider all genes with p-value lt a as
differential genes

10
Multiple Testing

P-valg associated with the t-score Tg is the
probability for obtaining by random a t-score
that is at least as extreme as Tg.
Multiplicity problem thousands of genes are
tested simultaneously.
e.g. suppose
10,000 genes on a chip
not a single one is differentially expressed.
a0.01
10000x0.01 100 genes are expected to have a
p-value lt 0.01 just by chance.

11
Multiple testing

Individual pvalues of e.g. 0.01 no longer
correspond to significant findings.
Need to adjust for multiple testing when
assessing the statistical significance of findings

12
Multiple Testing Bonferroni correction

Consider as differential genes only those with
p-value lt (a/N)
N number of tests
a0.01, N10,000 cut-off0.000001
Ensure very low probability for having any false
positive genes (less than a)
Advantage very clean list of differential genes
Limit the list usually contains very few genes
unacceptable high rate of false negatives

13
Multiple Testing FDR correction (Benjamini
Hochberg)

False Discovery Rate
In high-throughput studies certain proportion of
false positives is tolerable
Control the expected proportion of false
positives among the genes identified as
differential (q10).
Scheme
Rank genes according to their p-vals
p(1)ltp(2)ltp(N)
Consider as differential genes the top k that
satisfy
p(i) lt i(q/N), 1ik

14
Approaches for identification of differential
genes

Fold Change
T-test
Cyber-T
SAM

15
3. Cyber-T (Baldi Long)

Regularized t-test
Problem Low number of replicates ? unstable
estimations of gene variances
Found that in microarray datasets, after
log-transformation, the variance is dependant on
the expression level
Lower expression level ? larger variance

Utilize this rule to improve the estimation of
gene variances

Lower expression level ? larger variance

In t-test, use s2 in place of s2 s2 genes
variance over the replicates n number of
replicates s02 expected variance given the
expression level of the gene ?0 weight of
s0 s02 estimated over a window of size 101
genes ?0 n 10
Log (expression)
17
Stabilization of the variance estimation
s2
s2
Log (expression)
Log (expression)
18
Cyber-T

Regularized t-test performs better than the
conventional t-test when the number of replicates
is low (2-3)
By 5 replicates (5 control, 5 treatment) the
performances were similar

19
Approaches for identification of differential
genes

Fold Change
T-test
Cyber-T
SAM

20
4. SAM (Tusher, Tibshirani Chu)

Significance Analysis of Microarray
Limit of analytical FDR approach assumes that
the tests are independent
However in the microarray context, the expression
levels of some genes are highly correlated ?
unreliable FDR estimate
SAM uses permutations to get an estimate for the
FDR of the reported differential genes

21
SAM

Scheme
Compute for each gene a statistic that measures
its relative expression difference in control vs
treatment (t-score or a variant)
Rank the genes according to their difference
score
Set a cut off (d0) and consider all genes above
it as differential (Nd)
Permute the condition labels, and count how many
genes got score above d0 (Np)
Repeat on all possible permutations and count
(Npj)
estimate FDR as the proportion ltNpjgt/Nd

22
Permutation on condition labels
BACK

Write a Comment

User Comments (0)

About PowerShow.com

Identification of Differential Genes PowerPoint PPT Presentation