Title: Identification of Differential Genes
1Identification of Differential Genes
2Identification of differential genes
- The most basic experimental design comparison
between 2 conditions treatment vs control - The goal to identify genes that are
differentially expressed in the examined
conditions - Number of replicates is usually low (n2-4)
3Approaches for identification of differential
genes
- Fold Change
- T-test
- Cyber-T
- SAM
41. Fold Change
- Consider genes whose mean expression level was
change by at least 1.75-2 fold as differential
genes - Limits
- Usually no estimation of false positive rate is
provided - Biased to genes with low expression level
- Ignores the variability of gene levels over
replicates.
5Fold Change limit Biased to low expression
levels
30
70
110
Determine floor cut-off according to estimate
of background level and set all expression levels
below it to this floor level
6Fold Change limit ignores variability over
replicates
- Seek for score that punishes genes with high
variability over replicates
7Approaches for identification of differential
genes
- Fold Change
- T-test
- Cyber-T
- SAM
82. T-test
- Compute a t-score for each gene
mc, mt mean levels in Control and
Treatment Sc2, St2 variance estimates in
Control and Treatment nc, nt number of
replicates in in Control and Treatment
9T - test
- The t-score can be associated with statistical
significance (p-value) under the assumption that
expression levels follow normal distribution - Log-transformation
- Set cut-off for p-value (a0.01)
- Consider all genes with p-value lt a as
differential genes
10Multiple Testing
- P-valg associated with the t-score Tg is the
probability for obtaining by random a t-score
that is at least as extreme as Tg. - Multiplicity problem thousands of genes are
tested simultaneously. - e.g. suppose
- 10,000 genes on a chip
- not a single one is differentially expressed.
- a0.01
- 10000x0.01 100 genes are expected to have a
p-value lt 0.01 just by chance.
11Multiple testing
- Individual pvalues of e.g. 0.01 no longer
correspond to significant findings. - Need to adjust for multiple testing when
assessing the statistical significance of findings
12Multiple Testing Bonferroni correction
- Consider as differential genes only those with
p-value lt (a/N) - N number of tests
- a0.01, N10,000 cut-off0.000001
- Ensure very low probability for having any false
positive genes (less than a) - Advantage very clean list of differential genes
- Limit the list usually contains very few genes
unacceptable high rate of false negatives
13Multiple Testing FDR correction (Benjamini
Hochberg)
- False Discovery Rate
- In high-throughput studies certain proportion of
false positives is tolerable - Control the expected proportion of false
positives among the genes identified as
differential (q10). - Scheme
- Rank genes according to their p-vals
p(1)ltp(2)ltp(N) - Consider as differential genes the top k that
satisfy - p(i) lt i(q/N), 1ik
14Approaches for identification of differential
genes
- Fold Change
- T-test
- Cyber-T
- SAM
153. Cyber-T (Baldi Long)
- Regularized t-test
- Problem Low number of replicates ? unstable
estimations of gene variances - Found that in microarray datasets, after
log-transformation, the variance is dependant on
the expression level - Lower expression level ? larger variance
16- Utilize this rule to improve the estimation of
gene variances
- Lower expression level ? larger variance
In t-test, use s2 in place of s2 s2 genes
variance over the replicates n number of
replicates s02 expected variance given the
expression level of the gene ?0 weight of
s0 s02 estimated over a window of size 101
genes ?0 n 10
Log (expression)
17Stabilization of the variance estimation
s2
s2
Log (expression)
Log (expression)
18Cyber-T
- Regularized t-test performs better than the
conventional t-test when the number of replicates
is low (2-3) - By 5 replicates (5 control, 5 treatment) the
performances were similar
19Approaches for identification of differential
genes
- Fold Change
- T-test
- Cyber-T
- SAM
204. SAM (Tusher, Tibshirani Chu)
- Significance Analysis of Microarray
- Limit of analytical FDR approach assumes that
the tests are independent - However in the microarray context, the expression
levels of some genes are highly correlated ?
unreliable FDR estimate - SAM uses permutations to get an estimate for the
FDR of the reported differential genes
21SAM
- Scheme
- Compute for each gene a statistic that measures
its relative expression difference in control vs
treatment (t-score or a variant) - Rank the genes according to their difference
score - Set a cut off (d0) and consider all genes above
it as differential (Nd) - Permute the condition labels, and count how many
genes got score above d0 (Np) - Repeat on all possible permutations and count
(Npj) - estimate FDR as the proportion ltNpjgt/Nd
22Permutation on condition labels
BACK