Title: Some discussion on multiple hypothesis test and false discovery adjustment
1Some discussion on multiple hypothesis test and
false discovery adjustment
2P-value of NULL distribution
- 6000 genes
- 100 samples, 50 each
- All N(0,1)
- P-value uniform distribution 0,1
3P-value and hypothesis test
- P-value of any statistics the possibility that
getting the statistics from NULL hypothesis - P-value cutoff set an arbitrary threshold on one
hypothesis test - When there are multiple hypothesis tests, it may
risk in getting too much false discovery - e.g. in 6000 genes, if cut by p-value 0.05, there
will be up to - 6000 0.05 300 false discoveries
- Need some adjustment to these statistics, to
guarantee that the False Discovery Rate in the
finally rejected lists is lower than a
pre-defined level f.
4FDR adjusted P-value
- Sort raw P-value
- Reject the test with the i-th smallest P-value if
Pi lt f/i, where f is a pre-defined FDR level.
5Difficulties
- Maybe it is too stringent, especially when the
signal is very weak, and the number of genes is
too large - Pre-screening may help to purify the data set and
get better result after FDR. - What does it mean by FDR lt 0.05, if there are
only 10 genes being rejected?
6Pair-wise t-test of pre-post exposure
- Raw P-value of paired t-test
- Scatter plot of raw P-value and mean difference
- There are some information, but too weak berried
under noise.
7Linear regression of gene expression versus
exposure
- Raw P-value of linear regression
- Scatter-plot of R-sqrt and raw P-value
8Alternative idea estimate the overall
distribution by permutation
9Randomly flip pre-post pair of all genes of some
sample, and get the maximum t-statistics from the
whole permuted data set
10Distribution of the maximum t-statistics in 1000
permuted data sets
- Find the gene with the true t-statistics above
the 95 percentile of this distribution
11Consider the sets of gene, not individual ones
- Genes are correlated
- We have prior knowledge to group genes together
- GO terms
- Pathway,
- functional group
- Clustering
- Design statistics on set of gene, not individual
genes, to be differentially expressed between
conditions - Combine weak information of genes in the sets,
and reduce the number of hypothesis tests
12(No Transcript)
13(No Transcript)
14Moothas data