Title: Test of significance for small samples
1Test of significance for small samples Javier
Cabrera
2Outline
3Differential Expression for small samples
C1 C2 C3 T1 T2 T3 G1
4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13
2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77
5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69
10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05
3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95
7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8
2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29
4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85
3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77
4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38
6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81
4.15 .......
- Preprocessed data.
- Perform a t-test for each gene.
- Select the most significant subset.
4The pooled variances T-test
5Plot t vs sp
- Only genes that have small sp are differentially
expressed. - Moderately and Highly expressed genes are
unlikely to have small sp so they will not be
picked up. - Most genes that are picked up are low expressers.
300
6Is this effect statistical or biological? This
graph was generated using IID normal samples
7Comparison of distribution of sp for
differentially and non-differentially expressed
genes Differentially expressed genes have
small sp
300
21983
8- Often the sample size per group is small.
- ? unreliable variances (inferences)
- ? dependence between the test statistics (tg) and
the standard error estimates (sg) - ? borrow strength across genes (LPE/EB)
- ? regularize the test statistics (SAM)
- ? work with tgsg (Conditional t).
The effect of small sample size
9SAM Significance Analysis for Microarray Tibshira
ni(2001)
- 1. Determine c
- Obtain significant genes by doing a simulation
and - use the False Discovery Ratio (FDR) to
find D . - 3. Significant Genes
10Determining c
Start with the pairs rg ,sg Let s? be the
?th percentile of the sg values and let
Compute the percentiles, q1 ? q2 ? q100, of
the sg values. For ??0, 5, 10, , 100,
compute vj (a) mad Tg(s?) ?sg ? ?qj, qj1) ,
j 1, 2, , n, Compute cv(?), the
coefficient of variation of the vj (a)
values. Choose as the value of ? that
minimizes cv(?). Fix as the value .
11Determining c
For each a
v1 (a) mad Tg
cv(?)
v2(a) v3(a) v4(a) v5(a) v6(a) v7(a)
Tg
Min
sg
12Simulation and use the False Discovery Ratio
(FDR) to find D .
-
- For each gene B permutations are generated. For
each perm. - Expected order statistic
13SAM The t statistics
D
14SAM output table
15Interpreting the SAM table
(1) Choose a value of the FDR (say 5 or 1) and
use the corresponding value of ?. In our example
Suppose we choose FDR (90 ) 1 this
corresponds to ?1.5. (2) Some scientists find
the choice of FDR a hard one to make and are more
comfortable with a more classical strategy of
choosing ? that correspond to a fixed
proportion of false positives, say 0.01. This
method would produce ?1.1. (3) A third strategy
would be to start with strategy (2), then check
the FDR and depending on the value if the FDR is
too high we may increase ? as long as (i) there
is an important reduction of the FDR and as long
as (ii) the number of called genes does not
decrease substantially. In our example we may
argue that ?1.1 corresponds to an FDR of 4.5
which maybe good enough.
16- Concerns about SAM
- Permutations of 6?
- c just a 1st order correction
D 0.70
D 1.05
D 1.33
17Conditional t Basic Model
? Let Xgij denote the preprocessed intensity
measurement for gene g in array i of group j. ?
Model Xgij mgj sg egij ? Effect of
interest tg mg2 - mg1 ? Error model egij
F(location0, scale1) ? Gene mean-variance
model(mg1,sg2) Fm,s with marginals mg1 Fm
and sg2 Fs
18Possible approaches
Parametric Assume functional forms for F and
Fm,s and apply either a Bayes or Empirical Bayes
procedure. Nonparametric
19Procedure
20Procedure (cont.)
21Roadblock
Let Xij be a sample from the model with s2 Fs
and let the variance obtained from the Xij be
s2 Then Var(s2) gt Var(s2) For example, if we
assume that Fs c32, n4 and e N(0,1), then
Var(s2)6 and Var(s2)15. Fix by target
estimation.
22Example Checking for the distribution of ?g
Compare the distr. of sg vs simulation with
1. Df0.5
2. Df2
1. Df0.5
Mice Data
3. Df6
2. Df2
3. Df6
23Another Example
Compare the distr. of sg vs simulation with
Df0.5
Df0.5
Df3
Df6
Df3
Df6
Df3
Df6
24Fixing the variance distribution
25Fixing the variance distribution (contd)
Proceed as before
26Plot t vs sp Differentially expressed genes may
have large sp
130
27Comparison of distribution of sp for
differentially and non-differentially expressed
genes selected by CT Differentially expressed
genes may have large sp
28Generating p-values
29Extensions ? F test - Condition on the
sqrt(MSE) ? Multiple comparisons - Tukey,
Dunnett, Bump. - Condition on the
sqrt(MSE) ? Gene Ontology. - Test for the
significance of groups. - Use Hypergeometric
Statistic, mean t, mean p-value, or other.
- Condition on log of the number of genes per
group
30Conditional F
31Target Estimation
- Target Estimation
- Cabrera, Fernholz (1999)
- - Bias Reduction.
- - MSE reduction.
- Recent Applications
- - Ellipse Estimation (Multivariate Target).
- - Logistic Regression
- Cabrera, Fernholz, Devas (2003)
- Patel (2003) Target Conditional MLE (TCMLE)
- Implementation in StatXact (CYTEL) and
- logXact Procs in SAS(by CYTEL).
32Target Estimation
33Target Estimation
Algorithms - Stochastic approximation.
- Simulation and iteration. - Exact
algorithm for TCMLE
34GO Ontology Conditioning on log(n)
Abs(T)
Log(n)