Test of significance for small samples - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Test of significance for small samples

Description:

(1) Choose a value of the FDR (say 5% or 1%) and use the corresponding value of ... (2) Some scientists find the choice of FDR a hard one to make and are more ... – PowerPoint PPT presentation

Number of Views:325
Avg rating:3.0/5.0
Slides: 35
Provided by: Dama65
Category:

less

Transcript and Presenter's Notes

Title: Test of significance for small samples


1
Test of significance for small samples Javier
Cabrera
2
Outline
3
Differential Expression for small samples
C1 C2 C3 T1 T2 T3 G1
4.67 4.44 4.42 4.73 4.85 4.69 G2 3.13
2.54 1.96 0.97 2.38 3.36 G3 6.22 6.77
5.32 6.40 6.94 6.87 G4 10.74 10.81 10.69
10.75 10.68 10.68 G5 3.76 4.16 5.27 3.05
3.20 2.85 G6 6.95 6.78 6.33 6.81 6.95
7.01 G7 4.98 4.61 4.56 4.57 4.90 4.44 G8
2.72 3.30 3.24 3.22 3.42 3.22 G9 5.29
4.79 5.13 3.31 4.67 5.27 G10 5.12 4.85
3.79 4.13 3.12 4.79 G11 4.67 3.50 4.77
4.09 3.86 2.88 G12 6.22 6.42 5.02 6.38
6.54 6.80 G13 2.88 3.76 2.78 2.98 4.81
4.15 .......
  • Preprocessed data.
  • Perform a t-test for each gene.
  • Select the most significant subset.

4
The pooled variances T-test
5
Plot t vs sp
  • Only genes that have small sp are differentially
    expressed.
  • Moderately and Highly expressed genes are
    unlikely to have small sp so they will not be
    picked up.
  • Most genes that are picked up are low expressers.

300
6
Is this effect statistical or biological? This
graph was generated using IID normal samples
7
Comparison of distribution of sp for
differentially and non-differentially expressed
genes Differentially expressed genes have
small sp
300
21983
8
  • Often the sample size per group is small.
  • ? unreliable variances (inferences)
  • ? dependence between the test statistics (tg) and
    the standard error estimates (sg)
  • ? borrow strength across genes (LPE/EB)
  • ? regularize the test statistics (SAM)
  • ? work with tgsg (Conditional t).

The effect of small sample size
9
SAM Significance Analysis for Microarray Tibshira
ni(2001)
  • 1. Determine c
  • Obtain significant genes by doing a simulation
    and
  • use the False Discovery Ratio (FDR) to
    find D .
  • 3. Significant Genes

10
Determining c
Start with the pairs rg ,sg Let s? be the
?th percentile of the sg values and let
Compute the percentiles, q1 ? q2 ? q100, of
the sg values. For ??0, 5, 10, , 100,
compute vj (a) mad Tg(s?) ?sg ? ?qj, qj1) ,
j 1, 2, , n, Compute cv(?), the
coefficient of variation of the vj (a)
values. Choose as the value of ? that
minimizes cv(?). Fix as the value .
11
Determining c
For each a
v1 (a) mad Tg
cv(?)
v2(a) v3(a) v4(a) v5(a) v6(a) v7(a)
Tg
Min
sg
12
Simulation and use the False Discovery Ratio
(FDR) to find D .
  • For each gene B permutations are generated. For
    each perm.
  • Expected order statistic

13
SAM The t statistics
D
14
SAM output table
15
Interpreting the SAM table
(1) Choose a value of the FDR (say 5 or 1) and
use the corresponding value of ?. In our example
Suppose we choose FDR (90 ) 1 this
corresponds to ?1.5. (2) Some scientists find
the choice of FDR a hard one to make and are more
comfortable with a more classical strategy of
choosing ? that correspond to a fixed
proportion of false positives, say 0.01. This
method would produce ?1.1. (3) A third strategy
would be to start with strategy (2), then check
the FDR and depending on the value if the FDR is
too high we may increase ? as long as (i) there
is an important reduction of the FDR and as long
as (ii) the number of called genes does not
decrease substantially. In our example we may
argue that ?1.1 corresponds to an FDR of 4.5
which maybe good enough.
16
  • Concerns about SAM
  • Permutations of 6?
  • c just a 1st order correction

D 0.70
D 1.05
D 1.33
17
Conditional t Basic Model
? Let Xgij denote the preprocessed intensity
measurement for gene g in array i of group j. ?
Model Xgij mgj sg egij ? Effect of
interest tg mg2 - mg1 ? Error model egij
F(location0, scale1) ? Gene mean-variance
model(mg1,sg2) Fm,s with marginals mg1 Fm
and sg2 Fs
18
Possible approaches
Parametric Assume functional forms for F and
Fm,s and apply either a Bayes or Empirical Bayes
procedure. Nonparametric
19
Procedure
20
Procedure (cont.)
21
Roadblock
Let Xij be a sample from the model with s2 Fs
and let the variance obtained from the Xij be
s2 Then Var(s2) gt Var(s2) For example, if we
assume that Fs c32, n4 and e N(0,1), then
Var(s2)6 and Var(s2)15. Fix by target
estimation.
22
Example Checking for the distribution of ?g
Compare the distr. of sg vs simulation with
1. Df0.5
2. Df2
1. Df0.5
Mice Data
3. Df6
2. Df2
3. Df6
23
Another Example
Compare the distr. of sg vs simulation with
Df0.5
Df0.5
Df3
Df6
Df3
Df6
Df3
Df6
24
Fixing the variance distribution
25
Fixing the variance distribution (contd)
Proceed as before
26
Plot t vs sp Differentially expressed genes may
have large sp
130
27
Comparison of distribution of sp for
differentially and non-differentially expressed
genes selected by CT Differentially expressed
genes may have large sp
28
Generating p-values
29
Extensions ? F test - Condition on the
sqrt(MSE) ? Multiple comparisons - Tukey,
Dunnett, Bump. - Condition on the
sqrt(MSE) ? Gene Ontology. - Test for the
significance of groups. - Use Hypergeometric
Statistic, mean t, mean p-value, or other.
- Condition on log of the number of genes per
group
30
Conditional F
31
Target Estimation
  • Target Estimation
  • Cabrera, Fernholz (1999)
  • - Bias Reduction.
  • - MSE reduction.
  • Recent Applications
  • - Ellipse Estimation (Multivariate Target).
  • - Logistic Regression
  • Cabrera, Fernholz, Devas (2003)
  • Patel (2003) Target Conditional MLE (TCMLE)
  • Implementation in StatXact (CYTEL) and
  • logXact Procs in SAS(by CYTEL).

32
Target Estimation
33
Target Estimation

Algorithms - Stochastic approximation.
- Simulation and iteration. - Exact
algorithm for TCMLE
34
GO Ontology Conditioning on log(n)
Abs(T)
Log(n)
Write a Comment
User Comments (0)
About PowerShow.com