Significance Testing of Microarray Data - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Significance Testing of Microarray Data

Description:

Permute group labels among samples. redo tests with pseudo-groups ... Permuted t -scores for many genes may be lower than from random samples from the ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 36
Provided by: MarkRe80
Category:

less

Transcript and Presenter's Notes

Title: Significance Testing of Microarray Data


1
Significance Testing of Microarray Data
  • BIOS 691 Fall 2008
  • Mark Reimers
  • Dept. Biostatistics

2
Outline
  • Multiple Testing
  • Family wide error rates
  • False discovery rates
  • Application to microarray data
  • Practical issues correlated errors
  • Computing FDR by permutation procedures
  • Conditioning t-scores

3
Reality Check
  • Goals of Testing
  • To identify genes most likely to be changed or
    affected
  • To prioritize candidates for focused follow-up
    studies
  • To characterize functional changes consequent on
    changes in gene expression
  • So in practice we dont need to be exact
  • but we do need to be principled!

4
Multiple comparisons
  • Suppose no genes really changed
  • (as if random samples from same population)
  • 10,000 genes on a chip
  • Each gene has a 5 chance of exceeding the
    threshold at a p-value of .05
  • Type I error
  • The test statistics for 500 genes should exceed
    .05 threshold by chance

5
Distributions of p-values
Real Microarray Data
Random Data
6
Characterizing False Positives
  • Family-Wide Error Rate (FWE)
  • probability of at least one false positive
    arising from the selection procedure
  • Strong control of FWE
  • Bound on FWE independent of number changed
  • False Discovery Rate
  • Proportion of false positives arising from
    selection procedure
  • ESTIMATE ONLY!

7
Corrected p-Values for FWE
  • Sidak (exact correction for independent tests)
  • pi 1 (1 pi)N if all pi are independent
  • pi _at_ 1 (1 Npi ) gives Bonferroni
  • Bonferroni correction
  • pi Npi, if Npi lt 1, otherwise 1
  • Expectation argument
  • Still conservative if genes are co-regulated
    (correlated)
  • Both are too conservative for array use!

8
Holms FWER Procedure
  • Order p-values p(1), , p(N)
  • If p(1) lt a/N, reject H(1) , then
  • If p(2) lt a/(N-1), reject H(2) , then
  • Let k be the largest n such that p(n) lt a/n, for
    all n lt k
  • Reject p(1) p(k)
  • Then P( at least one false positive) lt a
  • Step-up procedure
  • Proof doesnt depend on distributions

9
Hochbergs FWER Procedure
  • Find largest k p(k) lt a / (N k 1 )
  • Then select genes (1) to (k)
  • Step-down procedure starting from largest
    p-values and working down
  • More powerful than Holms procedure
  • But requires assumptions independence or
    positive dependence
  • When one type I error, could have many

10
Simes Lemma
  • Suppose we order the p-values from N independent
    tests using random data
  • p(1), p(2), , p(N)
  • Pick a target threshold a
  • P( p(1) lt a /N p(2) lt 2 a /N p(3) lt 3 a /N
    ) a

a/2
A a/2 a/2 a2/4 a2/4
a/2
11
Simes FWER Procedure
  • Pick a target threshold a
  • Order the p-values p(1), p(2), , p(N)
  • If p(1) lt a /N then
  • If p(2) lt 2 a /N then
  • if p(k) lt k a /N
  • Select the corresponding genes (1) to (k)
  • Step-up procedure
  • starting with the smallest p-values and working up

12
Truth vs. Decision
Decision
Truth
13
False Discovery Rate
  • In genomic problems a few false positives are
    often acceptable.
  • Want to trade-off power .vs. false positives
  • Could control
  • Expected number of false positives
  • Expected proportion of false positives
  • What to do with E(V/R) when R is 0?
  • Actual proportion of false positives

14
Catalog of Type I Error Rates
  • Per-family Error Rate
  • PFER E(V)
  • Per-comparison Error Rate
  • PCER E(V)/m
  • Family-wise Error Rate
  • FWER p(V 1)
  • False Discovery Rate
  • i) FDR E(Q), where
  • Q V/R if R gt 0 Q 0 if R 0 (B-H)
  • ii) FDR E( V/R R gt 0) (Storey)

15
Benjamini-Hochberg
  • Cant know what FDR is for a particular sample
  • B-H suggest procedure specifying Average FDR
  • Order the p-values p(1), p(2), , p(N)
  • If any p(k) lt k a /N
  • Then select genes (1) to (k)
  • q-value smallest FDR at which the gene becomes
    significant
  • NB acceptable FDR may be much larger than
    acceptable p-value (e.g. 0.10 )

16
Argument for B-H Method
  • If no true changes (all null Hs hold)
  • Q 1 condition of Simes lemma holds
  • P lt a
  • If all true changes (no null Hs hold)
  • Q 0 lt a
  • Build argument by induction

17
Storeys pFDR
  • Storey argues that E(Q V gt 0 ) is the quantity
    of real interest
  • Sometimes quite different from B-H

18
A Bayesian Interpretation
  • Suppose nature generates true nulls with
    probability p0 and false nulls with P p1
  • Then pFDR P( H true procedure)

19
Storeys Procedure
20
Practical Issues
  • Actual proportion of false positives varies from
    data set to data set
  • Mean FDR could be low but could be high in your
    data set

21
The Effect of Correlation
  • If all genes are uncorrelated, Sidak is exact
  • If all genes were perfectly correlated
  • p-values for one are p-values for all
  • No multiple-comparisons correction needed
  • Typical gene data is highly correlated
  • First eigenvalue of SVD may be more than half the
    variance
  • Distribution of p-values may differ from uniform
  • True FDR more variable

22
Symptoms of Correlated Tests
P-value Histograms
23
Distributions of numbers of p-values below
threshold
  • 10,000 genes
  • 10,000 random drawings
  • L Uncorrelated R Highly correlated

24
Permutation Tests
  • We dont know the true distribution of gene
    expression measures within groups
  • We simulate the distribution of samples drawn
    from the same group by pooling the two groups,
    and selecting randomly two groups of the same
    size we are testing.
  • Need at least 5 in each group to do this!

25
Permutation Tests How To
  • Suppose samples 1,2,,10 are in group 1 and
    samples 11 20 are from group 2
  • Permute 1,2,,20 say
  • 13,4,7,20,9,11,17,3,8,19,2,5,16,14,6,18,12,15,10
  • Construct t-scores for each gene based on these
    groups
  • Repeat many times to obtain Null distribution of
    t-scores
  • This will be a t-distribution ? original
    distribution has no outliers

26
Multivariate Permutation Tests
  • Want a null distribution with same correlation
    structure as given data but no real differences
    between groups
  • Permute group labels among samples
  • redo tests with pseudo-groups
  • repeat ad infinitum (10,000 times)

27
Critiques of Permutations
  • Variances of permuted values for truly changed
    genes are inflated
  • artificially low p-values
  • Permuted t -scores for many genes may be lower
    than from random samples from the same population

28
Permutations for FWER
  • Typically tests are correlated
  • Extreme case all tests highly correlated
  • One test is proxy for all
  • Corrected p-values are the same as
    uncorrected
  • Intermediate case some correlation
  • Usually probability of obtaining a p-value by
    chance is in between Sidak and uncorrected values

29
Westfall-Young Approach
  • How often is smallest p-value less than a given
    p-value if tests are correlated to the same
    extent and all Nulls are true?
  • Construct permuted samples n 1,,N
  • Determine p-values pn for each sample

30
Permutations for FDR - B-H Style
  • Estimate p-values in the spirit of W-Y (but
    without multiple testing correction
  • t.j is the permutation p-value for gene j
  • N is the number of tests
  • I is the number of permutations
  • Apply B-H procedure to these p-values

31
Permutations FDR Korn Style
  • B-H procedure only guarantees long-term behavior
    of method
  • can be quite badly wrong
  • Korn addresses issue of correlations

32
Moderated Tests
  • Many false positives with t-test arise because of
    under-estimate of variance
  • Most gene variances are comparable
  • (but not equal)
  • Can we use pooled information about all?

33
Steins Lemma
  • Whenever you have multiple variables with
    comparable distributions, you can make a more
    efficient joint estimator by shrinking the
    individual estimates toward the common mean
  • Can formalize this using Bayesian analysis
  • Suppose true values come from prior distrib.
  • Mean of all parameter estimates is a good
    estimate of prior mean

34
SAM
  • Statistical Analysis of Microarrays
  • Uses a fudge factor to shrink individual SD
    estimates toward a common value
  • di (x1,i x2,i / ( si s0)
  • Patented!

35
limma
  • Empirical Bayes formalism
  • Depends on prior estimate of number of genes
    changed
  • Bioconductors approach free!
Write a Comment
User Comments (0)
About PowerShow.com