Data Analysis for Gene Chip Data Part I: Onegeneatatime methods PowerPoint PPT Presentation

presentation player overlay
1 / 40
About This Presentation
Transcript and Presenter's Notes

Title: Data Analysis for Gene Chip Data Part I: Onegeneatatime methods


1
  • Data Analysis for Gene Chip DataPart I
    One-gene-at-a-time methods
  • Min-Te Chao
  • 2002/10/28

2
Outline
  • Simple description of gene chip data
  • Earlier works
  • Mutiple t-test and SAM
  • Lees ANOVA
  • Wongs factor models
  • Efrons empirical Bayes

3
Remarks
  • Most works are statistical analysis, not really
    machine learning type
  • Very small set of training sample not to
    mention the test sample
  • Medical research needs scientific rigor when we
    can

4
Arthritis and Rheumatism
  • Guidelines for the submission and reviews of
    reports involving microarray technology
  • v.46, no. 4, 859-861

5
Reproducibility
  • Should document the accuracy and precision of
    data, including run-to-run variability of each
    gene
  • No arbitrary setting of threshold (e.g., 2-fold)
  • Careful evaluation of false discovery rate

6
Statistical Analysis
  • Statistical analysis is absolutely necessary to
    support claims of an increase or decrease of gene
    expression
  • Such rigor requires multiple experiments and
    analysis of standard statistical instruments.

7
Sample Heterogenenity
  • Strongly recommends that investigators focus
    studies on homogenous cell populations until
    other methodological and data analysis problems
    can be resolved.

8
Independent Confirmation
  • It is important that the findings be confirmed
    using an independent method, preferably with
    separate samples rather than restating of the
    original mRNA.

9
Microarray
  • Other terms
  • DNA array
  • DNA chips
  • biochips
  • Gene chips

10
  • The underlying principle is the same for all
    microarrays, no matter how they are made
  • Gene function is the key element researchers want
    to extract from the sequence
  • DNA array is one of the most important tools
  • (Nature, v.416, April 2002 885-891)

11
2 types of microarray
  • cDNA
  • Oligonucleotides
  • DIY type

12
  • Microarray
  • allows the researchers to determine which
    genes are being expressed in a given cell type at
    a particular time and under particular condition
  • Gene-expression

13
Basic data form
  • On each array, there are p spots (pgt1000,
    sometimes 20000). Each spot has k probes (k20 or
    so). There are usually 2k measurements
    (expressions) per spot, and the k differences, or
    the difference of logs, are used.
  • Sometimes they only give you a summary
    statistics, e.g. median, mean,.. per spot

14
  • Each spot corresponding to a gene
  • For each study, we can arrange the chips so that
    the i-th spot represents the i-th gene. (genes
    close in index may not be close physically at
    all)
  • This means that when we read the i-th spot of
    all chips in one study, we know we get different
    measurements of the same i-th gene

15
  • Data of one chip can be arranged in a matrix
    form,
  • Y X_1, X_2, , X_p
  • Just as in a regression setup. But in practice, n
    (chips used) is small compared with p.
  • Y is the response cell type, experimental
    condition, survival time,

16
  • For a spot with 20 probes, see Efron et al.
    (2001, JASA, p.1153).

17
Earlier works
  • Cluster analysis
  • Fold methods
  • Multiple t with Bonferroni correction

18
Multiple t with Bonferroni correction
  • It is too conservative
  • Family wise error rate
  • Among G tests, the probability of at least one
    false reject basically goes to 1 with
    exponential rate in G

19
  • Sidaks single-step adjusted p-value
  • p1-(1-p)G
  • Bonferronis single-step adjusted p-value
  • pminGp,1
  • All are very conservative

20
FDR false discovery rate
  • Roughly Among all rejected cases, how many are
    rejected wrong?
  • (Benjamini and Hochberg 1995 JRSSB, 289-300)
    Sequential p-method

21
Sequential p-method
  • Using the observed data, it estimates the
    rejection regions so that the
  • FDR lt alpha
  • Order all p-values, from small to large, and
    obtain a k so the first k hypotheses (wrt the
    smallest k p-values) are rejected.

22
  • Since we have a different definition for error to
    control, it will increase the power
  • For modifications, see Storey (2002, JRSSB,
    479-498)
  • These are criteria specifically designed to
    handle risk assessment when G is large

23
Role of permutation
  • For tests (multiple or not), it is important to
    use a null distribution
  • It is generated by a well-designed permutation
    (of the columns of the data matrix) column
    refers to observations, not genes.

24
One simple example
  • Let us say we look at the first gene, with n_1
    arrays for treatment and n_2 arrays for control
  • We use a t-statistics, t_1, say. What is the
    p-value corresponding to this observed t_1?

25
  • Permute the nn_n_2 columns of data of the data
    matrix. Look at first row (corresponds to the
    first gene)
  • Treat the first n_1 numbers as a fake
    treatment, the last n_2 numbers as a fake
    control , compute a t-value, say we get s_1

26
  • Permute again and do the same thing and we get
    s_2, .
  • Do it B times and get s_1, s_2, ., s_B
  • Treat these ss as a (bootstrap) sample for the
    null distribution of the t_1 statistic
  • The p-value of the earlier t_1 is found from the
    ecdf of the s_j, j1,2,,B

27
  • Permutation plays a major role --- finding a
    reference measure of variation in various
    situations
  • For a well designed experiment with microarray,
    DOE techniques will play an important role in
    determining how to do proper permutations.

28
SAM significance analysis of microarray
  • A standard method of microarray analysis, taught
    many times in Stanford short courses of data
    mining
  • Modified multiple t-tests
  • Using the permutation of certain data columns to
    evaluate variation of data in each gene

29
  • Original paper is hard to read
  • (Tusher, Tibshirani and Chu, PNAS 2001, v.98,
    no.9, 5116-5121)
  • But the SAM manual is a lot easier to read for
    statisticians (free software for academia use)

30
  • D(i)X_treatment X_control over
  • s(i)s_0
  • i1,2,,G
  • D(1)ltD(2)lt..
  • Used in SAM, s_0 is a carefully determined
    constant gt0.

31
  • D(i) are used with certain group of permutations
    of the columns D(i) are also ordered
  • Plot D vs. D, points outside the 45-degree line
    by a threshold Delta are signals of significant
    expression change.
  • Control the value of Delta to get different FDR.

32
Other model-based methods
  • Wongs model
  • PM-MM \theta \phi \epsilon
  • Outlier detection
  • Model validation
  • Li and Wong (2001, PNAS v.98, no.1, 31-36)

33
Lees work
  • ANOVA based
  • May do unbalanced data e.g., 7 microarray chips
  • (Lee et al. 2000, PNAS, v.97, 9834-9839)

34
Empirical Bayes
  • (Efron et al. (2001) JASA, v.96, 1151-1160)
  • Use a mix model
  • f(z)p_0 f_0(z)p_1 f_1(z)
  • with f_0, f_1 estimated by data.
  • p_1prior prob that a gene expression is
    affected (by a treatment)

35
  • A key idea is to use permuted (columns) data to
    estimate f_0
  • Use a tricky logistic regression method
  • Eventually found
  • p_1(Z) the a posteriori probability that a
    gene at expression level Z is affected

36
Part I conclusion
  • Earlier methods are relatively easy to
    understand, but to get familiar with the
    bio-language needs time
  • More powerful data analytic methods will continue
    to develop
  • It is important to first understand the basic
    problems of biologist before we jump with the
    fancy stat methods

37
  • We may do the wrong problem
  • But if the problem is relevant, even simple
    methods can get good recognition
  • All methods so far are first moment only ie,
    not too much different from multiple t tests or,
    they all are one-gene-at-a-time methods.

38
  • We did not address issues about data cleaning,
    outlier detection, normalization, etc. Microarray
    data are highly noisy, these problems are by no
    means trivial.
  • As the cost per chip goes down, the number of
    chips per problem may grow. But still
    well-designed experiments, e.g., fractional
    factorial, has room to play in this game

39
  • Statistical methods, as compared with machine
    learn based methods, will play a more important
    role for this type of data since, with a model,
    parametric or not, one can attach a measure of
    confidence to the claimed result. This is crucial
    for scientific development.

40
Quote
  • The statistical literature for microarrays, still
    in its infancy and with much of it unpublished,
    has tended to focus on frequentist
    data-analytical devices, such as cluster
    analysis, bootstrapping and linear models.
    (Efron, B. 2001)
Write a Comment
User Comments (0)
About PowerShow.com