Title: Statistics for Microarrays
1Statistics for Microarrays
Data Preprocessing, Exploratory Data Analysis,
Significance Analysis, Clustering, PCA, and
Prediction
??? 11/18/2003 References 1. http//statwww.epfl
.ch/davison/teaching/Microarrays/
2. Data Analysis Tools for DNA Microarrays, Sorin
Draghici, 2003, CRC.
2Flowchart ref.1
Biological question Differentially expressed
genes Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Discrimination
Biological verification and interpretation
3cDNA Microarrays ref.1
Compare the genetic expression in two samples of
cells
PRINT cDNA from one gene on each spot
SAMPLES cDNA labelled red/green
e.g. treatment / control / tumor
normal tissue
4(contd) ref.1
SCAN
HYBRIDIZE Add equal amounts of labelled cDNA
samples to microarray.
Laser
Detector
5Images examples ref.1
6Quantification of Expression
For each spot on the slide we calculate
Red intensity Rfg - Rbg and
(fg foreground, bg background)
Green intensity Gfg - Gbg and combine them
in the log (base 2 or 10 or e) ratio Log2( Red
intensity / Green intensity)
7Gene Expression Data ref.1
On p genes for n slides p is O(10,000), n is
O(10-100), but growing,
Slides
slide 1 slide 2 slide 3 slide 4 slide 5 1
0.46 0.30 0.80 1.51 0.90 ... 2 -0.10 0.49
0.24 0.06 0.46 ... 3 0.15 0.74 0.04 0.10
0.20 ... 4 -0.45 -1.03 -0.79 -0.56 -0.32 ... 5 -0.
06 1.06 1.35 1.09 -1.09 ...
Genes
Gene expression level of gene 5 in slide 4
Log2( Red intensity / Green intensity)
These values are conventionally displayed on a
red (gt0) yellow (0) green (lt0) scale.
8Differential Expression Analysis / Significance
Analysis
Goal Identify differentially regulated genes in
two samples. (1) Visualization Tools QQ
plots, boxplots, scatter plot, MA plots,
histograms, PCA, ICA, (2) Differential
Expression Analysis Fold change method,
unusual ratio method, hypothesis testing,
ANOVA, ML,
9(W1) Fold Change Method (X)
(Way) Setting thresholds (/-2 or 3) for log2
ratio data and selecting the genes outside such
thresholds. (Advantage) Simplest and intuitive.
(Drawback) (a) Threshold is chosen arbitrarily
(might improper). (b) A bad signal/noise ratio
for genes with low expression levels. ? a funnel
shape on a MA-plot (c) A constant threshold for
all genes might overestimate the regulation at
low intensity and underestimate it at high
intensity.
10Example Fold-change on a scatter plot
- T cells, stimulated (y axis) vs. control (x
axis) - All genes outside inner R lines are dysregulated
(up or down)
11Example MA-plot
log2R vs log2G
Mlog2R/G vs Alog2vRG
12(W2) Unusual Ratio Method (X)
(Way) Applying z-transform to the log2 ratio and
selecting the genes outside /-2 (S.D.).
(Advantage) Auto.adjust the cut-off threshold by
the of genes and variability of regulation
(Drawback) (a) Report a fixed proportion (e.g.
5) of the genes as differentially regulated no
matter no or more genes are in fact regulated.
i.e., always pick the most-affected genes
(?) (b) Use cut-off boundaries will overestimate
at low and underestimate at high intensity.
13Selecting by the unusual ratio criterion
The frequencies of the z-scores of the genes are
plotted in a histogram. The horizontal axis is
now graded in standard deviations.
14(W3) Univariate Stat. Tests
(Way) Applying hypothesis testing (e.g. t-test
or Wilcoxon) modified w/ correction for multiple
comparisons (Holm step-down, FDR - false
discovery rate, permutations, SAM - significance
analysis of microarray, ). (Drawback) A bit
conservative (due to insufficient data) and
independent genes assumption. (Improvement) Use
re-sampling and take into consideration
dependencies among genes. (Advantage) Classic
statistical tests. Simple. Recommended
(after correction for multiple comparisons) !
15Statements in Hypotheses
Assume that the underlying distribution of
mean intensities of cDNA spots of a normal gene
has meanµ0 and variances02, (i) The gene is
up-regulated in the condition under studyµgtµ0
. (ii) The gene is down-regulatedµltµ0 . (iii)
The gene is unexpressedµµ0 . (iv) The gene is
(differentially) expressedµ?µ0 .
16Hypothesis Testing Procedure
(i) Specify the H0 vs. H1 (mutually exclusive
and all inclusive). (ii) Choose the significance
level a. (iii) Take an appropriate test
statistic, the degree of freedom and calculate
the critical values (cut-off thresholds) and
p-values. (iv) Make a decision of rejecting or
not rejecting H0 by the critical-value method or
p-value method. Hypothesis testing is not
centered on the data it is centered on our a
priori beliefs about it.
17One-sided
Cannot reject H0
Reject H0 (rejection region)
Threshold (critical value)
18Two-sided
Cannot reject H0
Reject H0 (rejection region)
Thresholds
19Errors in Hypothesis Testing
Truth
Decision
a Pr( reject H0 H0) a rejection error ß Pr(
not reject H0 H1) an acceptance error power
1 ß Pr( reject H0 H1) ? the purpose of
testing
20Diagnostic Tests
Truth
Test
Sensitivity Pr(T D) , Specificity Pr(T
Dc) FN Pr(T D) , FP Pr (T
Dc)
21Performance of the Pap smear as a diagnostic test
for cervical cancer
1,000,000 Women
Prevalence0.000083
999,917 No Cervical Cancer
83 Cervical Cancer
Specificity0.8136
Sensitivity0.8375
70 Test
13 Test-
813,532 Test-
186,385 Test
186,455 Test
813,545 Test-
Observed Test Results
22More in Diagnosis - ROC Curve
Diagnosis is an imperfect process. It is
desirable to have a test that is both highly
sensitive and highly specific. However,
increasing the sensitivity would decrease the
specificity because of lowering the cutoff
point. A ROC (receiver operator characteristic)
curve is a line graph that plots the
probabilities of sensitivity (true positive, TN)
against a false positive result (FN) for
different cutoff values of a test. The point
lying closest to the upper left-hand corner of
the line will maximize both sensitivity and
specificity simultaneously. Different tests have
different ROC curves.
23ROC Curve
24a ß
H1
H0
Type II, ß
Type I,a
Threshold
25Decreasinga? increasing threshold andß
H0
H1
Type II, ß
Type I,a
Threshold
26More in Testing Increasing n
The trade-off between a and ß in testing is
similar to sensitivity (1-a) and specificity
(1-ß) in diagnosis. The way to diminish a
and ß simultaneously is to reduce the amount of
overlap in the two distn under H0, H1. (i)
Farther apart the centers of two distributions,
i.e., µ0, µ1 . (ii) Increase the sample size n
(increasing n ? decreasing the standard error
s/vn ? the two sampling distributions becoming
more narrow ? lessening overlap). (iii) Reduce
the underlying population variances2
(difficult). (iv) More powerful test statistic
27(W3) Hypothesis Testing
(a) null hypothesis (H0) vs. alternative/research
hypothesis (H1 or Ha) (b) significance level
(e.g.a0.05, specified before the test is carried
out) (c) 1-sided or 2-sided test (?depends on H1
) (d) parametric or nonparametric analysis (e)
test statistic, rejection region, critical value
/ threshold, p-value (f) type I error, type II
error, sensitivity, specificity, PV(), PV(-),
power
28Example 1 - Building Hypotheses
e.g.1 If one is selecting genes with at least
2-fold change between control and experiment,
H0 µ?2 the null hypothesis, vs.
H1 µgt 2 the research hypothesis (H0 H1
, mutually exclusive and all inclusive)
29Example 2
e.g.2 Assume that the distribution of mean
intensities of cDNA spots of a gene is N(400,
1502) from the literature. We expect the gene to
be up-regulated in the condition under study and
what would we conclude if we observe a spot with
a mean intensity higher than 700? H0 µ?µ0
the null hypothesis, vs. H1
µgtµ0 the research hypothesis Z (X
µ0)/s0 , where µ0 400, s0 150 The
probability of having such a value just by
chance, i.e. p-value P( X gt 700 ) P(
(X µ0)/s0 gt (700-400)/150 )
P( Z gt 2 ) 0.0228
30One-sided
Cannot reject H0
Reject H0 (rejection region)
Threshold (critical value)
31(contd)
The probability that an unexpressed gene has
such a high intensity spot is only about 2.28.
If the pre-chosen significance level a is
0.05, then we can reject the H0 because of 0.0228
lt 0.05 (for 1-sided). It means that if this
gene actually is a normal gene but not
up-regulated gene, then the probability of our
mistake decision (0.0228) is lower than our
significance threshold (0.05), the maximum
acceptable level for this error probability.
Thus, we conclude that
the gene is up-regulated at
5 significance level. or the gene is
up-regulated with a p-value of 0.0228.
32Significance Level
Choosing a significance level means choosing a
maximum acceptable level for this error
probability. The significance level, a, is the
amount of uncertainty we are prepared to accept
in our studies. e.g., A significance level of
10 we accept that 1 in 10 cases our
conclusion can be wrong. A significance
level of 20we can be wrong 1 out of 5 cases.
Usually, significance levels a are 1, 5, or
10.
33p-value
The p-value is the exact probability of making a
mistake for any given threshold. e.g. For
choosing up-regulated genes (1-sided), its the
probability of a value being higher than a
threshold k PH0( Z gt k ) 1 PH0( Z?k ),
where Z (X µ0)/s0 This is exactly the
probability of making a mistake if we used k as
the threshold for choosing up-regulated
genes. The p-value provides information about
the amount of trust we can place in a decision
made using the given threshold.
34Example 3
e.g.3 If we set the log2-ratio threshold for
up-regulated genes 2, and there is a gene with
a measurement of 4, then we will call that gene
up-regulated. However, a gene from the
controls might have such a high expression
value (4), too. This is almost unlikely, but it
may happen with very low possibility. If this
happens and we call such a gene up-regulated,
we will make a mistake even if the gene has a
high expression level, it is from the
distribution of controls. The probability of
this happening is called the p-value.
35(contd)
The p-value is the probability of a measurement
more extreme than a certain threshold occurring
just by chance. In e.g.1, this corresponds to
rejecting the null hypothesis H0 µ?2 if this
null hypothesis is in fact true. The p-value is
the probability of drawing the wrong conclusion
by rejecting a true H0 (under H0).
36(W4) ANOVA (ANalysis Of VAriance)
(Way) Build an explicit model about the sources
of variance and estimate the variance of each
individual variable in the model. e.g. model
log(yijkg) µ Ai Dj Gg (AD)ij
Kerr-Churchill (AG)ig (VG)kg
(DG)jg eijkg where µ- overall mean signal, Ai
- effect of the ith array, Dj effect of the jth
dye, Gg variation of the gth gene, (AD)ij
array-dye int., (AG)ig array-gene int., (DG)jg
dye-gene int.,eijkg error term here, a
variety is a condition such as disease or healthy
(CS/CN), so (VG)kg interaction of kth variety
and gth gene (!! Interested !!)
37(contd)
(Result) The differentially regulated genes
will be the genes for which the (VG)kg factor
representing significant. (int. of CS/CN and gene
G) (Advantage) Clearly distinguish between
interesting variables (gene regulation) and side
effects (differences due to dyes or arrays).
(Drawback) A very careful DOE to ensure a
sufficient number of degree of freedom. Rule
block what you can, randomize what you cannot
38(W5) MLE Methods (Maximum
Likelihood Estimation)
(Way) Build a mixture model and estimate the
parameters of the model by a ML (maximum
likelihood) approach based on observed data.
e.g. model fj(y) p?fEj(y) (1-p)?fUj(y) Lee
et al. where fj(y) the pdf of observed
intensity value y in spot j, p the
prior prob. of a gene being expressed, (1-p) -
unexp. fEj(y) the pdf of observed y
given an expressed gene, fUj(y) the
pdf of observed y given an unexpressed gene,
E - the event of a gene is expressed,
U the event of a gene is unexpressed
39(contd)
(Process) Applying Bayes theorem, for an
intensity value Yjy, we have P(EYjy)
P(E)?P(YjyE) / P(Yjy)
p?fEj(y) / fj(y) p?fEj(y) /
p?fEj(y) (1-p)?fUj(y) The prob. of a
gene is expressed given an observed intensity
y. In Lee et al., by assuming that
fEj(y), fUj(y) are normal distributions, they
calculated the likelihood of the model and
estimated the parameters of p, µEj, sEj, µUj, sUj
by a ML approach, which searches various
combinations of the parameters such as the
likelihood function fits the data as best as
possible (the most likely based on observed data).
40(contd)
(Advantage) (a) In large sample, the MLEs are
UMVUE.
(unbiased minimum variance estimators) (b)
Flexibility and large applicability (test
hypotheses about models and parameters) by
likelihood functions. (Drawback) (a)
Computationally intensive! (solve non-linear
equations) (b) A more complex model if the genes
are divided into up-, down- and no-regulated.
(c) Unreliable as the sample size decreases or
data deviates from assumed distribution.
41Test Statistics (1)
42Test Statistics (2)
- Qualitative covariates
- e.g. two-sample t-statistic, Mann-Whitney
statistic, F-statistic, - Quantitative covariates
- e.g. standardized regression coefficient
- Survival response
- e.g. score statistic for Cox model
43Criteria of Gene Selection
- Let TP be the of true positives, TN be true
negatives, FP be false positives, FN be false
negative, then - Specificity TN / (TN FP)
- Sensitivity TP / (TP FN)
- Accuracy (TP TN) /N
- PPV TP / (TP FP)
- NPV TN / (TN FN)
- Where PPV, NPV are positive predicted value,
negative predicted value, resp., and N TP FP
TN FN.
The proportion (TP FN)/N is called prevalence.
44Performance of Gene Selection Methods
True
Reported
Accuracy (TPTN)/N, Prevalence (TPFN)/N A
perfect method will produce PPVNPVspec.sen.1