Steilkurs in praktischer MikroarrayAnalyse - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Steilkurs in praktischer MikroarrayAnalyse

Description:

Two or more types of samples (e.g. treated cell lines, biopsies) ... R. Simon, M. D. Radmacher and K. Dobbin (2002). Design of studies using DNA microarrays. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 56
Provided by: B76
Category:

less

Transcript and Presenter's Notes

Title: Steilkurs in praktischer MikroarrayAnalyse


1
Steilkurs in praktischer Mikroarray-Analyse
Mainz, 22.6.2006
  • Andreas Buneß

2
Differential Gene Expression
Setting Two or more types of samples (e.g.
treated cell lines, biopsies) For each sample we
have thousands of gene expresssion
levels Goal Which genes are differentially
expressed ?
3
Trade Off or Everything is a Compromise
Sensitivity simply take all ?
Specificity simply take none ?
Here statistical testing framework - ranking of
genes (ordered list wrt up/down regulation) -
cut off, i.e. significant genes
4
T-test difference of the means, but takes
variance into account
5
(No Transcript)
6
Multiple Testing
statistical test for each gene g yields a p-value
for each gene under the null hypothesis of no
differential expression p-value is a uniformly
distributed random number between 0 and 1 under
the null hypothesis of no differential expression

7
Multiple Testing The Problem
Multiplicity problem thousands of hypotheses are
tested simultaneously. Increased chance of
false positives. E.g. suppose you have 10,000
genes on a chip and not a single one is
differentially expressed. You would expect
100000.01 100 of them to have a p-value lt
0.01.
8
Andreas Buneß
Multiple Testing - Error Control
Null hypothesis no differential expression
between the two types/groups
R all which are called significant
V Type I error, false positives (FP)
T Type II error, miss, false negatives (FN)
power of the test
9
What kind of error do we exactly want to control
? error concepts, like FWER, FDR
How to we achieve this error control
? procedures or methods, like Bonferroni, Holm,
Benjamini-Hochberg, ... R packages
multtest, qvalue / limma,
samr
10
Family Wise Error Rate (FWER)
FWER Pr(V gt 0) The probability of at least one
Type I error (false positive) among the genes
selected as significant.
11
False Discovery Rate (FDR)
FDR E(Q), where QV/R if R gt 0 and Q0 if
R 0
The expected proportion of Type I errors among
the rejected hypotheses. The expected proportion
of false positives among all genes called
significant.
12
p-values refer to a single gene FDR and FWER
error control refers to a list of genes (the FDR
corresponding to the ordered list up to the
particular gene is often referred to as its
q-value)
13
FDR Horizontal cutoff
14
FWER The Bonferroni Correction
15
Example
Golub data, 27 ALL vs. 11 AML samples, 3,051
genes.

98 genes with Bonferroni-adjusted p lt 0.05, praw
lt 0.000016
16
FDR The Optimal Discovery Procedure
John Storey, software edge
17
FWER or FDR ?
Choose FWER control if high confidence in all
selected genes is desired. Choose FDR control
if a certain proportion of false positives is
tolerable - frequently used in practice.
18
Statistical Tests
  • Standard t-test assumes normally distributed
    data in each class (almost always questionable,
    but may be a good approximation), equal variances
    within classes
  • Welch t-test as above, but allows for unequal
    variances
  • Wilcoxon test nonparametric, rankbased
  • Permutation test estimate the distribution of
    the test statistic (e.g., the t-statistic) under
    the null hypothesis by permutations of the sample
    labelsThe pvalue is given as the fraction of
    permutations yielding a test statistic that is at
    least as extreme as the observed one.

19
SAM (significance analysis of microarrays)
permutation test regularized
t-statistics multiple testing correction R
package samr
20
T-test difference of the means, but takes
variance into account
21
Few replicates regularized/moderated
tstatistics
  • With the ttest, we estimate the variance of each
    gene individually. This is fine if we have enough
    replicates, but with few replicates (say 25 per
    group), the variance estimates are unstable.
  • In a moderated tstatistic, the estimated
    genespecific variance s2g is augmented with s20,
    a global variance estimator obtained from pooling
    all genes. This gives an interpolation between
    the tstatistic and a foldchange criterion

22
Permutation tests
test statistic
true class labels
null distribution of test statistic
2.2
(random) permutations of class labels
1.5 -0.4 2.3 0.7 0.2 -1.2
2.2
23
SAM typical plot
Expected random score vs observed scores
Deviations from the main diagonal are evidence
for differentially expressed genes
24
What you typically observe

No differential gene expression
A lot of differential gene expression
Global changes in gene expression
25
Statistical tests Different settings
  • comparison of two classes (e.g. tumor vs. normal)
  • paired observations from two classes e.g. the
    ttest for paired samples is based on the
    withinpair differences.
  • more than two classes and/or more than one factor
    (categorical or continuous) tests may be based
    on linear models

paired samples
26
LIMMA (Linear models for microarray data)
moderated t-statistic multiple testing
correction analysis of complex designs and
factorial experiments
27
Linear models
  • Linear models are a flexible framework for
    assessing the associations of phenotypic
    variables with gene expression.
  • The expression yi of a given gene in sample i is
    modeled as linearly depending on one or several
    factors (e.g. cell type, treatment, encoded in
    xij) of the sample yi
    a1xi1 amxim ei.
  • Estimated coefficients aj and their standard
    errors are obtained using least squares, assuming
    normally distributed errors ei (R function lm)
    or with a robust method (R function rlm).

28
Linear models
  • Contrasts, that is, differences/linear
    combinations of the coefficients, express the
    differences between phenotypes and can be tested
    for significance (ttest).
  • Example Consider a study of three different
    types of kidney cancer. For each gene set up a
    linear model yi a1xi1 a2xi2
    a3xi3 ei,where xij 1 if tumor sample i is
    of type j, and 0 otherwise.
  • The least squares estimates of the coefficients
    ai are the mean expression levels in the classes.
  • The contrast a1 - a2 expresses the mean
    difference between class 1 and 2.

29
Linear model analysis with the Bioconductor
package limma
  • The phenotype information for the samples is to
    be entered as a design matrix (xij from the above
    formula). The rows of the matrix correspond to
    the samples, and the columns to the coefficients
    of the linear model.
  • Contrasts are extracted after fitting the linear
    model.
  • The significance of contrasts is assessed with a
    moderated tstatistic.

30
Experimental Design Complex Designs
Aim of the experiment Robustness Extensibility Ef
ficiency
31
Aim of the experiment
Major focus differential expression treatment
vs. control (multiple treatments) tumor vs.
normal (tumor subtypes) time series of multiple
treatments Well defined goal or competing goals
? several comparisons (different
subgroups) one factor per comparison (skipping
others) statistical modelling (various
factors) exact subdivision of all samples
32
Efficiency
statistical efficiency (pooling, direct
indirect) cost efficiency (microarray, mRNA
source)
33
Replicates required for statistical
inference independent biological
replicates technical replicates may occur
on different levels in the experimental
hierarchy Pooling limited number of mRNA source
or limited number of microarrays
? amplification ? independent pools to estimate
variance one mRNA source may spoil the whole pool
34
Blocking Factors ? technical factors (slide
batches, hybridisation day, labelling days)
should not be (completely) confounded with your
comparison of interest Randomisation control
of unknown covariates Balancing control of all
covariates (all are known) Statistical modelling
35
single channel/one color microarrays (e.g.
Affymetrix) experimental design one
independent biological sample per microarray/hyb
ridisation pooling/blocking ? two color
microarrays (e.g. spotted cDNA arrays) experiment
al design may become more complex pooling/blockin
g ?
36
Dye effect (two color microarrays) different
(gene-specific) labelling efficiencies
37
Two colour microarrays
Typical designs reference design loop
design dye-swaps
38
Graphical Representation two colour microarrays

node mRNA sample edge hybridisation direction
dye assignment.
39
Dye-Swap/(mini-) Loop design
Reference design
A
R
B
two groups A and B independent biological
replicates of A and B
40
Reference Design one independent biological
sample against the reference per
hybridizationmicroarray (possible exception
pooling) extendable (e.g. ongoing study) any
unknown/unpredictable comparisons same
efficiency for any comparisons analysis as for
single channel microarray experiments no dye
effect often used with large sample
size simple, i.e. minimizes experimental
confusion
41
Dye Swap Design
often refers to a technical replicate where two
slides are used for two samples each labelled
twice (red/green) sometimes used to control the
dye effect, i.e. the different labelling
efficiencies for each gene, via averaging of the
two slides prefer "biological" replicates
42
Dye Swap Design
43
Loop Design
well defined experimental setting/comparisons
often relatively small smaple size (due to
manageability) requires statistical modelling in
general dye effect is addressed with the
statistical model
44
(Mini-) Loop Design
Design matrix A-B dye
A1
B1
B2
A2
(
)
1 -1 1 -1
1 1 1 1
A3
B3
B4
A4
45
C
A
C
B
A
B
R
A-R B-R C-R Contrast A-BA-R-(B-R)
C-A A-B (direct) B-C A-BB-C-(C-A) (indirect)
46
Recommendations
Large patient sample collective reference
design Unknown/unpredictable comparisons
reference design Small scale cell-line
experiments direct comparisons
Recall statistical analysis is limited by the
number of independent biological replicates !
47
(No Transcript)
48
  • Thanks to ...
  • Anja von Heydebreck,
  • Rainer Spang
  • Tim Beißbarth
  • for some of the slides.

49
Links
www.r-project.org/ www.bioconductor.org/ bioinf.w
ehi.edu.au/limma/ www-stat.stanford.edu/tibs/SAM
/ www.biostat.washington.edu/software/jstorey/edg
e/ NGFN course material http//compdiag.molgen.m
pg.de/ngfn/
50
References
  • Y. Benjamini and Y. Hochberg (1995). Controlling
    the false discovery rate a practical and
    powerful approach to multiple testing. Journal of
    the Royal Statistical Society B, Vol. 57,
    289300.
  • S. Dudoit, J.P. Shaffer, J.C. Boldrick (2003).
    Multiple hypothesis testing in microarray
    experiments. Statistical Science, Vol. 18,
    71103.
  • J.D. Storey and R. Tibshirani (2003). SAM
    thresholding and false discovery rates for
    detecting differential gene expression in DNA
    microarrays. In The analysis of gene expression
    data methods and software. Edited by G.
    Parmigiani, E.S. Garrett, R.A. Irizarry, S.L.
    Zeger. Springer, New York.
  • V.G. Tusher et al. (2001). Significance analysis
    of microarrays applied to the ionizing radiation
    response. PNAS, Vol. 98, 51165121.
  • M. Pepe et al. (2003). Selecting differentially
    expressed genes from microarray experiments.
    Biometrics, Vol. 59, 133142.

51
References
  • T. P. Speed and Y. H Yang (2002). Direct versus
    indirect designs for cDNA microarray experiments.
    Sankhya The Indian Journal of Statistics, Vol.
    64, Series A, Pt. 3, pp 706-720
  • Y.H. Yang and T. P. Speed (2003). Design and
    analysis of comparative microarray Experiments In
    T. P Speed (ed) Statistical analysis of gene
    expression microarray data, Chapman Hall.
  • R. Simon, M. D. Radmacher and K. Dobbin (2002).
    Design of studies using DNA microarrays. Genetic
    Epidemiology 2321-36.
  • F. Bretz, J. Landgrebe and E. Brunner (2003).
    Efficient design and analysis of two color
    factorial microarray experiments. Biostaistics.
  • G. Churchill (2003). Fundamentals of experimental
    design for cDNA microarrays. Nature genetics
    review 32490-495.
  • G. Smyth, J. Michaud and H. Scott (2003) Use of
    within-array replicate spots for assessing
    differential experssion in microarray
    experiments. Technical Report In WEHI.
  • Glonek, G. F. V., and Solomon, P. J. (2002).
    Factorial and time course designs for cDNA
    microarray experiments. Technical Report,
    Department of Applied Mathematics, University of
    Adelaide. 10/2002

52
(No Transcript)
53
Verschiedene Designs 4 Bedingungen
A.B
B
A
C
A
A.B
B
C
C
A
C
A
A.B
B
A.B
B
54
Graphische Representation
  • Die Struktur des Graphen legt fest, welche
    Effekte geschätzt werden können und wie präzise
    die Schätzungen sind.
  • Zwei mRNA Samples können nur verglichen werden,
    wenn es einen Pfad gibt, welcher die zugehörigen
    Knoten verbindet.
  • Die Präzision der geschätzten Kontraste hängt
    direkt von der Anzahl der Pfade, welche die
    Knoten verbinden und zur Länge dieser Pfade.
  • Direkte Vergleiche auf demselben Slide geben
    präzisere Messungen als indirekte Vergleiche.

55
Strong control any combination of true and
false hypotheses
Weak control complete null hypotheses, all
null hypotheses in the family are true
Write a Comment
User Comments (0)
About PowerShow.com