Design of Experiments - PowerPoint PPT Presentation

About This Presentation

Title:

Design of Experiments

Description:

... to microarrays? which samples should be hybridized on the same ? different experimental designs reference design, loop design what is the optimal design? – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 37

Provided by: pan107

Category:

more less

Transcript and Presenter's Notes

Title: Design of Experiments

1
Design of Experiments
Panu Somervuo, March 20, 2007

Problem formulation
Setting up the experiment
Analysis of data

2
Problem formulation

what is the biological question?
how to answer that?
what is already known?
what information is missing?
problem formulation ? model of the biological
system

3
Setting up an experiment

what kind of data is needed to answer the
question?
how to collect the data?
how much data is needed?
biological and technical replicates
pooling
how to carry out the experiment (sample
preparation, measurements)?

4
Analysis of data

preprocessing
filtering outlier removal
normalization
statistical model fitting
hypothesis testing
reporting the results, documentation

5
Everything depends on everything
problem formulation model of the system
analysis of data statistical tests
setting up the experiment number of samples
6
Practical guidelines

blocking unwanted effects (e.g. dye effect)
randomization (avoid systematic bias by
randomizing e.g. the order of sample
preparations)
replication (replicate measurements can be
averaged to reduce the effect of random errors)

group2
group1
group2
group1
cy3
cy3
cy3
cy5
cy5
cy5
7
log transform, normalization
y µF1F2...error
8
Pairwise sample comparison vs modeling

pairwise sample comparison is easy and
straightforward
instead of comparing samples as such, we can
construct a model for the measurements and then
perform comparisons

9
Mathematical model of data

try to capture the essence of a (biological)
phenomenon in mathematical terms
here we concentrate on linear models observation
consists of effects of one or more factors and
random error
factor may have several levels (e.g. factor sex
has two levels, male and female)

10
Examples of models
normalization, log transform

single factor
y µ gene error
two factors
y µ treatment gene error
two factors including interaction term
y µ treatment gene
treatment.gene error
four factors
y µ treatment gene dye array
error

11
From model to experimental design

y µ drug sex drug.sex error
factor 1, drug 3 levels
factor 2, sex 2 levels
?3x2 factorial design

M F
no treatment y111, y112, y113, y114 y121, y122, y123, y124
treatment A y211, y212, y213, y214 y221, y222, y223, y224
treatment B y311, y312, y313, y314 y321, y322, y323, y324
12
Analysis of variance

ANOVA can be used to analyse factorial designs
y µ drug sex drug.sex error

summary(aov(ydrugsex,datadata))
Df Sum Sq Mean Sq F value Pr(gtF)
drug 2 2.86750 1.43375 51.3582 3.644e-08
sex 1 1.26042 1.26042 45.1493 2.673e-06
drugsex 2 0.06583 0.03292 1.1791 0.3302
Residuals 18 0.50250 0.02792
---
Signif. codes 0 ' 0.001 ' 0.01 ' 0.05
.' 0.1 ' 1

M F
no treatment 1.0, 1.1, 0.9, 1.3 0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3 0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0 1.5, 1.3, 1.4, 1.1
13
Multiple pairwise comparisons

ANOVA tells that at least one drug treatment has
effect, but in order to find which one we perform
all pairwise comparisons

M F
no treatment 1.0, 1.1, 0.9, 1.3 0.7, 0.5, 0.6, 0.8
treatment A 1.1, 1.2, 0.8, 1.3 0.7, 0.8, 0.6, 0.9
treatment B 2.1, 1.9, 1.7, 2.0 1.5, 1.3, 1.4, 1.1

TukeyHSD(aov(ydrugsex,datadata,"drug")
Tukey multiple comparisons of means
95 family-wise confidence level
factor levels have been ordered
Fit aov(formula y drug sex, data data)
drug
diff lwr upr
A-0 0.0625 -0.1507113 0.2757113
B-0 0.7625 0.5492887 0.9757113
B-A 0.7000 0.4867887 0.9132113

14
Benefits of (good) models

after fitting the model with data, model can be
used to answer the questions e.g.
is there dye effect?
is the difference of gene expression levels in
two conditions statistically significant?
is there interaction between gene and another
factor?
simple pairwise sample comparisons cannot give
answers to all of these questions simultaneously

yµF1F2...error
15
What is a good model?

good model allows us to get more detailed results
best model and parametrization is application
specific
simple vs complex model
yµF1F2F3...error
there should be balance between model complexity
and the amount of data

dye1 dye2
control y111, y112, y113 y121, y122, y123
treatment A y211, y212, y213 y221, y222, y223
treatment B y311, y312, y313 y321, y322, y323
16
How the number of samples affects the confidence
of our results?

measurement error is always present, see the
example self-self hybridization

17
How the number of samples affects the confidence
of our results?

lets compute the mean average of expression
level of a gene
how accurate is this value?
variance(mean) variance(error)/number of
samples
samples from normal distribution (mean 0, sd 1)

18
Theoretical sample size calculations

for each statistical test, there is a
(test-specific) relation between
power of a test 1 probability(type I error)
significance level probability(type II error)
error variance
mean difference needed to be detected
number of samples

19
actual situation drug has effect actual situation drug has no effect
our conclusion drug has effect correct conlusion true positive probability 1-b type I error false positive probability a
our conclusion drug has no effect type II error false negative probability b correct conclusion true negative probability 1-a
20
How many samples are needed to detect sample mean
difference of 1 unit ?
R function power.t.test gt power.t.test(delta1,p
ower0.95,sd1,sig.level0.05) Two-sample t
test power calculation n
26.98922 delta 1 sd 1
sig.level 0.05 power 0.95
alternative two.sided NOTE n is number in
each group
21
What is the power of test when using 10 samples ?
R function power.t.test gt power.t.test(n10,delt
a1,sd1,sig.level0.05) Two-sample t test
power calculation n 10
delta 1 sd 1 sig.level
0.05 power 0.5619846 alternative
two.sided NOTE n is number in each group
22
How small difference between sample means we are
able to detect using 10 samples ?
R function power.t.test gt power.t.test(n10,powe
r0.95,sd1,sig.level0.05) Two-sample t
test power calculation n 10
delta 1.706224 sd 1
sig.level 0.05 power 0.95
alternative two.sided NOTE n is number in
each group
23
Two kinds of replicates

biological replicates biological variability
technical replicates measurement accuracy
most statistical programs assume independent
samples

A3
A2
A1
B3
B2
B1
C3
C2
C1
D3
D2
D1
24
Pooling
A1
A2
A3
B1
B2
B3
25
Pooling

ok when the interest is not on the individual,
but on common patterns across individuals
(population characteristics)
results in averaging ? reduces variability ?
substantive features are easier to find
recommended when fewer than 3 arrays are used in
each condition
beneficial when many subjects are pooled
one pool vs independent samples in multiple pools
C. Kendziorski, R. A. Irizarry, K.-S. Chen, J. D.
Haag, and M. N. Gould,
"On the utility of pooling biological samples in
microarray experiments",
PNAS March 2005, 102(12) 4252-4257

inference for most genes was not affected by
pooling
26
How to allocate the samples to microarrays?

which samples should be hybridized on the same
slide?
different experimental designs
reference design, loop design
what is the optimal design?

27
Example of four-array experiment
B
cy5
cy3
array cy3 cy5 log(cy5/cy3)
1 A B log(B) log(A)
2 A B log(B) log(A)
3 B A log(A) log(B)
4 B A log(A) log(B)
1 2 3 4
cy3
cy5
A
28
Reference design
array cy3 cy5 log(cy5/cy3)
1 Ref A log(A) log(Ref)
2 Ref B log(B) log(Ref)
3 Ref C log(C) log(Ref)
4 Ref D log(D) log(Ref)
A
1
Ref
B
2
3
C
4
log(C/A) log(C) - log(A) log(C) - log(Ref)
log(Ref) - log(A) log(C) - log(Ref)
(log(A) - log(Ref)) logratio(array3) -
logratio(array1)
D
29
Loop design
A
array cy3 cy5 log(cy5/cy3)
1 A B log(B) log(A)
2 B C log(C) log(B)
3 C D log(D) log(C)
4 D A log(A) log(D)
1
4
B
D
2
C
3
log(C/A) log(C) log(B) log(B) log(A)
logratio(array2) logratio(array1)
log(C/A) log(C) log(D) log(D) log(A)
- logratio(array3) - logratio(array4)
log(C/A)(logratio1 logratio2)/2
30
Comparing the designs
reference design reference design with replicates loop design
number of arrays 3 6 3
amount of RNA required per sample 1Ref 2Ref 2
error 2.0 1.0 0.67
31
Design with all direct pairwise comparisons
2
3
1
4
6
5
32
Example examining genotype, phenotype, and
environment
Parental - stressed
Derived - stressed
Parental - unstressed
Derived - unstressed
33
Optimal design

maximize the accuracy of parameters of interest
procedure enumerate all possible designs,
calculate the parameter accuracy for each of them
and select the best design
optimal design is model specific

34
(No Transcript)
35
About the nature of microarray data

Microarray data can give hypothesis to be tested
further
Results from microarray analysis should be
cerified by other means (qPCR,...)
quality of microarray data depends on samples,
probes, hybridization, lab work
data pre-processing, normalization, and outlier
detection are as important as good experimental
design

36
More about statistics

M.J. Crawley Statistics An Introduction using
R, John WileySons, 2005
S.A. Glantz Primer of Biostatistics,
McGraw-Hill, 5th ed., 2002
D.C. Montgomery Design and Analysis of
Experiments, John WileySons, 5th ed. 2001
Google

Write a Comment

User Comments (0)