Title: Understanding the design matrix in linear models for microarray experiments Natalie Thorne
1Understanding the design matrix in linear models
for microarray experimentsNatalie Thorne
- Many thanks to Terry Speed, Gordon Smyth, Jean
(Yee Hwa) Yang, Ingrid Lonnstedt, Matthew Ritchie
for sharing their teaching material with me.
The Hutchison/MRC Research Center
2Linear models from experimental design to model
3Specifying the design
Possible parameters
Sample types
1. What differences are important?
1. Represent the effect measured by each sample
type.
1
A
Single-channel representation
Log-ratio representation
Ref
2
B
3
4Specifying the design
Possible parameters
Sample types
1. What differences are important?
1. Represent the effect measured by each sample
type. A baseline a B baseline b Ref
baseline
1
A
Ref
2
B
3
Based on your idea about what is measured in each
sample type (in relation to the other sample
types), we do this to help you understand and
interpret results from your log-ratios
5Specifying the design
Possible parameters
Sample types
1. What differences are important?
1. Represent the effect measured by each sample
type. A baseline a B baseline
b Ref baseline
A - Ref baseline a a -(baseline) B
- Ref baseline b b -(baseline) A -
B baseline a - a-b (baseline b)
1
A
Ref
2
B
3
Parameters in two-colour experiments need to be
representative of log-ratio comparisons (this is
the data that you would typically model). Once
youve defined the effects in each sample type,
it is fairly easy to define possible parameters
for your model.
6Specifying the design
Possible parameters
Choose parameters
Samples
1. What differences are important?
1. Parameters must be independent!
1. Represent the samples by the effects present
in each. A baseline a B baseline
b Ref baseline
A - Ref a B - Ref b A - B a-b
1
A
Ref
2
B
3
7Specifying the design
Possible parameters
Choose parameters
Samples
1. What differences are important?
1. Parameters must be independent!
1. Represent the samples by the effects present
in each. A baseline a B baseline
b Ref baseline
These two are independent
A - Ref a B - Ref b A - B a-b
1
A
Ref
2
X
B
3
But this parameter is not independent i.e. a-b
(a) - (b)
8Specifying the design
Possible parameters
Choose parameters
Samples
1. What differences are important?
1. Parameters must be independent!
1. Represent the samples by the effects present
in each. A baseline a B baseline
b Ref baseline
These two are independent
A - Ref a B - Ref b A - B a-b
1
A
X
Ref
2
B
3
But this parameter is not independent i.e. b
(a-b) - (a)
A definition of independent parameters No
combination of the parameters can equal any of
the paramters
9Specifying the design
Choose parameters
Possible parameters
Samples
1. Parameters must be independent!
1. Represent the samples by the effects present
in each. A baseline a B baseline
b Ref baseline
1. What differences are important?
A - Ref a B - Ref b A - B a-b
A - Ref , a B - Ref , b
1
A
Ref
2
B
3
Now, specify the design and the parameters that
youve chosen as a matrix
10Specifying the design as a matrix
Parameters
A - Ref
B - Ref
1
slide 1
A
Observations
Ref
2
slide 2
B
3
slide 3
Represent log-ratio from each slide by a
parameter gt specify the model for your data
11Specifying the design as a matrix
Parameters
A - Ref
B - Ref
1
slide 1
A
Observations
Ref
2
slide 2
B
3
slide 3
Represent log-ratio from each slide by a
parameter gt specify the model for your data
12Specifying the design as a matrix
Parameters
A - Ref
B - Ref
1
slide 1
A
Observations
Ref
2
slide 2
B
3
slide 3
Represent log-ratio from each slide by a
parameter gt specify the model for your data
13Specifying the design as a matrix
Parameters
A - Ref
B - Ref
1
slide 1
A
Observations
Ref
2
slide 2
B
3
slide 3
Represent log-ratio from each slide by a
parameter gt specify the model for your data
14Specifying the design as a matrix
Parameters
A - Ref
B - Ref
1
slide 1
A
Observations
Ref
2
slide 2
B
3
slide 3
This is called the design matrix
15Write the model using design matrix
A - Ref
B - Ref
1
slide 1
A
Ref
2
slide 2
B
3
slide 3
16Write the model using design matrix
A - Ref
1
slide 1
A
Ref
2
B - Ref
slide 2
B
3
slide 3
17Write the model using design matrix
Observed data modelled by these parameters
Matrix notation
A - Ref
1
slide 1
A
x
Ref
2
B - Ref
slide 2
B
3
slide 3
18Write the model using design matrix
1
a
y1
A
x
Ref
2
E
b
y2
B
3
y3
19Write the model using design matrix
Matrix multiplication
a
1
y1
1 x a 0 x b
A
x
Ref
2
E
b
y2
-1 x a 0 x b
B
3
y3
0 x a 1 x b
a
-a
b
20Write the model using design matrix
Matrix multiplication
a
1
y1
1 x a 0 x b
A
x
Ref
2
E
b
y2
-1 x a 0 x b
B
3
y3
0 x a 1 x b
a
-a
b
21Write the model using design matrix
Matrix multiplication
a
1
y1
1 x a 0 x b
A
x
Ref
2
E
b
y2
-1 x a 0 x b
B
3
y3
0 x a 1 x b
a
-a
b
22Modelling data
With two observations the line is DETERMINED!!!
23Modelling data
With many observations the line is NOT DETERMINED
- we must estimate it!!!!
24Simple regression
Minimize the difference between the observation
and its prediction according to the line. Method
of least squares find the line which minimizes
the sum of square errors.
25Linear model for different groups
.
.
.
.
.
M (log-ratios)
.
.
B
.
.
C
A
.
.
.
n
.
n
n
.
Example data for one gene
.
R
.
.
Experiment that might result in this data
.
Minimise the errors around the means of each
group. Notice, that only with replication in
each group, can we estimate a mean for each group
and fit a statistical model to the data.
26Linear model for different groups at two levels
.
.
Drugs result in different transcriptional
response (main effects) Effect over time is also
different for different drugs (interaction)
.
.
.
.
B
.
A
C
.
.
.
B
.
.
C
A
.
.
.
.
.
R
.
Experiment that might result in this data i.e. A
treatment with drug A 1hr later A
treatment with drug A 24hrs later
Example data for one gene
27Linear model and array platforms
- Linear modelling approach applies to both single
channel (Affymetrix) and two-colour spotted
arrays. - Two colour with common reference is virtually
equivalent to single channel from an analysis
point of view - Need to cover some special features of two-colour
arrays using direct comparisons.
28Specifying the design
Sample types
Possible parameters
1. What differences are important?
1. Represent the effect measured by each sample
type.
1
Single-channel representation (2 types
of samples)
Log-ratio representation (choose 2-1 1
parameters)
A
B
2
3
A a B b
A - B a - b
4
B - A b - a
Choose one parameter to model your data. Write
the design matrix for this experiment.
29Write the model using design matrix
1
Samples 2
Parameter 1
A
B
A a B b
B - A b - a
2
3
4
y1
b-a
x
E
y2
y3
y4
Y
X
ß
30Write the model using design matrix
1
A
B
2
3
4
a-b
-1 1 -1 1
y1
-1 x (b - a)
b-a
x
E
b-a
y2
1 x (b - a)
-1 x (b - a)
a-b
y3
1 x (b - a)
b-a
y4
Y
X
ß
31Write the model using design matrix
1
A
B
2
3
4
a-b
1 -1 1 -1
y1
1 x (a - b)
a-b
x
E
b-a
y2
-1 x (a - b)
1 x (a - b)
a-b
y3
-1 x (a - b)
b-a
y4
Y
X
ß
32Specifying the design
A
Sample types
1
Possible parameters
2
1. What differences are important?
1. Represent the effect measured by each sample
type.
R
3
B
4
Single-channel representation (3 types
of samples)
Log-ratio representation (choose 2-1 1
parameters)
A - B basea-(baseb)
A base a B base b R base
A - R a
B - R b
more
33Specifying the design (alternative)
A
Sample types
1
Possible parameters
2
1. What differences are important?
1. Represent the effect measured by each sample
type.
R
3
B
4
Single-channel representation (3 types
of samples)
Log-ratio representation (choose 3-1 2
parameters)
A - B a - b
A a B b R r
A - R a - r
B - R b - r
more
Choose two parameters to model your data. Write
the design matrix for this experiment.
34Write the model using design matrix
A
1
Samples 3
Parameters 2
2
A a B b R r
A - R a - r A - B a - b
R
3
B
4
y1
a-r
x
E
a-b
y2
y3
y4
Y
X
ß
35Write the model using design matrix
A
1
Samples 3
Parameters 2
2
A a B b R r
A - R a - r A - B a - b
R
3
B
4
a-r
y1
1 0 -1 0 1 -1 -1 1
(a-r)0
a-r
x
E
r-a
a-b
y2
-1(a-r)0
(a-r)-(a-b)
b-r
y3
-1(a-r)(a-b)
r-b
y4
Y
X
ß
36design matrix with alternative parameterisation
A
1
Samples 3
Parameters 2
2
A ra B rb R r
A - R a B - R b
R
3
B
4
y1
a
x
E
b
y2
y3
y4
Y
X
ß
37design matrix with alternative parameterisation
A
1
Samples 3
Parameters 2
2
A ra B rb R r
A - R a B - R b
R
3
B
4
a
y1
1 0 -1 0 0 1 0 -1
a0
a
x
E
-a
b
y2
-a0
0b
b
y3
0-b
-b
y4
Y
X
ß
38Specifying a contrast matrix
A
1
Samples 3
Parameters 2
2
A ra B rb R r
A - R a B - R b
R
3
B
4
Linear model estimates of parameters
â
b
Parameter estimates (called coefficients in limma)
39Specifying a contrast matrix
A
1
Samples 3
Parameters 2
2
A ra B rb R r
A - R a B - R b
R
3
B
4
a
y1
1 0 -1 0 0 1 0 -1
1 0 0 1 1 -1 .5 .5
a
x
x
â
E
-a
b
y2
b
b
y3
-b
y4
Parameter estimates (called coefficients in limma)
Y
X
ß
Contrast matrix
40Specifying a contrast matrix
A
1
Samples 3
Parameters 2
2
A ra B rb R r
A - R a B - R b
R
3
B
4
1 0 0 1 1 -1 .5 .5
x
â
â
A
B
â -
A - B
â
.5( )
1/2(A B)
Parameter estimates (called coefficients in limma)
Contrast matrix
Contrasts of interest
41Specifying the design
Sample types
Possible parameters
1. What differences are important?
1. Represent the effect measured by each sample
type.
A
1
6
5
2
3
Single-channel representation (3 types
of samples)
Log-ratio representation (choose 3-1 2
parameters)
B
C
4
A - B a - b
A a B b C c
B - C b - c
C - A c - a
more
Choose two parameters to model your data. Write
the design matrix for this experiment.
42Write the model using design matrix
A
Samples 3
Parameters 2
1
6
A a B b C c
A - B a - b B - C b - c
5
2
3
B
C
4
y1
a-b
x
E
y2
b-c
y3
y4
y5
y6
Y
X
ß
43Write the model using design matrix
A
Samples 3
Parameters 2
1
6
A a B b C c
A - B a - b B - C b - c
5
2
3
B
C
4
- 0
- -1 0
- 0 -1
- 0 1
- -1 -1
- 1 1
(a-b)0
a-b
y1
a-b
x
E
0-1(a-b)
b-a
y2
b-c
0-1(b-c)
c-b
y3
01(b-c)
b-c
y4
-1(a-b)-(b-c)
c-a
y5
(a-b)(b-c)
a-c
y6
Y
X
ß
44Specify contrasts of interest
A
Samples 3
Parameters 2
1
6
A a B b C c
A - B a - b B - C b - c
5
2
3
B
C
4
- 0
- -1 0
- 0 -1
- 0 1
- -1 -1
- 1 1
y1
a-b
x
x
E
y2
b-c
y3
y4
y5
Parameter estimates (called coefficients in limma)
y6
Contrast matrix
Contrasts of interest
Y
X
ß
In limma, the contrast matrix is the transpose of
the above!! i.e. rows become the columns and the
columns become the rows
45Specify contrasts of interest
A
Samples 3
Parameters 2
1
6
A a B b C c
A - B a - b B - C b - c
5
2
3
B
C
4
- 0
- -1 0
- 0 -1
- 0 1
- -1 -1
- 1 1
y1
a-b
x
1 0 0 1 1 1
x
E
y2
b-c
y3
y4
y5
Parameter estimates (called coefficients in limma)
Contrasts of interest
y6
Contrast matrix
Y
X
ß
In limma, the contrast matrix is the transpose of
the above!! i.e. rows become the columns and the
columns become the rows
46Factorial experiment one sample as a common
reference
2
Parameters 3
Samples 4
A
C
5
A - C a B - C b AB - C ab
A basea B base b
C base AB baseab
1
3
4
B
AB
6
47Factorial experiment one sample as a common
reference
2
Parameters 3
Samples 4
A
C
5
A - C a B - C b AB - C ab
A basea B base b
C base AB baseab
1
3
4
B
AB
6
y1
a
x
E
b
y2
ab
y3
y4
y5
y6
Y
X
ß
48Factorial experiment one sample as a common
reference
2
Parameters 3
Samples 4
A
C
5
A - C a B - C b AB - C ab
A basea B base b
C base AB baseab
1
3
4
B
AB
6
0 1 0 1 0 0 -1 0 1 0 0 1
1 -1 0 0 -1 1
0 b 0
b
y1
a
x
E
b
a 0 0
a
y2
-a 0 ab
ab
-aab
y3
0 0 ab
00ab
y4
a - b 0
a-b
y5
0 - b ab
-bab
y6
Y
X
ß
49Design for factorial experiment with interaction
2
Parameters 3
Samples 4
A
C
5
A ca B cb C
c AB cabab
A - C a B - C b AB - A - B C ab
1
3
4
B
AB
6
50Design for factorial experiment with interaction
2
Parameters 3
Samples 4
A
C
5
A ca B cb C
c AB cabab
A - C a B - C b AB - A - B C ab
1
3
4
B
AB
6
y1
a
x
E
b
y2
ab
y3
y4
y5
y6
Y
X
ß
51Design for factorial experiment with interaction
2
Parameters 3
Samples 4
A
C
5
A - C a B - C b AB - A - B C ab
A ca B cb C
c AB cabab
1
3
4
B
AB
6
0 1 0 1 0 0 0 1 1 1 1 1 1 -1 0 1
0 1
0 b 0
b
y1
a
x
E
b
a 0 0
a
y2
0 b ab
ab
bab
y3
a b ab
abab
y4
a - b 0
a-b
y5
a 0 ab
aab
y6
Y
X
ß
52Interaction
ab positive
ab negative
ab
c
ca
ab
ab
cb
cabab
joint
B
joint
B
A
A
53Trend analysis
Parameters 1
Samples 5
T2 - T1 a
C base T1 basea
T2 base2a T3 base3a
T4 base 4a
6
Note the possible number of parameters is 4, but
we choose here to use only one parameter
y1
x
a
E
y2
y3
y4
y5
y6
Y
X
ß
54Trend analysis
Parameters 3
Samples 5
T2 - T1 a
C base T1 basea
T2 base2a T3 base3a
T4 base 4a
T1
T3
3
1
2
4
6
T2
T4
C
5
straight line model is fitted
1 1 1 1 -4 2
y1
x
a
big a
E
y2
y3
small a
y4
time
y5
y6
Y
X
ß
large -ve a
55Trend analysis
Parameters 3
Samples 5
T2 - T1 a
C base T1 basea
T2 base4a T3 base9a
T4 base16a
T1
T3
3
1
2
4
6
T2
T4
C
5
quadratic model is fitted
1 3 5 7 -16 4
y1
x
a
big a
E
y2
y3
small a
y4
time
y5
y6
Y
X
ß
large -ve a
56Trend analysis
Parameters 3
Samples 5
T2 - T1 a
C base T1 base16a
T2 base9a T3 base4a
T4 basea
T1
T3
3
1
2
4
6
T2
T4
C
5
quadratic model is fitted
16 -7 -5 -3 1 9
y1
x
a
E
y2
big a
y3
small a
y4
time
y5
y6
large -ve a
Y
X
ß
572 by 3 factorial experiment
- Identify DE genes that have different time
profiles between different mutants. - a time effect, b strains, ab
interaction effect
M
a gt 0 b 0 ab0
strain A
Strain B
0 12 24
time
58Design matrix for single-colour arrays
Samples Parameters
Samples
1. Represent the effect measured by each sample.
D
C
B
A
Replicates
1
4
7
9
Single-channel representation (4 types
of samples)
2
8
10
5
3
6
A a B b C c D d
Squares represent single-colour arrays, numbers
represent the array number. Observations (data)
are log-intensities.
59D
C
B
A
Samples Parameters
1
4
7
9
A a B b C c D d
2
5
8
10
3
6
y1
y2
E
y3
Data are log-intensities NOT log-ratios
y4
a
x
y5
b
y6
c
y7
d
y8
y9
y10
Y
X
ß
60D
C
B
A
Samples Parameters
1
4
7
9
A a B b C c D d
2
5
8
10
3
6
y1
1 0 0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 0 1 0 0 0
1
a a a b b b c c d d
y2
E
y3
Data are log-intensities NOT
log-ratios Design matrix is EASY!!!
y4
a
x
y5
b
y6
c
y7
d
y8
y9
y10
Y
X
ß
61D
C
B
A
Contrast matrix
1
4
7
9
2
5
8
10
3
6
1 -1 0 0 1 0 -1 0
x
In limma, the contrast matrix is the transpose of
the above!! i.e. rows become the columns and the
columns become the rows
62Recipe getting the design matrix right
everytime
63- Step 1 draw a picture of your experiment
- Make sure your arrays are connected
- Step 2 decide on the parameters of interest
- Make sure your parameters of interest do not form
a loop in your experimental design picture - Make sure your parameters involve every treatment
type at least once - Try a few different ways of parameterising your
experiments - Step 3 label your parameters
- Step 4 specify each slide using the parameters
you selected - Some slides will need a combination of parameters
in order to specify them - Step 5
64Experimental design what is hybridised to what?
a4
a26
a30
ref
w14
w13
w8
65A suitable parameterisation
a4
a26
a30
ref
w14
w13
w8
66Can we add another parameter to our model?
a4
a26
a30
ref
w14
w13
w8
67Can we add another parameter to our model?
a4
a26
a30
ref
w14
w13
w8
68Can we add another parameter to our model? No!
a4
a26
a30
ref
w14
w13
w8
There are 7 treatments, there can only be 6
parameters .. any more parameters would create a
loop any less would leave out one
of the treatments
69Can we add this parameter to our model?
a4
a26
a30
ref
w14
w13
w8
70Can we add this parameter to our model? No! We
created a loop.
a4
a26
a30
ref
w14
w13
w8
71Whats wrong with this parameterisation?
There are 6 (n-1) parameters! There is no loop!
Have we left out a treatment?
a4
a26
a30
ref
w14
w13
w8
72Whats wrong with this parameterisation? No
parameter involving w13
a4
a26
a30
ref
w14
w13
w8
No combination of parameters to specify treatment
w13
73Whats wrong with this parameterisation? No
parameter involving w13
a4
a30
ref
w13
w14
w8
a26
74Whats wrong with this parameterisation? No
parameter involving w13. There is also a
loop! (we can easily see this when we rearrange
the picture)
a4
a30
ref
w13
w14
w8
a26
75A suitable parameterisation
a4
a26
a30
ref
w14
w13
w8
76Another suitable parameterisation
a4
a26
a30
ref
w14
w13
w8
77Yet another suitable parameterisation!
a4
a26
a30
ref
w14
w13
w8
78Naming the parameters allows you to specify each
array according to your parameterisation (or
model)
a4
a26
a30
a4
a26
a30
ref
w13
w8
w14
w14
w13
w8
79a4a30
a26a4
a4
a26
a30
a30
ref
a4w14
a30w8
a26w13
w14
w13
w8
80a4a30
a26a4
a4
a26
a30
ref
a26w13
a30w8
a4w14
w13
w14
w13
w8
81A statistician might name your parameters
differently
ß3
ß4
a4
a26
a30
ref
ß2
ß6
ß5
ß1
w14
w13
w8