Title: Estimating parameters in a statistical model source"http:eh3'uc'eduteachingcfg2006RNickelLinearModel
1Estimating parameters in a statistical
modelsource("http//eh3.uc.edu/teaching/cfg/2006/
R/NickelLinearModels.R")
- Likelihood and Maximum likelihood estimation
- General Linear Model - the general framework that
will allow us to perform complicated analysis
relatively easily
2Random Sample
- Random sample A set of independently generated
observations by the statistical model. - For example, n replicated measurements of the
differences in expression levels for a gene under
two different treatments x1,x2,....,xn iid
N(?,?2) - Given parameters, the statistical model defines
the probability of observing any particular
combination of the values in this sample - Since the observations are independent, the
probability distribution function describing the
probability of observing a particular combination
is the product of probability distribution
functions
3Probability distribution function vs probability
- In the case of the discrete random variables
which have a countable number of potential values
(can assume finitely many for now), probability
density function is equal to the probability of
each value (outcome) - In the case of a continuous random variable which
can yield any number in a particular interval,
the probability distribution function is
different from the probability - Probability of any particular number for a
continuous random variable is equal to zero - Probability density function defines the
probability of the number falling into any
particular sub-interval as the area under the
curve defined by the probability density
function.
4Probability distribution function vs probability
- Example The assumption of our Normal model is
that there the outcome can be pretty much any
real number. This is obviously a wrong
assumption, but it turns out that this model is a
good approximation of the reality. - We could "discretize" this random variable.
Define r.v. y1 if xgtc and 0 otherwise for
some constant c - This random variable can be assume 2 different
values and the probability distribution function
is define by p(y1) - Although the probability distribution function in
the case of a continuous random variable does not
give probabilities, it satisfies key properties
of the probability.
5Back to basics Probability, Conditional
Probability and Independence
- For x1,...,xn iid of f(x)
- For y1,...,yn iid of p(y)
- From now on, we will talk in terms of just a pdf
and things will hold for both discrete and
continuous random variables
6Expectation, Expected value and Variance
- Discrete pdf (y)
- Expectation of any function g of the random
variable y (average value of the function after a
large number of experiments
- Continuous pdf (x)
- Expectation of any function g of the random
variable x
- Expected value - average x after a very large
number of experiments
- Expected value - average y after a very large
number of experiments
- Variance - Expected value of (y-E(y))2
- Variance - Expected value of (x-E(x))2
7Expected Value and Variance of a Normal Random
Variable
- Expected value - average x after a very large
number of experiments
- Variance - Expected value of (x-E(x))2
8Maximum Likelihood
- x1,x2,....,xn iid N(?,?2)
- Joint pdf for the whole random sample
- Likelihood function is basically the pdf for the
fixed sample
- Maximum likelihood estimates of the model
parameters ? and ?2 are numbers that maximize the
joint pdf for the fixed sample which is called
the Likelihood function
9Simple Linear Model (Use y for x)
Statistical Model Trtj"Effect due to treatmet
j" Ctl for j1 and Nic for j2 yij
log-intensity for replicate i and
treatment j, where i1,2,3 and j1,2
Maximum Likelihood Estimation
We are interested in d ?2 - ?1
10Alternative Linear Model
Statistical Model Tj"Effect due to treatmet
j"Ctl for j1 and Nic for j2 ?Overall Mean
Expression Level yij log-intensity for
replicate i and treatment j, where
i1,2,3 and j1,2
Maximum Likelihood Estimation
We are interested in d ?2 - ?1
11We can specify different linear models and still
get the same answer if we know how to ask the
right question
M1 Estimating Two Parameters
M2 Estimating Three Parameters - need additional
"constraint"Set T10
Specifying the model for limma and calculating
gt designMatrix1lt-cbind(Ctlc(1,0,0,0,1,1), Nicc(0
,1,1,1,0,0)) gt designMatrix1 Ctl Nic 1,
1 0 2, 0 1 3, 0 1 4, 0 1 5,
1 0 6, 1 0 gt contrastlt-c(-1,1) gt
FitLM1lt-lmFit(LinearModelData,designMatrix1) gt
FitCLM1lt-contrasts.fit(FitLM1,contrast) gt
EFitLM1lt-eBayes(FitCLM1)
gt designMatrix2lt-cbind(Intc(1,1,1,1,1,1), Nicc(0
,1,1,1,0,0)) gt designMatrix2 Int Nic 1,
1 0 2, 1 1 3, 1 1 4, 1 1 5,
1 0 6, 1 0 gt FitLM2lt-lmFit(LinearModelDa
ta,designMatrix2) gt EFitLM2lt-eBayes(FitLM2) gt
12Pretending That Data is Single Channel Again
- gt LinearModelDatalt-cbind(NLNBLimmaDataNickel,NLNBL
immaDataNickel) - gt dim(LinearModelData)
- 1 14112 6
- gt LinearModelDataM1,
- Cy5_Ctl-WT00hr_2_VS_Cy3_Nic-WT72hr_2
Cy5_Nic-WT72hr_3_VS_Cy3_Ctl-WT00hr_3
Cy5_Nic-WT72hr_1_VS_Cy3_Ctl-WT00hr_1
Cy5_Ctl-WT00hr_2_VS_Cy3_Nic-WT72hr_2 - -0.1216908
1.2155011
-0.4833670 -0.1216908 - Cy5_Nic-WT72hr_3_VS_Cy3_Ctl-WT00hr_3
Cy5_Nic-WT72hr_1_VS_Cy3_Ctl-WT00hr_1 - 1.2155011
-0.4833670 - gt LinearModelDataMlt-cbind((NLNBLimmaDataNickelA
NLNBLimmaDataNickelM/2),NLNBLimmaDataNickelA-(NL
NBLimmaDataNickelM/2)) - gt LinearModelDataM1,
- Cy5_Ctl-WT00hr_2_VS_Cy3_Nic-WT72hr_2
Cy5_Nic-WT72hr_3_VS_Cy3_Ctl-WT00hr_3
Cy5_Nic-WT72hr_1_VS_Cy3_Ctl-WT00hr_1
Cy5_Ctl-WT00hr_2_VS_Cy3_Nic-WT72hr_2 - 11.354734
11.842679
9.485143 476425 - Cy5_Nic-WT72hr_3_VS_Cy3_Ctl-WT00hr_3
Cy5_Nic-WT72hr_1_VS_Cy3_Ctl-WT00hr_1 - 10.627177
9.968510 - gt
- This will "split" ratios back into the two
channel data - We could use original R and G data, but want to
make use of loess normalization
13Comparing Results of Two Models
gt par(mfrowc(1,2)) gt plot(EFitLM1coefficients,EF
itLM2coefficients,"Nic",main"Estimated
Differential Expressions", xlab"Model
1",ylab"Model 2") gt gt plot(-log10(EFitLM1p.valu
e),-log10(EFitLM2p.value),"Nic",main"-log10(P-
values)", xlab"Model 1",ylab"Model 2")
14General Linear Model
y X? ?
Maximum Likelihood Estimation
15General Linear Model
y X? ?
gt XpXlt-t(designMatrix2)designMatrix2 gt XpX
Int Nic Int 6 3 Nic 3 3 gt
XpXInvlt-solve(XpX) gt XpXInv Int
Nic Int 0.3333333 -0.3333333 Nic -0.3333333
0.6666667 gt XpXInvXpX Int Nic Int 1
0 Nic 0 1 gt BetaHatlt-(XpXInvt(designMatrix2
))LinearModelDataM1, gt BetaHat
,1 Int 10.6501404 Nic 0.2846083 gt
EFitLM2coefficients1, Int Nic
10.6501404 0.2846083 gt
16Linear Models for paired t-test(The model
changes but the Linear Model estimation
"machinery" is the same)
y X? ?
yi is the log-ratio of Nic-measurement
toCtl-measurement at the ith microarray
Set T10 A10
y X? ?
17General Linear Model
gt PairedTTSimpleDesignlt-c(-1,1,1) gt
PairedTTlt-lmFit(NLNBLimmaDataNickel,PairedTTSimple
Design) gt EPairedTTlt-eBayes(PairedTT) gt gt
PairedTTDesignlt-cbind(Intc(1,1,1,1,1,1),Array2c(
0,1,0,0,1,0),Array3c(0,0,1,0,0,1),Nicc(0,1,1,1,0
,0)) gt PairedTTDesign Int Array2 Array3
Nic 1, 1 0 0 0 2, 1 1
0 1 3, 1 0 1 1 4, 1
0 0 1 5, 1 1 0 0 6, 1
0 1 0 gt LMPairedTTlt-lmFit(LinearModelDa
ta,PairedTTDesign) gt ELMPairedTTlt-eBayes(LMPairedT
T)
18General Linear Model
gt par(mfrowc(1,2)) gt plot(EPairedTTcoefficients,
ELMPairedTTcoefficients,"Nic",main"Estimated
Differential Expressions", xlab"Simple
t-test",ylab"Linear Model") gt gt
plot(-log10(EPairedTTp.value),-log10(ELMPairedTT
p.value),"Nic",main"-log10(P-values)",
xlab"Simple t-test",ylab"Linear Model") gt
19Why Bother? Turns out that we can do things in
the GLM framework that would be really
complicated without it - Factoring Out
Gene-Specific Dye Effect
D1Cy3 effectD2Cy5 effect
y X? ?
Set T10 A10 D10
20General Linear Model
gt designMatrixDyelt-cbind(Intc(1,1,1,1,1,1),Array2
c(0,1,0,0,1,0),Array3c(0,0,1,0,0,1),
Nicc(0,1,1,1,0,0),Cy5c(1,1,1,0,0,0)) gt
designMatrixDye Int Array2 Array3 Nic
Cy5 1, 1 0 0 0 1 2, 1
1 0 1 1 3, 1 0 1 1
1 4, 1 0 0 1 0 5, 1 1
0 0 0 6, 1 0 1 0 0 gt
FitLMADlt-lmFit(LinearModelData,designMatrixDye) gt
EFitLMADlt-eBayes(FitLMAD) gt gt EBayesFDRDlt-p.adjus
t(EFitLMADp.value,"Nic",method"fdr") gt
DyeFDRlt-p.adjust(EFitLMADp.value,"Cy5",method"
fdr") gt ELMPairedTTFDRlt-p.adjust(ELMPairedTTp.val
ue,"Nic",method"fdr") gt
21General Linear Model
gt plot(EFitLMADcoefficients,"Nic",ELMPairedTTc
oefficients,"Nic",xlab"Linear Model with
Dye",ylab"Linear Model without Dye") gt
22General Linear Model
gt boxplot(list(NoArrayFitLM2sigma,NoDyeEffectLM
PairedTTsigma,WithDyeEffectFitLMADsigma,NoDyeEf
fectLMPairedTTsigma),xlab"Linear Model with
Dye",ylab"Linear Model without Dye") gt
boxplot(list(NoArrayFitLM2sigma,NoDyeEffectLMPa
iredTTsigma,WithDyeEffectFitLMADsigma),xlab"Li
near Model with Dye",ylab"Linear Model without
Dye") gt boxplot(list(NoArrayFitLM2sigma,NoDyeEff
ectLMPairedTTsigma,WithDyeEffectFitLMADsigma),
xlab"",ylab"Estimated Sigma") gt
23Factoring out Dye effect - what it really means
D1Cy3 effectD2Cy5 effect
y X? ?
Set T10 A10 D10
Set T10 A10
y X? ?
24What happens if we don't factor out Dye effect?
D1Cy3 effectD2Cy5 effect
25Factoring out Dye effect - what it really means
y X? ?
Set T10 A10 D10
26To Dye or not to Dye
- If there is a significant dye effect we need to
factor it out or the estimates of effects of
interest will be biased - The estimates of variance are likely going to be
smaller after factoring the Dye effect even if it
is not significant - The estimates of standard errors (denominator in
the t-statistic) would also likely be smaller - P-values could increase due to the lower
t-statistic or because of the loss of one degree
of freedom
27General Linear Model
gt par(mfrowc(1,3)) gt plot(ELMPairedTTcoefficient
s,"Nic",-log(ELMPairedTTFDR,base10),main"Empir
ical Bayes t-test", xlab"Mean
Difference",ylab"-log10(FDRpvalue)",ylimc(0,3))
gt grid(nx NULL, ny NULL, col "lightgray",
lty "dotted",lwd NULL, equilogs TRUE) gt gt
plot(EFitLMADcoefficients,"Nic",-log(EBayesFDRD
,base10),main"Empirical Bayes t-test with
Dye", xlab"Mean Difference",ylab"-log10(FDRpva
lue)",ylimc(0,3)) gt grid(nx NULL, ny NULL,
col "lightgray", lty "dotted",lwd NULL,
equilogs TRUE) gt gt plot(EFitLMADcoefficients,
"Nic",-log(DyeFDR,base10),main"Empirical Bayes
t-test for Dye", xlab"Mean Difference",ylab"-l
og10(FDRpvalue)",ylimc(0,3)) gt grid(nx NULL,
ny NULL, col "lightgray", lty "dotted",lwd
NULL, equilogs TRUE) gt
28What happens if we don't factor out Dye effect?
- Because the design is unbalanced (i.e. all
treatments don't have equal number of Cy5 and Cy3
labelings) with respect to Dye assignment - the
existence of a dye-effect (i.e. D2?0) would
introduce "bias" - If the design was balanced with respect to Dye -
there would be no bias but the existence of a
dye-effect would increase the variability of the
estimates - The dye effect is not present in Affymetrix
technology, but the same issues arise in
multi-factorial experiments where the effect of
one treatment needs to be assessed after
factoring out the effect of other experimental
factors - For example Effect of nickel on two different
strains of mice - the overall effects of Nickel
needs to be adjusted for the effects of the mouse
genotype
29Creating Design Matrices
gt arrayslt-factor(c(1,2,3,1,2,3)) gt
dyelt-factor(c("Cy5","Cy5","Cy5","Cy3","Cy3","Cy3")
) gt trtlt-factor(c("Ctl","Nic","Nic","Nic","Ctl","C
tl")) gt gt DMRDyelt-model.matrix(arraysdyetrt) gt
DMRDye (Intercept) arrays2 arrays3 dyeCy5
trtNic 1 1 0 0 1
0 2 1 1 0 1 1 3
1 0 1 1 1 4
1 0 0 0 1 5 1
1 0 0 0 6 1 0
1 0 0 attr(,"assign") 1 0 1 1 2
3 attr(,"contrasts") attr(,"contrasts")arrays 1
"contr.treatment" attr(,"contrasts")dye 1
"contr.treatment" attr(,"contrasts")trt 1
"contr.treatment" gt