Title: Multivariate Analysis and Discrimination
1Multivariate Analysis and Discrimination
- EPP 245
- Statistical Analysis of
- Laboratory Data
2Cystic Fibrosis Data Set
- The 'cystfibr' data frame has 25 rows and 10
columns. It contains lung function data for
cystic fibrosis patients (7-23 years old) - We will examine the relationships among the
various measures of lung function
3- age a numeric vector. Age in years.
- sex a numeric vector code. 0 male, 1female.
- height a numeric vector. Height (cm).
- weight a numeric vector. Weight (kg).
- bmp a numeric vector. Body mass ( of normal).
- fev1 a numeric vector. Forced expiratory volume.
- rv a numeric vector. Residual volume.
- frc a numeric vector. Functional residual
capacity. - tlc a numeric vector. Total lung capacity.
- pemax a numeric vector. Maximum expiratory
pressure.
4Scatterplot matrices
- We have five variables and may wish to study the
relationships among them - We could separately plot the (5)(4)/2 10
pairwise scatterplots - The graph matrix function in Stata
- .graph matrix fev1 rv frc tlc pemax
5(No Transcript)
6Principal Components Analysis
- The idea of PCA is to create new variables that
are combinations of the original ones. - If x1, x2, , xp are the original variables, then
a component is a1x1 a2x2 apxp - We pick the first PC as the linear combination
that has the largest variance - The second PC is that linear combination
orthogonal to the first one that has the largest
variance, and so on
7(No Transcript)
8. pca fev1 rv frc tlc pemax Principal
components/correlation Number of
obs 25
Number of comp. 5
Trace 5 Rotation
(unrotated principal) Rho
1.0000 ------------------------------
--------------------------------------------
Component Eigenvalue Difference
Proportion Cumulative ---------------------
--------------------------------------------------
-- Comp1 3.22412 2.33772
0.6448 0.6448 Comp2
.886399 .40756 0.1773
0.8221 Comp3 .478839
.133874 0.0958 0.9179
Comp4 .344966 .279286
0.0690 0.9869 Comp5
.0656798 . 0.0131
1.0000 ---------------------------------------
----------------------------------- Principal
components (eigenvectors)
--------------------------------------------------
---------------------------- Variable
Comp1 Comp2 Comp3 Comp4 Comp5
Unexplained --------------------------------
--------------------------------------------
fev1 -0.4525 0.2140 0.5539
0.6641 -0.0397 0 rv
0.5043 0.1736 -0.2977 0.4993
-0.6145 0 frc
0.5291 0.1324 0.0073 0.3571 0.7582
0 tlc 0.4156 0.4525
0.6474 -0.4134 -0.1806 0
pemax -0.2970 0.8377 -0.4306
-0.1063 0.1152 0
--------------------------------------------------
----------------------------
9(No Transcript)
10(No Transcript)
11Fishers Iris Data
- This famous (Fisher's or Anderson's) iris data
set gives the measurements in centimeters of the
variables sepal length and width and petal length
and width, respectively, for 50 flowers from each
of 3 species of iris. The species are _Iris
setosa_, _versicolor_, and _virginica_.
12. summarize Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- index 150 75.5
43.44537 1 150 sepallength
150 5.843333 .8280661 4.3
7.9 sepalwidth 150 3.057333
.4358663 2 4.4 petallength
150 3.758 1.765298 1
6.9 petalwidth 150 1.199333
.7622377 .1 2.5 ------------------
--------------------------------------------------
- species 0 . graph matrix
sepallength sepalwidth petallength petalwidth
13(No Transcript)
14. pca sepallength sepalwidth petallength
petalwidth Principal components/correlation
Number of obs 150
Number
of comp. 4
Trace
4 Rotation (unrotated principal)
Rho 1.0000
--------------------------------------------------
------------------------ Component
Eigenvalue Difference Proportion
Cumulative ----------------------------------
---------------------------------------
Comp1 2.9185 2.00447
0.7296 0.7296 Comp2
.91403 .767274 0.2285
0.9581 Comp3 .146757
.126042 0.0367 0.9948
Comp4 .0207148 .
0.0052 1.0000 --------------------------
------------------------------------------------
Principal components (eigenvectors)
--------------------------------------------------
------------------ Variable Comp1
Comp2 Comp3 Comp4 Unexplained
-------------------------------------------------
----------------- sepallength 0.5211
0.3774 -0.7196 -0.2613 0
sepalwidth -0.2693 0.9233 0.2444
0.1235 0 petallength 0.5804
0.0245 0.1421 0.8014 0
petalwidth 0.5649 0.0669 0.6343
-0.5236 0 -----------------------
---------------------------------------------
15(No Transcript)
16Multinomial Logit
- Logistic regression can only deal with binary
categories - One alternative that can deal with more than two
categories is discriminant analysis. This comes
in two flavors, (Fishers) Linear Discriminant
Analysis or LDA and (Fishers) Quadratic
Discriminant Analysis or QDA - In each case we model the shape of the groups and
provide a dividing line/curve
17- Unfortunately, Stata does not provide an LDA or
QDA program - An alternative is multinomial logit, in which
logistic regression is used but with more than
two categories.
18 . use "C\TD\CLASS\K30-2006\iris1.dta" . encode
species, generate(nspecies) . mlogit nspecies
sepallength sepalwidth petallength
petalwidth Iteration 0 log likelihood
-164.79184 Iteration 1 log likelihood
-67.780459 Iteration 2 log likelihood
-45.612546 Iteration 3 log likelihood
-23.256068 Iteration 4 log likelihood
-12.950564 Iteration 5 log likelihood
-8.8076897 Iteration 6 log likelihood
-7.0436566 Iteration 7 log likelihood
-6.2945111 Iteration 8 log likelihood
-6.0232697 Iteration 22 log likelihood
-5.9492736 Iteration 23 log likelihood
-5.9492736
19 Multinomial logistic regression
Number of obs 150
LR chi2(8)
317.69
Prob gt chi2 0.0000 Log
likelihood -5.9492736
Pseudo R2 0.9639 --------------------
--------------------------------------------------
-------- nspecies Coef. Std. Err.
z Pgtz 95 Conf. Interval ------------
-------------------------------------------------
---------------- versicolor sepallength
-8.596623 975.8807 -0.01 0.993
-1921.288 1904.094 sepalwidth -6.182755
. . . .
. petallength 16.82211 . .
. . . petalwidth
17.8852 9.742607 1.84 0.066 -1.209956
36.98036 _cons 7.261185 25.70765
0.28 0.778 -43.12489 57.64726 ----------
-------------------------------------------------
------------------ virginica sepallength
-11.06184 975.8824 -0.01 0.991
-1923.756 1901.632 sepalwidth -12.86364
4.479563 -2.87 0.004 -21.64342
-4.083859 petallength 26.2515 4.737208
5.54 0.000 16.96674 35.53626
petalwidth 36.17133 . .
. . . _cons
-35.37661 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome) .
20. test sepallength ( 1) versicolorsepallength
0 ( 2) virginicasepallength 0
chi2( 2) 1.06 Prob gt chi2
0.5885 . mlogit nspecies sepalwidth petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(6) 316.32
Prob gt chi2
0.0000 Log likelihood -6.6329197
Pseudo R2 0.9597 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
sepalwidth -11.07503 2941.788 -0.00
0.997 -5776.873 5754.723 petallength
12.326 3.840663 3.21 0.001 4.798441
19.85356 petalwidth 23.4702 .
. . . .
_cons -16.4182 23.99477 -0.68 0.494
-63.44709 30.61069 ---------------------------
--------------------------------------------------
virginica sepalwidth -19.4511
2941.798 -0.01 0.995 -5785.269
5746.367 petallength 20.20055 .
. . . .
petalwidth 44.8998 10.70735 4.19
0.000 23.91378 65.88582 _cons
-66.94504 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome)
21. test sepalwidth ( 1) versicolorsepalwidth
0 ( 2) virginicasepalwidth 0
chi2( 2) 3.09 Prob gt chi2
0.2128 . mlogit nspecies petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(4) 309.02
Prob gt chi2
0.0000 Log likelihood -10.281754
Pseudo R2 0.9376 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
petallength 18.61007 2659.161 0.01
0.994 -5193.25 5230.47 petalwidth
23.61704 . . . .
. _cons -63.27827 13.61167
-4.65 0.000 -89.95664
-36.5999 ----------------------------------------
------------------------------------- virginica
petallength 24.3646 2659.158 0.01
0.993 -5187.489 5236.218 petalwidth
34.06374 3.755651 9.07 0.000 26.7028
41.42468 _cons -108.5506 .
. . .
. ------------------------------------------------
------------------------------ (nspeciessetosa
is the base outcome)
22. test petallength ( 1) versicolorpetallength
0 ( 2) virginicapetallength 0
chi2( 2) 6.23 Prob gt chi2
0.0444 . test petalwidth ( 1)
versicolorpetalwidth 0 ( 2)
virginicapetalwidth 0 Constraint 1
dropped chi2( 1) 82.26
Prob gt chi2 0.0000 . twoway (scatter
petalwidth petallength if nspecies1,
mcolor(red)) (scatter petalwidth petallength
if nspecies2, mcolor(blue)) (scatter
petalwidth petallength if nspecies3,
mcolor(black)), legend(off) . graph export
petals.wmf
23(No Transcript)
24Lymphoma Data Set
- Alizadeh et al. Nature (2000) Distinct types of
diffuse large B-cell lymphoma identified by gene
expression profiling - We will analyze a subset of the data consisting
of 61 arrays on patients with - 45 Diffuse large B-cell lymphoma (DLBCL/DL)
- 10 Chronic lymphocytic leukaemia (CLL/CL)
- 6 Follicular leukaemia (FL)
25Data Available
- The original Nature paper
- The expression data in the form of unbackground
corrected log ratios. A common reference was
always on Cy3 with the sample on Cy5
(array.data.txt and array.data.zip). 9216 by 61 - A file with array codes and disease status for
each of the 61 arrays, ArrayID.txt
26Identify Differentially Expressed Genes
- We will assume that the log ratios are on a
reasonable enough scale that we can use them as
is - For each gene, we can run a one-way ANOVA and
find the p-value, obtaining 9,216 of them. - Account for multiplicity by using a small p-value
threshold - Identify genes with small adjusted p-values
27Develop Classifier
- Reduce dimension with ANOVA gene selection or
with PCA - Use logistic regression of LDA
- Evaluate the four possibilities and their
sub-possibilities with cross validation. With 61
arrays one could reasonable omit 10 or 6 at
random - This is difficult with Array Assist or Stata
28Differential Expression
- We can locate genes that are differentially
expressed that is, genes whose expression
differs systematically by the type of lymphoma. - To do this, we use lymphoma type to predict
expression, and see when this is statistically
significant. - This can be done in Array Assist
29- It is almost equivalent to locate genes whose
expression can be used to predict lymphoma type,
this being the reverse process. - If there is significant association in one
direction there should logically be significant
association in the other - This will not be true exactly, but is true
approximately - We can also easily do the latter analysis using
the expression of more than one gene using
logistic regresssion, LDA, and QDA
30(No Transcript)
31Significant Genes
- There are 3845 out of 9261 genes that have
significant p-values from the ANOVA of less than
0.05, compared to 463 expected by chance - There are 2733 genes with FDR adjusted p-values
less than 0.05 - There are only 184 genes with Bonferroni adjusted
p-values less than 0.05
32(No Transcript)
33Logistic Regression
- We will use logistic regression to distinguish
DLBCL from CLL and DLBCL from FL - We will do this first by choosing the variables
with the smallest overall p-values in the ANOVA - We will then evaluate the results within sample
and by cross validation
34(No Transcript)
35(No Transcript)
36Evaluation of performance
- Within sample evaluation of performance like this
is unreliable - This is especially true if we are selecting
predictors from a very large set - One useful yardstick is the performance of random
classifiers
37Left is CV performance of best k
variables Random 25.4
38Conclusion
- Logistic regression on the variables with the
smallest p-values does not work very well - This cannot be identified by looking at the
within sample statistics - Cross validation is a requirement to assess
performance of classifiers