Multivariate Analysis and Discrimination - PowerPoint PPT Presentation

About This Presentation

Title:

Multivariate Analysis and Discrimination

Description:

It contains lung function data for cystic fibrosis patients (7-23 ... 10 Chronic lymphocytic leukaemia (CLL/CL) 6 Follicular leukaemia (FL) November 30, 2006 ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 39

Provided by: davidm133

Learn more at: https://dmrocke.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Multivariate Analysis and Discrimination

1
Multivariate Analysis and Discrimination

EPP 245
Statistical Analysis of
Laboratory Data

2
Cystic Fibrosis Data Set

The 'cystfibr' data frame has 25 rows and 10
columns. It contains lung function data for
cystic fibrosis patients (7-23 years old)
We will examine the relationships among the
various measures of lung function

age a numeric vector. Age in years.
sex a numeric vector code. 0 male, 1female.
height a numeric vector. Height (cm).
weight a numeric vector. Weight (kg).
bmp a numeric vector. Body mass ( of normal).
fev1 a numeric vector. Forced expiratory volume.
rv a numeric vector. Residual volume.
frc a numeric vector. Functional residual
capacity.
tlc a numeric vector. Total lung capacity.
pemax a numeric vector. Maximum expiratory
pressure.

4
Scatterplot matrices

We have five variables and may wish to study the
relationships among them
We could separately plot the (5)(4)/2 10
pairwise scatterplots
The graph matrix function in Stata
.graph matrix fev1 rv frc tlc pemax

5
(No Transcript)
6
Principal Components Analysis

The idea of PCA is to create new variables that
are combinations of the original ones.
If x1, x2, , xp are the original variables, then
a component is a1x1 a2x2 apxp
We pick the first PC as the linear combination
that has the largest variance
The second PC is that linear combination
orthogonal to the first one that has the largest
variance, and so on

7
(No Transcript)
8
. pca fev1 rv frc tlc pemax Principal
components/correlation Number of
obs 25
Number of comp. 5

Trace 5 Rotation
(unrotated principal) Rho
1.0000 ------------------------------
--------------------------------------------
Component Eigenvalue Difference
Proportion Cumulative ---------------------
--------------------------------------------------
-- Comp1 3.22412 2.33772
0.6448 0.6448 Comp2
.886399 .40756 0.1773
0.8221 Comp3 .478839
.133874 0.0958 0.9179
Comp4 .344966 .279286
0.0690 0.9869 Comp5
.0656798 . 0.0131
1.0000 ---------------------------------------
----------------------------------- Principal
components (eigenvectors)
--------------------------------------------------
---------------------------- Variable
Comp1 Comp2 Comp3 Comp4 Comp5
Unexplained --------------------------------
--------------------------------------------
fev1 -0.4525 0.2140 0.5539
0.6641 -0.0397 0 rv
0.5043 0.1736 -0.2977 0.4993
-0.6145 0 frc
0.5291 0.1324 0.0073 0.3571 0.7582
0 tlc 0.4156 0.4525
0.6474 -0.4134 -0.1806 0
pemax -0.2970 0.8377 -0.4306
-0.1063 0.1152 0
--------------------------------------------------
----------------------------
9
(No Transcript)
10
(No Transcript)
11
Fishers Iris Data

This famous (Fisher's or Anderson's) iris data
set gives the measurements in centimeters of the
variables sepal length and width and petal length
and width, respectively, for 50 flowers from each
of 3 species of iris. The species are _Iris
setosa_, _versicolor_, and _virginica_.

12
. summarize Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- index 150 75.5
43.44537 1 150 sepallength
150 5.843333 .8280661 4.3
7.9 sepalwidth 150 3.057333
.4358663 2 4.4 petallength
150 3.758 1.765298 1
6.9 petalwidth 150 1.199333
.7622377 .1 2.5 ------------------
--------------------------------------------------
- species 0 . graph matrix
sepallength sepalwidth petallength petalwidth
13
(No Transcript)
14
. pca sepallength sepalwidth petallength
petalwidth Principal components/correlation
Number of obs 150
Number
of comp. 4
Trace
4 Rotation (unrotated principal)
Rho 1.0000
--------------------------------------------------
------------------------ Component
Eigenvalue Difference Proportion
Cumulative ----------------------------------
---------------------------------------
Comp1 2.9185 2.00447
0.7296 0.7296 Comp2
.91403 .767274 0.2285
0.9581 Comp3 .146757
.126042 0.0367 0.9948
Comp4 .0207148 .
0.0052 1.0000 --------------------------
------------------------------------------------
Principal components (eigenvectors)
--------------------------------------------------
------------------ Variable Comp1
Comp2 Comp3 Comp4 Unexplained
-------------------------------------------------
----------------- sepallength 0.5211
0.3774 -0.7196 -0.2613 0
sepalwidth -0.2693 0.9233 0.2444
0.1235 0 petallength 0.5804
0.0245 0.1421 0.8014 0
petalwidth 0.5649 0.0669 0.6343
-0.5236 0 -----------------------
---------------------------------------------
15
(No Transcript)
16
Multinomial Logit

Logistic regression can only deal with binary
categories
One alternative that can deal with more than two
categories is discriminant analysis. This comes
in two flavors, (Fishers) Linear Discriminant
Analysis or LDA and (Fishers) Quadratic
Discriminant Analysis or QDA
In each case we model the shape of the groups and
provide a dividing line/curve

Unfortunately, Stata does not provide an LDA or
QDA program
An alternative is multinomial logit, in which
logistic regression is used but with more than
two categories.

18
. use "C\TD\CLASS\K30-2006\iris1.dta" . encode
species, generate(nspecies) . mlogit nspecies
sepallength sepalwidth petallength
petalwidth Iteration 0 log likelihood
-164.79184 Iteration 1 log likelihood
-67.780459 Iteration 2 log likelihood
-45.612546 Iteration 3 log likelihood
-23.256068 Iteration 4 log likelihood
-12.950564 Iteration 5 log likelihood
-8.8076897 Iteration 6 log likelihood
-7.0436566 Iteration 7 log likelihood
-6.2945111 Iteration 8 log likelihood
-6.0232697 Iteration 22 log likelihood
-5.9492736 Iteration 23 log likelihood
-5.9492736
19
Multinomial logistic regression
Number of obs 150
LR chi2(8)
317.69
Prob gt chi2 0.0000 Log
likelihood -5.9492736
Pseudo R2 0.9639 --------------------
--------------------------------------------------
-------- nspecies Coef. Std. Err.
z Pgtz 95 Conf. Interval ------------
-------------------------------------------------
---------------- versicolor sepallength
-8.596623 975.8807 -0.01 0.993
-1921.288 1904.094 sepalwidth -6.182755
. . . .
. petallength 16.82211 . .
. . . petalwidth
17.8852 9.742607 1.84 0.066 -1.209956
36.98036 _cons 7.261185 25.70765
0.28 0.778 -43.12489 57.64726 ----------
-------------------------------------------------
------------------ virginica sepallength
-11.06184 975.8824 -0.01 0.991
-1923.756 1901.632 sepalwidth -12.86364
4.479563 -2.87 0.004 -21.64342
-4.083859 petallength 26.2515 4.737208
5.54 0.000 16.96674 35.53626
petalwidth 36.17133 . .
. . . _cons
-35.37661 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome) .
20
. test sepallength ( 1) versicolorsepallength
0 ( 2) virginicasepallength 0
chi2( 2) 1.06 Prob gt chi2
0.5885 . mlogit nspecies sepalwidth petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(6) 316.32
Prob gt chi2
0.0000 Log likelihood -6.6329197
Pseudo R2 0.9597 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
sepalwidth -11.07503 2941.788 -0.00
0.997 -5776.873 5754.723 petallength
12.326 3.840663 3.21 0.001 4.798441
19.85356 petalwidth 23.4702 .
. . . .
_cons -16.4182 23.99477 -0.68 0.494
-63.44709 30.61069 ---------------------------
--------------------------------------------------
virginica sepalwidth -19.4511
2941.798 -0.01 0.995 -5785.269
5746.367 petallength 20.20055 .
. . . .
petalwidth 44.8998 10.70735 4.19
0.000 23.91378 65.88582 _cons
-66.94504 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome)
21
. test sepalwidth ( 1) versicolorsepalwidth
0 ( 2) virginicasepalwidth 0
chi2( 2) 3.09 Prob gt chi2
0.2128 . mlogit nspecies petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(4) 309.02
Prob gt chi2
0.0000 Log likelihood -10.281754
Pseudo R2 0.9376 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
petallength 18.61007 2659.161 0.01
0.994 -5193.25 5230.47 petalwidth
23.61704 . . . .
. _cons -63.27827 13.61167
-4.65 0.000 -89.95664
-36.5999 ----------------------------------------
------------------------------------- virginica
petallength 24.3646 2659.158 0.01
0.993 -5187.489 5236.218 petalwidth
34.06374 3.755651 9.07 0.000 26.7028
41.42468 _cons -108.5506 .
. . .
. ------------------------------------------------
------------------------------ (nspeciessetosa
is the base outcome)
22
. test petallength ( 1) versicolorpetallength
0 ( 2) virginicapetallength 0
chi2( 2) 6.23 Prob gt chi2
0.0444 . test petalwidth ( 1)
versicolorpetalwidth 0 ( 2)
virginicapetalwidth 0 Constraint 1
dropped chi2( 1) 82.26
Prob gt chi2 0.0000 . twoway (scatter
petalwidth petallength if nspecies1,
mcolor(red)) (scatter petalwidth petallength
if nspecies2, mcolor(blue)) (scatter
petalwidth petallength if nspecies3,
mcolor(black)), legend(off) . graph export
petals.wmf
23
(No Transcript)
24
Lymphoma Data Set

Alizadeh et al. Nature (2000) Distinct types of
diffuse large B-cell lymphoma identified by gene
expression profiling
We will analyze a subset of the data consisting
of 61 arrays on patients with
45 Diffuse large B-cell lymphoma (DLBCL/DL)
10 Chronic lymphocytic leukaemia (CLL/CL)
6 Follicular leukaemia (FL)

25
Data Available

The original Nature paper
The expression data in the form of unbackground
corrected log ratios. A common reference was
always on Cy3 with the sample on Cy5
(array.data.txt and array.data.zip). 9216 by 61
A file with array codes and disease status for
each of the 61 arrays, ArrayID.txt

26
Identify Differentially Expressed Genes

We will assume that the log ratios are on a
reasonable enough scale that we can use them as
is
For each gene, we can run a one-way ANOVA and
find the p-value, obtaining 9,216 of them.
Account for multiplicity by using a small p-value
threshold
Identify genes with small adjusted p-values

27
Develop Classifier

Reduce dimension with ANOVA gene selection or
with PCA
Use logistic regression of LDA
Evaluate the four possibilities and their
sub-possibilities with cross validation. With 61
arrays one could reasonable omit 10 or 6 at
random
This is difficult with Array Assist or Stata

28
Differential Expression

We can locate genes that are differentially
expressed that is, genes whose expression
differs systematically by the type of lymphoma.
To do this, we use lymphoma type to predict
expression, and see when this is statistically
significant.
This can be done in Array Assist

It is almost equivalent to locate genes whose
expression can be used to predict lymphoma type,
this being the reverse process.
If there is significant association in one
direction there should logically be significant
association in the other
This will not be true exactly, but is true
approximately
We can also easily do the latter analysis using
the expression of more than one gene using
logistic regresssion, LDA, and QDA

30
(No Transcript)
31
Significant Genes

There are 3845 out of 9261 genes that have
significant p-values from the ANOVA of less than
0.05, compared to 463 expected by chance
There are 2733 genes with FDR adjusted p-values
less than 0.05
There are only 184 genes with Bonferroni adjusted
p-values less than 0.05

32
(No Transcript)
33
Logistic Regression

We will use logistic regression to distinguish
DLBCL from CLL and DLBCL from FL
We will do this first by choosing the variables
with the smallest overall p-values in the ANOVA
We will then evaluate the results within sample
and by cross validation

34
(No Transcript)
35
(No Transcript)
36
Evaluation of performance

Within sample evaluation of performance like this
is unreliable
This is especially true if we are selecting
predictors from a very large set
One useful yardstick is the performance of random
classifiers

37
Left is CV performance of best k
variables Random 25.4
38
Conclusion

Logistic regression on the variables with the
smallest p-values does not work very well
This cannot be identified by looking at the
within sample statistics
Cross validation is a requirement to assess
performance of classifiers

Write a Comment

User Comments (0)