Multivariate Analysis and Discrimination - PowerPoint PPT Presentation

About This Presentation
Title:

Multivariate Analysis and Discrimination

Description:

It contains lung function data for cystic fibrosis patients (7-23 ... 10 Chronic lymphocytic leukaemia (CLL/CL) 6 Follicular leukaemia (FL) November 30, 2006 ... – PowerPoint PPT presentation

Number of Views:69
Avg rating:3.0/5.0
Slides: 39
Provided by: davidm133
Category:

less

Transcript and Presenter's Notes

Title: Multivariate Analysis and Discrimination


1
Multivariate Analysis and Discrimination
  • EPP 245
  • Statistical Analysis of
  • Laboratory Data

2
Cystic Fibrosis Data Set
  • The 'cystfibr' data frame has 25 rows and 10
    columns. It contains lung function data for
    cystic fibrosis patients (7-23 years old)
  • We will examine the relationships among the
    various measures of lung function

3
  • age a numeric vector. Age in years.
  • sex a numeric vector code. 0 male, 1female.
  • height a numeric vector. Height (cm).
  • weight a numeric vector. Weight (kg).
  • bmp a numeric vector. Body mass ( of normal).
  • fev1 a numeric vector. Forced expiratory volume.
  • rv a numeric vector. Residual volume.
  • frc a numeric vector. Functional residual
    capacity.
  • tlc a numeric vector. Total lung capacity.
  • pemax a numeric vector. Maximum expiratory
    pressure.

4
Scatterplot matrices
  • We have five variables and may wish to study the
    relationships among them
  • We could separately plot the (5)(4)/2 10
    pairwise scatterplots
  • The graph matrix function in Stata
  • .graph matrix fev1 rv frc tlc pemax

5
(No Transcript)
6
Principal Components Analysis
  • The idea of PCA is to create new variables that
    are combinations of the original ones.
  • If x1, x2, , xp are the original variables, then
    a component is a1x1 a2x2 apxp
  • We pick the first PC as the linear combination
    that has the largest variance
  • The second PC is that linear combination
    orthogonal to the first one that has the largest
    variance, and so on

7
(No Transcript)
8
. pca fev1 rv frc tlc pemax Principal
components/correlation Number of
obs 25
Number of comp. 5

Trace 5 Rotation
(unrotated principal) Rho
1.0000 ------------------------------
--------------------------------------------
Component Eigenvalue Difference
Proportion Cumulative ---------------------
--------------------------------------------------
-- Comp1 3.22412 2.33772
0.6448 0.6448 Comp2
.886399 .40756 0.1773
0.8221 Comp3 .478839
.133874 0.0958 0.9179
Comp4 .344966 .279286
0.0690 0.9869 Comp5
.0656798 . 0.0131
1.0000 ---------------------------------------
----------------------------------- Principal
components (eigenvectors)
--------------------------------------------------
---------------------------- Variable
Comp1 Comp2 Comp3 Comp4 Comp5
Unexplained --------------------------------
--------------------------------------------
fev1 -0.4525 0.2140 0.5539
0.6641 -0.0397 0 rv
0.5043 0.1736 -0.2977 0.4993
-0.6145 0 frc
0.5291 0.1324 0.0073 0.3571 0.7582
0 tlc 0.4156 0.4525
0.6474 -0.4134 -0.1806 0
pemax -0.2970 0.8377 -0.4306
-0.1063 0.1152 0
--------------------------------------------------
----------------------------
9
(No Transcript)
10
(No Transcript)
11
Fishers Iris Data
  • This famous (Fisher's or Anderson's) iris data
    set gives the measurements in centimeters of the
    variables sepal length and width and petal length
    and width, respectively, for 50 flowers from each
    of 3 species of iris. The species are _Iris
    setosa_, _versicolor_, and _virginica_.

12
. summarize Variable Obs Mean
Std. Dev. Min Max ---------------
--------------------------------------------------
---- index 150 75.5
43.44537 1 150 sepallength
150 5.843333 .8280661 4.3
7.9 sepalwidth 150 3.057333
.4358663 2 4.4 petallength
150 3.758 1.765298 1
6.9 petalwidth 150 1.199333
.7622377 .1 2.5 ------------------
--------------------------------------------------
- species 0 . graph matrix
sepallength sepalwidth petallength petalwidth
13
(No Transcript)
14
. pca sepallength sepalwidth petallength
petalwidth Principal components/correlation
Number of obs 150
Number
of comp. 4
Trace
4 Rotation (unrotated principal)
Rho 1.0000
--------------------------------------------------
------------------------ Component
Eigenvalue Difference Proportion
Cumulative ----------------------------------
---------------------------------------
Comp1 2.9185 2.00447
0.7296 0.7296 Comp2
.91403 .767274 0.2285
0.9581 Comp3 .146757
.126042 0.0367 0.9948
Comp4 .0207148 .
0.0052 1.0000 --------------------------
------------------------------------------------
Principal components (eigenvectors)
--------------------------------------------------
------------------ Variable Comp1
Comp2 Comp3 Comp4 Unexplained
-------------------------------------------------
----------------- sepallength 0.5211
0.3774 -0.7196 -0.2613 0
sepalwidth -0.2693 0.9233 0.2444
0.1235 0 petallength 0.5804
0.0245 0.1421 0.8014 0
petalwidth 0.5649 0.0669 0.6343
-0.5236 0 -----------------------
---------------------------------------------
15
(No Transcript)
16
Multinomial Logit
  • Logistic regression can only deal with binary
    categories
  • One alternative that can deal with more than two
    categories is discriminant analysis. This comes
    in two flavors, (Fishers) Linear Discriminant
    Analysis or LDA and (Fishers) Quadratic
    Discriminant Analysis or QDA
  • In each case we model the shape of the groups and
    provide a dividing line/curve

17
  • Unfortunately, Stata does not provide an LDA or
    QDA program
  • An alternative is multinomial logit, in which
    logistic regression is used but with more than
    two categories.

18
. use "C\TD\CLASS\K30-2006\iris1.dta" . encode
species, generate(nspecies) . mlogit nspecies
sepallength sepalwidth petallength
petalwidth Iteration 0 log likelihood
-164.79184 Iteration 1 log likelihood
-67.780459 Iteration 2 log likelihood
-45.612546 Iteration 3 log likelihood
-23.256068 Iteration 4 log likelihood
-12.950564 Iteration 5 log likelihood
-8.8076897 Iteration 6 log likelihood
-7.0436566 Iteration 7 log likelihood
-6.2945111 Iteration 8 log likelihood
-6.0232697 Iteration 22 log likelihood
-5.9492736 Iteration 23 log likelihood
-5.9492736
19
Multinomial logistic regression
Number of obs 150
LR chi2(8)
317.69
Prob gt chi2 0.0000 Log
likelihood -5.9492736
Pseudo R2 0.9639 --------------------
--------------------------------------------------
-------- nspecies Coef. Std. Err.
z Pgtz 95 Conf. Interval ------------
-------------------------------------------------
---------------- versicolor sepallength
-8.596623 975.8807 -0.01 0.993
-1921.288 1904.094 sepalwidth -6.182755
. . . .
. petallength 16.82211 . .
. . . petalwidth
17.8852 9.742607 1.84 0.066 -1.209956
36.98036 _cons 7.261185 25.70765
0.28 0.778 -43.12489 57.64726 ----------
-------------------------------------------------
------------------ virginica sepallength
-11.06184 975.8824 -0.01 0.991
-1923.756 1901.632 sepalwidth -12.86364
4.479563 -2.87 0.004 -21.64342
-4.083859 petallength 26.2515 4.737208
5.54 0.000 16.96674 35.53626
petalwidth 36.17133 . .
. . . _cons
-35.37661 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome) .
20
. test sepallength ( 1) versicolorsepallength
0 ( 2) virginicasepallength 0
chi2( 2) 1.06 Prob gt chi2
0.5885 . mlogit nspecies sepalwidth petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(6) 316.32
Prob gt chi2
0.0000 Log likelihood -6.6329197
Pseudo R2 0.9597 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
sepalwidth -11.07503 2941.788 -0.00
0.997 -5776.873 5754.723 petallength
12.326 3.840663 3.21 0.001 4.798441
19.85356 petalwidth 23.4702 .
. . . .
_cons -16.4182 23.99477 -0.68 0.494
-63.44709 30.61069 ---------------------------
--------------------------------------------------
virginica sepalwidth -19.4511
2941.798 -0.01 0.995 -5785.269
5746.367 petallength 20.20055 .
. . . .
petalwidth 44.8998 10.70735 4.19
0.000 23.91378 65.88582 _cons
-66.94504 . . .
. . ------------------------------------
------------------------------------------ (nspeci
essetosa is the base outcome)
21
. test sepalwidth ( 1) versicolorsepalwidth
0 ( 2) virginicasepalwidth 0
chi2( 2) 3.09 Prob gt chi2
0.2128 . mlogit nspecies petallength
petalwidth Multinomial logistic regression
Number of obs 150
LR
chi2(4) 309.02
Prob gt chi2
0.0000 Log likelihood -10.281754
Pseudo R2 0.9376 -------------
--------------------------------------------------
--------------- nspecies Coef. Std.
Err. z Pgtz 95 Conf.
Interval ---------------------------------------
-------------------------------------- versicolor
petallength 18.61007 2659.161 0.01
0.994 -5193.25 5230.47 petalwidth
23.61704 . . . .
. _cons -63.27827 13.61167
-4.65 0.000 -89.95664
-36.5999 ----------------------------------------
------------------------------------- virginica
petallength 24.3646 2659.158 0.01
0.993 -5187.489 5236.218 petalwidth
34.06374 3.755651 9.07 0.000 26.7028
41.42468 _cons -108.5506 .
. . .
. ------------------------------------------------
------------------------------ (nspeciessetosa
is the base outcome)
22
. test petallength ( 1) versicolorpetallength
0 ( 2) virginicapetallength 0
chi2( 2) 6.23 Prob gt chi2
0.0444 . test petalwidth ( 1)
versicolorpetalwidth 0 ( 2)
virginicapetalwidth 0 Constraint 1
dropped chi2( 1) 82.26
Prob gt chi2 0.0000 . twoway (scatter
petalwidth petallength if nspecies1,
mcolor(red)) (scatter petalwidth petallength
if nspecies2, mcolor(blue)) (scatter
petalwidth petallength if nspecies3,
mcolor(black)), legend(off) . graph export
petals.wmf
23
(No Transcript)
24
Lymphoma Data Set
  • Alizadeh et al. Nature (2000) Distinct types of
    diffuse large B-cell lymphoma identified by gene
    expression profiling
  • We will analyze a subset of the data consisting
    of 61 arrays on patients with
  • 45 Diffuse large B-cell lymphoma (DLBCL/DL)
  • 10 Chronic lymphocytic leukaemia (CLL/CL)
  • 6 Follicular leukaemia (FL)

25
Data Available
  • The original Nature paper
  • The expression data in the form of unbackground
    corrected log ratios. A common reference was
    always on Cy3 with the sample on Cy5
    (array.data.txt and array.data.zip). 9216 by 61
  • A file with array codes and disease status for
    each of the 61 arrays, ArrayID.txt

26
Identify Differentially Expressed Genes
  • We will assume that the log ratios are on a
    reasonable enough scale that we can use them as
    is
  • For each gene, we can run a one-way ANOVA and
    find the p-value, obtaining 9,216 of them.
  • Account for multiplicity by using a small p-value
    threshold
  • Identify genes with small adjusted p-values

27
Develop Classifier
  • Reduce dimension with ANOVA gene selection or
    with PCA
  • Use logistic regression of LDA
  • Evaluate the four possibilities and their
    sub-possibilities with cross validation. With 61
    arrays one could reasonable omit 10 or 6 at
    random
  • This is difficult with Array Assist or Stata

28
Differential Expression
  • We can locate genes that are differentially
    expressed that is, genes whose expression
    differs systematically by the type of lymphoma.
  • To do this, we use lymphoma type to predict
    expression, and see when this is statistically
    significant.
  • This can be done in Array Assist

29
  • It is almost equivalent to locate genes whose
    expression can be used to predict lymphoma type,
    this being the reverse process.
  • If there is significant association in one
    direction there should logically be significant
    association in the other
  • This will not be true exactly, but is true
    approximately
  • We can also easily do the latter analysis using
    the expression of more than one gene using
    logistic regresssion, LDA, and QDA

30
(No Transcript)
31
Significant Genes
  • There are 3845 out of 9261 genes that have
    significant p-values from the ANOVA of less than
    0.05, compared to 463 expected by chance
  • There are 2733 genes with FDR adjusted p-values
    less than 0.05
  • There are only 184 genes with Bonferroni adjusted
    p-values less than 0.05

32
(No Transcript)
33
Logistic Regression
  • We will use logistic regression to distinguish
    DLBCL from CLL and DLBCL from FL
  • We will do this first by choosing the variables
    with the smallest overall p-values in the ANOVA
  • We will then evaluate the results within sample
    and by cross validation

34
(No Transcript)
35
(No Transcript)
36
Evaluation of performance
  • Within sample evaluation of performance like this
    is unreliable
  • This is especially true if we are selecting
    predictors from a very large set
  • One useful yardstick is the performance of random
    classifiers

37
Left is CV performance of best k
variables Random 25.4
38
Conclusion
  • Logistic regression on the variables with the
    smallest p-values does not work very well
  • This cannot be identified by looking at the
    within sample statistics
  • Cross validation is a requirement to assess
    performance of classifiers
Write a Comment
User Comments (0)
About PowerShow.com