Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection

Description:

preserve (augmented) discriminant information. Variables with between group differences ... Report No. 31 (available at http://mbi.osu.edu/publications/pub2005.html) ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 29

Provided by: JoeVer4

Category:

more less

Transcript and Presenter's Notes

Title: Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection

1
Shrunken Centroid Ordering by Orthogonal
Projections(SCOOP) method of variable selection

Joe Verducci
Ohio State University

2
Outline

Motivationgene expression
Variable selection for LDA
Large p Moderate n
Advantages in gene selection
Method
Model Justification
Measures of Performance
Modifications

3
LDA Motivation

Non-greedy selection
preserve (augmented) discriminant information
Variables with between group differences
Variables highly correlated with these

4
Fishers Linear Discriminant FunctionandA
Stupid Generalization
where
5
Why Its Stupid
S
m1
m2
Results from Bickel and Levina (2004) imply that
the eigenvectors of within and between group
covariance matrices approach orthogonality under
n fixed p?infinity asymptotics.
6
Genetic Motivation

Wound Healing
80 National Wound Healing Clinics
1000 patients
Initial 1-week samples
Clinical records of patients
10K genes of potential interest in myocytes
Subsets of genes act in concert
A single gene may be active in several subsystems

7
P53

When the DNA in a cell becomes damaged by agents
such as toxic chemicals or ultraviolet (UV) rays
from sunlight, this protein plays a critical role
in determining whether the DNA will be repaired
or the cell will undergo programmed cell death
(apoptosis).
If the DNA can be repaired, tumor protein p53
activates other genes to fix the damage.
If the DNA cannot be repaired, tumor protein p53
prevents the cell from dividing and signals it to
undergo apoptosis. This process prevents cells
with mutated or damaged DNA from dividing, which
helps prevent the development of tumors.

8
Pathway construction based on GeneChipTM
expression data. Genes shown in red ellipse are
candidates identified using GeneChipTM assay that
were up-regulated in 20 O2 compared with 3 O2.
Green ellipses are genes that were down-regulated
under conditions mentioned above. The expressions
of candidates shown in red ellipse with blue
outline have been independently verified using
either real-time PCR or ribonuclease protection
assay (6). BAX, Bcl2-associated X protein Catn,
catenin CASP, caspage ccng, cyclin G Cdc61,
cell division cycle CDK, cyclin-dependent
kinase CDKN1A, cyclin-dependent kinase inhibitor
1A (p21) Cx43, gap junction membrane channel
protein GADD, growth arrest and DNA
damage-inducible MAPK, mitogen-activated protein
kinase Mdm2, transformed mouse 3T3 cell double
minute 2 N-Cdh, cadherin 2 PXN, paxillin Tob,
transducer of ErbB-2.1 TP53, transformation-relat
ed protein 53 Vcl, vinculin Wig, wild-type
p53-induced gene 1.
9
Motivating Simple Example

Two groups
50 samples in each
P 4000 normal variables
All have variance 1
First 10 variables
correlation .75 between all pairs
Difference of 2 between group means
Second 10 variables
correlation .75 between all pairs
Difference of 1 between group means
Last 3980 variables
independent
same mean in both groups

10
Results from 100 Simulations

Individual t-test ranking by p-values
73 of top 20 selected are correct
On average need to select 400 variables to ensure
inclusion of all 20
SCOOP
91 of top 20 selected are correct
On average need to select 200 variables to ensure
inclusion of all 20

11
Shrunken Centroid Methodfor K groupsTibshirani,
Hastie,Narasimhan Chu

For each gene i,
xik sample mean in group k,
xi overall sample mean
sik estimated std. error of xik
Based on pooled std deviation
dik (xik - xi)/sik is a t-statistic
Shrinking by an amount D gt 0 gives
Shrunken difference

Shrunken centroid

12
Properties of Shrunken Centroid

When K 2, ordering of variables/genes is same
as t-test
Keeps redundant predictors
Can be modified to regularize the estimated std
errors
Shrunken centroids used directly for
classification
Shrinkage by amount D is simultaneous in all
coordinates on standardized scale
Shrinkage parameter D chosen by cross-validation

13
Reformulating the Goals

Genetic studies
Find biomarkers
classification/prediction
Use small number of classifiers/predictors
Understand genetic pathways
Discover which genes work together to make a
difference
possible intervention
Other studies
Improve efficiency in difficult discrimination
problems

14
SCOOP Method(version 1)

Define the Augmented Discriminant Space
ADS span of eigenvectors
of Within and Between Covariance Matrices
Modify shrinkage so as not to distort
configuration of data in the ADS
shrink variables differentially along directions
orthogonal to the ADS
Note Unlike the reference, we do not
standardize, but scale only at the shrinkage
stage.
Keep track of the amount of shrinkage li needed
to eliminate the ith variable

15
SCOOP Algorithmfor K groups

1. Between Group eigenvectors
DB (xik - xi) p x K matrix
Use Singular Value Decompostion (SVD) on DB.
The singular vectors of DB are the eigenvectors
of
DB (DB)T
2. Within Group eigenvectors

16
Algorithm (part 2)

Orthogonalize the Between group (BG) eigenvectors
to the Within group (WG) eigenvectors
Note residuals from orthogonalization will no
longer be orthogonal to each other
Renormalize
compute projection operator onto complement of
the ADS
Note do not need to use p x p storage

17
Algorithm (part 3)

Order variables by scaled shrinkage distances
li
For each variable i, compute a scale value
(squared) length of its projection onto the
orthogonal complement of the ADS
Then calculate how many li such units are
needed to shrink each of the K mean differences
to 0

18
Notes

Shrinking is non-linear
it truncates at 0
shrinks each group only as much as it needs to
What to use as a stopping rule?
Some measure of preserved information
Elbow in the distribution of li
Reference to extreme value distribution

19
Theoretical Concern

Inconsistency of sample eigenvectors
if p(n)/n ? c gt 0
Johnstone and Lu (2004)
Unless sparse representation
(offset) factor model
Latent factors account for both
Correlation among variables
Group mean differences

20
Modeling considerations

Common offset factor model for gene expression
latent factors represent biological variation
random measurement error are uniqueness
components of individual genes.
Normally distributed data
two populations share the same factor structure
differ only by the means of the underlying
factors
the restricted maximum likelihood procedure is
the (stupid) generalization of Fishers Linear
Discriminant Analysis (SLDA) that incorporates a
generalized inverse of the pooled sample
covariance matrix.
SLDA seldom works well for real data
amend overly restrictive assumptions on both
means and covariances.

21
More model considerations

Factors underlying biological variation
Common factors in 2 groups
Some with different means in 2 groups
Some with same mean
Group specific factors
Some may have non-zero means
Some have 0 means
Unique variation among genes
Most is noise
A few of the genes that do not load on any factor
may have different means in the two groups
.

22
Model
23
Simulation

n100
p4000
G2
K3
J(g) 1
s1
skF1
Sj(g)1

Loadings on common factors
l1 indicates 1st 10 variables 1
l2 indicates 2nd 10 variables .55
l3 indicates 3rd 10 variables 0
Loadings on Group-specific factors
L1(1) indicates 4th 10 variables .55
L1(2) indicates 5th 10 variables 0
Here is the difference in means

24
Shrinkage Needed to Select Top Predictors
25
Measures of Performance

Individual t-test ranking by p-values
49 of top 30 selected are correct
On average need to select 400 variables to
ensure inclusion of all 30
SCOOP
61 of top 30 selected are correct
On average need to select 200 variables to ensure
inclusion of all 30

26
Modifications

Preserve common and group-distinct within group
sample eigenvectors
Regularize sample eigenvectors using Linear
Perturbation Theory

This is piecewise linear until adjacent
eigenvalues become equal
27
Conclusions

To the extent that something like an offset
factor model holds, incorporating correlations
may substantially improve selection of
discriminating variables (DVs)
Clustering of non-DVs does not seem to have any
serious ill effect
SCOOP is one way to use covariance structure
efficiently

28
References

Bickel PJ and Levina E (2004). Some theory for
Fisher's linear discriminant function, naive
Bayes', and some alternatives when there are many
more variables than observations. Bernoulli 10,
no. 6 9891010.
Tibshirani R, Hastie T,Narasimhan Chu (2002)
Diagnosis of multiple cancer types by shrunken
centroids of gene expression. PNAS 99, no. 10
6567-6572.
Sen, CK, Verducci, JS, Melfi, VF, Khanna, S,
Barbacioru, C and Roy, S (2005). Post-reperfusion
healing of the heart Focus on oxygen-sensitive
genes and DNA microarray as a tool. Mathematical
Biosciences Institute Technical Report No. 31
(available at http//mbi.osu.edu/publications/pub2
005.html)