Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection

Description:

preserve (augmented) discriminant information. Variables with between group differences ... Report No. 31 (available at http://mbi.osu.edu/publications/pub2005.html) ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 29
Provided by: JoeVer4
Category:

less

Transcript and Presenter's Notes

Title: Shrunken Centroid Ordering by Orthogonal Projections SCOOP method of variable selection


1
Shrunken Centroid Ordering by Orthogonal
Projections(SCOOP) method of variable selection
  • Joe Verducci
  • Ohio State University

2
Outline
  • Motivationgene expression
  • Variable selection for LDA
  • Large p Moderate n
  • Advantages in gene selection
  • Method
  • Model Justification
  • Measures of Performance
  • Modifications

3
LDA Motivation
  • Non-greedy selection
  • preserve (augmented) discriminant information
  • Variables with between group differences
  • Variables highly correlated with these

4
Fishers Linear Discriminant FunctionandA
Stupid Generalization
where
5
Why Its Stupid
S
m1
m2
Results from Bickel and Levina (2004) imply that
the eigenvectors of within and between group
covariance matrices approach orthogonality under
n fixed p?infinity asymptotics.
6
Genetic Motivation
  • Wound Healing
  • 80 National Wound Healing Clinics
  • 1000 patients
  • Initial 1-week samples
  • Clinical records of patients
  • 10K genes of potential interest in myocytes
  • Subsets of genes act in concert
  • A single gene may be active in several subsystems

7
P53
  • When the DNA in a cell becomes damaged by agents
    such as toxic chemicals or ultraviolet (UV) rays
    from sunlight, this protein plays a critical role
    in determining whether the DNA will be repaired
    or the cell will undergo programmed cell death
    (apoptosis).
  • If the DNA can be repaired, tumor protein p53
    activates other genes to fix the damage.
  • If the DNA cannot be repaired, tumor protein p53
    prevents the cell from dividing and signals it to
    undergo apoptosis. This process prevents cells
    with mutated or damaged DNA from dividing, which
    helps prevent the development of tumors.

8
Pathway construction based on GeneChipTM
expression data. Genes shown in red ellipse are
candidates identified using GeneChipTM assay that
were up-regulated in 20 O2 compared with 3 O2.
Green ellipses are genes that were down-regulated
under conditions mentioned above. The expressions
of candidates shown in red ellipse with blue
outline have been independently verified using
either real-time PCR or ribonuclease protection
assay (6). BAX, Bcl2-associated X protein Catn,
catenin CASP, caspage ccng, cyclin G Cdc61,
cell division cycle CDK, cyclin-dependent
kinase CDKN1A, cyclin-dependent kinase inhibitor
1A (p21) Cx43, gap junction membrane channel
protein GADD, growth arrest and DNA
damage-inducible MAPK, mitogen-activated protein
kinase Mdm2, transformed mouse 3T3 cell double
minute 2 N-Cdh, cadherin 2 PXN, paxillin Tob,
transducer of ErbB-2.1 TP53, transformation-relat
ed protein 53 Vcl, vinculin Wig, wild-type
p53-induced gene 1.
9
Motivating Simple Example
  • Two groups
  • 50 samples in each
  • P 4000 normal variables
  • All have variance 1
  • First 10 variables
  • correlation .75 between all pairs
  • Difference of 2 between group means
  • Second 10 variables
  • correlation .75 between all pairs
  • Difference of 1 between group means
  • Last 3980 variables
  • independent
  • same mean in both groups

10
Results from 100 Simulations
  • Individual t-test ranking by p-values
  • 73 of top 20 selected are correct
  • On average need to select 400 variables to ensure
    inclusion of all 20
  • SCOOP
  • 91 of top 20 selected are correct
  • On average need to select 200 variables to ensure
    inclusion of all 20

11
Shrunken Centroid Methodfor K groupsTibshirani,
Hastie,Narasimhan Chu
  • For each gene i,
  • xik sample mean in group k,
  • xi overall sample mean
  • sik estimated std. error of xik
  • Based on pooled std deviation
  • dik (xik - xi)/sik is a t-statistic
  • Shrinking by an amount D gt 0 gives
  • Shrunken difference
  • Shrunken centroid

12
Properties of Shrunken Centroid
  • When K 2, ordering of variables/genes is same
    as t-test
  • Keeps redundant predictors
  • Can be modified to regularize the estimated std
    errors
  • Shrunken centroids used directly for
    classification
  • Shrinkage by amount D is simultaneous in all
    coordinates on standardized scale
  • Shrinkage parameter D chosen by cross-validation

13
Reformulating the Goals
  • Genetic studies
  • Find biomarkers
  • classification/prediction
  • Use small number of classifiers/predictors
  • Understand genetic pathways
  • Discover which genes work together to make a
    difference
  • possible intervention
  • Other studies
  • Improve efficiency in difficult discrimination
    problems

14
SCOOP Method(version 1)
  • Define the Augmented Discriminant Space
  • ADS span of eigenvectors
  • of Within and Between Covariance Matrices
  • Modify shrinkage so as not to distort
    configuration of data in the ADS
  • shrink variables differentially along directions
    orthogonal to the ADS
  • Note Unlike the reference, we do not
    standardize, but scale only at the shrinkage
    stage.
  • Keep track of the amount of shrinkage li needed
    to eliminate the ith variable

15
SCOOP Algorithmfor K groups
  • 1. Between Group eigenvectors
  • DB (xik - xi) p x K matrix
  • Use Singular Value Decompostion (SVD) on DB.
    The singular vectors of DB are the eigenvectors
    of
  • DB (DB)T
  • 2. Within Group eigenvectors

16
Algorithm (part 2)
  • Orthogonalize the Between group (BG) eigenvectors
    to the Within group (WG) eigenvectors
  • Note residuals from orthogonalization will no
    longer be orthogonal to each other
  • Renormalize
  • compute projection operator onto complement of
    the ADS
  • Note do not need to use p x p storage

17
Algorithm (part 3)
  • Order variables by scaled shrinkage distances
    li
  • For each variable i, compute a scale value
    (squared) length of its projection onto the
    orthogonal complement of the ADS
  • Then calculate how many li such units are
    needed to shrink each of the K mean differences
    to 0

18
Notes
  • Shrinking is non-linear
  • it truncates at 0
  • shrinks each group only as much as it needs to
  • What to use as a stopping rule?
  • Some measure of preserved information
  • Elbow in the distribution of li
  • Reference to extreme value distribution

19
Theoretical Concern
  • Inconsistency of sample eigenvectors
  • if p(n)/n ? c gt 0
  • Johnstone and Lu (2004)
  • Unless sparse representation
  • (offset) factor model
  • Latent factors account for both
  • Correlation among variables
  • Group mean differences

20
Modeling considerations
  • Common offset factor model for gene expression
  • latent factors represent biological variation
  • random measurement error are uniqueness
    components of individual genes.
  • Normally distributed data
  • two populations share the same factor structure
  • differ only by the means of the underlying
    factors
  • the restricted maximum likelihood procedure is
    the (stupid) generalization of Fishers Linear
    Discriminant Analysis (SLDA) that incorporates a
    generalized inverse of the pooled sample
    covariance matrix.
  • SLDA seldom works well for real data
  • amend overly restrictive assumptions on both
    means and covariances.

21
More model considerations
  • Factors underlying biological variation
  • Common factors in 2 groups
  • Some with different means in 2 groups
  • Some with same mean
  • Group specific factors
  • Some may have non-zero means
  • Some have 0 means
  • Unique variation among genes
  • Most is noise
  • A few of the genes that do not load on any factor
    may have different means in the two groups
  • .

22
Model
23
Simulation
  • n100
  • p4000
  • G2
  • K3
  • J(g) 1
  • s1
  • skF1
  • Sj(g)1
  • Loadings on common factors
  • l1 indicates 1st 10 variables 1
  • l2 indicates 2nd 10 variables .55
  • l3 indicates 3rd 10 variables 0
  • Loadings on Group-specific factors
  • L1(1) indicates 4th 10 variables .55
  • L1(2) indicates 5th 10 variables 0
  • Here is the difference in means

24
Shrinkage Needed to Select Top Predictors
25
Measures of Performance
  • Individual t-test ranking by p-values
  • 49 of top 30 selected are correct
  • On average need to select 400 variables to
    ensure inclusion of all 30
  • SCOOP
  • 61 of top 30 selected are correct
  • On average need to select 200 variables to ensure
    inclusion of all 30

26
Modifications
  • Preserve common and group-distinct within group
    sample eigenvectors
  • Regularize sample eigenvectors using Linear
    Perturbation Theory

This is piecewise linear until adjacent
eigenvalues become equal
27
Conclusions
  • To the extent that something like an offset
    factor model holds, incorporating correlations
    may substantially improve selection of
    discriminating variables (DVs)
  • Clustering of non-DVs does not seem to have any
    serious ill effect
  • SCOOP is one way to use covariance structure
    efficiently

28
References
  • Bickel PJ and Levina E (2004). Some theory for
    Fisher's linear discriminant function, naive
    Bayes', and some alternatives when there are many
    more variables than observations. Bernoulli  10,
    no. 6 9891010.
  • Tibshirani R, Hastie T,Narasimhan Chu (2002)
    Diagnosis of multiple cancer types by shrunken
    centroids of gene expression. PNAS 99, no. 10
    6567-6572.
  • Sen, CK, Verducci, JS, Melfi, VF, Khanna, S,
    Barbacioru, C and Roy, S (2005). Post-reperfusion
    healing of the heart Focus on oxygen-sensitive
    genes and DNA microarray as a tool. Mathematical
    Biosciences Institute Technical Report No. 31
    (available at http//mbi.osu.edu/publications/pub2
    005.html)
Write a Comment
User Comments (0)
About PowerShow.com