Title: Correlation Aware Feature Selection
1Correlation Aware Feature Selection
Annalisa BarlaCesare FurlanelloGiuseppe
JurmanStefano MerlerSilvano Paoli
http//mpa.itc.it
Berlin 8/10/2005
2Overview
- On Feature Selection
- Correlation Aware Ranking
- Synthetic Example
3Feature Selection
- Step-wise variable selection
One feature vs. N features
nltN effective variables modeling the
classification function
N features
Step 1
Step N
N steps
4Feature Selection
Step-wise selection of the features.
Ranked Features Discarded Features
Steps
5Ranking
- Classifier independent filters
- Prefiltering is risky you might discard features
that turns out to be important. (ignoring
labelling) - Induced by a classifier
-
6Support Vector Machines
Classification function
Optimal Separating Hyperplane
7The classification/ranking machine
- The RFE idea given N features (genes)
- Train a SVM
- Compute a cost function J from the weight
coefficients of the the SVM - Rank features in terms of contribution to J
- Discard the feature less contributing to J
- Reapply procedure on the N-1 features
- This is called Recursive Feature Elimination
(RFE) - Features are ranked according to their
contribute to the classification, given the
training data. - Time and data consuming, and at risk of selection
bias
Guyon et al. 2002
8RFE-based Methods
- Considering chunks of data at a time
- Parametrics
- Sqrt(N) RFE
- Bisection RFE
- Non-Parametrics
- E RFE (adapting to weight distribution)thresh
olding weights to a value w
9Variable Elimination
Correlated genes
Given Fx1, x2, , xH such that
for a given threshold T.
Each single weight is negligible
w(x1)w(x2) e lt w
BUT
w(x1)w(x2) gtgt w
10Correlated Genes (1)
11Correlated Genes (2)
12Synthetic Data
- Binary problem
- 100 (50 50) samples of 1000 genes
- genes 1?50 randomly extracted from N(1,1) and
N(-1,1) respectively - genes 50?100 randomly extracted from N(1,1)
and N(-1,1) respectively - (1 repeated 50 times)
- genes 101? 1000 extracted from UNIF(-4,4)
1
1000
50
100
Class 1 50
51 significantfeatures
Class 2 50
50
1x50
13Our algorithm
step j
14Methodology
- Implemented within the BioDCV system (50
replicates) - Realized through R - C code interaction
15Synthetic Data
1
100
1000
50
steps
Gene 100 is consistently ranked as 2nd
16Work in Progress
- Preservation of high correlated genes with low
initial weights on microarrays datasets - Robust correlation measures
- Different techniques to detect Fl families
(clustering, gene functions)
17(No Transcript)
18Synthetic Data
19Synthetic Data
Features discarded at step 9 from E-RFE procedure
- 51 52 53 54 55 56
57 58 59 60 - 61 62 63 64 65 66 67
68 69 70 - 71 72 73 74 75 76
77 78 79 80 - 81 82 83 84 85 86
87 88 89 90 - 91 92 93 94 95 96
97 98 99 100 - 227 559 864 470 363 735
Correlation Correction
Saves feature 100
20Challenges
Challenges for predictive profiling
- INFRASTRUCTURE
- MPACluster -gt available for batch jobs
- Connecting with IFOM -gt 2005
- Running at IFOM -gt 2005/2006
- Production on GRID resources (spring 2005)
- ALGORITHMS II
- Gene list fusion suite of algebraic/statistical
methods - Prediction over multi-platform gene expression
datasets (sarcoma, breast cancer) large scale
semi-supervised analysis - New SVM Kernels for prediction on spectrometry
data within complete validation
21Prefiltering is risky you might discard features
that turns out to be important.
Nevertheless, wrapper methods are quite
costing. Moreover, in the gene expression data,
you have to deal also with particular situations
like clones or highly correlated features that
may represent a pitfall for several selection
methods.
A classic alternative is to map into linear
combination of features,and then select.
Principal Component Analysis
Metagenes (a simplified model for pathways but
biological suggestions require caution)
eigen-craters for unexploded bomb risk maps
But we are not working anymore with the original
features.
22(No Transcript)
23A few issues in feature selectionwith a
particular interest on classificationof genomic
data
WHY?
To enhance information
To ease computational burden
Discard the (apparently) less significant
features and train in a simplified space
alleviate the curse of dimensionality
Highlight (and rank) the most important features
and improve the knowledge of the underlying
process.
HOW?
As a pre-processing step
As a learning step
Link the feature ranking to the classification
task wrapper methods,
Employ a statistical filter (t-test, S2N)
24Prefiltering is risky you might discard features
that turns out to be important.
Nevertheless, wrapper methods are quite
costing. Moreover, in the gene expression data,
you have to deal also with particular situations
like clones or highly correlated features that
may represent a pitfall for several selection
methods.
A classic alternative is to map into linear
combination of features,and then select.
Principal Component Analysis
Metagenes (a simplified model for pathways but
biological suggestions require caution)
eigen-craters for unexploded bomb risk maps
But we are not working anymore with the original
features.
25Feature Selection within Complete Validation
Experimental Setups
Complete Validation is needed to decouple model
tuning from (ensemble) model accuracy estimation
otherwise selection bias effects
Accumulating rel. importance from Random Forest
models for the identification of sensory
drivers(with P. Granitto, IASMA)