Correlation Aware Feature Selection - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Correlation Aware Feature Selection

Description:

effective variables modeling the classification function. N ... ( ignoring labelling) Induced by a classifier. Support Vector Machines. Classification function: ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 25
Provided by: menn3
Category:

less

Transcript and Presenter's Notes

Title: Correlation Aware Feature Selection


1
Correlation Aware Feature Selection
Annalisa BarlaCesare FurlanelloGiuseppe
JurmanStefano MerlerSilvano Paoli
http//mpa.itc.it
Berlin 8/10/2005
2
Overview
  • On Feature Selection
  • Correlation Aware Ranking
  • Synthetic Example

3
Feature Selection
  • Step-wise variable selection

One feature vs. N features
nltN effective variables modeling the
classification function
N features

Step 1
Step N

N steps
4
Feature Selection
Step-wise selection of the features.
Ranked Features Discarded Features
Steps
5
Ranking
  • Classifier independent filters
  • Prefiltering is risky you might discard features
    that turns out to be important. (ignoring
    labelling)
  • Induced by a classifier

6
Support Vector Machines
Classification function
Optimal Separating Hyperplane
7
The classification/ranking machine
  • The RFE idea given N features (genes)
  • Train a SVM
  • Compute a cost function J from the weight
    coefficients of the the SVM
  • Rank features in terms of contribution to J
  • Discard the feature less contributing to J
  • Reapply procedure on the N-1 features
  • This is called Recursive Feature Elimination
    (RFE)
  • Features are ranked according to their
    contribute to the classification, given the
    training data.
  • Time and data consuming, and at risk of selection
    bias

Guyon et al. 2002
8
RFE-based Methods
  • Considering chunks of data at a time
  • Parametrics
  • Sqrt(N) RFE
  • Bisection RFE
  • Non-Parametrics
  • E RFE (adapting to weight distribution)thresh
    olding weights to a value w

9
Variable Elimination
Correlated genes
Given Fx1, x2, , xH such that
for a given threshold T.
Each single weight is negligible
w(x1)w(x2) e lt w
BUT
w(x1)w(x2) gtgt w
10
Correlated Genes (1)
11
Correlated Genes (2)
12
Synthetic Data
  • Binary problem
  • 100 (50 50) samples of 1000 genes
  • genes 1?50 randomly extracted from N(1,1) and
    N(-1,1) respectively
  • genes 50?100 randomly extracted from N(1,1)
    and N(-1,1) respectively
  • (1 repeated 50 times)
  • genes 101? 1000 extracted from UNIF(-4,4)

1
1000
50
100
Class 1 50
51 significantfeatures
Class 2 50
50
1x50
13
Our algorithm
step j
14
Methodology
  • Implemented within the BioDCV system (50
    replicates)
  • Realized through R - C code interaction

15
Synthetic Data
1
100
1000
50
steps
Gene 100 is consistently ranked as 2nd
16
Work in Progress
  • Preservation of high correlated genes with low
    initial weights on microarrays datasets
  • Robust correlation measures
  • Different techniques to detect Fl families
    (clustering, gene functions)

17
(No Transcript)
18
Synthetic Data
19
Synthetic Data
Features discarded at step 9 from E-RFE procedure
  • 51 52 53 54 55 56
    57 58 59 60
  • 61 62 63 64 65 66 67
    68 69 70
  • 71 72 73 74 75 76
    77 78 79 80
  • 81 82 83 84 85 86
    87 88 89 90
  • 91 92 93 94 95 96
    97 98 99 100
  • 227 559 864 470 363 735

Correlation Correction
Saves feature 100
20
Challenges
Challenges for predictive profiling
  • INFRASTRUCTURE
  • MPACluster -gt available for batch jobs
  • Connecting with IFOM -gt 2005
  • Running at IFOM -gt 2005/2006
  • Production on GRID resources (spring 2005)
  • ALGORITHMS II
  • Gene list fusion suite of algebraic/statistical
    methods
  • Prediction over multi-platform gene expression
    datasets (sarcoma, breast cancer) large scale
    semi-supervised analysis
  • New SVM Kernels for prediction on spectrometry
    data within complete validation

21
Prefiltering is risky you might discard features
that turns out to be important.
Nevertheless, wrapper methods are quite
costing. Moreover, in the gene expression data,
you have to deal also with particular situations
like clones or highly correlated features that
may represent a pitfall for several selection
methods.
A classic alternative is to map into linear
combination of features,and then select.
Principal Component Analysis
Metagenes (a simplified model for pathways but
biological suggestions require caution)
eigen-craters for unexploded bomb risk maps
But we are not working anymore with the original
features.
22
(No Transcript)
23
A few issues in feature selectionwith a
particular interest on classificationof genomic
data
WHY?
To enhance information
To ease computational burden
Discard the (apparently) less significant
features and train in a simplified space
alleviate the curse of dimensionality
Highlight (and rank) the most important features
and improve the knowledge of the underlying
process.
HOW?
As a pre-processing step
As a learning step
Link the feature ranking to the classification
task wrapper methods,
Employ a statistical filter (t-test, S2N)
24
Prefiltering is risky you might discard features
that turns out to be important.
Nevertheless, wrapper methods are quite
costing. Moreover, in the gene expression data,
you have to deal also with particular situations
like clones or highly correlated features that
may represent a pitfall for several selection
methods.
A classic alternative is to map into linear
combination of features,and then select.
Principal Component Analysis
Metagenes (a simplified model for pathways but
biological suggestions require caution)
eigen-craters for unexploded bomb risk maps
But we are not working anymore with the original
features.
25
Feature Selection within Complete Validation
Experimental Setups
Complete Validation is needed to decouple model
tuning from (ensemble) model accuracy estimation
otherwise selection bias effects
Accumulating rel. importance from Random Forest
models for the identification of sensory
drivers(with P. Granitto, IASMA)
Write a Comment
User Comments (0)
About PowerShow.com