Title: talk proteomics
1Lecture 8 Feature SelectionBioinformatics
Data Analysis and Tools
Elena Marchiori (elena_at_few.vu.nl)
2Why select features
- Select a subset of relevant input variables
- Advantages
- it is cheaper to measure less variables
- the resulting classifier is simpler and
potentially faster - prediction accuracy may improve by discarding
irrelevant variables - identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection)
3Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
4Approaches
- Wrapper
- feature selection takes into account the
contribution to the performance of a given type
of classifier - Filter
- feature selection is based on an evaluation
criterion for quantifying how well feature
(subsets) discriminate the two classes - Embedded
- feature selection is part of the training
procedure of a classifier (e.g. decision trees)
5Embedded methods
- Attempt to jointly or simultaneously train both a
classifier and a feature subset - Often optimize an objective function that jointly
rewards accuracy of classification and penalizes
use of more features. - Intuitively appealing
- Example tree-building algorithms
Adapted from J. Fridlyand
6Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
7Filter methods
Feature selection
p
s
Classifier design
R
R
s ltlt p
- Features are scored independently and the top s
are used by - the classifier
- Score correlation, mutual information,
t-statistic, F-statistic, - p-value, tree importance statistic etc
Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
8Problems with filter method
- Redundancy in selected features features are
considered independently and not measured on the
basis of whether they contribute new information - Interactions among features generally can not be
explicitly incorporated (some filter methods are
smarter than others) - Classifier has no say in what features should be
used some scores may be more appropriates in
conjuction with some classifiers than others.
Adapted from J. Fridlyand
9Dimension reduction a variant on a filter method
- Rather than retain a subset of s features,
perform dimension reduction by projecting
features onto s principal components of variation
(e.g. PCA etc) - Problem is that we are no longer dealing with one
feature at a time but rather a linear or possibly
more complicated combination of all features. It
may be good enough for a black box but how does
one build a diagnostic chip on a supergene?
(even though we dont want to confuse the tasks) - Those methods tend not to work better than simple
filter methods.
Adapted from J. Fridlyand
10Wrapper methods
Feature selection
p
s
Classifier design
R
R
s ltlt p
- Iterative approach many feature subsets are
scored based - on classification performance and best is used.
- Selection of subsets forward selection, backward
selection, - Forward-backward selection, tree harvesting etc
Adapted from J. Fridlyand
11Problems with wrapper methods
- Computationally expensive for each feature
subset to be considered, a classifier must be
built and evaluated - No exhaustive search is possible (2 subsets to
consider) generally greedy algorithms only. - Easy to overfit.
p
Adapted from J. Fridlyand
12Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
13Microarray Data Challenges to Machine Learning
Algorithms
- Few samples for analysis (38 labeled)
- Extremely high-dimensional data (7129 gene
expression values per sample) - Noisy data
- Complex underlying mechanisms, not fully
understood
14Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
15Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
16Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
17Importance of Feature (Gene) Selection
- Majority of genes are not directly related to
leukemia - Having a large number of features enhances the
models flexibility, but makes it prone to
overfitting - Noise and the small number of training samples
makes this even more likely - Some types of models, like kNN do not scale well
with many features
18With 7219 genes, how do we choose the best?
- Distance metrics to capture class separation
- Rank genes according to distance metric score
- Choose the top n ranked genes
HIGH score
LOW score
19Distance Metrics
- Tamayos Relative Class Separation
- t-test
- Bhattacharyya distance
20SVM-RFE wrapper
- Recursive Feature Elimination
- Train linear SVM -gt linear decision function
- Use absolute value of variable weights to rank
variables - Remove half variables with lower rank
- Repeat above steps (train, rank, remove) on data
restricted to variables not removed - Output subset of variables
21SVM-RFE
- Linear binary classifier decision function
- Recursive Feature Elimination (SVM-RFE)
- at each iteration
- eliminate threshold of variables with lower
score - recompute scores of remaining variables
22SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
23RELIEF
- Idea relevant variables make nearest examples of
same class closer and make nearest examples of
opposite classes more far apart. - weights zero
- For all examples in training set
- find nearest example from same (hit) and opposite
class (miss) - update weight of each variable by adding
abs(example - miss) -abs(example - hit)
RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
24RELIEF Algorithm
- RELIEF assigns weights to variables based on how
well they separate samples from their nearest
neighbors (nnb) from the same and from the
opposite class. - RELIEF
- input X (two classes)
- output W (weights assigned to variables)
- nr_var total number of variables
- weights zero vector of size nr_var
- for all x in X do
- hit(x) nnb of x from same class
- miss(x) nnb of x from opposite class
- weights abs(x-miss(x)) - abs(x-hit(x))
- end
- nr_ex number of examples of X
- return W weights/nr_ex
- Note Variables have to be normalized (e.g.,
divide each variable by its (max min) values)
25EXAMPLE
- What are the weights of s1, s2, s3 and s4
assigned by RELIEF?
26Classification CV error
N samples
- Training error
- Empirical error
- Error on independent test set
- Test error
- Cross validation (CV) error
- Leave-one-out (LOO)
- N-fold CV
splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
27Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
feature selection
Train and test the feature-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
28Difference between CV1 and CV2
- CV1 gene selection within LOOCV
- CV2 gene selection before before LOOCV
- CV2 can yield optimistic estimation of
classification true error - CV2 used in paper by Golub et al.
- 0 training error
- 2 CV error (5.26)
- 5 test error (14.7)
- CV error different from test error!
29Significance of classification results
- Permutation test
- Permute class label of samples
- LOOCV error on data with permuted labels
- Repeat process a high number of times
- Compare with LOOCV error on original data
- P-value ( times LOOCV on permuted data lt
LOOCV on original data) / total of permutations
considered
30Application Biomarker detection with Mass
Spectrometric data of mixed quality
I. Marchiori et al, IEEE CIBCB, 385-391, 2005
- MALDI-TOF data.
- samples of mixed quality due to different storage
time. - controlled molecule spiking used to generate two
classes.
31Profiles of one spiked sample
32Comparison of ML algorithms
- Feature selection classification
- RFESVM
- RFEkNN
- RELIEFSVM
- RELIEFkNN
33LOOCV results
- Misclassified samples are of bad quality (higher
storage time) - The selected features do not always correspond to
m/z of spiked molecules
34LOOCV results
- The variables selected by RELIEF correspond to
the spiked peptides - RFE is less robust than RELIEF over LOOCV runs
and selects also irrelevant variables - RELIEF-based feature selection yields results
which are better interpretable than RFE
35BUT...
- RFESVM yields superior loocv accuracy than
RELIEFSVM - RFEkNN superior accuracy than RELIEFkNN
- (perfect LOOCV classification for RFE1NN)
- RFE-based feature selection yields better
predictive performance than RELIEF
36Conclusion
- Better predictive performance does not
necessarily correspond to stability and
interpretability of results - Open issues
- (ML/BIO) Ad-hoc measure of relevance for
potential biomarkers identified by feature
selection algorithms (use of domain knowledge)? - (ML) Is stability of feature selection
algorithms more important than predictive
accuracy?