talk proteomics - PowerPoint PPT Presentation

About This Presentation

Title:

talk proteomics

Description:

Those methods tend not to work better than simple filter methods. Adapted ... Repeat process a high number of times. Compare with LOOCV error on original data: ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 37

Provided by: elenama4

Category:

more less

Transcript and Presenter's Notes

Title: talk proteomics

1
Lecture 8 Feature SelectionBioinformatics
Data Analysis and Tools
Elena Marchiori (elena_at_few.vu.nl)
2
Why select features

Select a subset of relevant input variables
Advantages
it is cheaper to measure less variables
the resulting classifier is simpler and
potentially faster
prediction accuracy may improve by discarding
irrelevant variables
identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection)

3
Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
4
Approaches

Wrapper
feature selection takes into account the
contribution to the performance of a given type
of classifier
Filter
feature selection is based on an evaluation
criterion for quantifying how well feature
(subsets) discriminate the two classes
Embedded
feature selection is part of the training
procedure of a classifier (e.g. decision trees)

5
Embedded methods

Attempt to jointly or simultaneously train both a
classifier and a feature subset
Often optimize an objective function that jointly
rewards accuracy of classification and penalizes
use of more features.
Intuitively appealing
Example tree-building algorithms

Adapted from J. Fridlyand
6
Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
7
Filter methods
Feature selection
p
s
Classifier design
R
R
s ltlt p

Features are scored independently and the top s
are used by
the classifier
Score correlation, mutual information,
t-statistic, F-statistic,
p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
8
Problems with filter method

Redundancy in selected features features are
considered independently and not measured on the
basis of whether they contribute new information
Interactions among features generally can not be
explicitly incorporated (some filter methods are
smarter than others)
Classifier has no say in what features should be
used some scores may be more appropriates in
conjuction with some classifiers than others.

Adapted from J. Fridlyand
9
Dimension reduction a variant on a filter method

Rather than retain a subset of s features,
perform dimension reduction by projecting
features onto s principal components of variation
(e.g. PCA etc)
Problem is that we are no longer dealing with one
feature at a time but rather a linear or possibly
more complicated combination of all features. It
may be good enough for a black box but how does
one build a diagnostic chip on a supergene?
(even though we dont want to confuse the tasks)
Those methods tend not to work better than simple
filter methods.

Adapted from J. Fridlyand
10
Wrapper methods
Feature selection
p
s
Classifier design
R
R
s ltlt p

Iterative approach many feature subsets are
scored based
on classification performance and best is used.
Selection of subsets forward selection, backward
selection,
Forward-backward selection, tree harvesting etc

Adapted from J. Fridlyand
11
Problems with wrapper methods

Computationally expensive for each feature
subset to be considered, a classifier must be
built and evaluated
No exhaustive search is possible (2 subsets to
consider) generally greedy algorithms only.
Easy to overfit.

p
Adapted from J. Fridlyand
12
Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
13
Microarray Data Challenges to Machine Learning
Algorithms

Few samples for analysis (38 labeled)
Extremely high-dimensional data (7129 gene
expression values per sample)
Noisy data
Complex underlying mechanisms, not fully
understood

14
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
15
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
16
Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
17
Importance of Feature (Gene) Selection

Majority of genes are not directly related to
leukemia
Having a large number of features enhances the
models flexibility, but makes it prone to
overfitting
Noise and the small number of training samples
makes this even more likely
Some types of models, like kNN do not scale well
with many features

18
With 7219 genes, how do we choose the best?

Distance metrics to capture class separation
Rank genes according to distance metric score
Choose the top n ranked genes

HIGH score
LOW score
19
Distance Metrics

Tamayos Relative Class Separation
t-test
Bhattacharyya distance

20
SVM-RFE wrapper

Recursive Feature Elimination
Train linear SVM -gt linear decision function
Use absolute value of variable weights to rank
variables
Remove half variables with lower rank
Repeat above steps (train, rank, remove) on data
restricted to variables not removed
Output subset of variables

21
SVM-RFE

Linear binary classifier decision function
Recursive Feature Elimination (SVM-RFE)
at each iteration
eliminate threshold of variables with lower
score
recompute scores of remaining variables

22
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
23
RELIEF

Idea relevant variables make nearest examples of
same class closer and make nearest examples of
opposite classes more far apart.
weights zero
For all examples in training set
find nearest example from same (hit) and opposite
class (miss)
update weight of each variable by adding
abs(example - miss) -abs(example - hit)

RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
24
RELIEF Algorithm

RELIEF assigns weights to variables based on how
well they separate samples from their nearest
neighbors (nnb) from the same and from the
opposite class.
RELIEF
input X (two classes)
output W (weights assigned to variables)
nr_var total number of variables
weights zero vector of size nr_var
for all x in X do
hit(x) nnb of x from same class
miss(x) nnb of x from opposite class
weights abs(x-miss(x)) - abs(x-hit(x))
end
nr_ex number of examples of X
return W weights/nr_ex
Note Variables have to be normalized (e.g.,
divide each variable by its (max min) values)

25
EXAMPLE

What are the weights of s1, s2, s3 and s4
assigned by RELIEF?

26
Classification CV error
N samples

Training error
Empirical error
Error on independent test set
Test error
Cross validation (CV) error
Leave-one-out (LOO)
N-fold CV

splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
27
Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
feature selection
Train and test the feature-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
28
Difference between CV1 and CV2

CV1 gene selection within LOOCV
CV2 gene selection before before LOOCV
CV2 can yield optimistic estimation of
classification true error
CV2 used in paper by Golub et al.
0 training error
2 CV error (5.26)
5 test error (14.7)
CV error different from test error!

29
Significance of classification results

Permutation test
Permute class label of samples
LOOCV error on data with permuted labels
Repeat process a high number of times
Compare with LOOCV error on original data
P-value ( times LOOCV on permuted data lt
LOOCV on original data) / total of permutations
considered

30
Application Biomarker detection with Mass
Spectrometric data of mixed quality
I. Marchiori et al, IEEE CIBCB, 385-391, 2005

MALDI-TOF data.
samples of mixed quality due to different storage
time.
controlled molecule spiking used to generate two
classes.

31
Profiles of one spiked sample
32
Comparison of ML algorithms

Feature selection classification
RFESVM
RFEkNN
RELIEFSVM
RELIEFkNN

33
LOOCV results

Misclassified samples are of bad quality (higher
storage time)
The selected features do not always correspond to
m/z of spiked molecules

34
LOOCV results

The variables selected by RELIEF correspond to
the spiked peptides
RFE is less robust than RELIEF over LOOCV runs
and selects also irrelevant variables
RELIEF-based feature selection yields results
which are better interpretable than RFE

35
BUT...

RFESVM yields superior loocv accuracy than
RELIEFSVM
RFEkNN superior accuracy than RELIEFkNN
(perfect LOOCV classification for RFE1NN)
RFE-based feature selection yields better
predictive performance than RELIEF

36
Conclusion

Better predictive performance does not
necessarily correspond to stability and
interpretability of results
Open issues
(ML/BIO) Ad-hoc measure of relevance for
potential biomarkers identified by feature
selection algorithms (use of domain knowledge)?
(ML) Is stability of feature selection
algorithms more important than predictive
accuracy?