talk proteomics - PowerPoint PPT Presentation

About This Presentation
Title:

talk proteomics

Description:

Those methods tend not to work better than simple filter methods. Adapted ... Repeat process a high number of times. Compare with LOOCV error on original data: ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 37
Provided by: elenama4
Category:

less

Transcript and Presenter's Notes

Title: talk proteomics


1
Lecture 8 Feature SelectionBioinformatics
Data Analysis and Tools
Elena Marchiori (elena_at_few.vu.nl)
2
Why select features
  • Select a subset of relevant input variables
  • Advantages
  • it is cheaper to measure less variables
  • the resulting classifier is simpler and
    potentially faster
  • prediction accuracy may improve by discarding
    irrelevant variables
  • identifying relevant variables gives more insight
    into the nature of the corresponding
    classification problem (biomarker detection)

3
Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
4
Approaches
  • Wrapper
  • feature selection takes into account the
    contribution to the performance of a given type
    of classifier
  • Filter
  • feature selection is based on an evaluation
    criterion for quantifying how well feature
    (subsets) discriminate the two classes
  • Embedded
  • feature selection is part of the training
    procedure of a classifier (e.g. decision trees)

5
Embedded methods
  • Attempt to jointly or simultaneously train both a
    classifier and a feature subset
  • Often optimize an objective function that jointly
    rewards accuracy of classification and penalizes
    use of more features.
  • Intuitively appealing
  • Example tree-building algorithms

Adapted from J. Fridlyand
6
Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
7
Filter methods
Feature selection
p
s
Classifier design
R
R
s ltlt p
  • Features are scored independently and the top s
    are used by
  • the classifier
  • Score correlation, mutual information,
    t-statistic, F-statistic,
  • p-value, tree importance statistic etc

Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
8
Problems with filter method
  • Redundancy in selected features features are
    considered independently and not measured on the
    basis of whether they contribute new information
  • Interactions among features generally can not be
    explicitly incorporated (some filter methods are
    smarter than others)
  • Classifier has no say in what features should be
    used some scores may be more appropriates in
    conjuction with some classifiers than others.

Adapted from J. Fridlyand
9
Dimension reduction a variant on a filter method
  • Rather than retain a subset of s features,
    perform dimension reduction by projecting
    features onto s principal components of variation
    (e.g. PCA etc)
  • Problem is that we are no longer dealing with one
    feature at a time but rather a linear or possibly
    more complicated combination of all features. It
    may be good enough for a black box but how does
    one build a diagnostic chip on a supergene?
    (even though we dont want to confuse the tasks)
  • Those methods tend not to work better than simple
    filter methods.

Adapted from J. Fridlyand
10
Wrapper methods
Feature selection
p
s
Classifier design
R
R
s ltlt p
  • Iterative approach many feature subsets are
    scored based
  • on classification performance and best is used.
  • Selection of subsets forward selection, backward
    selection,
  • Forward-backward selection, tree harvesting etc

Adapted from J. Fridlyand
11
Problems with wrapper methods
  • Computationally expensive for each feature
    subset to be considered, a classifier must be
    built and evaluated
  • No exhaustive search is possible (2 subsets to
    consider) generally greedy algorithms only.
  • Easy to overfit.

p
Adapted from J. Fridlyand
12
Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
13
Microarray Data Challenges to Machine Learning
Algorithms
  • Few samples for analysis (38 labeled)
  • Extremely high-dimensional data (7129 gene
    expression values per sample)
  • Noisy data
  • Complex underlying mechanisms, not fully
    understood

14
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
15
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
16
Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
17
Importance of Feature (Gene) Selection
  • Majority of genes are not directly related to
    leukemia
  • Having a large number of features enhances the
    models flexibility, but makes it prone to
    overfitting
  • Noise and the small number of training samples
    makes this even more likely
  • Some types of models, like kNN do not scale well
    with many features

18
With 7219 genes, how do we choose the best?
  • Distance metrics to capture class separation
  • Rank genes according to distance metric score
  • Choose the top n ranked genes

HIGH score
LOW score
19
Distance Metrics
  • Tamayos Relative Class Separation
  • t-test
  • Bhattacharyya distance

20
SVM-RFE wrapper
  • Recursive Feature Elimination
  • Train linear SVM -gt linear decision function
  • Use absolute value of variable weights to rank
    variables
  • Remove half variables with lower rank
  • Repeat above steps (train, rank, remove) on data
    restricted to variables not removed
  • Output subset of variables

21
SVM-RFE
  • Linear binary classifier decision function
  • Recursive Feature Elimination (SVM-RFE)
  • at each iteration
  • eliminate threshold of variables with lower
    score
  • recompute scores of remaining variables

22
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
23
RELIEF
  • Idea relevant variables make nearest examples of
    same class closer and make nearest examples of
    opposite classes more far apart.
  • weights zero
  • For all examples in training set
  • find nearest example from same (hit) and opposite
    class (miss)
  • update weight of each variable by adding
    abs(example - miss) -abs(example - hit)

RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
24
RELIEF Algorithm
  • RELIEF assigns weights to variables based on how
    well they separate samples from their nearest
    neighbors (nnb) from the same and from the
    opposite class.
  • RELIEF
  • input X (two classes)
  • output W (weights assigned to variables)
  • nr_var total number of variables
  • weights zero vector of size nr_var
  • for all x in X do
  • hit(x) nnb of x from same class
  • miss(x) nnb of x from opposite class
  • weights abs(x-miss(x)) - abs(x-hit(x))
  • end
  • nr_ex number of examples of X
  • return W weights/nr_ex
  • Note Variables have to be normalized (e.g.,
    divide each variable by its (max min) values)

25
EXAMPLE
  • What are the weights of s1, s2, s3 and s4
    assigned by RELIEF?

26
Classification CV error
N samples
  • Training error
  • Empirical error
  • Error on independent test set
  • Test error
  • Cross validation (CV) error
  • Leave-one-out (LOO)
  • N-fold CV

splitting
1/n samples for testing
N-1/n samples for training
Count errors
Summarize CV error rate
27
Two schemes of cross validation
CV2
CV1
N samples
N samples
LOO
feature selection
Train and test the feature-selector and the
classifier
LOO
Train and test the classifier
Count errors
Count errors
28
Difference between CV1 and CV2
  • CV1 gene selection within LOOCV
  • CV2 gene selection before before LOOCV
  • CV2 can yield optimistic estimation of
    classification true error
  • CV2 used in paper by Golub et al.
  • 0 training error
  • 2 CV error (5.26)
  • 5 test error (14.7)
  • CV error different from test error!

29
Significance of classification results
  • Permutation test
  • Permute class label of samples
  • LOOCV error on data with permuted labels
  • Repeat process a high number of times
  • Compare with LOOCV error on original data
  • P-value ( times LOOCV on permuted data lt
    LOOCV on original data) / total of permutations
    considered

30
Application Biomarker detection with Mass
Spectrometric data of mixed quality
I. Marchiori et al, IEEE CIBCB, 385-391, 2005
  • MALDI-TOF data.
  • samples of mixed quality due to different storage
    time.
  • controlled molecule spiking used to generate two
    classes.

31
Profiles of one spiked sample
32
Comparison of ML algorithms
  • Feature selection classification
  • RFESVM
  • RFEkNN
  • RELIEFSVM
  • RELIEFkNN

33
LOOCV results
  • Misclassified samples are of bad quality (higher
    storage time)
  • The selected features do not always correspond to
    m/z of spiked molecules

34
LOOCV results
  • The variables selected by RELIEF correspond to
    the spiked peptides
  • RFE is less robust than RELIEF over LOOCV runs
    and selects also irrelevant variables
  • RELIEF-based feature selection yields results
    which are better interpretable than RFE

35
BUT...
  • RFESVM yields superior loocv accuracy than
    RELIEFSVM
  • RFEkNN superior accuracy than RELIEFkNN
  • (perfect LOOCV classification for RFE1NN)
  • RFE-based feature selection yields better
    predictive performance than RELIEF

36
Conclusion
  • Better predictive performance does not
    necessarily correspond to stability and
    interpretability of results
  • Open issues
  • (ML/BIO) Ad-hoc measure of relevance for
    potential biomarkers identified by feature
    selection algorithms (use of domain knowledge)?
  • (ML) Is stability of feature selection
    algorithms more important than predictive
    accuracy?
Write a Comment
User Comments (0)
About PowerShow.com