Introduction to variable selection I - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to variable selection I

Description:

To predict the output of a test point x, the average of fw(x) over the posterior ... wMAP is the vector of parameters called the Maximum a Posteriori (MAP), i.e. 19 ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 27
Provided by: qiyu
Category:

less

Transcript and Presenter's Notes

Title: Introduction to variable selection I


1
Introduction to variable selection I
  • Qi Yu

2
Problems due to poor variable selection
  • Input dimension is too large the curse of
    dimensionality problem may happen
  • Poor model may be built with additional unrelated
    inputs or not enough relevant inputs
  • Complex models which contain too many inputs is
    more different to understand

3
Two broad classes of variable selection methods
filter and wrapper
  • Filter method is a pre-processing step, which is
    independent of the learning algorithm.
  • The inputs subset is chosen by an evaluation
    criterion, which measures the relation of each
    subset of input variables with the output.

4
Two broad classes of variable selection methods
filter and wrapper
  • Learning model is used as a part of evaluation
    function and also to induce the final learning
    model.
  • Optimizing the parameters of the model by
    measuring some cost functions.
  • Finally, the set of inputs can be selected using
    LOO, bootstrap or other re-sampling techniques.

5
Comparsion of filter and wrapper
  • Wrapper method tries to solve real problem, hence
    the criterion can be really optimaized but it is
    potentially very time consuming since they
    typically need to evaluate a cross-validation
    scheme at every iteration.
  • Filter method is much faster but it do not
    incorporate learning.

6
Embeded methods
  • In contrast of filter and wrapper approaches, in
    embedded methods the features selection part can
    not be separated from the learning part.
  • Structure of the class of function under
    consideration plays a crucial role
  • Existing embedded methods are reviewed based on a
    unifying mathematical framework.

7
Embeded methods
  • Forward-Backward Methods
  • Optimization of scaling factors
  • Sparsity term

8
Forward-Backward Methods
  • Forward selection methods these methods start
    with one or a few features selected according to
    a method specific selection criteria. More
    features are iteratively added until a stopping
    criterion is met.
  • Backward elimination methods methods of this
    type start with all features and iteratively
    remove one feature or bunches of features.
  • Nested methods during an iteration features can
    be added as well as removed from the data.

9
Forward selection
  • Forward selection with Least squares
  • Grafting
  • Decision trees

10
Forward selection with Least squares
  • 1. Start with and
  • 2. Find the component i such that
    is minimal.
  • 3. Add i to S
  • 4. Recompute the residuals Y with PSY
  • 5. Stop or go back to 2

11
Grafting
  • For fixed , Perkins
    suggested minimizing the function
  • over the set of parameters which defines
  • To solve this in a forward way
  • In every iteration the working set of
    parameters is extended by one and the
    newly obtained objective function is minimized
    over the enlarged working set.
  • The selection criterion for new parameters is
    .

12
Decision trees
  • Decision trees are iteratively build by splitting
    the data depending on the value of a specific
    feature.
  • A widely used criterion for the importance of a
    feature is the mutual information between feature
    i and the outputs Y
  • where H is the entropy and

13
Backward Elimination
  • Recursive Feature Elimination (RFE) , given that
    one wishes to employ only input
    dimensions in the final decision rule, attempts
    to find the best subset of size by a kind of
    greedy backward selection.
  • Algorithm of RFE in the linear case
  • 1 repeat
  • 2 Find w and b by training a linear SVM.
  • 3 Remove the feature with the smallest value
  • 4 until features remain.

14
Embeded methods
  • Forward-Backward Methods
  • Optimization of scaling factors
  • Sparsity term

15
Optimization of scaling factors
  • Scaling Factors for SVM
  • Automatic Relevance Determination
  • Variable Scaling Extension to Maximum Entropy
    Discrimination

16
Scaling Factors for SVM
  • Feature selection is performed by scaling the
    input parameters by a vector .
    Larger values of indicate more useful features.
  • Thus the problem is now one of choosing the
    best kernel of the form
  • We wish to find the optimal parameters
    which can be optimized by many criterias, i.e.
    gradient descent on the R2W2 bound, span bound or
    a validation error.

17
Optimization of scaling factors
  • Scaling Factors for SVM
  • Automatic Relevance Determination
  • Variable Scaling Extension to Maximum Entropy
    Discrimination

18
Automatic Relevance Determination
  • In a probabilistic framework, a model of the
    likelihood of the data is chosen P(yw) as well
    as a prior on the weight vector, P(w).
  • To predict the output of a test point x, the
    average of fw(x) over the posterior distribution
    P(wy) is computed, that is using function fwMAP
    to predict. wMAP is the vector of parameters
    called the Maximum a Posteriori (MAP), i.e.

19
Variable Scaling Extension to Maximum Entropy
Discrimination
  • The Maximum Entropy Discrimination (MED)
    framework is a probabilistic model in which one
    does not learn parameters of a model, but
    distributions over them.
  • Feature selection can be easily integrated in
    this framework . For this purpose, one has to
    specify a prior probability p0 that a feature is
    active.

20
Variable Scaling Extension to Maximum Entropy
Discrimination
  • If wi would be the weight associated with a given
    feature for a linear model, then the expectation
    of this weight modified as follows
  • This has the effect of discarding the components
    for which
  • This algorithm ignores features whose weights are
    smaller than a threshold.

21
Sparsity term
  • In the case of linear models, indicator
    variables are not necessary as feature selection
    can be enforced on the parameters of the model
    directly.
  • This is generally done by adding a sparsity
    term to the objective function that the model
    minimizes.
  • Feature Selection as an Optimization Problem
  • Concave Minimization

22
Feature Selection as an Optimization Problem
  • Most linear models that we consider can be
    understood as the result of the following
    minimization
  • where measures the loss of
    function
  • on the training point

23
Feature Selection as an Optimization Problem
  • Examples of empirical errors are
  • 1. l1 hinge loss
  • 2. l2 loss
  • 3. Logistic loss

24
Concave Minimization
  • In the case of linear models, feature selection
    can be understood as the optimization problem
  • For example, Bradley proposed to approximate the
    function as
  • Weston et al. use a slightly different function.
    They replace the l0 norm by

25
Summary of embeded methods
  • Embeded method is built upon the concept of
    scaling factors. We discussed embedded methods
    along how they approximate the proposed
    optimization problems
  • Explicit removal or addition of features - the
    scaling factors are optimized over the discrete
    set 0, 1n in a greedy iteration
  • Optimization of scaling factors over the compact
    interval 0, 1n, and
  • Linear approaches, that directly enforce sparsity
    of the model parameters.

26
Thank you !
Write a Comment
User Comments (0)
About PowerShow.com