Introduction to variable selection I - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to variable selection I

Description:

To predict the output of a test point x, the average of fw(x) over the posterior ... wMAP is the vector of parameters called the Maximum a Posteriori (MAP), i.e. 19 ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 27

Provided by: qiyu

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to variable selection I

1
Introduction to variable selection I

Qi Yu

2
Problems due to poor variable selection

Input dimension is too large the curse of
dimensionality problem may happen
Poor model may be built with additional unrelated
inputs or not enough relevant inputs
Complex models which contain too many inputs is
more different to understand

3
Two broad classes of variable selection methods
filter and wrapper

Filter method is a pre-processing step, which is
independent of the learning algorithm.
The inputs subset is chosen by an evaluation
criterion, which measures the relation of each
subset of input variables with the output.

4
Two broad classes of variable selection methods
filter and wrapper

Learning model is used as a part of evaluation
function and also to induce the final learning
model.
Optimizing the parameters of the model by
measuring some cost functions.
Finally, the set of inputs can be selected using
LOO, bootstrap or other re-sampling techniques.

5
Comparsion of filter and wrapper

Wrapper method tries to solve real problem, hence
the criterion can be really optimaized but it is
potentially very time consuming since they
typically need to evaluate a cross-validation
scheme at every iteration.
Filter method is much faster but it do not
incorporate learning.

6
Embeded methods

In contrast of filter and wrapper approaches, in
embedded methods the features selection part can
not be separated from the learning part.
Structure of the class of function under
consideration plays a crucial role
Existing embedded methods are reviewed based on a
unifying mathematical framework.

7
Embeded methods

Forward-Backward Methods
Optimization of scaling factors
Sparsity term

8
Forward-Backward Methods

Forward selection methods these methods start
with one or a few features selected according to
a method specific selection criteria. More
features are iteratively added until a stopping
criterion is met.
Backward elimination methods methods of this
type start with all features and iteratively
remove one feature or bunches of features.
Nested methods during an iteration features can
be added as well as removed from the data.

9
Forward selection

Forward selection with Least squares
Grafting
Decision trees

10
Forward selection with Least squares

1. Start with and
2. Find the component i such that
is minimal.
3. Add i to S
4. Recompute the residuals Y with PSY
5. Stop or go back to 2

11
Grafting

For fixed , Perkins
suggested minimizing the function
over the set of parameters which defines
To solve this in a forward way
In every iteration the working set of
parameters is extended by one and the
newly obtained objective function is minimized
over the enlarged working set.
The selection criterion for new parameters is
.

12
Decision trees

Decision trees are iteratively build by splitting
the data depending on the value of a specific
feature.
A widely used criterion for the importance of a
feature is the mutual information between feature
i and the outputs Y
where H is the entropy and

13
Backward Elimination

Recursive Feature Elimination (RFE) , given that
one wishes to employ only input
dimensions in the final decision rule, attempts
to find the best subset of size by a kind of
greedy backward selection.
Algorithm of RFE in the linear case
1 repeat
2 Find w and b by training a linear SVM.
3 Remove the feature with the smallest value
4 until features remain.

14
Embeded methods

Forward-Backward Methods
Optimization of scaling factors
Sparsity term

15
Optimization of scaling factors

Scaling Factors for SVM
Automatic Relevance Determination
Variable Scaling Extension to Maximum Entropy
Discrimination

16
Scaling Factors for SVM

Feature selection is performed by scaling the
input parameters by a vector .
Larger values of indicate more useful features.
Thus the problem is now one of choosing the
best kernel of the form
We wish to find the optimal parameters
which can be optimized by many criterias, i.e.
gradient descent on the R2W2 bound, span bound or
a validation error.

17
Optimization of scaling factors

Scaling Factors for SVM
Automatic Relevance Determination
Variable Scaling Extension to Maximum Entropy
Discrimination

18
Automatic Relevance Determination

In a probabilistic framework, a model of the
likelihood of the data is chosen P(yw) as well
as a prior on the weight vector, P(w).
To predict the output of a test point x, the
average of fw(x) over the posterior distribution
P(wy) is computed, that is using function fwMAP
to predict. wMAP is the vector of parameters
called the Maximum a Posteriori (MAP), i.e.

19
Variable Scaling Extension to Maximum Entropy
Discrimination

The Maximum Entropy Discrimination (MED)
framework is a probabilistic model in which one
does not learn parameters of a model, but
distributions over them.
Feature selection can be easily integrated in
this framework . For this purpose, one has to
specify a prior probability p0 that a feature is
active.

20
Variable Scaling Extension to Maximum Entropy
Discrimination

If wi would be the weight associated with a given
feature for a linear model, then the expectation
of this weight modified as follows
This has the effect of discarding the components
for which
This algorithm ignores features whose weights are
smaller than a threshold.

21
Sparsity term

In the case of linear models, indicator
variables are not necessary as feature selection
can be enforced on the parameters of the model
directly.
This is generally done by adding a sparsity
term to the objective function that the model
minimizes.
Feature Selection as an Optimization Problem
Concave Minimization

22
Feature Selection as an Optimization Problem

Most linear models that we consider can be
understood as the result of the following
minimization
where measures the loss of
function
on the training point

23
Feature Selection as an Optimization Problem

Examples of empirical errors are
1. l1 hinge loss
2. l2 loss
3. Logistic loss

24
Concave Minimization

In the case of linear models, feature selection
can be understood as the optimization problem
For example, Bradley proposed to approximate the
function as
Weston et al. use a slightly different function.
They replace the l0 norm by

25
Summary of embeded methods

Embeded method is built upon the concept of
scaling factors. We discussed embedded methods
along how they approximate the proposed
optimization problems
Explicit removal or addition of features - the
scaling factors are optimized over the discrete
set 0, 1n in a greedy iteration
Optimization of scaling factors over the compact
interval 0, 1n, and
Linear approaches, that directly enforce sparsity
of the model parameters.

26
Thank you !

Write a Comment

User Comments (0)