Feature Selection - PowerPoint PPT Presentation

About This Presentation

Title:

Feature Selection

Description:

Let M be a metric, scoring a model and a feature subset according to predictions ... Let A be a learning algorithm used to build the model ... – PowerPoint PPT presentation

Number of Views:681

Avg rating:3.0/5.0

Slides: 55

Provided by: csd6

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection

1
Feature Selection

Ioannis Tsamardinos
Machine Learning Course, 2006
Computer Science Dept. University of Crete
(Some slides borrowed from Aliferis,Tsamardinos,
2004 Medinfo Tutorial)

2
Outline

What is Variable/Feature Selection
Filters Wrappers
What is Relevancy
Connecting Wrappers, Filters, and Relevancy
SVM-Based Variable Selection
Markov-Blanket-Based Variable Selection

3
Back to the FundamentalsThe feature selection
problem

Journal of Machine Learning Research special
issue
Variable Selection refers to the problem of
selecting input variables that are most
predictive of a given outcome
Kohavi and John 1997
variable selection is the problem of
selecting the subset of features such that the
accuracy of the induced classifier is maximal
Problem according to which classifier is
predictive power measured by?
A specific one?
All possible classifiers?
What about different cost features?

4
Why Feature Selection

To reduce cost or risk associated with observing
the variables
To increase predictive power
To reduce the size of the models, so they are
easier to understand and trust
To understand the domain

5
Definition of Feature Selection

Let M be a metric, scoring a model and a feature
subset according to predictions and features used
Let A be a learning algorithm used to build the
model
Feature Selection Problem Select a feature
subset s, that maximizes the score that M gives
to the model learned by A using features s
Feature Selection Problem 2 Select a feature
subset s and a learner A, that maximizes the
score M gives to the model learned by A using
features s

6
Examples

M is the accuracy a preference for smaller
models, A is a SVM
Find the minimal feature subset that maximizes
the accuracy of a SVM
Other possibilities for M calibrated accuracy,
AUC, trade-off between accuracy and cost of
features

7
Filters and Wrappers
8
Wrappers

An algorithm for solving the feature selection
problem that is allowed to evaluate (has access
to) learner A on different feature subsets
Typical wrapper
Search (greedily or otherwise) the feature subset
space
Evaluate using M each subset s during search
Report the one that maximizes M

9
Wrappers
Say we have predictors A, B, C and classifier M.
We want to predict T given the smallest possible
subset of A,B,C, while achieving maximal
performance (accuracy)
FEATURE SET CLASSIFIER PERFORMANCE
A,B,C M 98
A,B M 98
A,C M 77
B,C M 56
A M 89
B M 90
C M 91
. M 85
10
An Example of a Greedy Wrapper

Since the search space is exponential, we have to
use heuristic search

Solution
A,B 98
start
A 89
A,C 77
A,B98
A,B,C98
85
B 90
B,C 56
A,C77
C 91
end
Subset returned by greedy search
B,C56
11
Wrappers

A common example of heuristic search is hill
climbing keep adding features one at a time
until no further improvement can be achieved
(forward greedy wrapping)
Alternatively we can start with the full set of
predictors and keep removing features one at a
time until no further improvement can be achieved
(backward greedy wrapping)
A third alternative is to interleave the two
phases (adding and removing) either in forward or
backward wrapping (forward-backward wrapping).
Of course other forms of search can be used most
notably
Exhaustive search
Genetic Algorithms
Branch-and-Bound (e.g., cost of features, goal
is to reach performance th or better)

12
Example Feature Selection Methods in
Bioinformatics GA/KNN

Wrapper approach whereby
heuristic searchGenetic Algorithm, and
classifierKNN

13
Filters

An algorithm for solving the feature selection
problem that is not allowed to evaluate (does not
have access to) learner A
Typical filters select the feature subset
according to certain statistical properties

14
Filter Example Univariate Association Filtering

Rank features according to their association with
the target (univariately)
Select the first k features

FEATURE ASSOCIATION WITH TARGET
No Threshold gives optimal solution
C 91
B 90
A 89
15
Example Feature Selection Methods in Biomedicine
Univariate Association Filtering

Order all predictors according to strength of
association with target
Choose the first k predictors and feed them to
the classifier
Various measures of association may be used X2,
G2, Pearson r, Fisher Criterion Scoring, etc.
How to choose k?
What if we have too many variables?

16
Example Feature Selection Methods in Biomedicine
Recursive Feature Elimination

Filter algorithm where feature selection is done
as follows
build linear Support Vector Machine classifiers
using V features
compute weights of all features and choose the
best V/2
repeat until 1 feature is left
choose the feature subset that gives the best
performance (using cross-validation)
give best feature set to the classifier of choice.

17
What is Relevancy
18
Relevant and Irrelevant Features

Large effort and debate to define relevant
(irrelevant) features (AI Journal vol. 97)
Why?
Intuition
For classification (presumably) we only need
relevant features
We can throw away the irrelevant features
The set of relevant features must be the solution
to the feature selection problem!
What is Relevant must be independent of the
classifier A used to build the final model!
Relevant Features teach us something about the
domain

19
Relevancy and Filters

Consider a definition of relevancy
Construct an algorithm that attempts to identify
the relevant features
It is a filtering algorithm (independent of the
classifier used)
Relevancy ? a family of filtering algorithms

20
The Argument of Kohavi and John 1997

Take a handicapped perceptron sgn(w?x) instead
of sgn(w?x w0)
Add an irrelevant variable to the data with value
always 1
For some problems, the irrelevant variables is
necessary
Filtering (presumably) returns only relevant
features
Thus, filtering is suboptimal, wrapping is not

sgn(w?x)
1
1
x0
x4
x3
x2
x1
21
KJ Definitions of Relevancy

KJ-Strongly Relevant Variable (for target T)
X is KJ-Strongly Relevant if it is necessary for
optimal density estimation
V set of all variables, SV \ X, T
P(T X, S) ? P(T S)

22
KJ Definitions of Relevancy

KJ-Weakly Relevant Variable (for target T)
X is Weakly Relevant if it is not necessary for
optimal density estimation, but still informative
(i.e., there is some subset that makes
conditioned on which it becomes informative)
V set of all variables
X not strongly relevant and
There exists U? V \ X, T
P(T X, U) ? P(T X)

23
KJ Definition of Irrelevancy

A variable X is KJ-irrelevant to T if it is not
weakly or strongly relevant to T
Intuitively
X provides no information for T conditioned on
any subset of other variables

24
Connecting Wrappers, Filters, and Relevancy
25
Negative Results on Relevancy and Filters

Kohavi and John argument filtering returns only
relevant variables and sometimes KJ-irrelevant or
KJ-weakly relevant variables maybe needed
True There is no definition of relevancy
independent of the classifier A used to build the
final model, or independent of the metric M that
evaluates the model of A, such that the relevant
features are the solution to the feature
selection problem Tsamardinos, Aliferis,
AIStats 2003
Have to assume a (family) of algorithm(s) and
metric(s) to define what is relevant

26
Negative Results on Wrappers

Wrappers are subject to the No Free Lunch theorem
for black-box optimization if the choice of
metric or the classifier is unconstrained
Tsamardinos, Aliferis, AIStats 2003
gt Averaged out on all possible problems, each
wrapper is the same as the random search
Requires an exponential search to provably find
the optimal feature subset

27
Connecting with Bayesian Networks
Faithful Bayesian Network
A
KJ Strongly Relevant Features
Irrelevant Features
C
D
B
KJ-irrelevant Features (anything without a path
to T)
K
F
T
H
E
I
Markov Blanket of T
Weakly Relevant Features (anything with a path to
T)
28
Markov Blanket in Faithful Bayesian Networks
Markov Blanket of T
KJ-Strongly Relevant Features
Smallest set of variables, conditioned on which
all other variables become independent of T
Set of parents, children, and spouses of T
29
OPTIMAL Solutions to a class of Feature Selection
Problems

MB(T) is smallest subset of variables,
conditioned on which all other variables become
independent of T
The Markov Blanket of T should be all we need
True when
Classifier can utilize the information in those
variables (e.g. is a universal approximator)
The metric prefers the smallest models with
optimal calibrated accuracy (otherwise the Markov
Blanket may include unnecessary variables)

30
SVM Based Variable Selection
31
Linear SVMs Identify Irrelevant Features

Theorem Both the hard and soft margin linear SVM
will assign a weight of zero to irrelevant
features (in the sample limit)
Set up the sample limit SVM
Prove there is a unique and w, b in the sample
limit
Prove in this w, b the weight of the irrelevant
features is zero

32
Linear SVMs may not Identify KJ-Strongly Relevant
Variables

Consider an exclusive OR
The soft margin linear SVM has a zero weight
vector
But, both features are KJ-strongly relevant
Similar result expected for non-linear SVMs

x2
1
1
-1
1
-1
0
1
33
Linear SVMs may Retain KJ-Weakly-Relevant Features
X2
1
1
X1
1?
0
34
Feature Selection with (Linear) SVMs

The SVM will correct remove irrelevant features
The SVM may incorrect also remove strongly
relevant features
The SVM may incorrectly retain weakly relevant
features

35
Markov Blanket Based Feature Selection
36
Optimal Feature Selection with Markov Blankets

MMMB and Hiton algorithms KDD 2003, AMIA 2003
Can identify the MB(T) among thousands of
variables
Provably correct in the sample limit and in
faithful distributions
Provably the MB(T) is the solution under the
conditions specified
Excellent results in real datasets from
biomedicine
The different between the two methods is
conditioning vs maximizing the margin

37
Causal Discovery

Recall feature selection to understand the
domain
Markov Blanket of T
Causal interpretation direct causes, direct
effects, direct causes of direct effects
When
Faithfulness
Causal Sufficiency
Acyclicity

38
Network Alarm-1k (999 variables, consists of 37
tiles of Alarm network) Classification
Algorithm RBF SVM Training sample size
1000 Testing sample size 1000
39
Target 46
Target
Member of MB
40
Feature Selection Method HITON_MB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
41
Feature Selection Method MMMB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
42
Feature Selection Method RFE Linear
Classification performance 74.6 2 True
positives 189 False positives 1 False negative
Target
False negative
True positive
False positive
43
Feature Selection Method RFE Polynomial
Classification performance 85.6 2 True
positives 33 False positives 1 False negative
Target
False negative
True positive
False positive
44
Feature Selection Method BFW
Classification performance 82.5 3 True
positives 161 False positives 0 False negatives
Target
False negative
True positive
False positive
45
Network Gene (801 variables) Classification
Algorithm See 5.0 Decision Trees Training
sample size 1000 Testing sample size 1000
46
Target 220
Target
Member of MB
47
Feature Selection Method MMMB
Classification performance with DT 96.4 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
48
Feature Selection Method HITON_MB
Classification performance with DT 96.2 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
49
Feature Selection Method BFW
Classification performance with DT 96.3 4
True positives 20 False positives 5 False
negatives
Target
False negative
True positive
False positive
50
Is This A General Phenomenon Or A Contrived
Example?
51
Random Targets in Tiled ALARM
52
Random Targets in GENE
53
Conclusions

A formal definition of the feature selection
problems allows to draw connections between
relevant/irrelevant variables, the Markov
Blanket, and solutions to the feature selection
problem
Need to specify the algorithm and metric to
design algorithms that provably solve the feature
selection problem

54
Conclusions

Linear SVMs (current formulations) correctly
identify the irrelevant variables, but do not
solve the feature selection problem (under the
conditions specified)
Markov Blanket based algorithms exists that are
probably correct in the sample limit for faithful
distributions
Questions
SVM formulations that provably return the
solution
Extend the results to the non-linear case