Title: Feature Selection
1Feature Selection
- Ioannis Tsamardinos
- Machine Learning Course, 2006
- Computer Science Dept. University of Crete
- (Some slides borrowed from Aliferis,Tsamardinos,
2004 Medinfo Tutorial)
2Outline
- What is Variable/Feature Selection
- Filters Wrappers
- What is Relevancy
- Connecting Wrappers, Filters, and Relevancy
- SVM-Based Variable Selection
- Markov-Blanket-Based Variable Selection
3Back to the FundamentalsThe feature selection
problem
- Journal of Machine Learning Research special
issue - Variable Selection refers to the problem of
selecting input variables that are most
predictive of a given outcome - Kohavi and John 1997
- variable selection is the problem of
selecting the subset of features such that the
accuracy of the induced classifier is maximal - Problem according to which classifier is
predictive power measured by? - A specific one?
- All possible classifiers?
- What about different cost features?
4Why Feature Selection
- To reduce cost or risk associated with observing
the variables - To increase predictive power
- To reduce the size of the models, so they are
easier to understand and trust - To understand the domain
5Definition of Feature Selection
- Let M be a metric, scoring a model and a feature
subset according to predictions and features used - Let A be a learning algorithm used to build the
model - Feature Selection Problem Select a feature
subset s, that maximizes the score that M gives
to the model learned by A using features s - Feature Selection Problem 2 Select a feature
subset s and a learner A, that maximizes the
score M gives to the model learned by A using
features s
6Examples
- M is the accuracy a preference for smaller
models, A is a SVM - Find the minimal feature subset that maximizes
the accuracy of a SVM - Other possibilities for M calibrated accuracy,
AUC, trade-off between accuracy and cost of
features
7Filters and Wrappers
8Wrappers
- An algorithm for solving the feature selection
problem that is allowed to evaluate (has access
to) learner A on different feature subsets - Typical wrapper
- Search (greedily or otherwise) the feature subset
space - Evaluate using M each subset s during search
- Report the one that maximizes M
9Wrappers
Say we have predictors A, B, C and classifier M.
We want to predict T given the smallest possible
subset of A,B,C, while achieving maximal
performance (accuracy)
FEATURE SET CLASSIFIER PERFORMANCE
A,B,C M 98
A,B M 98
A,C M 77
B,C M 56
A M 89
B M 90
C M 91
. M 85
10An Example of a Greedy Wrapper
- Since the search space is exponential, we have to
use heuristic search
Solution
A,B 98
start
A 89
A,C 77
A,B98
A,B,C98
85
B 90
B,C 56
A,C77
C 91
end
Subset returned by greedy search
B,C56
11Wrappers
- A common example of heuristic search is hill
climbing keep adding features one at a time
until no further improvement can be achieved
(forward greedy wrapping) - Alternatively we can start with the full set of
predictors and keep removing features one at a
time until no further improvement can be achieved
(backward greedy wrapping) - A third alternative is to interleave the two
phases (adding and removing) either in forward or
backward wrapping (forward-backward wrapping). - Of course other forms of search can be used most
notably - Exhaustive search
- Genetic Algorithms
- Branch-and-Bound (e.g., cost of features, goal
is to reach performance th or better)
12Example Feature Selection Methods in
Bioinformatics GA/KNN
- Wrapper approach whereby
- heuristic searchGenetic Algorithm, and
- classifierKNN
13Filters
- An algorithm for solving the feature selection
problem that is not allowed to evaluate (does not
have access to) learner A - Typical filters select the feature subset
according to certain statistical properties
14Filter Example Univariate Association Filtering
- Rank features according to their association with
the target (univariately) - Select the first k features
FEATURE ASSOCIATION WITH TARGET
No Threshold gives optimal solution
C 91
B 90
A 89
15Example Feature Selection Methods in Biomedicine
Univariate Association Filtering
- Order all predictors according to strength of
association with target - Choose the first k predictors and feed them to
the classifier - Various measures of association may be used X2,
G2, Pearson r, Fisher Criterion Scoring, etc. - How to choose k?
- What if we have too many variables?
16Example Feature Selection Methods in Biomedicine
Recursive Feature Elimination
- Filter algorithm where feature selection is done
as follows - build linear Support Vector Machine classifiers
using V features - compute weights of all features and choose the
best V/2 - repeat until 1 feature is left
- choose the feature subset that gives the best
performance (using cross-validation) - give best feature set to the classifier of choice.
17What is Relevancy
18Relevant and Irrelevant Features
- Large effort and debate to define relevant
(irrelevant) features (AI Journal vol. 97) - Why?
- Intuition
- For classification (presumably) we only need
relevant features - We can throw away the irrelevant features
- The set of relevant features must be the solution
to the feature selection problem! - What is Relevant must be independent of the
classifier A used to build the final model! - Relevant Features teach us something about the
domain
19Relevancy and Filters
- Consider a definition of relevancy
- Construct an algorithm that attempts to identify
the relevant features - It is a filtering algorithm (independent of the
classifier used) - Relevancy ? a family of filtering algorithms
20The Argument of Kohavi and John 1997
- Take a handicapped perceptron sgn(w?x) instead
of sgn(w?x w0) - Add an irrelevant variable to the data with value
always 1 - For some problems, the irrelevant variables is
necessary - Filtering (presumably) returns only relevant
features - Thus, filtering is suboptimal, wrapping is not
sgn(w?x)
1
1
x0
x4
x3
x2
x1
21KJ Definitions of Relevancy
- KJ-Strongly Relevant Variable (for target T)
- X is KJ-Strongly Relevant if it is necessary for
optimal density estimation - V set of all variables, SV \ X, T
- P(T X, S) ? P(T S)
22KJ Definitions of Relevancy
- KJ-Weakly Relevant Variable (for target T)
- X is Weakly Relevant if it is not necessary for
optimal density estimation, but still informative
(i.e., there is some subset that makes
conditioned on which it becomes informative) - V set of all variables
- X not strongly relevant and
- There exists U? V \ X, T
- P(T X, U) ? P(T X)
23KJ Definition of Irrelevancy
- A variable X is KJ-irrelevant to T if it is not
weakly or strongly relevant to T - Intuitively
- X provides no information for T conditioned on
any subset of other variables
24Connecting Wrappers, Filters, and Relevancy
25Negative Results on Relevancy and Filters
- Kohavi and John argument filtering returns only
relevant variables and sometimes KJ-irrelevant or
KJ-weakly relevant variables maybe needed - True There is no definition of relevancy
independent of the classifier A used to build the
final model, or independent of the metric M that
evaluates the model of A, such that the relevant
features are the solution to the feature
selection problem Tsamardinos, Aliferis,
AIStats 2003 - Have to assume a (family) of algorithm(s) and
metric(s) to define what is relevant
26Negative Results on Wrappers
- Wrappers are subject to the No Free Lunch theorem
for black-box optimization if the choice of
metric or the classifier is unconstrained
Tsamardinos, Aliferis, AIStats 2003 - gt Averaged out on all possible problems, each
wrapper is the same as the random search - Requires an exponential search to provably find
the optimal feature subset
27Connecting with Bayesian Networks
Faithful Bayesian Network
A
KJ Strongly Relevant Features
Irrelevant Features
C
D
B
KJ-irrelevant Features (anything without a path
to T)
K
F
T
H
E
I
Markov Blanket of T
Weakly Relevant Features (anything with a path to
T)
28Markov Blanket in Faithful Bayesian Networks
Markov Blanket of T
KJ-Strongly Relevant Features
Smallest set of variables, conditioned on which
all other variables become independent of T
Set of parents, children, and spouses of T
29OPTIMAL Solutions to a class of Feature Selection
Problems
- MB(T) is smallest subset of variables,
conditioned on which all other variables become
independent of T - The Markov Blanket of T should be all we need
- True when
- Classifier can utilize the information in those
variables (e.g. is a universal approximator) - The metric prefers the smallest models with
optimal calibrated accuracy (otherwise the Markov
Blanket may include unnecessary variables)
30SVM Based Variable Selection
31Linear SVMs Identify Irrelevant Features
- Theorem Both the hard and soft margin linear SVM
will assign a weight of zero to irrelevant
features (in the sample limit) - Set up the sample limit SVM
- Prove there is a unique and w, b in the sample
limit - Prove in this w, b the weight of the irrelevant
features is zero
32Linear SVMs may not Identify KJ-Strongly Relevant
Variables
- Consider an exclusive OR
- The soft margin linear SVM has a zero weight
vector - But, both features are KJ-strongly relevant
- Similar result expected for non-linear SVMs
x2
1
1
-1
1
-1
0
1
33Linear SVMs may Retain KJ-Weakly-Relevant Features
X2
1
1
X1
1?
0
34Feature Selection with (Linear) SVMs
- The SVM will correct remove irrelevant features
- The SVM may incorrect also remove strongly
relevant features - The SVM may incorrectly retain weakly relevant
features
35Markov Blanket Based Feature Selection
36Optimal Feature Selection with Markov Blankets
- MMMB and Hiton algorithms KDD 2003, AMIA 2003
- Can identify the MB(T) among thousands of
variables - Provably correct in the sample limit and in
faithful distributions - Provably the MB(T) is the solution under the
conditions specified - Excellent results in real datasets from
biomedicine - The different between the two methods is
conditioning vs maximizing the margin
37Causal Discovery
- Recall feature selection to understand the
domain - Markov Blanket of T
- Causal interpretation direct causes, direct
effects, direct causes of direct effects - When
- Faithfulness
- Causal Sufficiency
- Acyclicity
38Network Alarm-1k (999 variables, consists of 37
tiles of Alarm network) Classification
Algorithm RBF SVM Training sample size
1000 Testing sample size 1000
39Target 46
Target
Member of MB
40Feature Selection Method HITON_MB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
41Feature Selection Method MMMB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
42Feature Selection Method RFE Linear
Classification performance 74.6 2 True
positives 189 False positives 1 False negative
Target
False negative
True positive
False positive
43Feature Selection Method RFE Polynomial
Classification performance 85.6 2 True
positives 33 False positives 1 False negative
Target
False negative
True positive
False positive
44Feature Selection Method BFW
Classification performance 82.5 3 True
positives 161 False positives 0 False negatives
Target
False negative
True positive
False positive
45Network Gene (801 variables) Classification
Algorithm See 5.0 Decision Trees Training
sample size 1000 Testing sample size 1000
46Target 220
Target
Member of MB
47Feature Selection Method MMMB
Classification performance with DT 96.4 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
48Feature Selection Method HITON_MB
Classification performance with DT 96.2 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
49Feature Selection Method BFW
Classification performance with DT 96.3 4
True positives 20 False positives 5 False
negatives
Target
False negative
True positive
False positive
50Is This A General Phenomenon Or A Contrived
Example?
51Random Targets in Tiled ALARM
52Random Targets in GENE
53Conclusions
- A formal definition of the feature selection
problems allows to draw connections between
relevant/irrelevant variables, the Markov
Blanket, and solutions to the feature selection
problem - Need to specify the algorithm and metric to
design algorithms that provably solve the feature
selection problem
54Conclusions
- Linear SVMs (current formulations) correctly
identify the irrelevant variables, but do not
solve the feature selection problem (under the
conditions specified) - Markov Blanket based algorithms exists that are
probably correct in the sample limit for faithful
distributions - Questions
- SVM formulations that provably return the
solution - Extend the results to the non-linear case