Feature Selection - PowerPoint PPT Presentation

About This Presentation
Title:

Feature Selection

Description:

Let M be a metric, scoring a model and a feature subset according to predictions ... Let A be a learning algorithm used to build the model ... – PowerPoint PPT presentation

Number of Views:679
Avg rating:3.0/5.0
Slides: 55
Provided by: csd6
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection


1
Feature Selection
  • Ioannis Tsamardinos
  • Machine Learning Course, 2006
  • Computer Science Dept. University of Crete
  • (Some slides borrowed from Aliferis,Tsamardinos,
    2004 Medinfo Tutorial)

2
Outline
  • What is Variable/Feature Selection
  • Filters Wrappers
  • What is Relevancy
  • Connecting Wrappers, Filters, and Relevancy
  • SVM-Based Variable Selection
  • Markov-Blanket-Based Variable Selection

3
Back to the FundamentalsThe feature selection
problem
  • Journal of Machine Learning Research special
    issue
  • Variable Selection refers to the problem of
    selecting input variables that are most
    predictive of a given outcome
  • Kohavi and John 1997
  • variable selection is the problem of
    selecting the subset of features such that the
    accuracy of the induced classifier is maximal
  • Problem according to which classifier is
    predictive power measured by?
  • A specific one?
  • All possible classifiers?
  • What about different cost features?

4
Why Feature Selection
  • To reduce cost or risk associated with observing
    the variables
  • To increase predictive power
  • To reduce the size of the models, so they are
    easier to understand and trust
  • To understand the domain

5
Definition of Feature Selection
  • Let M be a metric, scoring a model and a feature
    subset according to predictions and features used
  • Let A be a learning algorithm used to build the
    model
  • Feature Selection Problem Select a feature
    subset s, that maximizes the score that M gives
    to the model learned by A using features s
  • Feature Selection Problem 2 Select a feature
    subset s and a learner A, that maximizes the
    score M gives to the model learned by A using
    features s

6
Examples
  • M is the accuracy a preference for smaller
    models, A is a SVM
  • Find the minimal feature subset that maximizes
    the accuracy of a SVM
  • Other possibilities for M calibrated accuracy,
    AUC, trade-off between accuracy and cost of
    features

7
Filters and Wrappers
8
Wrappers
  • An algorithm for solving the feature selection
    problem that is allowed to evaluate (has access
    to) learner A on different feature subsets
  • Typical wrapper
  • Search (greedily or otherwise) the feature subset
    space
  • Evaluate using M each subset s during search
  • Report the one that maximizes M

9
Wrappers
Say we have predictors A, B, C and classifier M.
We want to predict T given the smallest possible
subset of A,B,C, while achieving maximal
performance (accuracy)
FEATURE SET CLASSIFIER PERFORMANCE
A,B,C M 98
A,B M 98
A,C M 77
B,C M 56
A M 89
B M 90
C M 91
. M 85
10
An Example of a Greedy Wrapper
  • Since the search space is exponential, we have to
    use heuristic search

Solution
A,B 98
start
A 89
A,C 77
A,B98
A,B,C98
85
B 90
B,C 56
A,C77
C 91
end
Subset returned by greedy search
B,C56
11
Wrappers
  • A common example of heuristic search is hill
    climbing keep adding features one at a time
    until no further improvement can be achieved
    (forward greedy wrapping)
  • Alternatively we can start with the full set of
    predictors and keep removing features one at a
    time until no further improvement can be achieved
    (backward greedy wrapping)
  • A third alternative is to interleave the two
    phases (adding and removing) either in forward or
    backward wrapping (forward-backward wrapping).
  • Of course other forms of search can be used most
    notably
  • Exhaustive search
  • Genetic Algorithms
  • Branch-and-Bound (e.g., cost of features, goal
    is to reach performance th or better)

12
Example Feature Selection Methods in
Bioinformatics GA/KNN
  • Wrapper approach whereby
  • heuristic searchGenetic Algorithm, and
  • classifierKNN

13
Filters
  • An algorithm for solving the feature selection
    problem that is not allowed to evaluate (does not
    have access to) learner A
  • Typical filters select the feature subset
    according to certain statistical properties

14
Filter Example Univariate Association Filtering
  • Rank features according to their association with
    the target (univariately)
  • Select the first k features

FEATURE ASSOCIATION WITH TARGET
No Threshold gives optimal solution
C 91
B 90
A 89
15
Example Feature Selection Methods in Biomedicine
Univariate Association Filtering
  • Order all predictors according to strength of
    association with target
  • Choose the first k predictors and feed them to
    the classifier
  • Various measures of association may be used X2,
    G2, Pearson r, Fisher Criterion Scoring, etc.
  • How to choose k?
  • What if we have too many variables?

16
Example Feature Selection Methods in Biomedicine
Recursive Feature Elimination
  • Filter algorithm where feature selection is done
    as follows
  • build linear Support Vector Machine classifiers
    using V features
  • compute weights of all features and choose the
    best V/2
  • repeat until 1 feature is left
  • choose the feature subset that gives the best
    performance (using cross-validation)
  • give best feature set to the classifier of choice.

17
What is Relevancy
18
Relevant and Irrelevant Features
  • Large effort and debate to define relevant
    (irrelevant) features (AI Journal vol. 97)
  • Why?
  • Intuition
  • For classification (presumably) we only need
    relevant features
  • We can throw away the irrelevant features
  • The set of relevant features must be the solution
    to the feature selection problem!
  • What is Relevant must be independent of the
    classifier A used to build the final model!
  • Relevant Features teach us something about the
    domain

19
Relevancy and Filters
  • Consider a definition of relevancy
  • Construct an algorithm that attempts to identify
    the relevant features
  • It is a filtering algorithm (independent of the
    classifier used)
  • Relevancy ? a family of filtering algorithms

20
The Argument of Kohavi and John 1997
  • Take a handicapped perceptron sgn(w?x) instead
    of sgn(w?x w0)
  • Add an irrelevant variable to the data with value
    always 1
  • For some problems, the irrelevant variables is
    necessary
  • Filtering (presumably) returns only relevant
    features
  • Thus, filtering is suboptimal, wrapping is not

sgn(w?x)
1
1
x0
x4
x3
x2
x1
21
KJ Definitions of Relevancy
  • KJ-Strongly Relevant Variable (for target T)
  • X is KJ-Strongly Relevant if it is necessary for
    optimal density estimation
  • V set of all variables, SV \ X, T
  • P(T X, S) ? P(T S)

22
KJ Definitions of Relevancy
  • KJ-Weakly Relevant Variable (for target T)
  • X is Weakly Relevant if it is not necessary for
    optimal density estimation, but still informative
    (i.e., there is some subset that makes
    conditioned on which it becomes informative)
  • V set of all variables
  • X not strongly relevant and
  • There exists U? V \ X, T
  • P(T X, U) ? P(T X)

23
KJ Definition of Irrelevancy
  • A variable X is KJ-irrelevant to T if it is not
    weakly or strongly relevant to T
  • Intuitively
  • X provides no information for T conditioned on
    any subset of other variables

24
Connecting Wrappers, Filters, and Relevancy
25
Negative Results on Relevancy and Filters
  • Kohavi and John argument filtering returns only
    relevant variables and sometimes KJ-irrelevant or
    KJ-weakly relevant variables maybe needed
  • True There is no definition of relevancy
    independent of the classifier A used to build the
    final model, or independent of the metric M that
    evaluates the model of A, such that the relevant
    features are the solution to the feature
    selection problem Tsamardinos, Aliferis,
    AIStats 2003
  • Have to assume a (family) of algorithm(s) and
    metric(s) to define what is relevant

26
Negative Results on Wrappers
  • Wrappers are subject to the No Free Lunch theorem
    for black-box optimization if the choice of
    metric or the classifier is unconstrained
    Tsamardinos, Aliferis, AIStats 2003
  • gt Averaged out on all possible problems, each
    wrapper is the same as the random search
  • Requires an exponential search to provably find
    the optimal feature subset

27
Connecting with Bayesian Networks
Faithful Bayesian Network
A
KJ Strongly Relevant Features
Irrelevant Features
C
D
B
KJ-irrelevant Features (anything without a path
to T)
K
F
T
H
E
I
Markov Blanket of T
Weakly Relevant Features (anything with a path to
T)
28
Markov Blanket in Faithful Bayesian Networks
Markov Blanket of T
KJ-Strongly Relevant Features
Smallest set of variables, conditioned on which
all other variables become independent of T
Set of parents, children, and spouses of T
29
OPTIMAL Solutions to a class of Feature Selection
Problems
  • MB(T) is smallest subset of variables,
    conditioned on which all other variables become
    independent of T
  • The Markov Blanket of T should be all we need
  • True when
  • Classifier can utilize the information in those
    variables (e.g. is a universal approximator)
  • The metric prefers the smallest models with
    optimal calibrated accuracy (otherwise the Markov
    Blanket may include unnecessary variables)

30
SVM Based Variable Selection
31
Linear SVMs Identify Irrelevant Features
  • Theorem Both the hard and soft margin linear SVM
    will assign a weight of zero to irrelevant
    features (in the sample limit)
  • Set up the sample limit SVM
  • Prove there is a unique and w, b in the sample
    limit
  • Prove in this w, b the weight of the irrelevant
    features is zero

32
Linear SVMs may not Identify KJ-Strongly Relevant
Variables
  • Consider an exclusive OR
  • The soft margin linear SVM has a zero weight
    vector
  • But, both features are KJ-strongly relevant
  • Similar result expected for non-linear SVMs

x2
1
1
-1
1
-1
0
1
33
Linear SVMs may Retain KJ-Weakly-Relevant Features
X2
1
1
X1
1?
0
34
Feature Selection with (Linear) SVMs
  • The SVM will correct remove irrelevant features
  • The SVM may incorrect also remove strongly
    relevant features
  • The SVM may incorrectly retain weakly relevant
    features

35
Markov Blanket Based Feature Selection
36
Optimal Feature Selection with Markov Blankets
  • MMMB and Hiton algorithms KDD 2003, AMIA 2003
  • Can identify the MB(T) among thousands of
    variables
  • Provably correct in the sample limit and in
    faithful distributions
  • Provably the MB(T) is the solution under the
    conditions specified
  • Excellent results in real datasets from
    biomedicine
  • The different between the two methods is
    conditioning vs maximizing the margin

37
Causal Discovery
  • Recall feature selection to understand the
    domain
  • Markov Blanket of T
  • Causal interpretation direct causes, direct
    effects, direct causes of direct effects
  • When
  • Faithfulness
  • Causal Sufficiency
  • Acyclicity

38
Network Alarm-1k (999 variables, consists of 37
tiles of Alarm network) Classification
Algorithm RBF SVM Training sample size
1000 Testing sample size 1000
39
Target 46
Target
Member of MB
40
Feature Selection Method HITON_MB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
41
Feature Selection Method MMMB
Classification performance 87.7 3 True
positives 2 False positives 0 False negatives
Target
False negative
True positive
False positive
42
Feature Selection Method RFE Linear
Classification performance 74.6 2 True
positives 189 False positives 1 False negative
Target
False negative
True positive
False positive
43
Feature Selection Method RFE Polynomial
Classification performance 85.6 2 True
positives 33 False positives 1 False negative
Target
False negative
True positive
False positive
44
Feature Selection Method BFW
Classification performance 82.5 3 True
positives 161 False positives 0 False negatives
Target
False negative
True positive
False positive
45
Network Gene (801 variables) Classification
Algorithm See 5.0 Decision Trees Training
sample size 1000 Testing sample size 1000
46
Target 220
Target
Member of MB
47
Feature Selection Method MMMB
Classification performance with DT 96.4 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
48
Feature Selection Method HITON_MB
Classification performance with DT 96.2 9
True positives 4 False positives 0 False negatives
Target
False negative
True positive
False positive
49
Feature Selection Method BFW
Classification performance with DT 96.3 4
True positives 20 False positives 5 False
negatives
Target
False negative
True positive
False positive
50
Is This A General Phenomenon Or A Contrived
Example?
51
Random Targets in Tiled ALARM
52
Random Targets in GENE
53
Conclusions
  • A formal definition of the feature selection
    problems allows to draw connections between
    relevant/irrelevant variables, the Markov
    Blanket, and solutions to the feature selection
    problem
  • Need to specify the algorithm and metric to
    design algorithms that provably solve the feature
    selection problem

54
Conclusions
  • Linear SVMs (current formulations) correctly
    identify the irrelevant variables, but do not
    solve the feature selection problem (under the
    conditions specified)
  • Markov Blanket based algorithms exists that are
    probably correct in the sample limit for faithful
    distributions
  • Questions
  • SVM formulations that provably return the
    solution
  • Extend the results to the non-linear case
Write a Comment
User Comments (0)
About PowerShow.com