Feature Selection in Nonlinear Kernel Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Feature Selection in Nonlinear Kernel Classification

Description:

Best linear classifier that uses only 1 feature selects the feature x1 ... Leads to a nonlinear mixed-integer program ... (e0Ee) to the objective with weight 0 ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 22
Provided by: tedw5
Learn more at: https://ftp.cs.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection in Nonlinear Kernel Classification


1
Feature Selection in Nonlinear Kernel
Classification
Workshop on Optimization-Based Data Mining
Techniques with Applications IEEE International
Conference on Data Mining Omaha, Nebraska,
October 28, 2007
  • Olvi Mangasarian Edward Wild
  • University of Wisconsin
  • Madison

2
Example
However, data is nonlinearly separable using only
the feature x2
Best linear classifier that uses only 1 feature
selects the feature x1
x2
Data is nonlinearly separable In general
nonlinear kernels use both x1 and x2








_
_
_
_
_
_
_
_








x1
Feature selection in nonlinear classification is
important
3
Outline
  • Minimize the number of input space features
    selected by a nonlinear kernel classifier
  • Start with a standard 1-norm nonlinear support
    vector machine (SVM)
  • Add 0-1 diagonal matrix to suppress or keep
    features
  • Leads to a nonlinear mixed-integer program
  • Introduce algorithm to obtain a good local
    solution to the resulting mixed-integer program
  • Evaluate algorithm on two public datasets from
    the UCI repository and synthetic NDCC data

4
Support Vector Machines
Linear kernel (K(A, B))ij (AB)ij AiBj
K(Ai, Bj) Gaussian kernel, parameter ? (K(A,
B))ij exp(-?Ai0-Bj2)
SVMs
  • x 2 Rn
  • SVM defined by parameters u and threshold ? of
    the nonlinear surface
  • A contains all data points
  • ½ A
  • ?? ½ A?
  • e is a vector of ones

K(A, A0)u e? e
K(A?, A0)u e? ?e
Minimize e0y (hinge loss or plus function or
max, 0) to fit data
Minimize e0s (u1 at solution) to reduce
overfitting
K(x0, A0)u ????
K(x0, A0)u ??
Slack variable y 0 allows points to be on the
wrong side of the bounding surface
K(x0, A0)u ??1?
5
Reduced Feature SVM
  • To suppress features, add the number of features
    present (e0Ee) to the objective with weight ? 0
  • As ? is increased, more features will be removed
    from the classifier

Start with Full SVM
Replace A with AE, where E is a diagonal n n
matrix with Eii 2 1, 0, i 1, , n
All features are present in the kernel matrix
K(A, A0)
If Eii is 0 the ith feature is removed
6
Reduced Feature SVM (RFSVM)
  • Initialize diagonal matrix E randomly
  • For fixed 0-1 values E, solve the SVM linear
    program to obtain (u, ?, y, s)
  • 3)Fix (u, ?, s) and sweep through E repeatedly as
    follows
  • For each component of E replace 1 by 0 and
    conversely provided the change decreases the
    overall objective function by more than tol
  • 4)Go to (3) if a change was made in the last
    sweep, otherwise continue to (5)
  • 5)Solve the SVM linear program with the new
    matrix E. If the objective decrease is less than
    tol, stop, otherwise go to (3)

7
RFSVM Convergence (for tol 0)
  • Objective function value converges
  • Each step decreases the objective
  • Objective is bounded below by 0
  • Limit of the objective function value is attained
    at any accumulation point of the sequence of
    iterates
  • Accumulation point is a local minimum solution
  • Continuous variables are optimal for the fixed
    integer variables
  • Changing any single integer variable will not
    decrease the objective

8
Experimental Results
  • Classification accuracy versus number of features
    used
  • Compare our RFSVM to Relief and RFE
    (Recursive Feature Elimination)
  • Results given on two public datasets from the UCI
    repository
  • Ability of RFSVM to handle problems with up to
    1000 features tested on synthetic NDCC datasets
  • Set feature selection parameter ?? 1

9
Relief and RFE
  • Relief
  • Kira and Rendell, 1992
  • Filter method feature selection is a
    preprocessing procedure
  • Features are selected as relevant if they tend to
    have different feature values for points in
    different classes
  • RFE (Recursive Feature Elimination)
  • Guyon, Weston, Barnhill, and Vapnik, 2002
  • Wrapper method feature selection is based on
    classification
  • Features are selected as relevant if removing
    them causes a large change in the margin of an SVM

10
Ionosphere Dataset 351 Points in R34
If the appropriate value of ? is selected, RFSVM
can obtain higher accuracy using fewer features
than SVM1
Nonlinear SVM with no feature selection
?
Even for feature selection parameter ? 0, some
features may be removed when removing them
decreases the hinge loss
Cross-validation accuracy
Note that accuracy decreases slightly until
about 10 features remain, and then decreases more
sharply as they are removed
Linear 1-norm SVM
Number of features used
11
Normally Distributed Clusters on Cubes Dataset
(Thompson, 2006)
  • Points are generated from normal distributions
    centered at vertices of 1-norm cubes
  • Dataset is not linearly separable

12
RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 20 True Features and Varying
Numbers of Irrelevant Features
Each point is the average test set correctness
over 10 datasets with 200 training, 200 tuning,
and 1000 testing points
RFSVM vs. SVM without Feature Selection (NKSVM1)
on NDCC Data with 100 True Features and 1000
Irrelevant Features
When 480 irrelevant features are added, the
accuracy of RFSVM is 45 higher than that of
NKSVM1
13
Conclusion
  • New rigorous formulation with precise objective
    for feature selection in nonlinear SVM
    classifiers
  • Obtain a local solution to the resulting
    mixed-integer program
  • Alternate between a linear program to compute
    continuous variables and successive sweeps to
    update the integer variables
  • Efficiently learns accurate nonlinear classifiers
    with reduced numbers of features
  • Handles problems with 1000 features, 900 of which
    are irrelevant

14
Questions?
  • Websites with links to papers and talks
  • http//www.cs.wisc.edu/olvi
  • http//www.cs.wisc.edu/wildt
  • NDCC generator
  • http//www.cs.wisc.edu/dmi/svm/ndcc/

15
Running Time on the Ionosphere Dataset
  • Averages 5.7 sweeps through the integer variables
  • Averages 3.4 linear programs
  • 75 of the time consumed in objective function
    evaluations
  • 15 of time consumed in solving linear programs
  • Complete experiment (1960 runs) took 1 hour
  • 3 GHz Pentium 4
  • Written in MATLAB
  • CPLEX 9.0 used to solve the linear programs
  • Gaussian kernel written in C

16
Sonar Dataset208 Points in R60
?
Cross-validation accuracy
Number of features used
17
Related Work
  • Approaches that use specialized kernels
  • Weston, Mukherjee, Chapelle, Pontil, Poggio, and
    Vapnik, 2000 structural risk minimization
  • Gold, Holub, and Sollich, 2005 Bayesian
    interpretation
  • Zhang, 2006 smoothing spline ANOVA kernels
  • Margin-based approach
  • Frölich and Zell, 2004 remove features if there
    is little change to the margin if they are
    removed
  • Other approaches which combine feature selection
    with basis reduction
  • Bi, Bennett, Embrechts, Breneman, and Song, 2003
  • Avidan, 2004

18
Future Work
  • Datasets with more features
  • Reduce the number of objective function
    evaluations
  • Limit the number of integer cycles
  • Other ways to update the integer variables
  • Application to regression problems
  • Automatic choice of ?

19
Algorithm
  • Global solution to nonlinear mixed-integer
    program cannot be found efficiently
  • Requires solving 2n linear programs
  • For fixed values of the integer diagonal matrix,
    the problem is reduced to an ordinary SVM linear
    program
  • Solution strategy alternate optimization of
    continuous and integer variables
  • For fixed values of E, solve a linear program for
    (u, ?, y, s)
  • For fixed values of (u, ?, s), sweep through the
    components of E and make updates which decrease
    the objective function

20
Notation
  • Data points represented as rows of an m n
    matrix A
  • Data labels of 1 or -1 are given as elements of
    an m m diagonal matrix D
  • Example
  • XOR 4 points in R2
  • Points (0, 1) , (1, 0) have label 1
  • Points (0, 0) , (1, 1) have label ?1
  • Kernel K(A, B) Rmn Rnk ! Rmk
  • Linear kernel (K(A, B))ij (AB)ij AiBj
    K(Ai, Bj)
  • Gaussian kernel, parameter ? (K(A, B))ij
    exp(-?Ai0 - Bj2)

21
Methodology
  • UCI Datasets
  • To reduce running time, 1/11 of each dataset was
    used as a tuning set to select ? and the kernel
    parameter
  • Remaining 10/11 used for 10-fold cross validation
  • Procedure repeated 5 times for each dataset with
    different random choice of tuning set each time
  • NDCC
  • Generate multiple datasets with 200 training, 200
    tuning, and 1000 testing points
Write a Comment
User Comments (0)
About PowerShow.com