Title: MaxMargin Classification of Data with Absent Features
1Max-Margin Classification of Data with Absent
Features
by Chechik, Heitz, Elidan, Abbeel and Koller,
JMLR 2008
Presented by Chunping Wang Machine Learning
Group, Duke University July 3, 2008
2Outline
- Introduction
- Standard SVM
- Max-Margin Formulation for Missing Features
- Three Algorithms
- Experimental Results
- Conclusions
3Introduction (1)
- Pattern of missing features
- due to measurement noise or corruption
existing but unknown - due to the inherent properties of the
instances non-existing
Example 1 Two subpopulation of instances
(animals and buildings) with few overlapping
features (body parts, architectural aspects
) Example 2 In a web-page task, one useful
feature of a given page may be the most common
topic of other sites that point to it, however,
this particular page may have no such parents.
4Introduction (2)
- Common methods for handling missing features
- (Assume the features exist but their values are
unknown) - Single imputation zeros, mean, kNN
- imputation by building probabilistic generative
models
Proposed method (Assume the features are
structurally absent) Each data instance resides
in a lower dimensional subspace of the feature
space, determined by its own existing features.
We try to maximize the worst-case margin of the
separating hyperplane, while measuring the margin
of each data instance in its own
lower-dimensional subspace.
5Standard SVM (1)
Binary classification
real-valued predictors
binary response
A classifier could be defined as
based on a linear function
Parameters
6Standard SVM (2)
Functional margin for each instance
Geometric margin for each instance
Geometric margin of a hyper plane
SVM
by fixing the functional margin to 1, i.e.,
s slack variables
C cost
Quadratic Programming (QP)
7Max-Margin Formulation for Missing Features (1)
A 2-D case with missing data
margin in the subspace
margin in the full feature space
Margin of instances with missing features is
underestimated.
8Max-Margin Formulation for Missing Features (2)
Instance margin
Optimization problem
is instance dependent and thus cannot be taken
out of the minimization
is non-convex in w
It is difficult to solve this optimization
problem directly.
9Three Algorithms (1)
- A convex formulation for linearly separable case
- Introduce a lower bound for
For a given , this is a second order cone
program (SOCP), which is convex and can be solved
efficiently.
To find the optimal , do a bisection search
over . Unfortunately, extending it
to the non-separable case is difficult.
10Three Algorithms (2)
- Average norm a convex approximation for
non-separable case
Get rid of the instance dependence
define
non-separable case
11Three Algorithms (3)
- Geometric margin an exact non-convex approach
for non-separable case
define
non-separable case
QP for a given set of s
12Three Algorithms (4)
- Geometric margin the exact non-convex approach
for non-separable case
Pseudo-code
The convergence is not always guaranteed. Cross
validation is used to choose an early stopping
point.
13Experimental Results (1)
Zero. Missing values were set to zero. Mean.
Missing values were set to the average value of
the feature over all data. Flag. Additional
features (flags) were added, explicitly
denoting whether a feature is missing for a given
instance. kNN. Missing features were set with
the mean value obtained from the K nearest
neighbors instances. EM. A Gaussian mixture
model is learned by iterating between (1)
learning a GMM model of the filled data and (2)
re-filling missing values using cluster means,
weighted by the posterior probability that a
cluster generated the sample. Averaged norm (avg
w). Proposed approximate convex
approach. Geometric margin (geom). Proposed exact
non-convex approach.
14Experimental Results (2)
- UCI data sets (missing at random)
- Remove 90 of the features of each sample randomly
Remove a patch covered 25 of pixels with
location of the patch uniformly sampled.
Digits 5 6 from MNIST
15Experimental Results (3)
- Visual object recognition
- Task to determine an automobile is present in a
given image or not.
Likelihood of patches to match each of 19
landmarks
Local edge information
Generative model
Set a threshold
(Up to 10) Candidate patches (21-by-21 pixels)
for landmarks
First 10 principal components for each patch
A feature vector (up to 1900 features)
PCA
concatenate
If the number of candidates for a given landmark
is less than ten, we consider the rest to be
structurally absent
16Experimental Results (4)
An example image the best 5 candidates matched
to the front windshield landmark
17Experimental Results (5)
18Experimental Results (6)
- Metabolic pathway reconstruction
Arrows chemical reactions
Purple boxed names enzymes
A fragment of the full metabolic pathway network
19Experimental Results (7)
- Three types of neighborhood relations between
enzyme pairs - Linear chains (ARO7, PHA2)
- Forks (TRP2, ARO7) same input, different
outputs - Funnels (ARO9, PHA2) same output, different
inputs
One feature vector (represents an enzyme)
Features for linear chain neighbor
Features for fork neighbor
Features for funnel neighbor
A feature vector will have structurally missing
entries if the enzyme does not have all types of
neighbors, e.g., PHA2 does not have a neighbor of
type fork.
20Experimental Results (8)
Task to identify if a candidate enzyme is in the
right neighborhood.
- Data creation
- Positive samples from the reactions with known
enzymes (in the right neighborhood) - Negative samples for each positive sample,
replace the true enzyme with a random impostor,
and calculate the features in such a wrong
neighborhood. The impostor was uniformly chosen
from the set of other enzymes.
21Experimental Results (9)
22Conclusions
- The authors presented a modified SVM model for
max-margin training of classifiers in the
presence of missing features, where the pattern
of missing features is an inherent part of the
domain. - The authors directly classified instances by
skipping the non-existing features, rather than
filling them with hypothetical values. - The proposed model was competitive with a range
of single imputation approaches when tested in
missing-at-random (MAR) settings. - One variant (geometric margin) significantly
outperformed other methods in two real problems
with non-existing features.