MaxMargin Classification of Data with Absent Features - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

MaxMargin Classification of Data with Absent Features

Description:

... rid of the instance dependence. non-separable case. Three ... Task: to determine an automobile is present in a given image or not. Local edge information ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 23

Provided by: hui4

Learn more at: https://people.ee.duke.edu

Category:

more less

Transcript and Presenter's Notes

Title: MaxMargin Classification of Data with Absent Features

1
Max-Margin Classification of Data with Absent
Features
by Chechik, Heitz, Elidan, Abbeel and Koller,
JMLR 2008
Presented by Chunping Wang Machine Learning
Group, Duke University July 3, 2008
2
Outline

Introduction
Standard SVM
Max-Margin Formulation for Missing Features
Three Algorithms
Experimental Results
Conclusions

3
Introduction (1)

Pattern of missing features
due to measurement noise or corruption
existing but unknown
due to the inherent properties of the
instances non-existing

Example 1 Two subpopulation of instances
(animals and buildings) with few overlapping
features (body parts, architectural aspects
) Example 2 In a web-page task, one useful
feature of a given page may be the most common
topic of other sites that point to it, however,
this particular page may have no such parents.
4
Introduction (2)

Common methods for handling missing features
(Assume the features exist but their values are
unknown)
Single imputation zeros, mean, kNN
imputation by building probabilistic generative
models

Proposed method (Assume the features are
structurally absent) Each data instance resides
in a lower dimensional subspace of the feature
space, determined by its own existing features.
We try to maximize the worst-case margin of the
separating hyperplane, while measuring the margin
of each data instance in its own
lower-dimensional subspace.
5
Standard SVM (1)
Binary classification
real-valued predictors
binary response
A classifier could be defined as
based on a linear function
Parameters
6
Standard SVM (2)
Functional margin for each instance
Geometric margin for each instance
Geometric margin of a hyper plane
SVM
by fixing the functional margin to 1, i.e.,
s slack variables
C cost
Quadratic Programming (QP)
7
Max-Margin Formulation for Missing Features (1)
A 2-D case with missing data
margin in the subspace
margin in the full feature space
Margin of instances with missing features is
underestimated.
8
Max-Margin Formulation for Missing Features (2)
Instance margin
Optimization problem
is instance dependent and thus cannot be taken
out of the minimization
is non-convex in w
It is difficult to solve this optimization
problem directly.
9
Three Algorithms (1)

A convex formulation for linearly separable case
Introduce a lower bound for

For a given , this is a second order cone
program (SOCP), which is convex and can be solved
efficiently.
To find the optimal , do a bisection search
over . Unfortunately, extending it
to the non-separable case is difficult.
10
Three Algorithms (2)

Average norm a convex approximation for
non-separable case

Get rid of the instance dependence
define
non-separable case
11
Three Algorithms (3)

Geometric margin an exact non-convex approach
for non-separable case

define
non-separable case
QP for a given set of s
12
Three Algorithms (4)

Geometric margin the exact non-convex approach
for non-separable case

Pseudo-code
The convergence is not always guaranteed. Cross
validation is used to choose an early stopping
point.
13
Experimental Results (1)
Zero. Missing values were set to zero. Mean.
Missing values were set to the average value of
the feature over all data. Flag. Additional
features (flags) were added, explicitly
denoting whether a feature is missing for a given
instance. kNN. Missing features were set with
the mean value obtained from the K nearest
neighbors instances. EM. A Gaussian mixture
model is learned by iterating between (1)
learning a GMM model of the filled data and (2)
re-filling missing values using cluster means,
weighted by the posterior probability that a
cluster generated the sample. Averaged norm (avg
w). Proposed approximate convex
approach. Geometric margin (geom). Proposed exact
non-convex approach.
14
Experimental Results (2)

UCI data sets (missing at random)
Remove 90 of the features of each sample randomly

Remove a patch covered 25 of pixels with
location of the patch uniformly sampled.
Digits 5 6 from MNIST
15
Experimental Results (3)

Visual object recognition
Task to determine an automobile is present in a
given image or not.

Likelihood of patches to match each of 19
landmarks
Local edge information
Generative model
Set a threshold
(Up to 10) Candidate patches (21-by-21 pixels)
for landmarks
First 10 principal components for each patch
A feature vector (up to 1900 features)
PCA
concatenate
If the number of candidates for a given landmark
is less than ten, we consider the rest to be
structurally absent
16
Experimental Results (4)
An example image the best 5 candidates matched
to the front windshield landmark
17
Experimental Results (5)
18
Experimental Results (6)

Metabolic pathway reconstruction

Arrows chemical reactions
Purple boxed names enzymes
A fragment of the full metabolic pathway network
19
Experimental Results (7)

Three types of neighborhood relations between
enzyme pairs
Linear chains (ARO7, PHA2)
Forks (TRP2, ARO7) same input, different
outputs
Funnels (ARO9, PHA2) same output, different
inputs

One feature vector (represents an enzyme)
Features for linear chain neighbor
Features for fork neighbor
Features for funnel neighbor
A feature vector will have structurally missing
entries if the enzyme does not have all types of
neighbors, e.g., PHA2 does not have a neighbor of
type fork.
20
Experimental Results (8)
Task to identify if a candidate enzyme is in the
right neighborhood.

Data creation
Positive samples from the reactions with known
enzymes (in the right neighborhood)
Negative samples for each positive sample,
replace the true enzyme with a random impostor,
and calculate the features in such a wrong
neighborhood. The impostor was uniformly chosen
from the set of other enzymes.

21
Experimental Results (9)
22
Conclusions

The authors presented a modified SVM model for
max-margin training of classifiers in the
presence of missing features, where the pattern
of missing features is an inherent part of the
domain.
The authors directly classified instances by
skipping the non-existing features, rather than
filling them with hypothetical values.
The proposed model was competitive with a range
of single imputation approaches when tested in
missing-at-random (MAR) settings.
One variant (geometric margin) significantly
outperformed other methods in two real problems
with non-existing features.