Max-margin Classification of Data with Absent Features - PowerPoint PPT Presentation

1 / 69

About This Presentation

Title:

Max-margin Classification of Data with Absent Features

Description:

Gal Chechik, Geremy Heitz, Gal Elidan, Pieter Abbeel, and Daphne Koller ... In the traditional supervised learning, data instances are viewed as feature ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 70

Provided by: mauric98

Category:

more less

Transcript and Presenter's Notes

Title: Max-margin Classification of Data with Absent Features

1
Max-margin Classification of Data with Absent
Features

Gal Chechik, Geremy Heitz, Gal Elidan, Pieter
Abbeel, and Daphne Koller
Journal of Machine Learning Research 2008

2
Outline

Introduction
Background knowledge
Problem Description
Algorithms
Experiments
Conclusions

3
Introduction

In the traditional supervised learning, data
instances are viewed as feature vectors in
high-dimensional space.
Why do features miss?
noise
undefined part of objects
structural absent
etc.

4
Introduction

How to handle classification if features missing?
fundamental methods
expectation maximization (EM)
Markov-chain monte-carlo (MCMC)
However, features sometimes are non-existing,
rather than have an unknown value.
To classify without filling missing values.

5
Background

Support Vector Machines(SVMs)
Second Order Cone Programming(SOCP)

6
Support Vector Machines

Support Vector Machines(SVMs) a supervised
learning method used for classification and
regression.
They simultaneously minimize the empirical
classification error and maximize the geometric
margin, also called as maximum margin classifiers.

7
Support Vector Machines

Given a set of n labeled sample x1xn, in a
feature spaces F of size d. Each sample xi has a
binary class label .
We want to give the maximum-margin hyperplane
which divides the data having yi 1 from those
having yi - 1.

8
Support Vector Machines

Any hyperplane can be written as the set of
samples x satisfying , where w
is a normal vector and b is the offset of the
hyperplane from the origin along w.
The hyperplane separate samples into two classes,
so it could be

9
Support Vector Machines

Geometric Margins we define the margin as
, and learn a classifier w
by maximize ?.
It turns to an optimization problem
Or a quadratic programming (QP) optimization
problem

10
Support Vector Machines
The hyperplane H3 doesn't separate the 2 classes.
H1 does, with a small margin and H2 with the
maximum margin.
11
Support Vector Machines

Soft-margin SVMs when the training samples are
not linearly separable, we introduce slack
variables to the SVMs.
It turns to
where C is the trade-off between accuracy and
model complexity.

12
Support Vector Machines

The dual problem of SVMs we consider to use
Lagrangian to solve the primal SVMs.
Set the first derivatives of L to 0

13
Support Vector Machines

We then derive
The dual problem of SVMs

14
Second Order Cone Programming

Second-order Cone Programming(SOCP) a convex
optimization problem of the form
where is the optimization variable.
We can solve SOCPs by using MOSEK.

15
Problem Description

Given a set of n labeled sample x1xn, in a
feature spaces F of size d. Each sample xi has a
binary class label .
Let Fi denote the set of features of the ith
sample. Each sample xi can be viewed as embedded
in the relevant subspace .

16
Problem Description

In the traditional SVMs, we try to maximize the
geometric margin ? for all instances.
In the case of missing feature, the margin which
measures the distance to a hyperplane is not well
defined.
We cant use the traditional SVMs in this case.

17
Problem Description

We should treat the margin of each instance in
its own relevant subspace.
We define the instance margin for the
ith instance as
where w(i) is a vector obtained by taking the
entries of w that are relevant for xi.

18
Problem Description

We consider new geometric margin to be the
minimum over all instance margins, that is
, and arrive at
a new optimization problem for the case

19
Problem Description

However, since different margin terms are
normalized by different norms , we cant
take out of the minimization.
Besides, each of the terms is
non-convex in w, its difficult to solve
directly.

20
Algorithms

How to solve this optimization problem?
Three approaches
Linearly separable case A Convex Formulation
The general case
Average Norm
Instance-specific Margins

21
A Convex Formulation

In the linearly separable case, we can transform
the optimization problem into a series of convex
optimization problem.
First step

By maximizing a lower bound, we take the
minimization term out of the objective function
into the constraints. And the resulting problem
is equivalent to the original one since the bound
? can always be maximized until its perfect
tight.
22
A Convex Formulation

Second step Replace the single constraint by
multiple constraints.

From single constraint to multiple constraints.
23
A Convex Formulation

Finally, we write

Because for all instances.
24
A Convex Formulation

Assume first that ? is given. For any fixed value
of ?, the problem obeys the general structure of
SOCP.

The structures are similar.
25
A Convex Formulation

We can solve it by doing a bisection search over
, for each iteration we solve a SOCP
problem.
However, one problem is that any scaled version
of a solution is also a solution.

Each constraint is invariant to a rescaling of w.
And the null case is always a
solution.
26
A Convex Formulation

How to solve it? We can add constraints
A non vanishing norm .
On a single entry in w, for
each entry.
We can solve the SOCP twice for each entry of w,
once for , and once for .
It becomes a total of 2d problems.

No longer convex!
27
A Convex Formulation

The convex formulation is difficult for the
non-separable case.

We cant be sure that its jointly convex with w.
Consider the slack variables are not normalized by
28
A Convex Formulation

In the case, the vanishing solutions ,
are also encountered.
We can no longer guarantee that the modified
approach we discussed above will coincide.
So the non-separable formulation isnt likely to
be of practical use for this case.

29
Average Norm

We consider an alternative solution based on an
approximation of the margin.
We can approximate the different norms
by a common term that doesnt depend on the
instance.

30
Average Norm

Replace each low-dimensional norms by
the root-mean-square norm over all instances.
When all samples have all features, its the same
as the original SVMs.

31
Average Norm

In the case of missing features, the
approximation of will be also good if all
the norms are equal.
When the norms are near equal, we expect
to find nearly optimal solutions.

32
Average Norm

We can derive

33
Average Norm

The linear-separable case
The non-separable case

They can be solved using the same techniques as
standard SVMs.
Quadratic programming Problems!
34
Average Norm

However, average norm isnt expected to perform
well if the norms vary considerably.
How to solve the problem in the case?
Instance-specific Margins approach.

35
Instance-specific Margins

We can represent each of the norms as a
scaling of the full norm .
By defining scaling coefficients
,we can rewrite the equation

36
Instance-specific Margins

So we can derive as above steps

Separable case.
Non-separable case.
37
Instance-specific Margins

We consider how to solve it
Projected gradient approach.

Not a Quadratic Programming problem!
Its not even convex in w.
38
Instance-specific Margins

Projected gradient approach one iterates between
steps in the direction of the gradient of the
Lagrangian
and projections to the constrained space, by
calculating .
With the right choices of step sizes, it
converges to local minima.
Other solutions?

39
Instance-specific Margins

If we give a set of si , the problem is a
Quadratic Programming problem.
We can use the fact to devise a iterative
algorithm.

For any fixed value of si, the problem is a QP!
40
Instance-specific Margins

For a given tuple of sis, we solve a QP for w,
and then use the resulting w to calculate new
sis.
To solve the QP, we derive the dual for given
sis

The dual problem of SVMs!
41
Instance-specific Margins

The inner product lt, gt is taken only over
features that are valid for both xi and xj.
We discuss the kernels for modified SVMs later.

42
Instance-specific Margins

Iterative optimization/projection algorithm

The Dual problem of SVMs.
The convergence isnt always guaranteed.
The dual solution is used to find the optimal
classifier by setting
43
Instance-specific Margins

Two other approaches for optimizing this

44
Instance-specific Margins

Updating approach minimize
subject to
.
Hybrid approach combine gradient ascent over s
and QP for w.
Those approaches didnt perform as well as the
iterative approach above.

45
Kernels for missing features

Why choose kernels for the SVMs?
Some common kernels
Polynomial
Polynomial(inhomogeneous)
Radial Basis Function
Gaussian Radial basis function
Sigmoid

RBF kernel for SVM
46
Kernels for missing features

In the dual formulation above, the dependence on
the instances is through their inner product.
We focus on kernels like the dependence.
Polynomial ,
Sigmodal

47
Kernels for missing features

For a polynomial kernel
, defined the modified kernel as
with the inner product calculated over valid
features
.
We define , where
replaces invalid entries(missing
features) with zeros.

48
Kernels for missing features

We have , simply
since multiplying by zero is equivalent to
skipping the missing values.
We make the kernels for missing features.

49
Experiments

Three experiments
Features are missing at random.
Visual object recognition features are missing
because they cant be located in the image.
Biological network completion(Metabolic Pathway
Reconstruction) missingness patterns of features
is determined by the known structure of the
network.

50
Experiments

Five common approaches for filling missing
features
Zero
Mean
Flag
kNN
EM
Average Norm and Geometric Margins.

51
Missing at Random

Features are missing at random
Data sets from the UCI repository.
MNIST images.

Missing features
52
Missing at Random

Experiment results

Geometric Margins has good performance.
Data sets from UCI
MNIST images
53
Visual Object Recognition

Visual Object Recognition to determine if an
object from a certain class is present in a given
input image.
The trunk of a car may not be found in a picture
of a hatch-back car.
The features are structurally missing.

54
Visual Object Recognition

The object model contains a set of landmarks,
defining the outline of an object.
We find several matches in a given image.

Five matches for the front windshield landmark.
55
Visual Object Recognition

In the car model, we located up to 10
matches(candidates) for each of the 19 landmarks.
For each candidate, we compute the first 10
principal component coefficients(PCA) of the
image patch.

56
Visual Object Recognition

We concatenate these descriptors to form
1900(191010) features per image.
If the number of descriptors for a given landmark
is less than 10, we consider the rest to be
structurally absent.

57
Visual Object Recognition

Experiment results

58
Visual Object Recognition

Examples

59
Metabolic Pathway Reconstruction

Metabolic Pathway Reconstruction predicting
missing enzymes in metabolic pathways.
Instances in this task have missing features due
to the structure of the biochemical network.

60
Metabolic Pathway Reconstruction

Cells use a complex network of chemical reactions
to produce their building blocks.

molecular compounds
enzyme
enzyme
molecular compounds
The enzyme catalyzes a reaction.
61
Metabolic Pathway Reconstruction

For many reactions, the enzyme responsible for
their catalysis is unknown, making it an
important computational task to predict the
identity of such missing enzymes.
How to predict?
The enzymes in local network neighborhoods
usually participate in related functions.

62
Metabolic Pathway Reconstruction

Different types of network neighborhood relations
between enzyme pairs lead to different relations
of their properties.
Three types
forks(same inputs, different outputs)
funnels(same outputs, different inputs)
linear chains

63
Metabolic Pathway Reconstruction
linear chains
forks(same inputs, different outputs)
funnels(same outputs, different inputs)
64
Metabolic Pathway Reconstruction

Each enzyme is represented using a vector of
features that measure its relatedness to each of
its different neighbors, across different data
types.
A feature vector will have structurally missing
entries if the enzyme does not have all types of
neighbors.

65
Metabolic Pathway Reconstruction

Three types of data for enzyme attributes
A compendium of gene expression assays
Protein domains content of enzymes
The cellular localization of proteins
We use those data to measure the similarity
between enzymes.

66
Metabolic Pathway Reconstruction

Similarity Measures for Enzyme Predictions

67
Metabolic Pathway Reconstruction

Positive examples From the reactions with known
enzymes.
Negative examples By plugging a random impostor
genes into each neighborhood.

68
Metabolic Pathway Reconstruction

Experiment results

69
Conclusions

A novel method for max-margin training of
classifiers in the presence of missing features.
To classify instances by skipping the
non-existing features, rather than filling them
with hypothetical values.

Write a Comment

User Comments (0)