Title: Max-margin Classification of Data with Absent Features
1Max-margin Classification of Data with Absent
Features
- Gal Chechik, Geremy Heitz, Gal Elidan, Pieter
Abbeel, and Daphne Koller - Journal of Machine Learning Research 2008
2Outline
- Introduction
- Background knowledge
- Problem Description
- Algorithms
- Experiments
- Conclusions
3Introduction
- In the traditional supervised learning, data
instances are viewed as feature vectors in
high-dimensional space. - Why do features miss?
- noise
- undefined part of objects
- structural absent
- etc.
4Introduction
- How to handle classification if features missing?
- fundamental methods
- expectation maximization (EM)
- Markov-chain monte-carlo (MCMC)
- However, features sometimes are non-existing,
rather than have an unknown value. - To classify without filling missing values.
5Background
- Support Vector Machines(SVMs)
- Second Order Cone Programming(SOCP)
6Support Vector Machines
- Support Vector Machines(SVMs) a supervised
learning method used for classification and
regression. - They simultaneously minimize the empirical
classification error and maximize the geometric
margin, also called as maximum margin classifiers.
7Support Vector Machines
- Given a set of n labeled sample x1xn, in a
feature spaces F of size d. Each sample xi has a
binary class label . - We want to give the maximum-margin hyperplane
which divides the data having yi 1 from those
having yi - 1.
8Support Vector Machines
- Any hyperplane can be written as the set of
samples x satisfying , where w
is a normal vector and b is the offset of the
hyperplane from the origin along w. - The hyperplane separate samples into two classes,
so it could be
9Support Vector Machines
- Geometric Margins we define the margin as
, and learn a classifier w
by maximize ?. - It turns to an optimization problem
- Or a quadratic programming (QP) optimization
problem
10Support Vector Machines
The hyperplane H3 doesn't separate the 2 classes.
H1 does, with a small margin and H2 with the
maximum margin.
11Support Vector Machines
- Soft-margin SVMs when the training samples are
not linearly separable, we introduce slack
variables to the SVMs. - It turns to
- where C is the trade-off between accuracy and
model complexity.
12Support Vector Machines
- The dual problem of SVMs we consider to use
Lagrangian to solve the primal SVMs. - Set the first derivatives of L to 0
13Support Vector Machines
- We then derive
- The dual problem of SVMs
14Second Order Cone Programming
- Second-order Cone Programming(SOCP) a convex
optimization problem of the form - where is the optimization variable.
- We can solve SOCPs by using MOSEK.
15Problem Description
- Given a set of n labeled sample x1xn, in a
feature spaces F of size d. Each sample xi has a
binary class label . - Let Fi denote the set of features of the ith
sample. Each sample xi can be viewed as embedded
in the relevant subspace .
16Problem Description
- In the traditional SVMs, we try to maximize the
geometric margin ? for all instances. - In the case of missing feature, the margin which
measures the distance to a hyperplane is not well
defined. - We cant use the traditional SVMs in this case.
17Problem Description
- We should treat the margin of each instance in
its own relevant subspace. - We define the instance margin for the
ith instance as - where w(i) is a vector obtained by taking the
entries of w that are relevant for xi.
18Problem Description
- We consider new geometric margin to be the
minimum over all instance margins, that is - , and arrive at
a new optimization problem for the case
19Problem Description
- However, since different margin terms are
normalized by different norms , we cant
take out of the minimization. - Besides, each of the terms is
non-convex in w, its difficult to solve
directly.
20Algorithms
- How to solve this optimization problem?
- Three approaches
- Linearly separable case A Convex Formulation
- The general case
- Average Norm
- Instance-specific Margins
21A Convex Formulation
- In the linearly separable case, we can transform
the optimization problem into a series of convex
optimization problem. - First step
By maximizing a lower bound, we take the
minimization term out of the objective function
into the constraints. And the resulting problem
is equivalent to the original one since the bound
? can always be maximized until its perfect
tight.
22A Convex Formulation
- Second step Replace the single constraint by
multiple constraints.
From single constraint to multiple constraints.
23A Convex Formulation
Because for all instances.
24A Convex Formulation
- Assume first that ? is given. For any fixed value
of ?, the problem obeys the general structure of
SOCP.
The structures are similar.
25A Convex Formulation
- We can solve it by doing a bisection search over
, for each iteration we solve a SOCP
problem. - However, one problem is that any scaled version
of a solution is also a solution.
Each constraint is invariant to a rescaling of w.
And the null case is always a
solution.
26A Convex Formulation
- How to solve it? We can add constraints
- A non vanishing norm .
- On a single entry in w, for
each entry. - We can solve the SOCP twice for each entry of w,
once for , and once for . - It becomes a total of 2d problems.
No longer convex!
27A Convex Formulation
- The convex formulation is difficult for the
non-separable case.
We cant be sure that its jointly convex with w.
Consider the slack variables are not normalized by
28A Convex Formulation
- In the case, the vanishing solutions ,
are also encountered. - We can no longer guarantee that the modified
approach we discussed above will coincide. - So the non-separable formulation isnt likely to
be of practical use for this case.
29Average Norm
- We consider an alternative solution based on an
approximation of the margin. - We can approximate the different norms
by a common term that doesnt depend on the
instance.
30Average Norm
- Replace each low-dimensional norms by
the root-mean-square norm over all instances. - When all samples have all features, its the same
as the original SVMs.
31Average Norm
- In the case of missing features, the
approximation of will be also good if all
the norms are equal. - When the norms are near equal, we expect
to find nearly optimal solutions.
32Average Norm
33Average Norm
- The linear-separable case
- The non-separable case
They can be solved using the same techniques as
standard SVMs.
Quadratic programming Problems!
34Average Norm
- However, average norm isnt expected to perform
well if the norms vary considerably. - How to solve the problem in the case?
- Instance-specific Margins approach.
35Instance-specific Margins
- We can represent each of the norms as a
scaling of the full norm . - By defining scaling coefficients
,we can rewrite the equation
36Instance-specific Margins
- So we can derive as above steps
Separable case.
Non-separable case.
37Instance-specific Margins
- We consider how to solve it
- Projected gradient approach.
Not a Quadratic Programming problem!
Its not even convex in w.
38Instance-specific Margins
- Projected gradient approach one iterates between
steps in the direction of the gradient of the
Lagrangian - and projections to the constrained space, by
calculating . - With the right choices of step sizes, it
converges to local minima. - Other solutions?
39Instance-specific Margins
- If we give a set of si , the problem is a
Quadratic Programming problem. - We can use the fact to devise a iterative
algorithm.
For any fixed value of si, the problem is a QP!
40Instance-specific Margins
- For a given tuple of sis, we solve a QP for w,
and then use the resulting w to calculate new
sis. - To solve the QP, we derive the dual for given
sis
The dual problem of SVMs!
41Instance-specific Margins
- The inner product lt, gt is taken only over
features that are valid for both xi and xj. - We discuss the kernels for modified SVMs later.
42Instance-specific Margins
- Iterative optimization/projection algorithm
The Dual problem of SVMs.
The convergence isnt always guaranteed.
The dual solution is used to find the optimal
classifier by setting
43Instance-specific Margins
- Two other approaches for optimizing this
44Instance-specific Margins
- Updating approach minimize
- subject to
. - Hybrid approach combine gradient ascent over s
and QP for w. - Those approaches didnt perform as well as the
iterative approach above.
45Kernels for missing features
- Why choose kernels for the SVMs?
- Some common kernels
- Polynomial
- Polynomial(inhomogeneous)
- Radial Basis Function
- Gaussian Radial basis function
- Sigmoid
RBF kernel for SVM
46Kernels for missing features
- In the dual formulation above, the dependence on
the instances is through their inner product. - We focus on kernels like the dependence.
- Polynomial ,
- Sigmodal
47Kernels for missing features
- For a polynomial kernel
, defined the modified kernel as - with the inner product calculated over valid
features
. - We define , where
replaces invalid entries(missing
features) with zeros.
48Kernels for missing features
- We have , simply
since multiplying by zero is equivalent to
skipping the missing values. - We make the kernels for missing features.
49Experiments
- Three experiments
- Features are missing at random.
- Visual object recognition features are missing
because they cant be located in the image. - Biological network completion(Metabolic Pathway
Reconstruction) missingness patterns of features
is determined by the known structure of the
network.
50Experiments
- Five common approaches for filling missing
features - Zero
- Mean
- Flag
- kNN
- EM
- Average Norm and Geometric Margins.
51Missing at Random
- Features are missing at random
- Data sets from the UCI repository.
- MNIST images.
Missing features
52Missing at Random
Geometric Margins has good performance.
Data sets from UCI
MNIST images
53Visual Object Recognition
- Visual Object Recognition to determine if an
object from a certain class is present in a given
input image. - The trunk of a car may not be found in a picture
of a hatch-back car. - The features are structurally missing.
54Visual Object Recognition
- The object model contains a set of landmarks,
defining the outline of an object. - We find several matches in a given image.
Five matches for the front windshield landmark.
55Visual Object Recognition
- In the car model, we located up to 10
matches(candidates) for each of the 19 landmarks. - For each candidate, we compute the first 10
principal component coefficients(PCA) of the
image patch.
56Visual Object Recognition
- We concatenate these descriptors to form
1900(191010) features per image. - If the number of descriptors for a given landmark
is less than 10, we consider the rest to be
structurally absent.
57Visual Object Recognition
58Visual Object Recognition
59Metabolic Pathway Reconstruction
- Metabolic Pathway Reconstruction predicting
missing enzymes in metabolic pathways. - Instances in this task have missing features due
to the structure of the biochemical network.
60Metabolic Pathway Reconstruction
- Cells use a complex network of chemical reactions
to produce their building blocks.
molecular compounds
enzyme
enzyme
molecular compounds
The enzyme catalyzes a reaction.
61Metabolic Pathway Reconstruction
- For many reactions, the enzyme responsible for
their catalysis is unknown, making it an
important computational task to predict the
identity of such missing enzymes. - How to predict?
- The enzymes in local network neighborhoods
usually participate in related functions.
62Metabolic Pathway Reconstruction
- Different types of network neighborhood relations
between enzyme pairs lead to different relations
of their properties. - Three types
- forks(same inputs, different outputs)
- funnels(same outputs, different inputs)
- linear chains
63Metabolic Pathway Reconstruction
linear chains
forks(same inputs, different outputs)
funnels(same outputs, different inputs)
64Metabolic Pathway Reconstruction
- Each enzyme is represented using a vector of
features that measure its relatedness to each of
its different neighbors, across different data
types. - A feature vector will have structurally missing
entries if the enzyme does not have all types of
neighbors.
65Metabolic Pathway Reconstruction
- Three types of data for enzyme attributes
- A compendium of gene expression assays
- Protein domains content of enzymes
- The cellular localization of proteins
- We use those data to measure the similarity
between enzymes.
66Metabolic Pathway Reconstruction
- Similarity Measures for Enzyme Predictions
67Metabolic Pathway Reconstruction
- Positive examples From the reactions with known
enzymes. - Negative examples By plugging a random impostor
genes into each neighborhood.
68Metabolic Pathway Reconstruction
69Conclusions
- A novel method for max-margin training of
classifiers in the presence of missing features. - To classify instances by skipping the
non-existing features, rather than filling them
with hypothetical values.