Title: Kernel Matching Reduction Algorithms
1- Kernel Matching Reduction Algorithms
- for Classification
- Jianwu Li and Xiaocheng Deng
- Beijing Institute of Technology
2Introduction
- Kernel-based pattern classification techniques
- Support vector machines (SVM)
- Kernel linear discriminant analysis (KLDA)
- Kernel Perceptrons
-
3Introduction
- Support vector machines (SVM)
- Structural risk minimization (SRM)
- Maximum margin classification
- Quadratic optimization problem
- Kernel trick
4Introduction
- Support vector machines (SVM)
- Support vectors (SV)
- Sparse solutions
5Introduction
- Kernel matching pursuit (KMP)
- KMP appends functions to an initially empty basis
sequentially, from a redundant dictionary of
functions, to approximate a classification
function by using a certain loss criterion. - KMP can produce much sparser models than SVMs.
6Introduction
- Kernel Matching Reduction Algorithms (KMRAs)
- Inspired by KMP and SVMs, we propose kernel
matching reduction algorithms. - Different from KMP, kernel matching reduction
algorithms (KMRAs), are proposed to perform a
reverse procedure in this paper.
7Introduction
- Kernel Matching Reduction Algorithms (KMRAs)
- Firstly, all training examples are selected to
construct a function dictionary. - Then the function dictionary is reduced
iteratively by linear support vector machines
(SVMs). - During the reduction process, the parameters of
the functions in the dictionary can be adjusted
dynamically.
8Kernel Matching Reduction Algorithms
- Constructing a Kernel-Based Dictionary
- For a binary classification problem, assume there
exist l training examples, which form the
training set S (x1, y1),
(x2, y2), . . . , (xl, yl), - where xi ? Rd, yi ?-1, 1, and yi represents
the class label of the point xi, i 1, 2, . . .
, l.
9Kernel Matching Reduction Algorithms
- Constructing a Kernel-Based Dictionary
- Given a kernel function K Rd Rd ? R, similar
to KMP, we use kernel functions, centered on the
training points, as our dictionary - D K(x, xi)i 1, . . . , l.
10Kernel Matching Reduction Algorithms
- Constructing a Kernel-Based Dictionary
- Here, the Gaussian kernel function is selected
11Kernel Matching Reduction Algorithms
- Constructing a Kernel-Based Dictionary
- The value of si should be set to keep the
influence of the local domain around xi and
prevent xi from having a high activation for the
field far from xi.
12Kernel Matching Reduction Algorithms
- Constructing a Kernel-Based Dictionary
- Therefore, we adopt the following heuristic
method - Where are p nearest neighbors of xi. Such,
the receptive width of each point is determined
to cover a certain region in the sample space.
13Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - Using all the kernel functions from the
kernel-based dictionary D K(x, xi)i 1, . . .
, l, we construct a mapping from original space
to feature space. - Any training example xi in S is mapped to a
corresponding point zi in S, where zi (K(xi,
x1),K(xi, x2), . . . , K(xi, xl)). - The training set S (x1, y1), (x2, y2), . . . ,
(xl, yl) in original space is mapped to S (z1,
y1), (z2, y2), . . . , (zl, yl) in feature
space.
14Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - we design a linear decision function gl(zt)
sign(fl(zt)) in feature space, and - which corresponds to the nonlinear form in
original space - where w (w1, w2, . . . , wl) represents
weights of every dimension in z.
15Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - We can decide which kernel functions are
important for classification, and which are not,
according to their weights magnitudes wi in (3)
or (4), where wi denotes the absolute value of
wi. Those redundant kernel functions, which have
lowest weights magnitudes, can be deleted from
the dictionary to reduce the model.
16Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - If we use the usual least squares error criterion
to find this function, it is not practical, since
the number of training examples, at the
beginning, is equal to, or near to, the dimension
number of the feature space S, and we will
confront the problem of the not-invertible matrix.
17Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - In fact, support vector machines (SVMs), based on
the structural risk minimization, are fit for
solving supervised classification problems with
high dimensions. we also adopt linear SVMs to
find the classification function in (3) or (4) on
S.
18Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - The optimization objective of linear SVMs is to
minimize - subject to the constraints
- yi(w zi) b 1 - ?i, and ?i 0, i 1, 2,
, l ,
19Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - wi denotes the contribution of zi to the
classifier in - (3), and the higher the value of wi, the
more contribution of zi to the model. - Consequently, we can rank zi according to the
values of wi (i 1, 2, , l) from large to
small. We can also rank xi by wi, because xi is
the preimage of zi in the original space.
20Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - The xi with the smallest wi can be deleted from
the dictionary D, and D can be reduced to D. - Then we can continue this procedure on the new
dictionary D. Thus, the process can be
iteratively performed until a given stop
criterion is satisfied. - Note that, each s should be computed again on the
new dictionary D, according to (2), after D is
reduced to D every time, such that the receptive
widths of kernel functions in D can always cover
the whole sample space.
21Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - We can set a tolerant minimum accuracy d for the
training examples, as the termination criterion
of this procedure. - We expect to gain a simplest model under the
condition of guaranteeing the satisfied
classification accuracy for all training
examples. - This idea accords with the principles of minimum
description length and Occams Razor. - Therefore, this algorithm can be expected to have
a good generalization ability.
22Kernel Matching Reduction Algorithms
- Reducing the Kernel-Based Dictionary by Linear
SVMs - Different from KMP which appends kernel functions
to the last model gradually, this reduction
strategy can expect to avoid local optima, just
due to deleting redundant functions from the
functions dictionary iteratively.
23Kernel Matching Reduction Algorithms
- The Detailed Procedure of KMRAs
- Step 1, Set the parameter p in (2), the cross
validation fold number v for determining C in
(5), and the required classification accuracy d
on the training examples. - Step 2, Input training examples S (x1, y1),
(x2, y2), . . . , (xl, yl). - Step 3, Compute each s by the equation (2), and
construct the kernel-based dictionary D K(x,
xi)i 1, . . . , l.
24Kernel Matching Reduction Algorithms
- The Detailed Procedure of KMRAs
- Step 4, Transform S to S by the dictionary D.
- Step 5, Determine C by v-fold cross validation.
- Step 6, Train the linear SVM with the penalty
factor C on S, and obtain the classification
model, including wi, i 1, 2, . . . , l. - Step 7, Rank xi by their weights magnitudes wi,
i 1, 2, . . . , l.
25Kernel Matching Reduction Algorithms
- The Detailed Procedure of KMRAs
- Step 8, If the classification accuracy of this
model for training data is higher than d, delete
from D the K(x, xi) which has the smallest wi,
then adjust each s for new D by (2), and go to
Step 4 Otherwise go to Step 9. - Step 9, Output the classification model, which
satisfies the accuracyd with the simplest
structure.
26Kernel Matching Reduction Algorithms
- The Detailed Procedure of KMRAs
- The reduction step 8 can be generalized to remove
more than one basis function per iteration for
improving the training speed.
27Comparing with Other Machine Learning Algorithms
- Although KMRAs, KMP, SVMs, HSSVMs, and RBFNNs can
all generate a similar decision function shape as
the equation (4), KMRAs have distinct
characteristics, in the essence, compared with
several other algorithms.
28Comparing with Other Machine Learning Algorithms
- Differences with KMP
- Both KMRA and KMP build kernel-based
dictionaries, but they adopt different ways to
select basis functions for last solutions. KMP
appends kernel functions iteratively to the
classification model. By contrary, KMRAs reduce
the size of the dictionary step by step, by
deleting redundant kernel functions. - Moreover, different from KMP, KMRAs utilize
linear SVMs to find solutions in feature space.
29Comparing with Other Machine Learning Algorithms
- KMRA Versus SVM
- The main difference between KMRA and SVM consists
in the approaches of producing feature spaces.
KMRAs create the feature space by a kernel-based
dictionary, whereas SVMs by kernel functions. - Kernel functions in SVMs must satisfy Mercers
theorem, while KMRAs have no restrictions on
kernel functions in the dictionary . The
comparison between KMRAs and SVMs is similar to
that between KMP and SVM. In fact, we select
Gaussian kernel functions in this paper, which
can have different kernel widths obtained by the
equation (2), but those Gaussian kernel
functions, for all support vectors of SVMs, have
the same kernel width.
30Comparing with Other Machine Learning Algorithms
- Linking with HSSVMs
- Hidden space support vector machines (HSSVMs),
also map input patterns into a high-dimensional
hidden space by a set of nonlinear functions, and
then train linear SVMs in the hidden space. From
this viewpoint of constructing feature spaces and
performing linear SVMs, KMRAs are similar to
HSSVMs. But we adopt an iterative procedure to
eliminate redundant kernel functions, until
obtaining a condense solution. - KMRAs can be considered as an improved version of
HSSVMs.
31Comparing with Other Machine Learning Algorithms
- Relation with RBFNNs
- Although RBFNNs also build feature spaces using
usually Gaussian kernel functions, they create
discrimination functions in the least square
sense. However, KMRAs use linear SVMs, i.e. the
idea of structural risk minimization, to find
solutions. - In a broad sense, we can think of KMRAs as a
special model of RBFNNs with a new configuration
design strategy.
32Experiments
- Description on Data Sets and Parameter Settings
- We compare KMRAs with SVMs, on four datasets
Wisconsin Breast Cancer, Pima Indians Diabetes,
Heart, and Australian, in which the former two
are from the UCI machine learning databases, and
the latter two from the Statlog database. - We directly use the LIBSVM software package for
performing the normal SVM.
33Experiments
- Description on Data Sets and Parameter Settings
- Throughout the experiments
- 1. All training data and test data are normalized
to -1, 1. - 2. Two-thirds of examples are randomly selected
as training examples, and the remaining one-third
as test those. - 3. Gaussian kernel functions are chosen for SVMs,
in which the kernel width s and the penalty
parameter C are decided by ten-fold cross
validation on the training set. - 4. p 2, in equation (2), is adopted.
- 5. v 5, in Step 5 of algorithm KMRA, is set.
- 6. For any dataset, SVM is firstly trained, and
then according to the classification accuracy of
SVM, we determine the stop accuracyd for KMRAs.
34Experiments
- Experimental Results
- We first illustrate the results from standard
SVMs, including their parameters C and s in Table
1, and support vector numbers SVs, and the
prediction accuracy in Table 2.
35Experiments
- Experimental Results
- We set the termination accuracy d 0.97, 0.8,
0.8, and 0.9 in KMRAs for these four datasets
respectively, according to the classification
accuracies of SVMs in Table 2. - We perform KMRAs on these datasets, and record
classification accuracies for test datasets per
iteration with algorithms running. Then we also
show the results in Fig. 1.
36Experiments
- Experimental Results
- In Fig. 1, the accuracies of SVMs on test
examples are expressed in the thick straight
lines, and the thin curves represent the
classification performance of KMRAs. The row axis
denotes iteration times of KMRAs, that is to say,
numbers of kernel functions in the dictionary
decrease gradually from left to right.
37Experiments
- Experimental Results
- For Diabetes and Australian, we can find the
prediction accuracies of KMRAs are improved
gradually with kernel functions in the dictionary
reducing. At the beginning of KMRAs runs, we can
conclude that the overfittings happen. Before
KMRAs end, the performance of KMRAs approaches
to, even is superior to, that of SVMs. - For Breast and Heart, from the beginning to the
end, the curves of KMRAs fluctuate up and down
around the accuracy lines of SVMs.
38Experiments
- Experimental Results
- We further illustrate, in the Table 2, the
numbers of kernel functions (i.e. SVs), which
appear in the last classification functions, as
well as the corresponding prediction accuracies,
when KMRAs terminate. - Moreover, we record the best performance during
the iterative process of KMRAs, and also list
them in the Table 2. - From Table 2, compared with SVMs, KMRAs use much
sparser support vectors, whereas they can obtain
comparable results.
39Experiments
40Experiments
41Experiments
42Conclusions
- We propose KMRAs, which delete redundant kernel
functions from a kernel-based dictionary,
iteratively. Therefore, we expect KMRAs can avoid
local optima, and can have a good generalization
ability. - Experimental results demonstrate that, compared
with SVMs, KMRAs show comparable accuracies, but
with typically much sparser representations. This
means that KMRAs can have a fast classification
speed for test examples than SVMs. - In addition, analogous to SVMs, we can extend
KMRAs to solve multi-classification problems,
though we only consider the two-class situation
in this paper.
43Conclusions
- We can also find, KMRAs gain sparser models at
the expense of a long training time.
Consequently, future work should attempt to
explore how to reduce the training cost. - In conclusion, KMRAs provide a new problem
solving approach for classification.
44Thanks!