Title: Combining classifiers based on kernel density estimators
1Combining classifiers based on kernel density
estimators
- Edgar Acuña
- Alex Rojas
- Department of Mathematics
- University of Puerto Rico
- Mayagüez Campus
- (www.math.uprm.edu/edgar)
- This research is partially supported by Grant
N00014-00-1-0360 from ONR -
2 The Classification Problem In
the classification problem with J classes and M
predictors, one has a learning set of data L
(yk,xk), k1,.N consisting of measurements on
N cases, where each case consists of a
categorical response variable yk that takes
values in 1,.,J and a feature vector xk(x1k,
.xMk). The goal of classification is to use L
to construct a function C(x, L ) that gives
accurate classification of future data.
3The Classification problem
4 The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L., and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. The ME can
be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Method
s to estimate MEResubstitution,Crossvalidation,
Bootstrapping
5The classifier may either overfit the data (Low
bias and large variance) or underfit the data (
Large bias and small variance).A classifier is
unstable (Breiman, 1996) if a small change in the
data L can make large changes in the
classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable.
6 Combining classifiersCombining the
predictions of several classifiers the variance
and bias of the ME could be reduced. This
combination is called an Ensemble and in general
is more accurate than the individual
classifiers.Methods for creating ensembles are
Bagging (Bootstrap aggregating by Breiman, 1996)
AdaBoosting (Adaptive Boosting by Freund and
Schapire, 1996) Arcing (Adaptively resampling
and combining, by Breiman (1998).
7(No Transcript)
8(No Transcript)
9Previous results on combining classifiers
10Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(j/x)gtP(j/x) for all
j?j By Bayess theorem P(j/x)?jf(x/j)/f(x)?j
Prior of the j-th classf(x/j)class
conditional densityf(x) density function of x
Hence j is the integer s.t.??jf(x/j) is
maximum.
11 Kernel density estimator classifiers
12 If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by  where is the kernel
estimator for the vector x(1) of categorical
predictors and is the kernel density
estimator of the vector x(2) of continuous
predictors.
13Problems
- Choice of the Bandwidth Fixed, adaptive.
- Mixed type of predictors continuous and Discrete
- The curse of dimensionalityFeature selection
Filters (The Relief), Wrappers. - Missing valuesImputation
14Previous Results
- In the Statlog Project (1994) classifiers based
- on KDE performed better than CART (a 13-8
- win) and tied with C4.5 (11-11). Also they
- appeared as the top 5 classifier for 11 datasets
- whereas C4.5 and CART appeared only 6 and
- 3 times respectively.
15Experimental Methodology
- Each dataset is randomly divided in 10 parts. The
- first of this part is taking as the test sample
and the - remaining ones as the training sample. Next, 50
- bootstrapped samples are taking from the training
- sample and a KDE classifier is constructed with
each - of them. Finally each instance of the test
sample is - assigned to a class by voting using the 50
classifiers. - The procedure is repeated with each part and then
- the experiment is repeated 10 times.
16 Performance without feature selection
17 Performance after feature selection
18Conclusions
- Increasing the number of bootstrapped samples
seems to improves the misclassification error for
both types of classifiers. - Without feature selection, the adaptive kernel
performs better than the standard kernel, but it
requires at least three times more computing
time. - After feature selection the performance of
bagging deteriorates for both type of kernels. - Feature selection does a good job, because after
that KDE classifiers gives lower ME saving
computing time.