Title: Bagging classifiers based on kernel density estimators
1Bagging classifiers based on kernel density
estimators
- Edgar Acuña
- Alex Rojas
- Department of Mathematics
- University of Puerto Rico
- Mayagüez Campus
- (www.math.uprm.edu/edgar)
- This research is partially supported by ONR
Grant N00014-00-1-0360 and NSF Grant EIA 99-77071
-
2 The Classification Problem In
the classification problem with J classes and M
predictors, one has a learning set of data L
(yk,xk), k1,.N consisting of measurements on
N cases, where each case consists of a
categorical response variable yk that takes
values in 1,.,J and a feature vector xk(x1k,
.xMk). The goal of classification is to use L
to construct a function C(x, L ) that gives
accurate classification of future data.
3The Classification problem
4 The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L., and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. The ME can
be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Method
s to estimate MEResubstitution,Crossvalidation,
Bootstrapping
5The classifier may either overfit the data (Low
bias and large variance) or underfit the data (
Large bias and small variance).A classifier is
said to be unstable (Breiman, 1996) if a small
change in the data L can make large changes in
the classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable.
6 Combining classifiersCombining the
predictions of several classifiers could lead to
a reduction of the variance and bias of the ME
This combination is called an Ensemble and in
general is more accurate than the individual
classifiers.Methods for creating ensembles are
Bagging (Bootstrap aggregating by Breiman, 1996)
AdaBoosting (Adaptive Boosting by Freund and
Schapire, 1996) Arcing (Adaptively resampling
and combining, by Breiman (1998).
7(No Transcript)
8Previous results on combining classifiers
9Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(j/x)gtP(j/x) for all
j?j By Bayess theorem P(j/x)?jf(x/j)/f(x)?j
Prior of the j-th classf(x/j)class
conditional densityf(x) density function of x
Hence j is the integer s.t.??jf(x/j) is
maximum.
10 Kernel density estimator classifiers
11Problems
- Choice of the Bandwidth Fixed, adaptive.
- Fixed
- Adaptive kernel density estimation an k-nn are
combined (Silverman, p101). - Mixed type of predictors Continuous and
Categorical (Binary, Ordinal, Nominal) - Use of Product Kernel
12 If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by  where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
13Problems
- The curse of dimensionalityFeature selection
Filters (The Relief), Wrappers. - Missing valuesImputation
14Previous Results
- In the Statlog Project (1994) classifiers based
- on KDE performed better than CART (a 13-8
- win) and tied with C4.5 (11-11). Also they
- appeared as the top 5 classifier for 11 datasets
- whereas C4.5 and CART appeared only 6 and
- 3 times respectively.
15Experimental Methodology
- Each dataset is randomly divided in 10 parts. The
- first of this part is taking as the test sample
and the - remaining ones as the training sample. Next, 50
- bootstrapped samples are taking from the training
- sample and a KDE classifier is constructed with
each - of them. Finally each instance of the test
sample is - assigned to a class by voting using the 50
classifiers. - The same procedure is applied to each part and
then - the experiment is repeated 10 times.
16 Datasets
17 Performance without feature selection
18 Feature selection
19 Performance after feature selection
20Concluding Remarks
- Increasing the number of bootstrapped samples
seems to improves the misclassification error for
both types of classifiers. - Without feature selection, the adaptive kernel
performs better than the standard kernel, but it
requires at least three times more computing
time. - After feature selection the performance of
bagging deteriorates for both type of kernels. - Feature selection does a good job, because after
that KDE classifiers gives lower ME saving
computing time.