Bagging classifiers based on kernel density estimators - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Bagging classifiers based on kernel density estimators

Description:

Bagging. Relative Improv. (%) Classifier. Reference. Bayesian approach to classification ... selection the performance of bagging deteriorates for both type of ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 21

Provided by: INELIC

Category:

more less

Transcript and Presenter's Notes

Title: Bagging classifiers based on kernel density estimators

1
Bagging classifiers based on kernel density
estimators

Edgar Acuña
Alex Rojas
Department of Mathematics
University of Puerto Rico
Mayagüez Campus
(www.math.uprm.edu/edgar)
This research is partially supported by ONR
Grant N00014-00-1-0360 and NSF Grant EIA 99-77071

2
The Classification Problem In
the classification problem with J classes and M
predictors, one has a learning set of data L
(yk,xk), k1,.N consisting of measurements on
N cases, where each case consists of a
categorical response variable yk that takes
values in 1,.,J and a feature vector xk(x1k,
.xMk). The goal of classification is to use L
to construct a function C(x, L ) that gives
accurate classification of future data.
3
The Classification problem
4
The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L., and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. The ME can
be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Method
s to estimate MEResubstitution,Crossvalidation,
Bootstrapping
5
The classifier may either overfit the data (Low
bias and large variance) or underfit the data (
Large bias and small variance).A classifier is
said to be unstable (Breiman, 1996) if a small
change in the data L can make large changes in
the classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable.
6
Combining classifiersCombining the
predictions of several classifiers could lead to
a reduction of the variance and bias of the ME
This combination is called an Ensemble and in
general is more accurate than the individual
classifiers.Methods for creating ensembles are
Bagging (Bootstrap aggregating by Breiman, 1996)
AdaBoosting (Adaptive Boosting by Freund and
Schapire, 1996) Arcing (Adaptively resampling
and combining, by Breiman (1998).
7
(No Transcript)
8
Previous results on combining classifiers
9
Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(j/x)gtP(j/x) for all
j?j By Bayess theorem P(j/x)?jf(x/j)/f(x)?j
Prior of the j-th classf(x/j)class
conditional densityf(x) density function of x
Hence j is the integer s.t.??jf(x/j) is
maximum.
10
Kernel density estimator classifiers
11
Problems

Choice of the Bandwidth Fixed, adaptive.
Fixed
Adaptive kernel density estimation an k-nn are
combined (Silverman, p101).
Mixed type of predictors Continuous and
Categorical (Binary, Ordinal, Nominal)
Use of Product Kernel

12
If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
13
Problems

The curse of dimensionalityFeature selection
Filters (The Relief), Wrappers.
Missing valuesImputation

14
Previous Results

In the Statlog Project (1994) classifiers based
on KDE performed better than CART (a 13-8
win) and tied with C4.5 (11-11). Also they
appeared as the top 5 classifier for 11 datasets
whereas C4.5 and CART appeared only 6 and
3 times respectively.

15
Experimental Methodology

Each dataset is randomly divided in 10 parts. The
first of this part is taking as the test sample
and the
remaining ones as the training sample. Next, 50
bootstrapped samples are taking from the training
sample and a KDE classifier is constructed with
each
of them. Finally each instance of the test
sample is
assigned to a class by voting using the 50
classifiers.
The same procedure is applied to each part and
then
the experiment is repeated 10 times.

16
Datasets
17
Performance without feature selection
18
Feature selection
19
Performance after feature selection
20
Concluding Remarks

Increasing the number of bootstrapped samples
seems to improves the misclassification error for
both types of classifiers.
Without feature selection, the adaptive kernel
performs better than the standard kernel, but it
requires at least three times more computing
time.
After feature selection the performance of
bagging deteriorates for both type of kernels.
Feature selection does a good job, because after
that KDE classifiers gives lower ME saving
computing time.