Bagging classifiers based on kernel density estimators - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Bagging classifiers based on kernel density estimators

Description:

Bagging. Relative Improv. (%) Classifier. Reference. Bayesian approach to classification ... selection the performance of bagging deteriorates for both type of ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 21
Provided by: INELIC
Category:

less

Transcript and Presenter's Notes

Title: Bagging classifiers based on kernel density estimators


1
Bagging classifiers based on kernel density
estimators
  • Edgar Acuña
  • Alex Rojas
  • Department of Mathematics
  • University of Puerto Rico
  • Mayagüez Campus
  • (www.math.uprm.edu/edgar)
  • This research is partially supported by ONR
    Grant N00014-00-1-0360 and NSF Grant EIA 99-77071

2
The Classification Problem In
the classification problem with J classes and M
predictors, one has a learning set of data L
(yk,xk), k1,.N consisting of measurements on
N cases, where each case consists of a
categorical response variable yk that takes
values in 1,.,J and a feature vector xk(x1k,
.xMk). The goal of classification is to use L
to construct a function C(x, L ) that gives
accurate classification of future data.
3
The Classification problem
4
The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L., and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. The ME can
be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Method
s to estimate MEResubstitution,Crossvalidation,
Bootstrapping
5
The classifier may either overfit the data (Low
bias and large variance) or underfit the data (
Large bias and small variance).A classifier is
said to be unstable (Breiman, 1996) if a small
change in the data L can make large changes in
the classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable.
6
Combining classifiersCombining the
predictions of several classifiers could lead to
a reduction of the variance and bias of the ME
This combination is called an Ensemble and in
general is more accurate than the individual
classifiers.Methods for creating ensembles are
Bagging (Bootstrap aggregating by Breiman, 1996)
AdaBoosting (Adaptive Boosting by Freund and
Schapire, 1996) Arcing (Adaptively resampling
and combining, by Breiman (1998).
7
(No Transcript)
8
Previous results on combining classifiers
9
Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(j/x)gtP(j/x) for all
j?j By Bayess theorem P(j/x)?jf(x/j)/f(x)?j
Prior of the j-th classf(x/j)class
conditional densityf(x) density function of x
Hence j is the integer s.t.??jf(x/j) is
maximum.
10
Kernel density estimator classifiers
11
Problems
  • Choice of the Bandwidth Fixed, adaptive.
  • Fixed
  • Adaptive kernel density estimation an k-nn are
    combined (Silverman, p101).
  • Mixed type of predictors Continuous and
    Categorical (Binary, Ordinal, Nominal)
  • Use of Product Kernel

12
If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by   where is the kernel
estimator for the vector x(1) of categorical
predictors (Titterington, 1980) and is
the kernel density estimator of the vector x(2)
of continuous predictors.
13
Problems
  • The curse of dimensionalityFeature selection
    Filters (The Relief), Wrappers.
  • Missing valuesImputation

14
Previous Results
  • In the Statlog Project (1994) classifiers based
  • on KDE performed better than CART (a 13-8
  • win) and tied with C4.5 (11-11). Also they
  • appeared as the top 5 classifier for 11 datasets
  • whereas C4.5 and CART appeared only 6 and
  • 3 times respectively.

15
Experimental Methodology
  • Each dataset is randomly divided in 10 parts. The
  • first of this part is taking as the test sample
    and the
  • remaining ones as the training sample. Next, 50
  • bootstrapped samples are taking from the training
  • sample and a KDE classifier is constructed with
    each
  • of them. Finally each instance of the test
    sample is
  • assigned to a class by voting using the 50
    classifiers.
  • The same procedure is applied to each part and
    then
  • the experiment is repeated 10 times.

16
Datasets
17
Performance without feature selection
18
Feature selection
19
Performance after feature selection
20
Concluding Remarks
  • Increasing the number of bootstrapped samples
    seems to improves the misclassification error for
    both types of classifiers.
  • Without feature selection, the adaptive kernel
    performs better than the standard kernel, but it
    requires at least three times more computing
    time.
  • After feature selection the performance of
    bagging deteriorates for both type of kernels.
  • Feature selection does a good job, because after
    that KDE classifiers gives lower ME saving
    computing time.
Write a Comment
User Comments (0)
About PowerShow.com