Combining classifiers based on kernel density estimators - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Combining classifiers based on kernel density estimators

Description:

In the classification problem with J classes and M predictors, ... Missing values:Imputation. Previous Results. In the Statlog Project (1994) classifiers based ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 19
Provided by: INELIC
Category:

less

Transcript and Presenter's Notes

Title: Combining classifiers based on kernel density estimators


1
Combining classifiers based on kernel density
estimators
  • Edgar Acuña
  • Alex Rojas
  • Department of Mathematics
  • University of Puerto Rico
  • Mayagüez Campus
  • (www.math.uprm.edu/edgar)
  • This research is partially supported by Grant
    N00014-00-1-0360 from ONR

2
The Classification Problem In
the classification problem with J classes and M
predictors, one has a learning set of data L
(yk,xk), k1,.N consisting of measurements on
N cases, where each case consists of a
categorical response variable yk that takes
values in 1,.,J and a feature vector xk(x1k,
.xMk). The goal of classification is to use L
to construct a function C(x, L ) that gives
accurate classification of future data.
3
The Classification problem
4
The Misclassification ErrorLet
C(X, L ) be the classifier constructed by using
the training sample L., and T another large
sample from the same population as L was drawn
from, then the misclassification error (ME) of
the classifier C is the proportion of
misclassified cases of T using C. The ME can
be descomposed as
ME(C)ME(C)Bias2(C) Var(C)where
C(x)argmaxjP(Yj/Xx) (Bayes Classifier)Method
s to estimate MEResubstitution,Crossvalidation,
Bootstrapping
5
The classifier may either overfit the data (Low
bias and large variance) or underfit the data (
Large bias and small variance).A classifier is
unstable (Breiman, 1996) if a small change in the
data L can make large changes in the
classification. Unstable classifiers have low
bias but high variance. CART and Neural networks
are unstable classifiers. Linear discriminant
analysis and K-nearest neighbor classifiers are
stable.
6
Combining classifiersCombining the
predictions of several classifiers the variance
and bias of the ME could be reduced. This
combination is called an Ensemble and in general
is more accurate than the individual
classifiers.Methods for creating ensembles are
Bagging (Bootstrap aggregating by Breiman, 1996)
AdaBoosting (Adaptive Boosting by Freund and
Schapire, 1996) Arcing (Adaptively resampling
and combining, by Breiman (1998).
7
(No Transcript)
8
(No Transcript)
9
Previous results on combining classifiers
10
Bayesian approach to classification An object
with measurement vector x is assigned to the
class j if P(j/x)gtP(j/x) for all
j?j By Bayess theorem P(j/x)?jf(x/j)/f(x)?j
Prior of the j-th classf(x/j)class
conditional densityf(x) density function of x
Hence j is the integer s.t.??jf(x/j) is
maximum.
11
Kernel density estimator classifiers
12
If the vector of predictors x can be decomposed
as x(x(1),x(2)), where x(1) contains the p1
categorical predictors and x(2) includes the p2
continuous predictors, then a mixed product
kernel density estimator will be given
by   where is the kernel
estimator for the vector x(1) of categorical
predictors and is the kernel density
estimator of the vector x(2) of continuous
predictors.
13
Problems
  • Choice of the Bandwidth Fixed, adaptive.
  • Mixed type of predictors continuous and Discrete
  • The curse of dimensionalityFeature selection
    Filters (The Relief), Wrappers.
  • Missing valuesImputation

14
Previous Results
  • In the Statlog Project (1994) classifiers based
  • on KDE performed better than CART (a 13-8
  • win) and tied with C4.5 (11-11). Also they
  • appeared as the top 5 classifier for 11 datasets
  • whereas C4.5 and CART appeared only 6 and
  • 3 times respectively.

15
Experimental Methodology
  • Each dataset is randomly divided in 10 parts. The
  • first of this part is taking as the test sample
    and the
  • remaining ones as the training sample. Next, 50
  • bootstrapped samples are taking from the training
  • sample and a KDE classifier is constructed with
    each
  • of them. Finally each instance of the test
    sample is
  • assigned to a class by voting using the 50
    classifiers.
  • The procedure is repeated with each part and then
  • the experiment is repeated 10 times.

16
Performance without feature selection
17
Performance after feature selection
18
Conclusions
  • Increasing the number of bootstrapped samples
    seems to improves the misclassification error for
    both types of classifiers.
  • Without feature selection, the adaptive kernel
    performs better than the standard kernel, but it
    requires at least three times more computing
    time.
  • After feature selection the performance of
    bagging deteriorates for both type of kernels.
  • Feature selection does a good job, because after
    that KDE classifiers gives lower ME saving
    computing time.
Write a Comment
User Comments (0)
About PowerShow.com