Bayes classifiers - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Bayes classifiers

Description:

Assume you want to predict output Y which has arity nY and ... don't try to be maximally discriminative---they merely try to honestly model what's going on ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 40

Provided by: edgar9

Category:

more less

Transcript and Presenter's Notes

Title: Bayes classifiers

1
Bayes classifiers

Edgar Acuna

2
Bayes Classifiers

A formidable and sworn enemy of decision trees

BC
3
How to build a Bayes Classifier

Assume you want to predict output Y which has
arity nY and values v1, v2, vny.
Assume there are m input attributes called X1,
X2, Xm
Break dataset into nY smaller datasets called
DS1, DS2, DSny.
Define DSi Records in which Yvi
For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.

4
How to build a Bayes Classifier

Assume you want to predict output Y which has
arity nY and values v1, v2, vny.
Assume there are m input attributes called X1,
X2, Xm
Break dataset into nY smaller datasets called
DS1, DS2, DSny.
Define DSi Records in which Yvi
For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.
Mi estimates P(X1, X2, Xm Yvi )

5
How to build a Bayes Classifier

Assume you want to predict output Y which has
arity nY and values v1, v2, vny.
Assume there are m input attributes called X1,
X2, Xm
Break dataset into nY smaller datasets called
DS1, DS2, DSny.
Define DSi Records in which Yvi
For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.
Mi estimates P(X1, X2, Xm Yvi )
Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(X1, X2, Xm
Yvi ) most likely

Is this a good idea?
6
How to build a Bayes Classifier

Assume you want to predict output Y which has
arity nY and values v1, v2, vny.
Assume there are m input attributes called X1,
X2, Xm
Break dataset into nY smaller datasets called
DS1, DS2, DSny.
Define DSi Records in which Yvi
For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.
Mi estimates P(X1, X2, Xm Yvi )
Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(X1, X2, Xm
Yvi ) most likely

This is a Maximum Likelihood classifier. It can
get silly if some Ys are very unlikely
Is this a good idea?
7
How to build a Bayes Classifier

Assume you want to predict output Y which has
arity nY and values v1, v2, vny.
Assume there are m input attributes called X1,
X2, Xm
Break dataset into nY smaller datasets called
DS1, DS2, DSny.
Define DSi Records in which Yvi
For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.
Mi estimates P(X1, X2, Xm Yvi )
Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(Yvi X1,
X2, Xm) most likely

Much Better Idea
Is this a good idea?
8
Terminology

MLE (Maximum Likelihood Estimator)

MAP (Maximum A-Posteriori Estimator)

9
Getting what we need
10
Getting a posterior probability
11
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction
12
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction

We can use our favorite Density Estimator here.
Right now we have two options
Joint Density Estimator
Naïve Density Estimator

13
Joint Density Bayes Classifier
In the case of the joint Bayes Classifier this
degenerates to a very simple rule Ypredict
the class containing most records in which X1
u1, X2 u2, . Xm um. Note that if no records
have the exact set of inputs X1 u1, X2 u2, .
Xm um, then P(X1, X2, Xm Yvi ) 0 for all
values of Y. In that case we just have to guess
Ys value
14
Ejemplo
15
Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1 .
Notar tambien que en esta clase el record (0,0,1)
aparece mas veces que en la clase 0.
16
Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
17
Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
Technical Hint If you have 10,000 input
attributes that product will underflow in
floating point math. You should use logs
18
Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1
19
BC Results XOR
The XOR dataset consists of 40,000 records and
2 Boolean inputs called a and b, generated 50-50
randomly as 0 or 1. c (output) a XOR b
The Classifier learned by Joint BC
The Classifier learned by Naive BC
20
BC Results MPG 392 records
The Classifier learned by Naive BC
21
More Facts About Bayes Classifiers

Many other density estimators can be slotted in.
Density estimation can be performed with
real-valued inputs
Bayes Classifiers can be built with real-valued
inputs
Rather Technical Complaint Bayes Classifiers
dont try to be maximally discriminative---they
merely try to honestly model whats going on
Zero probabilities are painful for Joint and
Naïve. A hack (justifiable with the magic words
Dirichlet Prior) can help.
Naïve Bayes is wonderfully cheap. And survives
10,000 attributes cheerfully!
See future Andrew Lectures

22
Naïve Bayes classifier
Naïve Bayes classifier puede ser aplicado cuando
hay predictoras continuas, pero hay que aplicar
previamente un metodo de discretizacion tal
como Usando intervalos de igual ancho, usando
intervalos con igual frecuencia, ChiMerge,1R,
Discretizacion usando el metodo de la entropia
con distintos criterios de parada, Todos ellos
estan disponible en la libreria dprep( ver
disc.mentr, disc.ew, disc.ef, etc) . La libreria
e1071 de R contiene una funcion naiveBayes que
calcula el clasificador naïve Bayes. Si la
variable es continua asume que sigue una
distribucion Gaussiana.
23
The misclassification error rate
The misclassification error rate R(d) is the
probability that the classifier d classifies
incorrectly an instance coming from a sample
(test sample) obtained in a later stage than the
training sample. Also is called the True error or
the actual error. It is an unknown value that
needs to be estimated.
24
Methods for estimation of the misclassification
error rate

Resubstitution or Aparent Error (Smith, 1947).
This is merely the proportion of instances in the
training sample that are incorrectly classified
by the classification rule. In general is an
estimator too optimistic and it can lead to wrong
conclusions if the number of instances is not
large compared with the number of features. This
estimator has a large bias.
ii) Leave one out estimation. (Lachenbruch,
1965). In this case an instance is omitted from
the training sample. Then the classifier is built
and the prediction for the omitted instances is
obtained. One must register if the instance was
correctly or incorrectly classfied. The process
is repeated for all the instances in the training
sample and the estimation of the ME will be given
by the proportion of instances incorrectly
classified. This estimator has low bias but its
variance tends to be large.

25
Methods for estimation of the misclassification
error rate

iii) Cross validation. (Stone, 1974) In this case
the training sample is randomly divided in v
parts (v10 is the most used). Then the
classifier is built using all the parts but one.
The omitted part is considered as the test sample
and the predictions for each instance on it are
found. The CV misclassification error rate is
found by adding the misclassification on each
part and dividing them by the total number of
instances. The CV estimated has low bias but
high variance. In order to reduce the variability
we usually repeat the estimation several times.
The estimation of the variance is a hard problem
(bengio and Grandvalet, 2004).

26
Methods for estimation of the misclassification
error rate
iv) The holdout method. A percentage (70) of the
dataset is considered as the training sample and
the remaining as the test sample. The classifier
is evaluated in the test sample. The experiment
is repeated several times and then the average is
taken. v) Bootstrapping. (Efron, 1983). In this
method we generate several training samples by
sampling with replacement from the original
training sample. The idea is to reduce the bias
of the resubstitution error. It is almost
unbiased, but it has a large variance. Its
computation cost is high. There exist several
variants of this method.
27
Naive Bayes para Bupa
Sin discretizar gt anaiveBayes(V7.,databupa) gt
predpredict(a,bupa,-7,type"raw") gt
pred1max.col(pred) gt table(pred1,bupa,7)
pred1 1 2 1 112 119 2 33 81 gt
error152/345 1 0.4405797 Discretizando con el
metodo de la entropia gt dbupadisc.mentr(bupa,17
) gt bnaiveBayes(V7.,datadbupa) gt
predpredict(b,dbupa,-7) gt table(pred,dbupa,7)
pred 1 2 1 79 61 2 66 139 gt
error1127/345 1 0.3681159
28
Naïve Bayes para Diabetes
Sin Descritizar gt anaiveBayes(V9.,datadiabetes)
gt predpredict(a,diabetes,-9,type"raw") gt
pred1max.col(pred) gt table(pred1,diabetes,9)
pred1 1 2 1 421 104 2 79 164 gt
error(79104)/768 1 0.2382813 Discretizando gt
ddiabetesdisc.mentr(diabetes,19) gt
bnaiveBayes(V9.,dataddiabetes) gt
predpredict(b,ddiabetes,-9) gt
table(pred,ddiabetes,9) pred 1 2 1
418 84 2 82 184 gt 166/768 1 0.2161458
29
Naïve Bayes usando discretizacion ChiMerge
gt chibupachiMerge(bupa,16) gt bnaiveBayes(V7.,d
atachibupa) gt predpredict(b,chibupa,-7) gt
table(pred,chibupa,7) pred 1 2 1 117
21 2 28 179 gt error49/345 1 0.1420290 gt
chidiabchiMerge(diabetes,18) gt
bnaiveBayes(V9.,datachidiab) gt
predpredict(b,chidiab,-9) gt table(pred,chidiab
,9) pred 1 2 1 457 33 2 43 235 gt
error76/768 1 0.09895833
30
Otros clasificadores Bayesianos
Analisis Discriminante Lineal (LDA). Aqui se
asume que la funcion de clase condicional
P(X1,Xm/Yvj) se asume que es normal
multivariada para cada vj. Se supone ademas que
la matriz de covarianza es igual para cada una de
las clases. La regla de decision para asignar el
objeto x se reduce a Notar que la regla de
decision es lineal en el vector de predictoras x.
Estrictamente hablando solo deberia aplicarse
cuando las predictoras son continuas.
31
Ejemplos de LDABupa y Diabetes
gt bupaldalda(V7.,databupa) gt
predpredict(bupalda,bupa,-7)class gt
table(pred,bupa,7) pred 1 2 1 78
35 2 67 165 gt error102/345 1 0.2956522 gt
diabetesldalda(V9.,datadiabetes) gt
predpredict(diabeteslda,diabetes,-9)class gt
table(pred,diabetes,9) pred 1 2 1
446 112 2 54 156 gt error166/768 1 0.2161458
32
Otros clasificadores Bayesianos
Los k vecinos mas cercanos (k nearest
neighbor). Aqui la funcion de clase condicional
P(X1,Xm/Yvj) es estimada por el metodo de los
k-vecinos mas cercanos. Estimadores basados en
estimacion de densidad por Kernel. Estimadores
basados en estimacion de la densidad condiiconal
usando mezclas Gaussianas
33
El clasificador k-nn

En el caso multivariado, el estimado de la
función de densidad tiene la forma
donde vk(x) es el volumen de un elipsoide
centrado en x de radio rk(x), que a su vez es la
distancia de x al k-ésimo punto más cercano.

34
El clasificador k-nn
Desde el punto de vista de clasificacion
supervisada el método k-nn es bien simple de
aplicar. En efecto, si las funciones de
densidades condicionales f(x/Ci) de la clase Ci
que aparecen en la ecuación son estimadas por
k-nn. Entonces, para clasificar un objeto, con
mediciones dadas por el vector x, en la clase Ci
se debe cumplir que para j?i. Donde ki y kj
son los k vecinos de x que caen en las clase Ci y
Cj respectivamente.
35

El clasificador k-nn
Asumiendo priors proporcionales a los tamaños de
las clases (ni/n y nj/n respectivamente) lo
anterior es equivalente a
kigtkj para j?iLuego, el procedimiento de
clasificación sería así 1) Hallar los k objetos
que están a una distancia más cercana al ojbeto
x, k usualmente es un número impar 1 o 3. 2) Si
la mayoría de esos k objetos pertenecen a la
clase Ci entonces el objeto x es asignado a
ella. En caso de empate se clasifica al azar.
36
El clasificador k-nn

Hay dos problemas en el método k-nn, la elección
de la distancia o métrica y la elección de k.
La métrica más elemental que se puede elegir es
la euclideana d(x,y)(x-y)'(x-y). Esta métrica
sin embargo, puede causar problemas si las
variables predictoras han sido medidas en
unidades muy distintas entre sí. Algunos
prefieren rescalar los datos antes de aplicar el
método. Otra distancia bien usada es la distancia
Manhatan definida por d(x,y)x-y. Hay metricas
especiales cuando hay distintode variables en el
conjunto de datos.
Enas y Choi (1996) usando simulación hicieron un
estudio para determinar el k óptimo cuando solo
hay dos clases presentesy determinaron que si los
tamaños muestrales de las dos clases son
comparables entonces kn3/8 si habia poca
diferencia entre las matrices de covarianzas de
los grupos y kn2/8 si habia bastante diferencia
entre las matrices de covarianzas.

37
Ejemplo de knn Bupa
gt bupak1knn(bupa,-7,bupa,-7,as.factor(bupa,7
),k1) gt table(bupak1,bupa,7) bupak1 1 2
1 145 0 2 0 200 error0 gt
bupak3knn(bupa,-7,bupa,-7,as.factor(bupa,7)
,k3) gt table(bupak3,bupa,7) bupak3 1
2 1 106 29 2 39
171 error19.71 gt bupak5knn(bupa,-7,bupa,-7,
as.factor(bupa,7),k5) gt table(bupak5,bupa,7)
bupak5 1 2 1 94 23 2 51
177 error21.44
38
Ejemplo de knn diabetes
gt diabk3knn(diabetes,-9,diabetes,-9,as.factor
(diabetes,9),k3) gt table(diabk3,diabetes,9)
diabk1 1 2 1 459 67 2 41
201 error14.06 gt diabk5knn(diabetes,-9,diabet
es,-9,as.factor(diabetes,9),k5) gt
table(diabk5,diabetes,9) diabk1 1 2
1 442 93 2 58 175 error19.66
39
What you should know