Title: Bayes classifiers
1Bayes classifiers
2Bayes Classifiers
- A formidable and sworn enemy of decision trees
BC
3How to build a Bayes Classifier
- Assume you want to predict output Y which has
arity nY and values v1, v2, vny. - Assume there are m input attributes called X1,
X2, Xm - Break dataset into nY smaller datasets called
DS1, DS2, DSny. - Define DSi Records in which Yvi
- For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records.
4How to build a Bayes Classifier
- Assume you want to predict output Y which has
arity nY and values v1, v2, vny. - Assume there are m input attributes called X1,
X2, Xm - Break dataset into nY smaller datasets called
DS1, DS2, DSny. - Define DSi Records in which Yvi
- For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records. - Mi estimates P(X1, X2, Xm Yvi )
5How to build a Bayes Classifier
- Assume you want to predict output Y which has
arity nY and values v1, v2, vny. - Assume there are m input attributes called X1,
X2, Xm - Break dataset into nY smaller datasets called
DS1, DS2, DSny. - Define DSi Records in which Yvi
- For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records. - Mi estimates P(X1, X2, Xm Yvi )
- Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(X1, X2, Xm
Yvi ) most likely
Is this a good idea?
6How to build a Bayes Classifier
- Assume you want to predict output Y which has
arity nY and values v1, v2, vny. - Assume there are m input attributes called X1,
X2, Xm - Break dataset into nY smaller datasets called
DS1, DS2, DSny. - Define DSi Records in which Yvi
- For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records. - Mi estimates P(X1, X2, Xm Yvi )
- Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(X1, X2, Xm
Yvi ) most likely
This is a Maximum Likelihood classifier. It can
get silly if some Ys are very unlikely
Is this a good idea?
7How to build a Bayes Classifier
- Assume you want to predict output Y which has
arity nY and values v1, v2, vny. - Assume there are m input attributes called X1,
X2, Xm - Break dataset into nY smaller datasets called
DS1, DS2, DSny. - Define DSi Records in which Yvi
- For each DSi , learn Density Estimator Mi to
model the input distribution among the Yvi
records. - Mi estimates P(X1, X2, Xm Yvi )
- Idea When a new set of input values (X1 u1, X2
u2, . Xm um) come along to be evaluated
predict the value of Y that makes P(Yvi X1,
X2, Xm) most likely
Much Better Idea
Is this a good idea?
8Terminology
- MLE (Maximum Likelihood Estimator)
- MAP (Maximum A-Posteriori Estimator)
9Getting what we need
10Getting a posterior probability
11Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction
12Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction
- We can use our favorite Density Estimator here.
- Right now we have two options
- Joint Density Estimator
- Naïve Density Estimator
13Joint Density Bayes Classifier
In the case of the joint Bayes Classifier this
degenerates to a very simple rule Ypredict
the class containing most records in which X1
u1, X2 u2, . Xm um. Note that if no records
have the exact set of inputs X1 u1, X2 u2, .
Xm um, then P(X1, X2, Xm Yvi ) 0 for all
values of Y. In that case we just have to guess
Ys value
14Ejemplo
X1 X2 X3 Y
0 0 1 0
0 1 0 0
1 1 0 0
0 0 1 1
1 1 1 1
0 0 1 1
1 1 0 1
15Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1 .
Notar tambien que en esta clase el record (0,0,1)
aparece mas veces que en la clase 0.
16Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
17Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
Technical Hint If you have 10,000 input
attributes that product will underflow in
floating point math. You should use logs
18Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1
19BC Results XOR
The XOR dataset consists of 40,000 records and
2 Boolean inputs called a and b, generated 50-50
randomly as 0 or 1. c (output) a XOR b
The Classifier learned by Joint BC
The Classifier learned by Naive BC
20BC Results MPG 392 records
The Classifier learned by Naive BC
21More Facts About Bayes Classifiers
- Many other density estimators can be slotted in.
- Density estimation can be performed with
real-valued inputs - Bayes Classifiers can be built with real-valued
inputs - Rather Technical Complaint Bayes Classifiers
dont try to be maximally discriminative---they
merely try to honestly model whats going on - Zero probabilities are painful for Joint and
Naïve. A hack (justifiable with the magic words
Dirichlet Prior) can help. - Naïve Bayes is wonderfully cheap. And survives
10,000 attributes cheerfully! - See future Andrew Lectures
22Naïve Bayes classifier
Naïve Bayes classifier puede ser aplicado cuando
hay predictoras continuas, pero hay que aplicar
previamente un metodo de discretizacion tal
como Usando intervalos de igual ancho, usando
intervalos con igual frecuencia, ChiMerge,1R,
Discretizacion usando el metodo de la entropia
con distintos criterios de parada, Todos ellos
estan disponible en la libreria dprep( ver
disc.mentr, disc.ew, disc.ef, etc) . La libreria
e1071 de R contiene una funcion naiveBayes que
calcula el clasificador naïve Bayes. Si la
variable es continua asume que sigue una
distribucion Gaussiana.
23The misclassification error rate
The misclassification error rate R(d) is the
probability that the classifier d classifies
incorrectly an instance coming from a sample
(test sample) obtained in a later stage than the
training sample. Also is called the True error or
the actual error. It is an unknown value that
needs to be estimated.
24Methods for estimation of the misclassification
error rate
- Resubstitution or Aparent Error (Smith, 1947).
This is merely the proportion of instances in the
training sample that are incorrectly classified
by the classification rule. In general is an
estimator too optimistic and it can lead to wrong
conclusions if the number of instances is not
large compared with the number of features. This
estimator has a large bias. - ii) Leave one out estimation. (Lachenbruch,
1965). In this case an instance is omitted from
the training sample. Then the classifier is built
and the prediction for the omitted instances is
obtained. One must register if the instance was
correctly or incorrectly classfied. The process
is repeated for all the instances in the training
sample and the estimation of the ME will be given
by the proportion of instances incorrectly
classified. This estimator has low bias but its
variance tends to be large.
25Methods for estimation of the misclassification
error rate
- iii) Cross validation. (Stone, 1974) In this case
the training sample is randomly divided in v
parts (v10 is the most used). Then the
classifier is built using all the parts but one.
The omitted part is considered as the test sample
and the predictions for each instance on it are
found. The CV misclassification error rate is
found by adding the misclassification on each
part and dividing them by the total number of
instances. The CV estimated has low bias but
high variance. In order to reduce the variability
we usually repeat the estimation several times. - The estimation of the variance is a hard problem
(bengio and Grandvalet, 2004).
26Methods for estimation of the misclassification
error rate
iv) The holdout method. A percentage (70) of the
dataset is considered as the training sample and
the remaining as the test sample. The classifier
is evaluated in the test sample. The experiment
is repeated several times and then the average is
taken. v) Bootstrapping. (Efron, 1983). In this
method we generate several training samples by
sampling with replacement from the original
training sample. The idea is to reduce the bias
of the resubstitution error. It is almost
unbiased, but it has a large variance. Its
computation cost is high. There exist several
variants of this method.
27Naive Bayes para Bupa
Sin discretizar gt anaiveBayes(V7.,databupa) gt
predpredict(a,bupa,-7,type"raw") gt
pred1max.col(pred) gt table(pred1,bupa,7)
pred1 1 2 1 112 119 2 33 81 gt
error152/345 1 0.4405797 Discretizando con el
metodo de la entropia gt dbupadisc.mentr(bupa,17
) gt bnaiveBayes(V7.,datadbupa) gt
predpredict(b,dbupa,-7) gt table(pred,dbupa,7)
pred 1 2 1 79 61 2 66 139 gt
error1127/345 1 0.3681159
28Naïve Bayes para Diabetes
Sin Descritizar gt anaiveBayes(V9.,datadiabetes)
gt predpredict(a,diabetes,-9,type"raw") gt
pred1max.col(pred) gt table(pred1,diabetes,9)
pred1 1 2 1 421 104 2 79 164 gt
error(79104)/768 1 0.2382813 Discretizando gt
ddiabetesdisc.mentr(diabetes,19) gt
bnaiveBayes(V9.,dataddiabetes) gt
predpredict(b,ddiabetes,-9) gt
table(pred,ddiabetes,9) pred 1 2 1
418 84 2 82 184 gt 166/768 1 0.2161458
29Naïve Bayes usando discretizacion ChiMerge
gt chibupachiMerge(bupa,16) gt bnaiveBayes(V7.,d
atachibupa) gt predpredict(b,chibupa,-7) gt
table(pred,chibupa,7) pred 1 2 1 117
21 2 28 179 gt error49/345 1 0.1420290 gt
chidiabchiMerge(diabetes,18) gt
bnaiveBayes(V9.,datachidiab) gt
predpredict(b,chidiab,-9) gt table(pred,chidiab
,9) pred 1 2 1 457 33 2 43 235 gt
error76/768 1 0.09895833
30Otros clasificadores Bayesianos
Analisis Discriminante Lineal (LDA). Aqui se
asume que la funcion de clase condicional
P(X1,Xm/Yvj) se asume que es normal
multivariada para cada vj. Se supone ademas que
la matriz de covarianza es igual para cada una de
las clases. La regla de decision para asignar el
objeto x se reduce a Notar que la regla de
decision es lineal en el vector de predictoras x.
Estrictamente hablando solo deberia aplicarse
cuando las predictoras son continuas.
31Ejemplos de LDABupa y Diabetes
gt bupaldalda(V7.,databupa) gt
predpredict(bupalda,bupa,-7)class gt
table(pred,bupa,7) pred 1 2 1 78
35 2 67 165 gt error102/345 1 0.2956522 gt
diabetesldalda(V9.,datadiabetes) gt
predpredict(diabeteslda,diabetes,-9)class gt
table(pred,diabetes,9) pred 1 2 1
446 112 2 54 156 gt error166/768 1 0.2161458
32Otros clasificadores Bayesianos
Los k vecinos mas cercanos (k nearest
neighbor). Aqui la funcion de clase condicional
P(X1,Xm/Yvj) es estimada por el metodo de los
k-vecinos mas cercanos. Estimadores basados en
estimacion de densidad por Kernel. Estimadores
basados en estimacion de la densidad condiiconal
usando mezclas Gaussianas
33El clasificador k-nn
- En el caso multivariado, el estimado de la
función de densidad tiene la forma - donde vk(x) es el volumen de un elipsoide
centrado en x de radio rk(x), que a su vez es la
distancia de x al k-ésimo punto más cercano.
34El clasificador k-nn
Desde el punto de vista de clasificacion
supervisada el método k-nn es bien simple de
aplicar. En efecto, si las funciones de
densidades condicionales f(x/Ci) de la clase Ci
que aparecen en la ecuación son estimadas por
k-nn. Entonces, para clasificar un objeto, con
mediciones dadas por el vector x, en la clase Ci
se debe cumplir que para j?i. Donde ki y kj
son los k vecinos de x que caen en las clase Ci y
Cj respectivamente.
35El clasificador k-nn
Asumiendo priors proporcionales a los tamaños de
las clases (ni/n y nj/n respectivamente) lo
anterior es equivalente a
kigtkj para j?iLuego, el procedimiento de
clasificación serÃa asà 1) Hallar los k objetos
que están a una distancia más cercana al ojbeto
x, k usualmente es un número impar 1 o 3. 2) Si
la mayorÃa de esos k objetos pertenecen a la
clase Ci entonces el objeto x es asignado a
ella. En caso de empate se clasifica al azar.
36El clasificador k-nn
- Hay dos problemas en el método k-nn, la elección
de la distancia o métrica y la elección de k. - La métrica más elemental que se puede elegir es
la euclideana d(x,y)(x-y)'(x-y). Esta métrica
sin embargo, puede causar problemas si las
variables predictoras han sido medidas en
unidades muy distintas entre sÃ. Algunos
prefieren rescalar los datos antes de aplicar el
método. Otra distancia bien usada es la distancia
Manhatan definida por d(x,y)x-y. Hay metricas
especiales cuando hay distintode variables en el
conjunto de datos. - Enas y Choi (1996) usando simulación hicieron un
estudio para determinar el k óptimo cuando solo
hay dos clases presentesy determinaron que si los
tamaños muestrales de las dos clases son
comparables entonces kn3/8 si habia poca
diferencia entre las matrices de covarianzas de
los grupos y kn2/8 si habia bastante diferencia
entre las matrices de covarianzas.
37Ejemplo de knn Bupa
gt bupak1knn(bupa,-7,bupa,-7,as.factor(bupa,7
),k1) gt table(bupak1,bupa,7) bupak1 1 2
1 145 0 2 0 200 error0 gt
bupak3knn(bupa,-7,bupa,-7,as.factor(bupa,7)
,k3) gt table(bupak3,bupa,7) bupak3 1
2 1 106 29 2 39
171 error19.71 gt bupak5knn(bupa,-7,bupa,-7,
as.factor(bupa,7),k5) gt table(bupak5,bupa,7)
bupak5 1 2 1 94 23 2 51
177 error21.44
38Ejemplo de knn diabetes
gt diabk3knn(diabetes,-9,diabetes,-9,as.factor
(diabetes,9),k3) gt table(diabk3,diabetes,9)
diabk1 1 2 1 459 67 2 41
201 error14.06 gt diabk5knn(diabetes,-9,diabet
es,-9,as.factor(diabetes,9),k5) gt
table(diabk5,diabetes,9) diabk1 1 2
1 442 93 2 58 175 error19.66
39What you should know
- Bayes Classifiers
- How to build one
- How to predict with a BC
- How to estimate the misclassification error.