Bayes classifiers - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Bayes classifiers

Description:

Assume you want to predict output Y which has arity nY and ... don't try to be maximally discriminative---they merely try to honestly model what's going on ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 40
Provided by: edgar9
Category:

less

Transcript and Presenter's Notes

Title: Bayes classifiers


1
Bayes classifiers
  • Edgar Acuna

2
Bayes Classifiers
  • A formidable and sworn enemy of decision trees

BC
3
How to build a Bayes Classifier
  • Assume you want to predict output Y which has
    arity nY and values v1, v2, vny.
  • Assume there are m input attributes called X1,
    X2, Xm
  • Break dataset into nY smaller datasets called
    DS1, DS2, DSny.
  • Define DSi Records in which Yvi
  • For each DSi , learn Density Estimator Mi to
    model the input distribution among the Yvi
    records.

4
How to build a Bayes Classifier
  • Assume you want to predict output Y which has
    arity nY and values v1, v2, vny.
  • Assume there are m input attributes called X1,
    X2, Xm
  • Break dataset into nY smaller datasets called
    DS1, DS2, DSny.
  • Define DSi Records in which Yvi
  • For each DSi , learn Density Estimator Mi to
    model the input distribution among the Yvi
    records.
  • Mi estimates P(X1, X2, Xm Yvi )

5
How to build a Bayes Classifier
  • Assume you want to predict output Y which has
    arity nY and values v1, v2, vny.
  • Assume there are m input attributes called X1,
    X2, Xm
  • Break dataset into nY smaller datasets called
    DS1, DS2, DSny.
  • Define DSi Records in which Yvi
  • For each DSi , learn Density Estimator Mi to
    model the input distribution among the Yvi
    records.
  • Mi estimates P(X1, X2, Xm Yvi )
  • Idea When a new set of input values (X1 u1, X2
    u2, . Xm um) come along to be evaluated
    predict the value of Y that makes P(X1, X2, Xm
    Yvi ) most likely

Is this a good idea?
6
How to build a Bayes Classifier
  • Assume you want to predict output Y which has
    arity nY and values v1, v2, vny.
  • Assume there are m input attributes called X1,
    X2, Xm
  • Break dataset into nY smaller datasets called
    DS1, DS2, DSny.
  • Define DSi Records in which Yvi
  • For each DSi , learn Density Estimator Mi to
    model the input distribution among the Yvi
    records.
  • Mi estimates P(X1, X2, Xm Yvi )
  • Idea When a new set of input values (X1 u1, X2
    u2, . Xm um) come along to be evaluated
    predict the value of Y that makes P(X1, X2, Xm
    Yvi ) most likely

This is a Maximum Likelihood classifier. It can
get silly if some Ys are very unlikely
Is this a good idea?
7
How to build a Bayes Classifier
  • Assume you want to predict output Y which has
    arity nY and values v1, v2, vny.
  • Assume there are m input attributes called X1,
    X2, Xm
  • Break dataset into nY smaller datasets called
    DS1, DS2, DSny.
  • Define DSi Records in which Yvi
  • For each DSi , learn Density Estimator Mi to
    model the input distribution among the Yvi
    records.
  • Mi estimates P(X1, X2, Xm Yvi )
  • Idea When a new set of input values (X1 u1, X2
    u2, . Xm um) come along to be evaluated
    predict the value of Y that makes P(Yvi X1,
    X2, Xm) most likely

Much Better Idea
Is this a good idea?
8
Terminology
  • MLE (Maximum Likelihood Estimator)
  • MAP (Maximum A-Posteriori Estimator)

9
Getting what we need
10
Getting a posterior probability
11
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction
12
Bayes Classifiers in a nutshell
1. Learn the distribution over inputs for each
value Y. 2. This gives P(X1, X2, Xm Yvi
). 3. Estimate P(Yvi ). as fraction of records
with Yvi . 4. For a new prediction
  • We can use our favorite Density Estimator here.
  • Right now we have two options
  • Joint Density Estimator
  • Naïve Density Estimator

13
Joint Density Bayes Classifier
In the case of the joint Bayes Classifier this
degenerates to a very simple rule Ypredict
the class containing most records in which X1
u1, X2 u2, . Xm um. Note that if no records
have the exact set of inputs X1 u1, X2 u2, .
Xm um, then P(X1, X2, Xm Yvi ) 0 for all
values of Y. In that case we just have to guess
Ys value
14
Ejemplo
15
Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1 .
Notar tambien que en esta clase el record (0,0,1)
aparece mas veces que en la clase 0.
16
Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
17
Naïve Bayes Classifier
In the case of the naive Bayes Classifier this
can be simplified
Technical Hint If you have 10,000 input
attributes that product will underflow in
floating point math. You should use logs
18
Ejemplo Continuacion
X10,X20, x31 sera asignado a la clase 1
19
BC Results XOR
The XOR dataset consists of 40,000 records and
2 Boolean inputs called a and b, generated 50-50
randomly as 0 or 1. c (output) a XOR b
The Classifier learned by Joint BC
The Classifier learned by Naive BC
20
BC Results MPG 392 records
The Classifier learned by Naive BC
21
More Facts About Bayes Classifiers
  • Many other density estimators can be slotted in.
  • Density estimation can be performed with
    real-valued inputs
  • Bayes Classifiers can be built with real-valued
    inputs
  • Rather Technical Complaint Bayes Classifiers
    dont try to be maximally discriminative---they
    merely try to honestly model whats going on
  • Zero probabilities are painful for Joint and
    Naïve. A hack (justifiable with the magic words
    Dirichlet Prior) can help.
  • Naïve Bayes is wonderfully cheap. And survives
    10,000 attributes cheerfully!
  • See future Andrew Lectures

22
Naïve Bayes classifier
Naïve Bayes classifier puede ser aplicado cuando
hay predictoras continuas, pero hay que aplicar
previamente un metodo de discretizacion tal
como Usando intervalos de igual ancho, usando
intervalos con igual frecuencia, ChiMerge,1R,
Discretizacion usando el metodo de la entropia
con distintos criterios de parada, Todos ellos
estan disponible en la libreria dprep( ver
disc.mentr, disc.ew, disc.ef, etc) . La libreria
e1071 de R contiene una funcion naiveBayes que
calcula el clasificador naïve Bayes. Si la
variable es continua asume que sigue una
distribucion Gaussiana.
23
The misclassification error rate
The misclassification error rate R(d) is the
probability that the classifier d classifies
incorrectly an instance coming from a sample
(test sample) obtained in a later stage than the
training sample. Also is called the True error or
the actual error. It is an unknown value that
needs to be estimated.
24
Methods for estimation of the misclassification
error rate
  • Resubstitution or Aparent Error (Smith, 1947).
    This is merely the proportion of instances in the
    training sample that are incorrectly classified
    by the classification rule. In general is an
    estimator too optimistic and it can lead to wrong
    conclusions if the number of instances is not
    large compared with the number of features. This
    estimator has a large bias.
  • ii) Leave one out estimation. (Lachenbruch,
    1965). In this case an instance is omitted from
    the training sample. Then the classifier is built
    and the prediction for the omitted instances is
    obtained. One must register if the instance was
    correctly or incorrectly classfied. The process
    is repeated for all the instances in the training
    sample and the estimation of the ME will be given
    by the proportion of instances incorrectly
    classified. This estimator has low bias but its
    variance tends to be large.

25
Methods for estimation of the misclassification
error rate
  • iii) Cross validation. (Stone, 1974) In this case
    the training sample is randomly divided in v
    parts (v10 is the most used). Then the
    classifier is built using all the parts but one.
    The omitted part is considered as the test sample
    and the predictions for each instance on it are
    found. The CV misclassification error rate is
    found by adding the misclassification on each
    part and dividing them by the total number of
    instances. The CV estimated has low bias but
    high variance. In order to reduce the variability
    we usually repeat the estimation several times.
  • The estimation of the variance is a hard problem
    (bengio and Grandvalet, 2004).

26
Methods for estimation of the misclassification
error rate
iv) The holdout method. A percentage (70) of the
dataset is considered as the training sample and
the remaining as the test sample. The classifier
is evaluated in the test sample. The experiment
is repeated several times and then the average is
taken. v) Bootstrapping. (Efron, 1983). In this
method we generate several training samples by
sampling with replacement from the original
training sample. The idea is to reduce the bias
of the resubstitution error. It is almost
unbiased, but it has a large variance. Its
computation cost is high. There exist several
variants of this method.
27
Naive Bayes para Bupa
Sin discretizar gt anaiveBayes(V7.,databupa) gt
predpredict(a,bupa,-7,type"raw") gt
pred1max.col(pred) gt table(pred1,bupa,7)
pred1 1 2 1 112 119 2 33 81 gt
error152/345 1 0.4405797 Discretizando con el
metodo de la entropia gt dbupadisc.mentr(bupa,17
) gt bnaiveBayes(V7.,datadbupa) gt
predpredict(b,dbupa,-7) gt table(pred,dbupa,7)
pred 1 2 1 79 61 2 66 139 gt
error1127/345 1 0.3681159
28
Naïve Bayes para Diabetes
Sin Descritizar gt anaiveBayes(V9.,datadiabetes)
gt predpredict(a,diabetes,-9,type"raw") gt
pred1max.col(pred) gt table(pred1,diabetes,9)
pred1 1 2 1 421 104 2 79 164 gt
error(79104)/768 1 0.2382813 Discretizando gt
ddiabetesdisc.mentr(diabetes,19) gt
bnaiveBayes(V9.,dataddiabetes) gt
predpredict(b,ddiabetes,-9) gt
table(pred,ddiabetes,9) pred 1 2 1
418 84 2 82 184 gt 166/768 1 0.2161458
29
Naïve Bayes usando discretizacion ChiMerge
gt chibupachiMerge(bupa,16) gt bnaiveBayes(V7.,d
atachibupa) gt predpredict(b,chibupa,-7) gt
table(pred,chibupa,7) pred 1 2 1 117
21 2 28 179 gt error49/345 1 0.1420290 gt
chidiabchiMerge(diabetes,18) gt
bnaiveBayes(V9.,datachidiab) gt
predpredict(b,chidiab,-9) gt table(pred,chidiab
,9) pred 1 2 1 457 33 2 43 235 gt
error76/768 1 0.09895833
30
Otros clasificadores Bayesianos
Analisis Discriminante Lineal (LDA). Aqui se
asume que la funcion de clase condicional
P(X1,Xm/Yvj) se asume que es normal
multivariada para cada vj. Se supone ademas que
la matriz de covarianza es igual para cada una de
las clases. La regla de decision para asignar el
objeto x se reduce a Notar que la regla de
decision es lineal en el vector de predictoras x.
Estrictamente hablando solo deberia aplicarse
cuando las predictoras son continuas.
31
Ejemplos de LDABupa y Diabetes
gt bupaldalda(V7.,databupa) gt
predpredict(bupalda,bupa,-7)class gt
table(pred,bupa,7) pred 1 2 1 78
35 2 67 165 gt error102/345 1 0.2956522 gt
diabetesldalda(V9.,datadiabetes) gt
predpredict(diabeteslda,diabetes,-9)class gt
table(pred,diabetes,9) pred 1 2 1
446 112 2 54 156 gt error166/768 1 0.2161458
32
Otros clasificadores Bayesianos
Los k vecinos mas cercanos (k nearest
neighbor). Aqui la funcion de clase condicional
P(X1,Xm/Yvj) es estimada por el metodo de los
k-vecinos mas cercanos. Estimadores basados en
estimacion de densidad por Kernel. Estimadores
basados en estimacion de la densidad condiiconal
usando mezclas Gaussianas
33
El clasificador k-nn
  • En el caso multivariado, el estimado de la
    función de densidad tiene la forma
  • donde vk(x) es el volumen de un elipsoide
    centrado en x de radio rk(x), que a su vez es la
    distancia de x al k-ésimo punto más cercano.

34
El clasificador k-nn
Desde el punto de vista de clasificacion
supervisada el método k-nn es bien simple de
aplicar. En efecto, si las funciones de
densidades condicionales f(x/Ci) de la clase Ci
que aparecen en la ecuación son estimadas por
k-nn. Entonces, para clasificar un objeto, con
mediciones dadas por el vector x, en la clase Ci
se debe cumplir que para j?i. Donde ki y kj
son los k vecinos de x que caen en las clase Ci y
Cj respectivamente.
35

El clasificador k-nn
Asumiendo priors proporcionales a los tamaños de
las clases (ni/n y nj/n respectivamente) lo
anterior es equivalente a
kigtkj para j?iLuego, el procedimiento de
clasificación sería así 1) Hallar los k objetos
que están a una distancia más cercana al ojbeto
x, k usualmente es un número impar 1 o 3. 2) Si
la mayoría de esos k objetos pertenecen a la
clase Ci entonces el objeto x es asignado a
ella. En caso de empate se clasifica al azar.
36
El clasificador k-nn
  • Hay dos problemas en el método k-nn, la elección
    de la distancia o métrica y la elección de k.
  • La métrica más elemental que se puede elegir es
    la euclideana d(x,y)(x-y)'(x-y). Esta métrica
    sin embargo, puede causar problemas si las
    variables predictoras han sido medidas en
    unidades muy distintas entre sí. Algunos
    prefieren rescalar los datos antes de aplicar el
    método. Otra distancia bien usada es la distancia
    Manhatan definida por d(x,y)x-y. Hay metricas
    especiales cuando hay distintode variables en el
    conjunto de datos.
  • Enas y Choi (1996) usando simulación hicieron un
    estudio para determinar el k óptimo cuando solo
    hay dos clases presentesy determinaron que si los
    tamaños muestrales de las dos clases son
    comparables entonces kn3/8 si habia poca
    diferencia entre las matrices de covarianzas de
    los grupos y kn2/8 si habia bastante diferencia
    entre las matrices de covarianzas.

37
Ejemplo de knn Bupa
gt bupak1knn(bupa,-7,bupa,-7,as.factor(bupa,7
),k1) gt table(bupak1,bupa,7) bupak1 1 2
1 145 0 2 0 200 error0 gt
bupak3knn(bupa,-7,bupa,-7,as.factor(bupa,7)
,k3) gt table(bupak3,bupa,7) bupak3 1
2 1 106 29 2 39
171 error19.71 gt bupak5knn(bupa,-7,bupa,-7,
as.factor(bupa,7),k5) gt table(bupak5,bupa,7)
bupak5 1 2 1 94 23 2 51
177 error21.44
38
Ejemplo de knn diabetes
gt diabk3knn(diabetes,-9,diabetes,-9,as.factor
(diabetes,9),k3) gt table(diabk3,diabetes,9)
diabk1 1 2 1 459 67 2 41
201 error14.06 gt diabk5knn(diabetes,-9,diabet
es,-9,as.factor(diabetes,9),k5) gt
table(diabk5,diabetes,9) diabk1 1 2
1 442 93 2 58 175 error19.66
39
What you should know
  • Bayes Classifiers
  • How to build one
  • How to predict with a BC
  • How to estimate the misclassification error.
Write a Comment
User Comments (0)
About PowerShow.com