Title: 2D1431 Machine Learning
12D1431 Machine Learning
2Outline
- Bayes theorem
- Maximum likelihood (ML) hypothesis
- Maximum a posteriori (MAP) hypothesis
- Naïve Bayes classifier
- Bayes optimal classifier
- Bayesian belief networks
- Expectation maximization (EM) algorithm
3Handwritten characters classification
4Gray level picturesobject classification
5Gray level pictures human action classification
6Literature Software
- T. Mitchell chapter 6
- S. Russell P. Norvig, Artificial Intelligence
A Modern Approach chapters 1415 - R.O. Duda, P.E. Hart, D.G. Stork, Pattern
Classification 2nd ed. chapters 23 - David Heckerman A Tutorial on Learning with
Bayesian Belief Networks - http//ftp.research.microsoft.com/pub/tr/tr-95
-06.pdf - Bayes Net Toolbox for Matlab (free), Kevin Murphy
- http//www.cs.berkeley.edu/murphyk/Bayes/bnt.
html -
-
7Bayes Theorem
- P(hD) P(Dh) P(h) / P(D)
- P(D) prior probability of the data D, evidence
- P(h) prior probability of the hypothesis h,
prior - P(hD) posterior probability of the hypothesis
given the data D, posterior - P(Dh) probability of the data D given the
hypothesis h , likelihood of the data
8Bayes Theorem
- P(hD) P(Dh) P(h) / P(D)
- posterior likelihood x prior / evidence
- By observing the data D we can convert the prior
probability P(h) to the a posteriori probability
(posterior) P(hD) - The posterior is probability that h holds after
data D has been observed. - The evidence P(D) can be viewed merely as a scale
factor that guarantees that the posterior
probabilities sum to one.
9Choosing Hypotheses
- P(hD) P(Dh) P(h) / P(D)
- Generally want the most probable hypothesis given
the training data - Maximum a posteriori hypothesis hMAP
- hMAP argmaxh?H P(hD)
- argmaxh?H P(Dh) P(h) / P(D)
- argmaxh?H P(Dh) P(h)
- If the priors of hypothesis are equally likely
P(hi)P(hj) then one can choose the maximum
likelihood (ML) hypothesis - hML argmaxh?H P(Dh)
10Bayes Theorem Example
- A patient takes a lab test and the result is
positive. The test returns a correct positive (?)
result in 98 of the cases in which the disease
is actually present, and a correct negative (?)
result in 97 of the cases in which the disease
is not present. Furthermore, 0.8 of the entire
population have the disease. Hypotheses
disease, disease - priors P(h) P(disease) 0.008, P(
disease)0.992 - likelihoods P(Dh) P(?disease)0.98, P(?
disease)0.02 - P(?disease)0.03,
P(?disease)0.97 - Maximum posteriors argmax P(hD)
- P(disease?) P(?disease)P(disease)0.0078
- P( disease?) P(? disease) P( disease)
0.0298 - P(disease?) 0.0078/(0.00780.0298) 0.21
- P( disease?) 0.0298/(0.00780.0298) 0.79
11Basic Formula for Probabilities
- Product rule P(A?B) P(A) P(B)
- Sum rule P(A?B) P(A) P(B) - P(A?B)
- Theorem of total probability if A1, A2, , An
are mutually exclusive events ?Si P(Ai) 1, then - P(B) Si P(BAi) P(Ai)
12Bayes Theorem Example
- P(x1,x2m1,m2,s) 1/(2ps) exp -Si (xi-mi)2/2s2
- hm1,m2,s
- Dx1,,xm
13Gaussian Probability Function
- P(Dm1,m2,s) Pm P(xmm1,m2,s)
- Maximum likelihood hypothesis hML
- hML argmax m1,m2,s P(Dm1,m2,s)
- Trick maximize log-likelihood
- log P(Dm1,m2,s) Sm log P(xmm1,m2,s)
- Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2
- -M log (2ps) - Sm Si (xmi-mi)2/2s2
14Gaussian Probability Function
- ?log P(Dm1,m2,s)/ ? mi 0
- Sm xmi-mi 0 ? mi ML 1/M Sm xmi Exm
- ?log P(Dm1,m2,s)/ ? s 0
- sML Sm Si (xmi-mi)2 / 2M E(Si (xmi-mi)2 / 2
- Maximum likelihood hypothesis hML miML,sML
15Maximum Likelihood Hypothesis
- mML (0.20, -0.14) sML 1.42
16Bayes Decision Rule
- x examples of class c1
- o examples of class c2
m2,s2
m1,s1
17Bayes Decision Rule
- Assume we have two Gaussians distributions
associated to two separate classes c1, c2. - P(xci) P(xmi,si) 1/(2ps) exp -Si
(xi-mi)2/2s2 - Bayes decision rule (max posterior probability)
- Decide c1 if P(c1x) gt P(c2x)
- otherwise decide c2.
- if P(c1) P(c2) use maximum likelihood P(xci)
- else use maximum posterior P(cix) P(xci)
P(ci)
18Bayes Decision Rule
c2
c1
19Two-Category Case
- Discriminant functions
- if g(x) gt 0 then c1 else c2
- g(x) P(c1x) P(c2x)
- P(xc1) P(c1) - P(xc1) P(c1)
- g(x) log P(c1x) log P(c2x)
- log P(xc1)/P(xc2) - log P(c1)/
P(c2) - Gaussian probability functions with identical si
- g(x) (x-m2)2/2s2 - (x-m1)2/2s2 log P(c1)
log P(c2) - decision surface is a line/hyperplane
20Learning a Real Valued Function
f
hML
e
- Consider a real-valued target function f
- Noisy training examples ltxi,digt
- di f(xi) ei
- ei is a random variable drawn from a Gaussian
distribution with zero mean. - The maximum likelihood hypothesis hML is the one
that minimizes the squared sum of errors - hML argmin h?H Si (di h(xi))2
21Learning a Real Valued Function
- hML argmax h?H P(Dh)
- argmax h?H Pi P(xih)
- argmax h?H Pi (2ps)-0.5 exp
-(di-h(xi))2/2s2 - maximizing logarithm log P(Dh)
- hML argmax h?H Si 0.5 log(2ps)
-(di-h(xi))2/2s2 - argmax h?H Si -(di - h(xi))2
- argmin h?H Si (di h(xi))2
22Learning to Predict Probabilities
- Predicting survival probability of a patient
- Training examples ltxi,digt where di is 0 or 1
- Objective train a neural network to output a
probability h(xi) p(di1) given xi - Maximum likelihood hypothesis
- hML argmax h?H Si di ln h(xi) (1-di) ln
(1-h(xi)) - maximize cross entropy between di and h(xi)
- Weight update rule for synapses wk to output
neuron h(xi) wk wk ? Si
(di-h(xi)) xk - Compare to standard BP weight update rule
- wk wk ? Si h(xi)(1-h(xi))
(di-h(xi)) xk
23Most Probable Classification
- So far we sought the most probable hypothesis
hMAP? - What is most probable classification of a new
instance x given the data D? - hMAP(x) is not the most probable classification,
although often a sufficiently good approximation
of it. - Consider three possible hypotheses
- P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
- Given a new instance x, h1(x), h2(x)-, h3(x)-
- hMAP(x) h1(x)
- most probable classification
- P()P(h1D)0.4 P(-)P(h2D) P(h3D)
0.6 -
24Bayes Optimal Classifier
- cmax argmax cj?C S hi?H P(cjhi) P(hiD)
- Example
- P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
- P(h1)1, P(-h1)0
- P(h2)0, P(-h2)1
- P(h3)0, P(-h3)1
- therefore
- S hi?H P(hi) P(hiD) 0.4
- S hi?H P(- hi) P(hiD) 0.6
- argmax cj?C S hi?H P(vjhi) P(hiD) -
25MAP vs. Bayes Method
- The maximum posterior hypothesis estimates a
point hMAP in the hypothesis space H. - Bayes method instead estimates and uses a
complete distribution P(hD). - The difference appears when inference MAP or
Bayes method are used for inference of unseen
instances and one compares the distributions
P(xD) - MAP P(xD) hMAP(x) with hML argmax h?H
P(hD) - Bayes P(xD) S hi?H P(xhi) P(hiD)
- For reasonable prior distributions P(h) MAP and
Bayes solution are equivalent in the asymptotic
limit of infinite training data D.
26Naïve Bayes Classifier
- popular, simple learning algorithm
- moderate or large training set available
- assumption attributes that describe instances
are conditionally independent given
classification (in practice works surprisingly
well even if assumption is violated) - Applications
- diagnosis
- text classification (newsgroup articles 20
newsgroups, 1000 documents per newsgroup,
classification accuracy 89)
27Naïve Bayes Classifier
- Assume discrete target function F X?C, where
each instance x described by attributes
lta1,a2,,angt - Most probable value of f(x) is
- cMAP argmax cj?C P(cj lta1,a2,,angt)
- argmax cj?C P(lta1,a2,,angtcj) P(cj) /
P(lta1,a2,,angt) - argmax cj?C P(lta1,a2,,angtcj) P(cj)
- Naïve Bayes assumption P(lta1,a2,,angtcj) Pi
P(aicj) - cNB argmax cj?C P(cj) Pi P(aicj)
28Naïve Bayes Learning Algorithm
- Naïve_Bayes_Learn(examples)
- for each target value cj estimate P(cj)
- for each attribute value ai estimate of each
attribute a estimate P(aicj) - Classify_New_Instance(x)
- cNB argmax cj?C P(cj) Pai?x P(aicj)
29Naïve Bayes Example
- Consider PlayTennis and new instance
- ltOutlookSunny, Tempcool, Humidityhigh,
Windstronggt - Compute cNB argmax cj?C P(cj) Pai?x P(aicj)
- playtennis (9,5-)
- P(yes) 9/14, P(no) 5/14
- windstrong (3,3-)
- P(strongyes) 3/9 , P(strongno) 3/5
-
- P(yes) P(sunyes) P(coolyes) P(highyes)
P(strongyes) 0.005 - P(no) P(sunno) P(coolno) P(highno)
P(strongno) 0.021
30Estimating Probabilities
- What if none (nc0) of the training instances
with target value cj have attribute ai? - P(aicj) nc/n 0 and P(cj) Pai?x P(aicj)
0 - Solution Bayesian estimate for P(aicj)
- P(aicj) (nc mp)/(n m)
- n number of training examples for which ccj
- nc number of examples for which ccj and aai
- p prior estimate of P(aicj)
- m weight given to prior (number of virtual
examples)
31Bayesian Belief Networks
- naïve assumption of conditional independency too
restrictive - full probability distribution intractable due to
lack of data - Bayesian belief networks describe conditional
independence among subsets of variables - allows combining prior knowledge about causal
relationships among variables with observed data
32Conditional Independence
- Definition X is conditionally independent of Y
given Z is the probability distribution governing
X is independent of the value of Y given the
value of Z, that is, if - ? xi,yj,zk P(XxiYyj,Zzk) P(XxiZzk)
- or more compactly P(XY,Z) P(XZ)
- Example Thunder is conditionally independent of
Rain given Lightning - P(Thunder Rain, Lightning) P(Thunder
Lightning) - Notice P(Thunder Rain) ? P(Thunder)
- Naïve Bayes uses conditional independence to
justify - P(X,YZ) P(XY,Z) P(YZ) P(XZ) P(YZ)
33Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder
- Network represents a set of conditional
independence assertions - Each node is conditionally independent of its
non-descendants, given its immediate
predecessors. (directed acyclic graph)
34Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder
P(CS,B)
- Network represents joint probability distribution
over all variables - P(Storm,BusGroup,Lightning,Campfire,Thunder,Forest
fire) - P(y1,,yn) Pi1n P(yiParents(Yi))
- joint distribution is fully defined by graph plus
P(yiParents(Yi))
35Expectation Maximization EM
- when to use
- data is only partially observable
- unsupervised clustering target value
unobservable - supervised learning some instance attributes
unobservable - applications
- training Bayesian Belief Networks
- unsupervised clustering
- learning hidden Markov models
36Generating Data from Mixture of Gaussians
- Each instance x generated by
- choosing one of the k Gaussians at random
- Generating an instance according to that Gaussian
37EM for Estimating k Means
- Given
- instances from X generated by mixture of k
Gaussians - unknown means ltm1,,mkgt of the k Gaussians
- dont know which instance xi was generated by
which Gaussian - Determine
- maximum likelihood estimates of ltm1,,mkgt
- Think of full description of each instance as
yiltxi,zi1,zi2gt - zij is 1 if xi generated by j-th Gaussian
- xi observable
- zij unobservable
38EM for Estimating k Means
- EM algorithm pick random initial hltm1,m2gt then
iterate - E step Calculate the expected value Ezij of
each hidden variable zij, assuming the current
hypothesis hltm1,m2gt holds. - Ezij p(xximmj) / Sn12 p(xximmj)
- exp(-(xi-mj)2/2s2) / Sn12
exp(-(xi-mn)2/2s2) - M step Calculate a new maximum likelihood
hypothesis hltm1,m2gt assuming the value taken
on by each hidden variable zij is its expected
value Ezij calculated in the E-step. Replace
hltm1,m2gt by hltm1,m2gt - mj Si1m Ezij xi / Si1m Ezij
39EM Algorithm
- Converges to local maximum likelihood and
provides estimates of hidden variables zij. - In fact local maximum in E ln (P(Yh)
- Y is complete (observable plus non-observable
variables) data - Expected valued is taken over possible values of
unobserved variables in Y
40General EM Problem
- Given
- observed data X x1,,xm
- unobserved data Z z1,,zm
- parameterized probability distribution P(Yh)
where - Y y1,,ym is the full data yiltxi,zigt
- h are the parameters
- Determine
- h that (locally) maximizes Eln P(Yh)
- Applications
- train Bayesian Belief Networks
- unsupervised clustering
- hidden Markov models
41General EM Method
- Define likelihood function Q(hh) which
calculates - Y X ? Z using observed X and current
parameters h to estimate Z - Q(hh) E ln( P(Yh) h, X
- EM algorithm
- Estimation (E) step Calculate Q(hh) using the
current hypothesis h and the observed data X to
estimate the probability distribution over Y. - Q(hh) E ln( P(Yh) h, X
- Maximization (M) step Replace hypothesis h by
the hypothesis h that maximizes this Q function. - h argmaxh?H Q(hh)