2D1431 Machine Learning - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

2D1431 Machine Learning

Description:

Example: Thunder is conditionally independent of Rain given Lightning ... Network represents joint probability distribution over all variables ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 42

Provided by: nada8

Category:

more less

Transcript and Presenter's Notes

Title: 2D1431 Machine Learning

1
2D1431 Machine Learning

Bayesian Learning

2
Outline

Bayes theorem
Maximum likelihood (ML) hypothesis
Maximum a posteriori (MAP) hypothesis
Naïve Bayes classifier
Bayes optimal classifier
Bayesian belief networks
Expectation maximization (EM) algorithm

3
Handwritten characters classification
4
Gray level picturesobject classification
5
Gray level pictures human action classification
6
Literature Software

T. Mitchell chapter 6
S. Russell P. Norvig, Artificial Intelligence
A Modern Approach chapters 1415
R.O. Duda, P.E. Hart, D.G. Stork, Pattern
Classification 2nd ed. chapters 23
David Heckerman A Tutorial on Learning with
Bayesian Belief Networks
http//ftp.research.microsoft.com/pub/tr/tr-95
-06.pdf
Bayes Net Toolbox for Matlab (free), Kevin Murphy
http//www.cs.berkeley.edu/murphyk/Bayes/bnt.
html

7
Bayes Theorem

P(hD) P(Dh) P(h) / P(D)
P(D) prior probability of the data D, evidence
P(h) prior probability of the hypothesis h,
prior
P(hD) posterior probability of the hypothesis
given the data D, posterior
P(Dh) probability of the data D given the
hypothesis h , likelihood of the data

8
Bayes Theorem

P(hD) P(Dh) P(h) / P(D)
posterior likelihood x prior / evidence
By observing the data D we can convert the prior
probability P(h) to the a posteriori probability
(posterior) P(hD)
The posterior is probability that h holds after
data D has been observed.
The evidence P(D) can be viewed merely as a scale
factor that guarantees that the posterior
probabilities sum to one.

9
Choosing Hypotheses

P(hD) P(Dh) P(h) / P(D)
Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP
hMAP argmaxh?H P(hD)
argmaxh?H P(Dh) P(h) / P(D)
argmaxh?H P(Dh) P(h)
If the priors of hypothesis are equally likely
P(hi)P(hj) then one can choose the maximum
likelihood (ML) hypothesis
hML argmaxh?H P(Dh)

10
Bayes Theorem Example

A patient takes a lab test and the result is
positive. The test returns a correct positive (?)
result in 98 of the cases in which the disease
is actually present, and a correct negative (?)
result in 97 of the cases in which the disease
is not present. Furthermore, 0.8 of the entire
population have the disease. Hypotheses
disease, disease
priors P(h) P(disease) 0.008, P(
disease)0.992
likelihoods P(Dh) P(?disease)0.98, P(?
disease)0.02
P(?disease)0.03,
P(?disease)0.97
Maximum posteriors argmax P(hD)
P(disease?) P(?disease)P(disease)0.0078
P( disease?) P(? disease) P( disease)
0.0298
P(disease?) 0.0078/(0.00780.0298) 0.21
P( disease?) 0.0298/(0.00780.0298) 0.79

11
Basic Formula for Probabilities

Product rule P(A?B) P(A) P(B)
Sum rule P(A?B) P(A) P(B) - P(A?B)
Theorem of total probability if A1, A2, , An
are mutually exclusive events ?Si P(Ai) 1, then
P(B) Si P(BAi) P(Ai)

12
Bayes Theorem Example

P(x1,x2m1,m2,s) 1/(2ps) exp -Si (xi-mi)2/2s2
hm1,m2,s
Dx1,,xm

13
Gaussian Probability Function

P(Dm1,m2,s) Pm P(xmm1,m2,s)
Maximum likelihood hypothesis hML
hML argmax m1,m2,s P(Dm1,m2,s)
Trick maximize log-likelihood
log P(Dm1,m2,s) Sm log P(xmm1,m2,s)
Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2
-M log (2ps) - Sm Si (xmi-mi)2/2s2

14
Gaussian Probability Function

?log P(Dm1,m2,s)/ ? mi 0
Sm xmi-mi 0 ? mi ML 1/M Sm xmi Exm
?log P(Dm1,m2,s)/ ? s 0
sML Sm Si (xmi-mi)2 / 2M E(Si (xmi-mi)2 / 2
Maximum likelihood hypothesis hML miML,sML

15
Maximum Likelihood Hypothesis

mML (0.20, -0.14) sML 1.42

16
Bayes Decision Rule

x examples of class c1
o examples of class c2

m2,s2
m1,s1
17
Bayes Decision Rule

Assume we have two Gaussians distributions
associated to two separate classes c1, c2.
P(xci) P(xmi,si) 1/(2ps) exp -Si
(xi-mi)2/2s2
Bayes decision rule (max posterior probability)
Decide c1 if P(c1x) gt P(c2x)
otherwise decide c2.
if P(c1) P(c2) use maximum likelihood P(xci)
else use maximum posterior P(cix) P(xci)
P(ci)

18
Bayes Decision Rule
c2
c1
19
Two-Category Case

Discriminant functions
if g(x) gt 0 then c1 else c2
g(x) P(c1x) P(c2x)
P(xc1) P(c1) - P(xc1) P(c1)
g(x) log P(c1x) log P(c2x)
log P(xc1)/P(xc2) - log P(c1)/
P(c2)
Gaussian probability functions with identical si
g(x) (x-m2)2/2s2 - (x-m1)2/2s2 log P(c1)
log P(c2)
decision surface is a line/hyperplane

20
Learning a Real Valued Function
f
hML
e

Consider a real-valued target function f
Noisy training examples ltxi,digt
di f(xi) ei
ei is a random variable drawn from a Gaussian
distribution with zero mean.
The maximum likelihood hypothesis hML is the one
that minimizes the squared sum of errors
hML argmin h?H Si (di h(xi))2

21
Learning a Real Valued Function

hML argmax h?H P(Dh)
argmax h?H Pi P(xih)
argmax h?H Pi (2ps)-0.5 exp
-(di-h(xi))2/2s2
maximizing logarithm log P(Dh)
hML argmax h?H Si 0.5 log(2ps)
-(di-h(xi))2/2s2
argmax h?H Si -(di - h(xi))2
argmin h?H Si (di h(xi))2

22
Learning to Predict Probabilities

Predicting survival probability of a patient
Training examples ltxi,digt where di is 0 or 1
Objective train a neural network to output a
probability h(xi) p(di1) given xi
Maximum likelihood hypothesis
hML argmax h?H Si di ln h(xi) (1-di) ln
(1-h(xi))
maximize cross entropy between di and h(xi)
Weight update rule for synapses wk to output
neuron h(xi) wk wk ? Si
(di-h(xi)) xk
Compare to standard BP weight update rule
wk wk ? Si h(xi)(1-h(xi))
(di-h(xi)) xk

23
Most Probable Classification

So far we sought the most probable hypothesis
hMAP?
What is most probable classification of a new
instance x given the data D?
hMAP(x) is not the most probable classification,
although often a sufficiently good approximation
of it.
Consider three possible hypotheses
P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
Given a new instance x, h1(x), h2(x)-, h3(x)-
hMAP(x) h1(x)
most probable classification
P()P(h1D)0.4 P(-)P(h2D) P(h3D)
0.6

24
Bayes Optimal Classifier

cmax argmax cj?C S hi?H P(cjhi) P(hiD)
Example
P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
P(h1)1, P(-h1)0
P(h2)0, P(-h2)1
P(h3)0, P(-h3)1
therefore
S hi?H P(hi) P(hiD) 0.4
S hi?H P(- hi) P(hiD) 0.6
argmax cj?C S hi?H P(vjhi) P(hiD) -

25
MAP vs. Bayes Method

The maximum posterior hypothesis estimates a
point hMAP in the hypothesis space H.
Bayes method instead estimates and uses a
complete distribution P(hD).
The difference appears when inference MAP or
Bayes method are used for inference of unseen
instances and one compares the distributions
P(xD)
MAP P(xD) hMAP(x) with hML argmax h?H
P(hD)
Bayes P(xD) S hi?H P(xhi) P(hiD)
For reasonable prior distributions P(h) MAP and
Bayes solution are equivalent in the asymptotic
limit of infinite training data D.

26
Naïve Bayes Classifier

popular, simple learning algorithm
moderate or large training set available
assumption attributes that describe instances
are conditionally independent given
classification (in practice works surprisingly
well even if assumption is violated)
Applications
diagnosis
text classification (newsgroup articles 20
newsgroups, 1000 documents per newsgroup,
classification accuracy 89)

27
Naïve Bayes Classifier

Assume discrete target function F X?C, where
each instance x described by attributes
lta1,a2,,angt
Most probable value of f(x) is
cMAP argmax cj?C P(cj lta1,a2,,angt)
argmax cj?C P(lta1,a2,,angtcj) P(cj) /
P(lta1,a2,,angt)
argmax cj?C P(lta1,a2,,angtcj) P(cj)
Naïve Bayes assumption P(lta1,a2,,angtcj) Pi
P(aicj)
cNB argmax cj?C P(cj) Pi P(aicj)

28
Naïve Bayes Learning Algorithm

Naïve_Bayes_Learn(examples)
for each target value cj estimate P(cj)
for each attribute value ai estimate of each
attribute a estimate P(aicj)
Classify_New_Instance(x)
cNB argmax cj?C P(cj) Pai?x P(aicj)

29
Naïve Bayes Example

Consider PlayTennis and new instance
ltOutlookSunny, Tempcool, Humidityhigh,
Windstronggt
Compute cNB argmax cj?C P(cj) Pai?x P(aicj)
playtennis (9,5-)
P(yes) 9/14, P(no) 5/14
windstrong (3,3-)
P(strongyes) 3/9 , P(strongno) 3/5
P(yes) P(sunyes) P(coolyes) P(highyes)
P(strongyes) 0.005
P(no) P(sunno) P(coolno) P(highno)
P(strongno) 0.021

30
Estimating Probabilities

What if none (nc0) of the training instances
with target value cj have attribute ai?
P(aicj) nc/n 0 and P(cj) Pai?x P(aicj)
0
Solution Bayesian estimate for P(aicj)
P(aicj) (nc mp)/(n m)
n number of training examples for which ccj
nc number of examples for which ccj and aai
p prior estimate of P(aicj)
m weight given to prior (number of virtual
examples)

31
Bayesian Belief Networks

naïve assumption of conditional independency too
restrictive
full probability distribution intractable due to
lack of data
Bayesian belief networks describe conditional
independence among subsets of variables
allows combining prior knowledge about causal
relationships among variables with observed data

32
Conditional Independence

Definition X is conditionally independent of Y
given Z is the probability distribution governing
X is independent of the value of Y given the
value of Z, that is, if
? xi,yj,zk P(XxiYyj,Zzk) P(XxiZzk)
or more compactly P(XY,Z) P(XZ)
Example Thunder is conditionally independent of
Rain given Lightning
P(Thunder Rain, Lightning) P(Thunder
Lightning)
Notice P(Thunder Rain) ? P(Thunder)
Naïve Bayes uses conditional independence to
justify
P(X,YZ) P(XY,Z) P(YZ) P(XZ) P(YZ)

33
Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder

Network represents a set of conditional
independence assertions
Each node is conditionally independent of its
non-descendants, given its immediate
predecessors. (directed acyclic graph)

34
Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder
P(CS,B)

Network represents joint probability distribution
over all variables
P(Storm,BusGroup,Lightning,Campfire,Thunder,Forest
fire)
P(y1,,yn) Pi1n P(yiParents(Yi))
joint distribution is fully defined by graph plus
P(yiParents(Yi))

35
Expectation Maximization EM

when to use
data is only partially observable
unsupervised clustering target value
unobservable
supervised learning some instance attributes
unobservable
applications
training Bayesian Belief Networks
unsupervised clustering
learning hidden Markov models

36
Generating Data from Mixture of Gaussians

Each instance x generated by
choosing one of the k Gaussians at random
Generating an instance according to that Gaussian

37
EM for Estimating k Means

Given
instances from X generated by mixture of k
Gaussians
unknown means ltm1,,mkgt of the k Gaussians
dont know which instance xi was generated by
which Gaussian
Determine
maximum likelihood estimates of ltm1,,mkgt
Think of full description of each instance as
yiltxi,zi1,zi2gt
zij is 1 if xi generated by j-th Gaussian
xi observable
zij unobservable

38
EM for Estimating k Means

EM algorithm pick random initial hltm1,m2gt then
iterate
E step Calculate the expected value Ezij of
each hidden variable zij, assuming the current
hypothesis hltm1,m2gt holds.
Ezij p(xximmj) / Sn12 p(xximmj)
exp(-(xi-mj)2/2s2) / Sn12
exp(-(xi-mn)2/2s2)
M step Calculate a new maximum likelihood
hypothesis hltm1,m2gt assuming the value taken
on by each hidden variable zij is its expected
value Ezij calculated in the E-step. Replace
hltm1,m2gt by hltm1,m2gt
mj Si1m Ezij xi / Si1m Ezij

39
EM Algorithm

Converges to local maximum likelihood and
provides estimates of hidden variables zij.
In fact local maximum in E ln (P(Yh)
Y is complete (observable plus non-observable
variables) data
Expected valued is taken over possible values of
unobserved variables in Y

40
General EM Problem

Given
observed data X x1,,xm
unobserved data Z z1,,zm
parameterized probability distribution P(Yh)
where
Y y1,,ym is the full data yiltxi,zigt
h are the parameters
Determine
h that (locally) maximizes Eln P(Yh)
Applications
train Bayesian Belief Networks
unsupervised clustering
hidden Markov models

41
General EM Method

Define likelihood function Q(hh) which
calculates
Y X ? Z using observed X and current
parameters h to estimate Z
Q(hh) E ln( P(Yh) h, X
EM algorithm
Estimation (E) step Calculate Q(hh) using the
current hypothesis h and the observed data X to
estimate the probability distribution over Y.
Q(hh) E ln( P(Yh) h, X
Maximization (M) step Replace hypothesis h by
the hypothesis h that maximizes this Q function.
h argmaxh?H Q(hh)