Nave Bayesian Learning slides by Francisco Iacobelli modified by Junsong Yuan presentation

About This Presentation

Transcript and Presenter's Notes

Title: Nave Bayesian Learning slides by Francisco Iacobelli modified by Junsong Yuan

1
Naïve Bayesian Learningslides by Francisco
Iacobelli / modified by Junsong Yuan

Thomas Bayes (c. 1702 17 April 1761) was a
British mathematician and Presbyterian
minister.(wikipedia)?

2
Bayes Theorem
An Essay towards solving a Problem in the
Doctrine of Chances

Notation
P(h) Probability that Hypothesis h holds (prior
probability)?
P(D) Probability of observing the training data
D.
P(Dh) Probability of observing D when h holds.
(read probability of D given h).
P(hD) ?
This last probability is what interests us. Why?
Because we usually have data and need to find the
most probable hypothesis given that data.

3
Bayes Theorem
4
An Example

Weather forecast.
Probability of hurricanes in Chicago 0.008
Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)?
Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?

P(hurricane) 0.008
P(hurricane) 0.992
5
An Example

Weather forecast.
Probability of hurricanes in Chicago 0.008
Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)?
Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?

P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-nhhurricane) 0.02
6
An Example

Weather forecast.
Probability of hurricanes in Chicago 0.008
Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)?
Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?

P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-hhurricane) 0.03
P(p-nhhurricane) 0.02
P(p-nh hurricane) 0.97
7
An Example

Weather forecast.
Probability of hurricanes in Chicago 0.008
Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)?
Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?

P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-hhurricane) 0.03
P(p-nhhurricane) 0.02
P(p-nh hurricane) 0.97
8
An Example

Any random day, Skilling says there will be a
hurricane. Should we believe him?
Whats the probability that he is right?
Whats the probability that he is wrong?

Let P(hurricane) P(h)Let P(hurricane)P(h)?
P(p-hh)P(h)?
0.980.008
0.0078
P(hp-h)

P(p-h)?
P(p-h)?
P(p-h)?
P(p-hh)P(h)?
0.030.992
0.0298
P(hp-h)

P(p-h)?
P(p-h)?
P(p-h)?
P(hp-h) 3.82 P(hp-h)
9

How does Tom predict ?

P(h)0.992
P(p-hh) 0.98
P(p-h)0.0376
P(p-nh h) 0.97
P(h)0.008
10
Maximum A Posteriori hypothesis (MAP) and Maximum
Likelihood (ML)?

hMAP argmax P(hD)?
But
hMAP argmax Thats what we
did!
hMAP argmax P(Dh)P(h)?
HML argmax P(Dh) when P(h) is constant.

P(Dh)P(h)?
P(D)?
11
Naïve Bayes

One of the most practical methods together with
NN, dTrees, Nnbr.
When to use
Moderate to Large Training Sets
Attributes that describe data are conditionally
independent. (P(ac,b) P(ab))?
Successful applications Authorship of texts,
other text classifications, diagnosis.

12
Naïve Bayes What about more attributes?

Vwatch-TV, Do-Homework
A tired, Heroes-on-TV, I-have-ML-hw
vMAP argmax P(vja1,a2,..,an)?
But
vMAP argmax We
have done this!
vNB argmax
Lets look at an example.

P(a1,a2,..,anvj)P(vj)?
P(vja1,a2,..,an)
P(a1,a2,..,an)?
P(a1,a2,..,anvj)P(vj)?
P(a1,a2,..,an)?
P(a1,a2,..,anvj)P(vj)? P(a1vj) P(anvj)
P(vj)?
13
Play Tennis
Tom Mitchel. Machine Learning, p(59)?
14
New ObservationltSunny,Cool,High,Stronggt

Do you Play Tennis?
VNB argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
P(yes)P(sunnyyes)P(coolyes)P(highyes)P(strongy
es)?
P(yes) 9/14 0.64
P(sunnyyes) 2/9 0.22
P(coolyes) 3/9 0.33
P(highyes) 3/9 0.33
P(strongyes) 3/9 0.33
P(yeslts,c,h,sgt) 0.0051

15
New ObservationltSunny,Cool,High,Stronggt

Do you Play Tennis?
Vnb argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
P(no)P(sunnyno)P(coolno)P(highno)P(strongno)?
P(no) 5/14 0.36
P(sunnyno) 3/5 0.6
P(coolno) 1/5 0.2
P(highno) 4/5 0.8
P(strongno) 3/5 0.6
P(nolts,c,h,sgt) 0.0207

16
New ObservationltSunny,Cool,High,Stronggt

Do you Play Tennis?
Vnb argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
P(yeslts,c,h,sgt) 0.0051
P(nolts,c,h,sgt) 0.0207
Therefore, given ltsunny, cool, high, stronggt I
will not play tennis.
Moreover. Theres a 80 probability that I will
not play tennis under the given forecast.
(Normalize)?

17
Another Observation ltovercast, cool, low,
stronggt

P(no)P(overcastno)P(coolno)P(lowno)P(strongno)
?
So, P(nolto,c,l,sgt) 0!?!?!
Solve m-estimate of probability
P(overcastno) 0.125

0
nc mp
p1/values
m weight. Usually values
n m
0 31/3
5 3
18
Naïve Bayes and Text

Naïve Bayes is used to classify text
Authorship of documents
Interesting websites
And

19
SPAM!!!!
20
Sample emails

In just as little as 2 weeks you can have a
masters degree from a national university. A
better job, more income and a better life can all
be yours inl ess than 2 weeks.
Yo Walter_balan!.!A Genuine Univers1ty Degree 1n
4-6 weeks! Have you ever thought that the onlyt
hing stopping you from a great job and better pay
was a few letters behind you name?Well now you
can get them! BA BSc MA MSc MBA PhD

21
Issues

Words are not always in the same positions e.g
weeks, job, better
Not all messages have the same length
Not all words are conditionally independent e.g
University Degree

22
How to do it?

One approach N-grams (take n-words at a time)?
Another approach
Assume conditional independence and
Assume probability of words is the same in any
given position.
P(aiwkvj) P(amwkvj)?
Probability P(wkvj)

nk 1
n Vocabulary
23
Now, Lets Learn The Probabilities

Examples a set of texts with target value. V is
the set of possible target values (spam, no
spam)?
Collect all words
Vocabulary all distinct words and tokens in any
text document
Calculate P(vj) and P(wkvj)?
For each target value vj in V do
docsj subset of Examples for which target is vj
P(vj)
Textj concatenation of members of docj
For each word wk in Vocabulary
nk number of times word wk occurs in Textj
P(wkvj)

docsj
Examples
nk 1
n Vocabulary
24
And Classify

Positions all word positions in Doc that
contain tokens found in Vocabulary
Return vnb where
vnb argmax P(vj)?P(aivj)?
Note ai word at ith position.

Vj in V
i in positions
25
A Note on Performance
Kim, S. B., Seo, H. C., and Rim, H. C. (2003).
26
Final Notes

P(attribute) can have any distribution you want.
You may need to find variance and error
estimation
But once thats figured out, you can have a
classifier with a different probability
distribution and maybe less training data.
To overcome the conditional independence
restriction, one can build Bayesian Belief
Networks
Bayes Theorem provides many possibilities for
probabilistic machine learning and underlies a
lot of algorithms used in many applications
including speech recognition, natural language
generation, text classification, etc.

Write a Comment

User Comments (0)

About PowerShow.com

Nave Bayesian Learning slides by Francisco Iacobelli modified by Junsong Yuan PowerPoint PPT Presentation