Title: Nave Bayesian Learning slides by Francisco Iacobelli modified by Junsong Yuan
1Naïve Bayesian Learningslides by Francisco
Iacobelli / modified by Junsong Yuan
- Thomas Bayes (c. 1702 17 April 1761) was a
British mathematician and Presbyterian
minister.(wikipedia)?
2Bayes Theorem
An Essay towards solving a Problem in the
Doctrine of Chances
- Notation
- P(h) Probability that Hypothesis h holds (prior
probability)? - P(D) Probability of observing the training data
D. - P(Dh) Probability of observing D when h holds.
(read probability of D given h). - P(hD) ?
- This last probability is what interests us. Why?
- Because we usually have data and need to find the
most probable hypothesis given that data.
3Bayes Theorem
4An Example
- Weather forecast.
- Probability of hurricanes in Chicago 0.008
- Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)? - Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?
P(hurricane) 0.008
P(hurricane) 0.992
5An Example
- Weather forecast.
- Probability of hurricanes in Chicago 0.008
- Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)? - Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?
P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-nhhurricane) 0.02
6An Example
- Weather forecast.
- Probability of hurricanes in Chicago 0.008
- Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)? - Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?
P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-hhurricane) 0.03
P(p-nhhurricane) 0.02
P(p-nh hurricane) 0.97
7An Example
- Weather forecast.
- Probability of hurricanes in Chicago 0.008
- Tom Skilling predicts hurricanes correctly 98 of
the time. (p-h)? - Skilling also gets it right 97 of the time when
he predicts no hurricane (p-nh)?
P(hurricane) 0.008
P(hurricane) 0.992
P(p-hhurricane) 0.98
P(p-hhurricane) 0.03
P(p-nhhurricane) 0.02
P(p-nh hurricane) 0.97
8An Example
- Any random day, Skilling says there will be a
hurricane. Should we believe him? - Whats the probability that he is right?
- Whats the probability that he is wrong?
Let P(hurricane) P(h)Let P(hurricane)P(h)?
P(p-hh)P(h)?
0.980.008
0.0078
P(hp-h)
P(p-h)?
P(p-h)?
P(p-h)?
P(p-hh)P(h)?
0.030.992
0.0298
P(hp-h)
P(p-h)?
P(p-h)?
P(p-h)?
P(hp-h) 3.82 P(hp-h)
9P(h)0.992
P(p-hh) 0.98
P(p-h)0.0376
P(p-nh h) 0.97
P(h)0.008
10Maximum A Posteriori hypothesis (MAP) and Maximum
Likelihood (ML)?
- hMAP argmax P(hD)?
- But
- hMAP argmax Thats what we
did! - hMAP argmax P(Dh)P(h)?
- HML argmax P(Dh) when P(h) is constant.
P(Dh)P(h)?
P(D)?
11Naïve Bayes
- One of the most practical methods together with
NN, dTrees, Nnbr. - When to use
- Moderate to Large Training Sets
- Attributes that describe data are conditionally
independent. (P(ac,b) P(ab))? - Successful applications Authorship of texts,
other text classifications, diagnosis.
12Naïve Bayes What about more attributes?
- Vwatch-TV, Do-Homework
- A tired, Heroes-on-TV, I-have-ML-hw
- vMAP argmax P(vja1,a2,..,an)?
- But
- vMAP argmax We
have done this! - vNB argmax
- Lets look at an example.
P(a1,a2,..,anvj)P(vj)?
P(vja1,a2,..,an)
P(a1,a2,..,an)?
P(a1,a2,..,anvj)P(vj)?
P(a1,a2,..,an)?
P(a1,a2,..,anvj)P(vj)? P(a1vj) P(anvj)
P(vj)?
13Play Tennis
Tom Mitchel. Machine Learning, p(59)?
14New ObservationltSunny,Cool,High,Stronggt
- Do you Play Tennis?
- VNB argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
- P(yes)P(sunnyyes)P(coolyes)P(highyes)P(strongy
es)? - P(yes) 9/14 0.64
- P(sunnyyes) 2/9 0.22
- P(coolyes) 3/9 0.33
- P(highyes) 3/9 0.33
- P(strongyes) 3/9 0.33
- P(yeslts,c,h,sgt) 0.0051
15New ObservationltSunny,Cool,High,Stronggt
- Do you Play Tennis?
- Vnb argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
- P(no)P(sunnyno)P(coolno)P(highno)P(strongno)?
- P(no) 5/14 0.36
- P(sunnyno) 3/5 0.6
- P(coolno) 1/5 0.2
- P(highno) 4/5 0.8
- P(strongno) 3/5 0.6
- P(nolts,c,h,sgt) 0.0207
16New ObservationltSunny,Cool,High,Stronggt
- Do you Play Tennis?
- Vnb argmax P(yeslts,c,h,sgt),P(nolts,c,h,sgt)
- P(yeslts,c,h,sgt) 0.0051
-
- P(nolts,c,h,sgt) 0.0207
- Therefore, given ltsunny, cool, high, stronggt I
will not play tennis. - Moreover. Theres a 80 probability that I will
not play tennis under the given forecast.
(Normalize)?
17Another Observation ltovercast, cool, low,
stronggt
- P(no)P(overcastno)P(coolno)P(lowno)P(strongno)
? - So, P(nolto,c,l,sgt) 0!?!?!
- Solve m-estimate of probability
- P(overcastno) 0.125
0
nc mp
p1/values
m weight. Usually values
n m
0 31/3
5 3
18Naïve Bayes and Text
- Naïve Bayes is used to classify text
- Authorship of documents
- Interesting websites
- And
19SPAM!!!!
20Sample emails
- In just as little as 2 weeks you can have a
masters degree from a national university. A
better job, more income and a better life can all
be yours inl ess than 2 weeks. - Yo Walter_balan!.!A Genuine Univers1ty Degree 1n
4-6 weeks! Have you ever thought that the onlyt
hing stopping you from a great job and better pay
was a few letters behind you name?Well now you
can get them! BA BSc MA MSc MBA PhD
21Issues
- Words are not always in the same positions e.g
weeks, job, better - Not all messages have the same length
- Not all words are conditionally independent e.g
University Degree
22How to do it?
- One approach N-grams (take n-words at a time)?
- Another approach
- Assume conditional independence and
- Assume probability of words is the same in any
given position. - P(aiwkvj) P(amwkvj)?
- Probability P(wkvj)
nk 1
n Vocabulary
23Now, Lets Learn The Probabilities
- Examples a set of texts with target value. V is
the set of possible target values (spam, no
spam)? - Collect all words
- Vocabulary all distinct words and tokens in any
text document - Calculate P(vj) and P(wkvj)?
- For each target value vj in V do
- docsj subset of Examples for which target is vj
- P(vj)
- Textj concatenation of members of docj
- For each word wk in Vocabulary
- nk number of times word wk occurs in Textj
- P(wkvj)
docsj
Examples
nk 1
n Vocabulary
24And Classify
- Positions all word positions in Doc that
contain tokens found in Vocabulary - Return vnb where
- vnb argmax P(vj)?P(aivj)?
- Note ai word at ith position.
Vj in V
i in positions
25A Note on Performance
Kim, S. B., Seo, H. C., and Rim, H. C. (2003).
26Final Notes
- P(attribute) can have any distribution you want.
- You may need to find variance and error
estimation - But once thats figured out, you can have a
classifier with a different probability
distribution and maybe less training data. - To overcome the conditional independence
restriction, one can build Bayesian Belief
Networks - Bayes Theorem provides many possibilities for
probabilistic machine learning and underlies a
lot of algorithms used in many applications
including speech recognition, natural language
generation, text classification, etc.