Title: Learning with uncertainty: Probability and Nave Bayes
1Learning with uncertaintyProbability and Naïve
Bayes
2What is Uncertain Knowledge?
- If reasonable truth exists,
- ?x,y, parent(x,y) ? male(x) ? father(x,y)
- the whole truth is a rare commodity!
- ?x, smoker(x) ? drinker(x) ? practice-sport(x)
- ? healthy-life(x) ???
- There is no certainty. The assumption is
acceptable if x is a respectful driver, doesnt
work in polluted atmosphere... - Avoiding smoking and drinking does not prevent
diseases but it is rational to think that implies
a healthy life. - We will talk later about degree of belief.
- The whole truth in not accessible in almost all
the real word problems medicine, weather
forecasting, computer networks - These domains imply lot of uncontrolled and
unknown factors.
3Summary
- Representing uncertain knowledge probabilities
- Representing a entire system by full joint
probability distribution - Independence to simplify the world
- Example
- Bayesian inference and update
4Handling Uncertain Knowledge
- Consider the following rules
- ?p, drinker(p) ? cirrhosis(p)
- ?p, drinker(p) ? hepatitis(p) ? cirrhosis(p)
- ?p, drinker(p) ? cirrhosis(p) ? car-accident(p)
? fight(p) - ?p, cirrhosis(p) ? drinker(p)
- First Order Predicate Logic fails in complex
domains because - Laziness
- Impossible to list the complete set of
antecedents or consequences. - Theoretical ignorance
- E.g. Medicine is not an exact science.
- Practical ignorance
- Even if all the general rules are known it might
be a particular case not taken into account.
5Probability to and degree of belief
- With uncertain knowledge one could at best,
provides a degree of belief in the sentence. - 0 false, 1 true, in 0,1 somewhere
between true and false! - The best tool for dealing with degree of belief
is probability theory. - Probability provides a way of summarizing
uncertainty from laziness and ignorance.
6Evidence
- The probability that a patient has cirrhosis is
0.8 is based on the evidence received up to now.
- prior (marginal) probability before evidence
received. - posterior (conditional) probability after
evidence received.
7Prior (or marginal) Probability
- Basic elements are Random Variables (RVs)
- P(Atrue) prior probability that A is true.
- assigned in the absence of other information
(only). - Ex
- for a dice, P(Dice4) 1/6
- At the very beginning of a poker party, for the
first player, P(Card9?)1/52, but this
probability evolves with the distribution of the
cards. - Domain values RV can take
- Discrete Weather ? sunny,cloudy,rainy
- Continuous X ? 0,3
8Probability Density Function
- Probability Density Function p.d.f.(x)
- Assign a probability to every value of the domain
of X - Discrete P(X?x) ?xp.d.f.(t)
- Continuous P(X?x) ? xp.d.f.(t)dt
- Example
- Discrete p.d.f(Wsunny)0.3, p.d.f(Wcloudy)0.5,
p.d.f(Wrainy)0.2 - Continuous
(normal
distribution)
p.d.f(x)
x
9Conditional Probability
- Conditional (Posterior) Probability
- P(AB) - probability of A given that all we know
is B. - P(XY) - two dimensional table.
- P(XxiYyj) for each i and j
- Product Rule
- P(A?B) P(AB)P(B) or vice versa P(A?B)
P(BA)P(A) - P(X,Y) P(XY)P(Y)
- N.B. In probability notation
- P(A and B) P(A?B)P(A?B)P(A,B)
10Kolmogorovs axioms
- All probabilities are between 0 and 1.
- 0 ? P(A) ? 1
- Valid propositions have probability 1 and
unsatifiable propositions have probability 0. - P(true)1, P(false)0
- The probability of a disjunction is given by
- P(A?B)P(A?B)P(A or B) P(A)P(B) - P(A?B)
11Joint Probability Distribution
- Completely specifies probability assignments to
all propositions in the domain - Atomic Event
- an assignment of particular values to all RVs in
a domain - P(A,B,...) assigns probabilities to all possible
atomic events. - With the joint probability distribution we can
compute all the probabilities we want. Thus, we
can answer all inference questions about the
variables of the domain!
12Marginalization with the Joint Probability
Distribution
- Marginal probability
- Marginal probability is obtained by summing (or
integrating, more generally) the joint
probability over the unwanted events. - This is called marginalization P(X) ?zP(X,z)
- In the following we will use P(a) P(AT) and
P(a) P(AF) -
- P(a) P(a,b,c) P(a,b,c) P(a,b,c)
P(a,b,c) - P(a) 0.108 0.012 0.072
0.008 0.2 -gt marginal probability - it is the sum over all combinations of B and C
values
13Using the Joint Probability Distribution
- Disjunction
- P(a?b) P(a)P(b)-P(a,b)
- 0.20.34-0.18 0.36
- Conditionality
- P(ba) P(b,a)/ P(a) 0.18/0.20.9
- What is the probability of c knowing A and B are
true? - P(ca,b) P(c,a,b)/ P(a,b) 0.072/0.180.4
- Conjunction
- P(a,b) P(a,b,c) P(a,b,c) 0.1080.0720.18
- it is the sum over all combinations of C values
14The chain rule
- The full joint distribution can be expressed by
the chain rule - P(X1,X2,X3,X4) P(X1)P(X2X1) P(X3X1, X2)
P(X4X1, X2, X3) - As P(AB) P(A,B)/P(B)
15Bayes Rule
- From the Product Rule we know
- P(A?B) P(AB)P(B)
- P(BA)P(A)
- So P(AB) P(BA)P(A)/P(B)
- This is the bayes rule
- Why this rule is important?
- It happens regularly that P(BA), P(B) and P(A)
are known but P(AB) is not. - A doctor can know what the probability of a
symptom B and disease A are and the probability
that A causes B. - Meaning of the RV A can be seen as a hypothesis
and B as evidence (or data) - P(A) is the prior probability of the hypothesis
(in the absence of any evidence) - P(B) is the probability of the evidence
- P(BA) is the likelihood that the evidence B was
produced, given that the hypothesis was A - P(AB) is the posterior probability of the
hypothesis A, given that the evidence is B.
16Independence of RVs
- The full joint distribution grows with the number
of random variables - For n Boolean RVs the representation of the full
joint distribution grows with O(2n). - Becomes intractable when the number of RVs grows
- Independence of the RVs is a convenient way of
simplifying the computation of the full joint
distribution - If A and B are independent then
- P(A,B) P(A) P(B)
- P(AB) P(A)
- So
- And the full joint distribution grows with O(n).
- However this assumption is rarely true!
17Conditional Independence
- Conditional Independence
- assumption to simplify the inference procedure
- A and B are independent knowing C
- P(A,BC) P(AC)P(BC)
- P(AB,C) P(AC)
- Example medical diagnosis
- P(flufever?winter) P(fever?winterflu)P(flu)/P(
fever?winter) - It would be convenient if fever and winter were
independent but they are not. During the winter
one might catch the flu that provokes the fever. - However these variables are independent given the
presence or the absence of a flu.
18Conditional Independence - case study
- We can check conditional independence of A and A
knowing C - P(a,bc) 1/16/4/16 1/4
- P(ac)?P(bc) 2/16/4/16 ? 2/16/4/16 2/4
? 2/4 1/4 - P(a,bc) 3/16/12/16 1/4
- P(ac)?P(bc) 4/16/12/16 ? 9/16/12/16 4/12
? 9/12 1/4 - we can verify that with every combination of A,C
- So P(A,BC)P(AC)?P(BC) then we can assume
conditional independence - N.B. A and B are not independent P(B,A) ? P(A) ?
P(B) - P(b,a) 4/16 1/4
- P(a) ? P(b) 6/16 ? 11/16 66/256 0.2578 ? 1/4
19Conditional Independence - Consequence
- Generalisation, given a Cause (C) and N Effects
(E) conditionally independent given C, using the
chain rule -
- That can be rewritten
- Thus, the full joint distribution table growths
with O(n) - Conditional independence assertions allow
probabilistic systems to scale up moreover, they
are much more commonly available than absolute
independence assertions. - Such probability is called naïve Bayes model. It
is naïve because it is often used when the
effects are NOT conditionally independent given
the cause variable.
20Bayesian Classification (1)
- Let set of classes be c1, c2,cn
- Let E be description of an instance.
- Determine class of E by determining for each
ci
- P(E) can be determined since classes are complete
and disjoint
21Bayesian Classification (2)
- Need
- Priors P(ci)
- Conditionals P(E ci)
- P(ci) are easily estimated from data.
- If ni of the examples in D are in ci,thenP(ci)
ni / D - Assume instance is a conjunction of binary
features - Too many possible instances (exponential in m) to
estimate all P(E ci)
- Naïve Bayes
- If we assume features of an instance are
independent given the class (ci) (conditionally
independent) - we then only need to know P(ej ci) for each
feature and class.
22Naïve Bayes model - example
- Example medical diagnosis
- P(fever,winter,flu)
- P(fever,winterflu)P(flu) (product rule)
- P(feverflu)P(winterflu)P(flu) (conditional
independence) - The original full joint distribution can be split
into smaller pieces. - So for n Boolean symptoms conditionally
independents given the disease, the
representation of the full joint distribution
grows with O(n) rather than O(2n)
23Naïve Bayes Classifier - Example
Probability Estimates
P(yes) 9/14 0.64P(no) 5/14 0.36
P(voucheryesclassyes) 3/9
0.33P(voucheryesclassno) 3/5 0.60 Similar
values computed for other feature/value pairs.
Prediction
P(yes) P(googleyes) P(56kyes) P(maleyes)
P(yesyes) 0.0053 P(no) P(googleno) P(56kno)
P(maleno) P(yesno) 0.0206
24Probability Estimates Smoothing
- Normally, probabilities are estimated based on
observed frequencies in the training data. - If D contains ni examples in class ci, and nij of
these ni examples contains feature ej, then -
- However, estimating such probabilities from small
training sets is error-prone.
- To account for estimation from small samples,
probability estimates are adjusted or smoothed. - Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability,
p, that is assumed to have been previously
observed in a virtual sample of size m. - For binary features, p is simply assumed to be
0.5.
25Naïve Bayes - Issues
- Posterior Probabilities
- Classification results of Naïve Bayes (the class
with maximum posterior probability) are usually
fairly accurate. - However, due to the inadequacy of the conditional
independence assumption, the actual
posterior-probability numerical estimates are
not. - Output probabilities are generally very close to
0 or 1.
- Underflow Prevention
- Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow. - Since log(xy) log(x) log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
26Hypothesis testing
- What if we have more than one piece of evidence?
- P(zA,B,,Y) P(A,B,,Yz)P(z)/P(A,B,,Y)
- We know that P(A,B,,Y) Si P(A,B,,Y,zi) Si
P(A,B,,Yzi)P(zi) - So,
- P(zA,B,,Y) aP(A,B,,Yz)P(z) by
normalisation. - Where a 1/(Si P(A,B,,Yzi)P(zi))
- Example patient with a fever and who is shaking
- The doctor want to know whether she has the flu
or not. - He knows P(fever,shakingflu) 0.8,
P(fever,shakingflu) 0.01, P(flu) 0.001,
P(flu) 1-P(flu) - He doesnt know P(fever,shaking) thus
- P(flufever,shaking) P(fever,shakingflu) ?
P(flu) / P(fever,shakingflu) ? P(flu)
P(fever,shakingflu) ? P(flu) - P(flufever,shaking) 0.8 ? 0.001 / 0.8 ? 0.001
0.001 ? 0.999 0.08
27Where do Probabilities come from?
- Three approaches
- Frequentist
- from experiments.
- Objectivist
- real aspects of the universe (known model of the
world). - Subjectivist
- Human beliefs without external significance.