Learning with uncertainty: Probability and Nave Bayes - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Learning with uncertainty: Probability and Nave Bayes

Description:

... all the real word problems: medicine, weather forecasting, computer networks... E.g.: Medicine is not an exact science. Practical ignorance ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 28

Provided by: deptcomput

Category:

more less

Transcript and Presenter's Notes

Title: Learning with uncertainty: Probability and Nave Bayes

1
Learning with uncertaintyProbability and Naïve
Bayes
2
What is Uncertain Knowledge?

If reasonable truth exists,
?x,y, parent(x,y) ? male(x) ? father(x,y)
the whole truth is a rare commodity!
?x, smoker(x) ? drinker(x) ? practice-sport(x)
? healthy-life(x) ???
There is no certainty. The assumption is
acceptable if x is a respectful driver, doesnt
work in polluted atmosphere...
Avoiding smoking and drinking does not prevent
diseases but it is rational to think that implies
a healthy life.
We will talk later about degree of belief.
The whole truth in not accessible in almost all
the real word problems medicine, weather
forecasting, computer networks
These domains imply lot of uncontrolled and
unknown factors.

3
Summary

Representing uncertain knowledge probabilities
Representing a entire system by full joint
probability distribution
Independence to simplify the world
Example
Bayesian inference and update

4
Handling Uncertain Knowledge

Consider the following rules
?p, drinker(p) ? cirrhosis(p)
?p, drinker(p) ? hepatitis(p) ? cirrhosis(p)
?p, drinker(p) ? cirrhosis(p) ? car-accident(p)
? fight(p)
?p, cirrhosis(p) ? drinker(p)
First Order Predicate Logic fails in complex
domains because
Laziness
Impossible to list the complete set of
antecedents or consequences.
Theoretical ignorance
E.g. Medicine is not an exact science.
Practical ignorance
Even if all the general rules are known it might
be a particular case not taken into account.

5
Probability to and degree of belief

With uncertain knowledge one could at best,
provides a degree of belief in the sentence.
0 false, 1 true, in 0,1 somewhere
between true and false!
The best tool for dealing with degree of belief
is probability theory.
Probability provides a way of summarizing
uncertainty from laziness and ignorance.

6
Evidence

The probability that a patient has cirrhosis is
0.8 is based on the evidence received up to now.
prior (marginal) probability before evidence
received.
posterior (conditional) probability after
evidence received.

7
Prior (or marginal) Probability

Basic elements are Random Variables (RVs)
P(Atrue) prior probability that A is true.
assigned in the absence of other information
(only).
Ex
for a dice, P(Dice4) 1/6
At the very beginning of a poker party, for the
first player, P(Card9?)1/52, but this
probability evolves with the distribution of the
cards.
Domain values RV can take
Discrete Weather ? sunny,cloudy,rainy
Continuous X ? 0,3

8
Probability Density Function

Probability Density Function p.d.f.(x)
Assign a probability to every value of the domain
of X
Discrete P(X?x) ?xp.d.f.(t)
Continuous P(X?x) ? xp.d.f.(t)dt
Example
Discrete p.d.f(Wsunny)0.3, p.d.f(Wcloudy)0.5,
p.d.f(Wrainy)0.2
Continuous
(normal
distribution)

p.d.f(x)
x
9
Conditional Probability

Conditional (Posterior) Probability
P(AB) - probability of A given that all we know
is B.
P(XY) - two dimensional table.
P(XxiYyj) for each i and j
Product Rule
P(A?B) P(AB)P(B) or vice versa P(A?B)
P(BA)P(A)
P(X,Y) P(XY)P(Y)
N.B. In probability notation
P(A and B) P(A?B)P(A?B)P(A,B)

10
Kolmogorovs axioms

All probabilities are between 0 and 1.
0 ? P(A) ? 1
Valid propositions have probability 1 and
unsatifiable propositions have probability 0.
P(true)1, P(false)0
The probability of a disjunction is given by
P(A?B)P(A?B)P(A or B) P(A)P(B) - P(A?B)

11
Joint Probability Distribution

Completely specifies probability assignments to
all propositions in the domain
Atomic Event
an assignment of particular values to all RVs in
a domain
P(A,B,...) assigns probabilities to all possible
atomic events.
With the joint probability distribution we can
compute all the probabilities we want. Thus, we
can answer all inference questions about the
variables of the domain!

12
Marginalization with the Joint Probability
Distribution

Marginal probability
Marginal probability is obtained by summing (or
integrating, more generally) the joint
probability over the unwanted events.
This is called marginalization P(X) ?zP(X,z)
In the following we will use P(a) P(AT) and
P(a) P(AF)
P(a) P(a,b,c) P(a,b,c) P(a,b,c)
P(a,b,c)
P(a) 0.108 0.012 0.072
0.008 0.2 -gt marginal probability
it is the sum over all combinations of B and C
values

13
Using the Joint Probability Distribution

Disjunction
P(a?b) P(a)P(b)-P(a,b)
0.20.34-0.18 0.36
Conditionality
P(ba) P(b,a)/ P(a) 0.18/0.20.9
What is the probability of c knowing A and B are
true?
P(ca,b) P(c,a,b)/ P(a,b) 0.072/0.180.4
Conjunction
P(a,b) P(a,b,c) P(a,b,c) 0.1080.0720.18
it is the sum over all combinations of C values

14
The chain rule

The full joint distribution can be expressed by
the chain rule
P(X1,X2,X3,X4) P(X1)P(X2X1) P(X3X1, X2)
P(X4X1, X2, X3)
As P(AB) P(A,B)/P(B)

15
Bayes Rule

From the Product Rule we know
P(A?B) P(AB)P(B)
P(BA)P(A)
So P(AB) P(BA)P(A)/P(B)
This is the bayes rule
Why this rule is important?
It happens regularly that P(BA), P(B) and P(A)
are known but P(AB) is not.
A doctor can know what the probability of a
symptom B and disease A are and the probability
that A causes B.
Meaning of the RV A can be seen as a hypothesis
and B as evidence (or data)
P(A) is the prior probability of the hypothesis
(in the absence of any evidence)
P(B) is the probability of the evidence
P(BA) is the likelihood that the evidence B was
produced, given that the hypothesis was A
P(AB) is the posterior probability of the
hypothesis A, given that the evidence is B.

16
Independence of RVs

The full joint distribution grows with the number
of random variables
For n Boolean RVs the representation of the full
joint distribution grows with O(2n).
Becomes intractable when the number of RVs grows
Independence of the RVs is a convenient way of
simplifying the computation of the full joint
distribution
If A and B are independent then
P(A,B) P(A) P(B)
P(AB) P(A)
So
And the full joint distribution grows with O(n).
However this assumption is rarely true!

17
Conditional Independence

Conditional Independence
assumption to simplify the inference procedure
A and B are independent knowing C
P(A,BC) P(AC)P(BC)
P(AB,C) P(AC)
Example medical diagnosis
P(flufever?winter) P(fever?winterflu)P(flu)/P(
fever?winter)
It would be convenient if fever and winter were
independent but they are not. During the winter
one might catch the flu that provokes the fever.
However these variables are independent given the
presence or the absence of a flu.

18
Conditional Independence - case study

We can check conditional independence of A and A
knowing C
P(a,bc) 1/16/4/16 1/4
P(ac)?P(bc) 2/16/4/16 ? 2/16/4/16 2/4
? 2/4 1/4
P(a,bc) 3/16/12/16 1/4
P(ac)?P(bc) 4/16/12/16 ? 9/16/12/16 4/12
? 9/12 1/4
we can verify that with every combination of A,C
So P(A,BC)P(AC)?P(BC) then we can assume
conditional independence
N.B. A and B are not independent P(B,A) ? P(A) ?
P(B)
P(b,a) 4/16 1/4
P(a) ? P(b) 6/16 ? 11/16 66/256 0.2578 ? 1/4

19
Conditional Independence - Consequence

Generalisation, given a Cause (C) and N Effects
(E) conditionally independent given C, using the
chain rule
That can be rewritten
Thus, the full joint distribution table growths
with O(n)
Conditional independence assertions allow
probabilistic systems to scale up moreover, they
are much more commonly available than absolute
independence assertions.
Such probability is called naïve Bayes model. It
is naïve because it is often used when the
effects are NOT conditionally independent given
the cause variable.

20
Bayesian Classification (1)

Let set of classes be c1, c2,cn
Let E be description of an instance.
Determine class of E by determining for each
ci

P(E) can be determined since classes are complete
and disjoint

21
Bayesian Classification (2)

Need
Priors P(ci)
Conditionals P(E ci)
P(ci) are easily estimated from data.
If ni of the examples in D are in ci,thenP(ci)
ni / D
Assume instance is a conjunction of binary
features
Too many possible instances (exponential in m) to
estimate all P(E ci)

Naïve Bayes
If we assume features of an instance are
independent given the class (ci) (conditionally
independent)
we then only need to know P(ej ci) for each
feature and class.

22
Naïve Bayes model - example

Example medical diagnosis
P(fever,winter,flu)
P(fever,winterflu)P(flu) (product rule)
P(feverflu)P(winterflu)P(flu) (conditional
independence)
The original full joint distribution can be split
into smaller pieces.
So for n Boolean symptoms conditionally
independents given the disease, the
representation of the full joint distribution
grows with O(n) rather than O(2n)

23
Naïve Bayes Classifier - Example
Probability Estimates
P(yes) 9/14 0.64P(no) 5/14 0.36
P(voucheryesclassyes) 3/9
0.33P(voucheryesclassno) 3/5 0.60 Similar
values computed for other feature/value pairs.
Prediction
P(yes) P(googleyes) P(56kyes) P(maleyes)
P(yesyes) 0.0053 P(no) P(googleno) P(56kno)
P(maleno) P(yesno) 0.0206
24
Probability Estimates Smoothing

Normally, probabilities are estimated based on
observed frequencies in the training data.
If D contains ni examples in class ci, and nij of
these ni examples contains feature ej, then
However, estimating such probabilities from small
training sets is error-prone.

To account for estimation from small samples,
probability estimates are adjusted or smoothed.
Laplace smoothing using an m-estimate assumes
that each feature is given a prior probability,
p, that is assumed to have been previously
observed in a virtual sample of size m.
For binary features, p is simply assumed to be
0.5.

25
Naïve Bayes - Issues

Posterior Probabilities
Classification results of Naïve Bayes (the class
with maximum posterior probability) are usually
fairly accurate.
However, due to the inadequacy of the conditional
independence assumption, the actual
posterior-probability numerical estimates are
not.
Output probabilities are generally very close to
0 or 1.

Underflow Prevention
Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
Since log(xy) log(x) log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.

26
Hypothesis testing

What if we have more than one piece of evidence?
P(zA,B,,Y) P(A,B,,Yz)P(z)/P(A,B,,Y)
We know that P(A,B,,Y) Si P(A,B,,Y,zi) Si
P(A,B,,Yzi)P(zi)
So,
P(zA,B,,Y) aP(A,B,,Yz)P(z) by
normalisation.
Where a 1/(Si P(A,B,,Yzi)P(zi))
Example patient with a fever and who is shaking
The doctor want to know whether she has the flu
or not.
He knows P(fever,shakingflu) 0.8,
P(fever,shakingflu) 0.01, P(flu) 0.001,
P(flu) 1-P(flu)
He doesnt know P(fever,shaking) thus
P(flufever,shaking) P(fever,shakingflu) ?
P(flu) / P(fever,shakingflu) ? P(flu)
P(fever,shakingflu) ? P(flu)
P(flufever,shaking) 0.8 ? 0.001 / 0.8 ? 0.001
0.001 ? 0.999 0.08