Handling Uncertainty presentation

About This Presentation

Transcript and Presenter's Notes

Title: Handling Uncertainty

1
Handling Uncertainty
2
Uncertain knowledge

Typical example Diagnosis. Consider
?x Symptom(x, Toothache) ? Disease(x, Cavity).
The problem is that this rule is wrong. Not all
patients with toothache have cavities some of
them have gum disease, an abscess,
?x Symptom(x, Toothache) ? Disease(x, Cavity) ?
Disease(x, GumDisease) ? Disease(x, Abscess) ?
...
Unfortunately, in order to make the rule true, we
have to add an almost unlimited list of possible
causes.
We could try turning the rule into a causal rule
?x Disease(x, Cavity) ? Symptom(x, Toothache).
But this rule isnt right either not all
cavities cause pain. We need to make it logically
exhaustive to augment the left side with all the
qualifications required for a cavity to cause
toothache.

3
So, using FOL fails

In a domain like medical diagnosis, using FOL
fails because of
Laziness
Too much work to list the complete set of
antecedents or consequents to ensure an
exceptionless rule.
Too hard to use such rules.
Theoretical Ignorance
Medical science hasnt complete theory for the
domain.
Practical Ignorance
Even if we know all the rules, we might be
uncertain about a particular patient because not
all necessary tests have been done

4
Belief and Probability

The connection between toothaches and cavities is
not a logical consequence in either direction.
However, we can provide a degree of belief on the
sentences. Our main tool for this is probability
theory.
E.g. We might not know for sure what afflicts a
particular patient, but we believe that there is,
say, an 80 chance that is probability 0.8
that the patient has cavity if he has a
toothache.
We usually get this belief from statistical data.
Assigning probability 0 to a sentence correspond
to an unequivocal belief that the sentence is
false.
Assigning probability 1 to a sentence correspond
to an unequivocal belief that the sentence is
true.

5
Syntax

Basic element random variable
Similar to propositional logic possible worlds
defined by assignment of values to random
variables.
Boolean random variables
e.g., Cavity (do I have a cavity?)
Discrete random variables
e.g., Weather is one of ltsunny,rainy,cloudy,snowgt
Domain values must be exhaustive and mutually
exclusive
Elementary proposition constructed by assignment
of a value to a random variable e.g., Weather
sunny, Cavity false
(abbreviated as ?cavity)
Complex propositions formed from elementary
propositions and standard logical connectives
e.g., Weather sunny ? Cavity false

6
Atomic events

Atomic event A complete specification of the
state of the world about which the agent is
uncertain
E.g., if the world consists of only two Boolean
variables Cavity and Toothache, then there are 4
distinct atomic events
Cavity false ? Toothache false
Cavity false ? Toothache true
Cavity true ? Toothache false
Cavity true ? Toothache true
Atomic events are mutually exclusive and
exhaustive

7
Axioms of probability

For any propositions A, B
0 P(A) 1
Necessarily true (i.e. valid) propositions have
probability 1, and necessarily false (i.e.
unsatisfiable) propositions have probability 0.
P(true) 1 and P(false) 0
P(A ? B) P(A) P(B) - P(A ? B)

8
Using the axioms of probability

We can derive a variety of useful facts from
basic axioms. E.g.
P(a ? ?a) P(a) P(?a) P(a ? ?a) (by axiom
3)
P(true) P(a) P(?a) P(a ? ?a) (by logical
equivalence)
1 P(a) P(?a) (by axiom 2)
P(?a) 1 - P(a) (by algebra)
Also, we can prove that for discrete variable D
with domain ltd1,,dngt we have
?ni1P(Ddi) 1.

9
Prior probability and distribution

Prior or unconditional probability associated
with a proposition is the degree of belief
accorded to it in the absence of any other
information.
e.g., P(Cavity true) 0.1 (or P(cavity)
0.1)
P(Weather sunny) 0.7 (or P(sunny)
0.7)
Probability distribution gives values for all
possible assignments
P(Weather sunny) 0.7
P(Weather rain) 0.2
P(Weather cloudy) 0.08
P(Weather snow) 0.02
As a shorthand we can use a vector notation as
P(Weather) lt0.7, 0.2, 0.08, 0.02gt (they sum
up to 1)

10
Joint probability

Joint probability distribution for a set of
random variables gives the probability of every
atomic event on those random variables.
E.g. for two random variables Weather and Cavity
we have we have
P(Weather,Cavity), which is a 4 2 matrix of
values
Weather sunny rainy cloudy snow
Cavity true 0.144 0.02 0.016 0.02
Cavity false 0.576 0.08 0.064 0.08
We can consider the joint probability
distribution of all the variables we use to
describe the world. Such joint probability
distribution is called full joint probability
distribution.
A full joint distribution specifies the
probability of every atomic event.
Any probabilistic question about a domain can be
answered by the full joint distribution.

11
Conditional probability

Conditional or posterior probabilities
e.g., P(cavity toothache) 0.8
i.e., given that toothache is all I know
Notation for conditional distributions
P(Cavity Toothache) is a 2-element vector of
2-element vectors

12
Conditional probability

Definition of conditional probability
P(a b) P(a ? b) / P(b) if P(b) gt 0
Product rule gives an alternative formulation
P(a ? b) P(a b) P(b) P(b a) P(a)
A general version holds for whole distributions,
e.g.,
P(Weather,Cavity) P(Weather Cavity) P(Cavity)
i.e. shorthand for
P(sunny ? cavity) P(sunny cavity) P(cavity)
P(rainy ? cavity) P(rainy cavity) P(cavity)
(View it as a set of 4 2 equations, not matrix
multiplication)

13
Chain rule

Chain rule is derived by successive application
of product rule

14
Inference by enumeration

Start with the joint probability distribution
For any proposition f, sum up the probabilities
of the atomic events where it is true
P(f) S??f P(?)

15
Inference by enumeration
16
Inference by enumeration
17
Inference by enumeration

Can also compute conditional probabilities

18
Normalization

Denominator can be viewed as a normalization
constant a and we can write in vector notation

19
Inference by enumeration, contd

Typically, we are interested in
the posterior joint distribution of the query
variables Y
given specific values e for the evidence
variables E
Let the hidden variables be H X - Y - E
Then the required summation of joint entries is
done by summing out the hidden variables
P(Y E e) aP(Y,E e) aShP(Y,E e, H h)
The terms in the summation are joint entries
because Y, E and H together exhaust the set of
random variables
Obvious problems
Worst-case time complexity O(dn) where d is the
largest arity
Space complexity O(dn) to store the joint
distribution
How to find the numbers for O(dn) entries?

20
Independence

Lets add a fourth variable, Weather.
The full joint probability distribution
P(Toothache, Catch, Cavity, Weather)
has 32 entries (because Weather has 4 values)!!
It contains 4 editions of the previous table, one
for each kind of weather.
Naturally, we ask what relationship these
editions have to each other and to the original
table?
E.g. how are
P(toothache, catch, cavity, cloudy) and
P(toothache, catch, cavity) related? Lets use
the product rule
P(toothache, catch, cavity, cloudy)
P(cloudy toothache, catch, cavity)
P(toothache, catch, cavity)
Of course, ones dental problems dont influence
the weather, so
P(cloudy toothache, catch, cavity) P(cloudy)

21
Independence (contd)

So, we can write
P(toothache, catch, cavity, cloudy)
P(cloudy toothache, catch, cavity)
P(toothache, catch, cavity)
P(cloudy) P(toothache, catch, cavity)
Thus, the 32 element table for four variables can
be constructed from one 8-element table and one
4-element table!!
This property is called independece.
A and B are independent iff
P(AB) P(A) or P(BA) P(B) or P(A, B)
P(A) P(B)
Absolute independence powerful but rare
Dentistry is a large field with hundreds of
variables, none of which are independent.

22
Bayes' Rule

Product rule
P(a?b) P(a b) P(b) P(b a) P(a)
Bayes' rule P(a b) P(b a) P(a) / P(b)
or in vector form
P(YX) P(XY) P(Y) / P(X) a P(XY) P(Y)
Useful for assessing diagnostic probability from
causal probability
P(CauseEffect) P(EffectCause) P(Cause) /
P(Effect)

23
Applying Bayes rule

Bayess rule is useful in practice because there
are many cases where we do have good probability
estimates for these three numbers and need to
compute the fourth.
For example,
A doctor knows that the meningitis causes the
patient to have a stiff neck 50 of the time.
The doctor also knows some unconditional facts
the prior probability that a patient has
meningitis is 1/50,000, and
the prior probability that any patient has a
stiff neck is 1/20.

24
Bayes rule (contd)

Let s be the proposition that the patient has a
stiff neck
m be the proposition that the patient
has meningitis,
P(sm) 0.5
P(m) 1/50000
P(s) 1/20
P(ms) P(sm) P(m) / P(s) (0.5) x
(1/50000) / (1/20) 0.0002
That is, we expect only 1 in 5000 patients with a
stiff neck to have meningitis.

25
Bayes rule (contd)

Well, we might say that doctors know that a stiff
neck implies meningitis in 1 out of 5000 cases
That is the doctor has quantitative information
in the diagnostic direction from symptoms
(effects) to causes.
Such a doctor has no need for Bayes rule?!
Unfortunately, diagnostic knowledge is more
fragile than causal knowledge.
Imagine, there is sudden epidemic of meningitis.
The prior probability, P(m), will go up.
The doctor who derives the diagnostic probability
P(ms) from his statistical observations of
patients before the epidemic will have no idea
how to update the value.
The doctor who derives the diagnostic
probability P(ms) from the other three values
will see that P(ms) goes up proportionally with
P(m).
Clearly, P(sm) is unaffected by the epidemic. It
simply reflects the way meningitis works.

26
Difficulty with more than two vars
27
Conditional independence

P(Toothache, Cavity, Catch) has 23 8
independent entries
If I have a cavity, the probability that the
probe catches in it doesn't depend on whether I
have a toothache
(1) P(catch toothache, cavity) P(catch
cavity)
The same independence holds if I haven't got a
cavity
(2) P(catch toothache,?cavity) P(catch
?cavity)
Catch is conditionally independent of Toothache
given Cavity
P(Catch Toothache,Cavity) P(Catch Cavity)
Equivalent statements
P(Toothache Catch, Cavity) P(Toothache
Cavity)
P(Toothache, Catch Cavity) P(Toothache
Cavity) P(Catch Cavity)

28
Conditional independence contd.

Write out full joint distribution using chain
rule
P(Toothache, Catch, Cavity)
P(Toothache Catch, Cavity) P(Catch, Cavity)
P(Toothache Catch, Cavity) P(Catch Cavity)
P(Cavity)
P(Toothache Cavity) P(Catch Cavity)
P(Cavity)
In most cases, the use of conditional
independence reduces the size of the
representation of the joint distribution from
exponential in n to linear in n.
Conditional independence is our most basic and
robust form of knowledge about uncertain
environments.

29
In general
30
Bayes' Rule and conditional independence

P(Cavity toothache ? catch)
aP(toothache ? catch Cavity) P(Cavity)
aP(toothache Cavity) P(catch Cavity)
P(Cavity)
This is an example of a naïve Bayes model
P(Cause,Effect1, ,Effectn) P(Cause)
piP(EffectiCause)
Total number of parameters is linear in n

31
Athens Example

Suppose you are a witness to a nighttime
hit-and-run accident involving a taxi in Athens.
All taxis in Athens are blue or green.
You swear, under oath, that the taxi was blue.
Extensive testing shows that, under the dim
lighting conditions, discrimination between blue
and green is 75 reliable.
9 out of 10 Athenian taxis are green
Whats most likely color for the taxi?
Hint distinguish carefully between the
proposition that the taxi is blue and the
proposition that the taxi appears blue.

32
Athens Example (contd)

Two random variables.
B taxi was blue with domain b, ?b
LB taxi looked blue with domain lb, ?lb
The information on the reliability of color
identification can be written as
P(lb b) 0.75 P(?lb ?b) 0.75
We need to know the probability that the taxi was
blue, given that it looked blue.
Then, we need to know the probability that the
taxi wasnt blue, given that it looked blue.
Lets use the Bayes rule
P(b lb) ?P(lb b) P(b) ? 0.75 0.1 ?
0.075
P(?b lb) ?P(lb ?b) P(?b)
?(1 - P(?lb ?b)) (1 - P(b))
?(1 - 0.75) (1 0.1) ? 0.25 0.9 ?
0.225
Hence, P(B lb) lt ?0.075, ?0.225gt. So, even
if the witness has seen a blue it is more
probable that the taxi was green.
? 1/P(lb) 1/( P(b lb) P(?b lb) )
1/(0.075 0.225)

33
Text Categorization

Text categorization is the task of assigning a
given document to one of a fixed set of
categories, on the basis of the text it contains.
Naïve Bayes models are often used for this task.
In these models, the query variable is the
document category, and the effect variables are
the presence or absence of each word in the
language.
How such a model can be constructed, given as
training data a set of documents that have been
assigned to categories?
The model consists of the prior probability
P(Category) and the conditional probabilities
P(Wordi Category).
For each category c, P(Categoryc) is estimated
as as the fraction of all the training
documents that are of that category.
Similarly, P(Wordi true Category c) is
estimated as the fraction of documents of
category that contain word.
Also, P(Wordi true Category ?c) is
estimated as the fraction of documents not of
category that contain word.

34
Text Categorization (contd)

Now we can use naïve Bayes for each c
P(Category c Word1 true, , Wordn true)
?P(Category c)?ni1 P(Wordi true
Category c)
P(Category ?c Word1 true, , Wordn true)
?P(Category ?c)?ni1 P(Wordi true
Category ?c)
where ? is the normalization constant.

Write a Comment

User Comments (0)

About PowerShow.com

Handling Uncertainty PowerPoint PPT Presentation