Uncertainty Handling

About This Presentation

Title:

Uncertainty Handling

Description:

if I apply this piece of knowledge, does the conclusion necessarily hold true? ... if (site culture is blood) & (gram organism is neg) ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 33

Provided by: NKU

Learn more at: https://www.nku.edu

Category:

more less

Transcript and Presenter's Notes

Title: Uncertainty Handling

1
Uncertainty Handling

This is a traditional AI topic, but we need to
cover it in at least a little detail here
prior to covering machine learning approaches
There are many different approaches to handling
uncertainty
Formal approaches based on mathematics
(probabilities)
Formal approaches based on logic
Informal approaches
Many questions arise
How do we combine uncertainty values?
How do we obtain uncertainty values?
How do we interpret uncertainty values?
How do we add uncertainty values to our knowledge
and inference mechanisms?

2
Why Is Uncertainty Needed?

We will find none of the approaches to be
entirely adequate so the natural question is why
even bother?
Input data may be questionable
to what extent is a patient demonstrating some
symptom?
do we rely on their word?
Knowledge may be questionable
is this really a fact?
Knowledge may not be truth-preserving
if I apply this piece of knowledge, does the
conclusion necessarily hold true? associational
knowledge for instance is not truth preserving,
but used all the time in diagnosis
Input may be ambiguous or unclear
this is especially true if we are dealing with
real-world inputs from sensors, or dealing with
situations where ambiguity readily exists
(natural languages for instance)
Output may be expected in terms of a
plausibility/probability such as what is the
likelihood that it will rain today?
The world is not just T/F, so our reasoners
should be able to model this and reason over the
shades of grey we find in the world

3
Methods to Handle Uncertainty

Fuzzy Logic
Logic that extends traditional 2-valued logic to
be a continuous logic (values from 0 to 1)
while this early on was developed to handle
natural language ambiguities such as you are
very tall it instead is more successfully
applied to device controllers
Probabilistic Reasoning
Using probabilities as part of the data and using
Bayes theorem or variants to reason over what is
most likely
Hidden Markov Models
A variant of probabilistic reasoning where
internal states are not observable (so they are
called hidden)
Certainty Factors and Qualitative Fuzzy Logics
More ad hoc approaches (non formal) that might be
more flexible or at least more human-like
Neural Networks
We will skip these in this lecture as we want to
talk about NNs more with respect to learning

4
Fuzzy Logic

Logic and sets are thought of as crisp
An item is T or F, an item is in the set or not
in the set
Fuzzy logic, based on fuzzy set theory, says that
an item is in a set by f(a) amount
where a is the item
and f is the membership function (which returns a
real number from 0 to 1)
Consider the figure on the right that compares
the crisp and fuzzy membership functions for
Tall

Membership to set A is often written like this
5
Fuzzy Logic as Process

First, inputs are translated into fuzzy values
This process is sometimes referred to as
fuzzification
For instance, assume Sue is 21, Jim is 42 and
Frank is 53, we want to know who is old and who
is young
Young Sue / .7, Jim / .2, Frank / .1
Old Sue / .1, Jim / .4, Frank / .6
Next, we want to infer over our members
We might have rules, for instance that you cannot
be a country club member unless you are OLD and
WEALTHY
The rules, when applied, will give us conclusions
such as Frank can be a member at .8 and Sue at .5
Finally, given our conclusions, we need to
defuzzify them
There is no single, accepted method for
defuzzification
how do we convert Frank / .8 and Sue / .5 into
actions?
Often, methods compute centers of gravity or a
weighted average of some kind,
The result is then used to determine what
conclusions are acceptable (true) and which are
not

6
Fuzzy Rules

Fuzzy logic is often applied in rule based
formats
If tall(x) and athletic(x) then
basketball_player(x)
If tall(x) and athletic(x) then soccer_player(x)
If basketball_player(x) and poor_grades(x) then
skip_college(x)
where tall, athletic, etc are membership
functions which return real values
We will see how to deal with and, or, not, and
implies on the next slide
Fuzzy logic can be used to supplement a
rule-based approach to KBS as seen above
However, because fuzzy membership functions are
not necessarily easy to define for ideas like
nausea, flu, and because the logic begins to
break down when there are lengthy chains of
rules, we dont see this approach being used very
often in modern KBS
Instead, fuzzy logic is used in many controller
devices where there are few rules
If fast_speed(x) and approaching_red_light(x)
then decelerate(x)
the amount of deceleration is determined by
defuzzifying the value derived by the implication
in the rule

7
Fuzzy Operators
Operator Fuzzy Set Fuzzy Logic
Member (element of) A(x) N/A
AND / Intersection Min Min or
OR / Union Max Max or ab-ab
XOR N/A Max(Min(a, 1-b), Min(1-a, b)) Or ab-3abaababb-aabb
NOT 1 A(x) 1 A(x)
Complement x / 1 - a(x) x in A N/A
? (Implication) N/A Max (1 a, b) or 1 a a b
8
Hedges

Imagine that we have a function for tall such
that tall(a) returns as membership in the
category
For instance, tall(52) .1, tall(511) .6
and tall(67) .9
A hedge is a fuzzy term that is used to convert
the membership functions output
What does very tall mean? somewhat tall?
incredibly tall?
Common hedges
Very ? f(x)2
Not very ? 1 f(x)2
Somewhat ? f(x)1/2
About (or around) ? f(x) /- delta
Nearly ? f(x) delta
So for example
if our membership function for Old says that 52
is a old / .6
then 52 would be very old / .36, not very old
.64, somewhat old / .77
We would need to define a reasonable delta for
things like around

9
Example

Consider as an example a controller for an engine
The job of the controller is to make sure that
the engine temperature stays within a reasonable
range
There are three memberships cold, warm, hot
In this case, all three membership functions can
be denoted in one figure (see to the right)

The controller has 4 simple rules
IF temperature IS very cold THEN stop fan
IF temperature IS cold THEN turn down fan
IF temperature IS normal THEN maintain level
IF temperature IS hot THEN speed up fan

Turn down and Speed up need to be defined,
possibly through defuzzification for instance,
the extent of cold or hot determines the
degree to turn up or down the fan speed
10
Advantages and Drawbacks

Advantages
Handles fuzzy terms in language quite easily
Logic is simple to implement and compute
Very appropriate for device controllers
used in anti-lock brakes, auto-focus in cameras,
Japans automated subway, the Space Shuttle, etc
Disadvantages
How do we define the membership functions?
there are no learning mechanisms in fuzzy logic
what if we have membership functions provided
from two different people
for instance, what a 611 basketball player
defines as tall will differ from a 410 gymnast
How do we reconcile the two different fuzzy
logics?
Membership values begin to move away from
expectations when chains of logic are lengthy so
this approach is not suitable for many KBS
problems (e.g., medical diagnosis)

11
Bayesian Probabilities

Bayes Theorem is given below
P(H0 E) probability that H0 is true given
evidence E (the conditional probability)
P(E H0) probability that E will arise given
that H0 has occurred (the evidential probability)
P(H0) probability that H0 will arise (the prior
probability)
P(E) probability that evidence E will arise
Usually we normalize our probabilities so that
P(E) 1
The idea is that you are given some evidence E
e1, e2, , en and you have a collection of
hypotheses H1, H2, , Hm
Using a collection of evidential and prior
probabilities, compute the most likely hypothesis

12
Independence of Evidence

Note that since E is a collection of some
evidence, but not all possible evidence, you will
need a whole lot of probabilities
P(E1 E2 H0), P(E1 E3 H0), P(E1 E2 E3
H0),
If you have n items that could be evidence, you
will need 2n different evidential probabilities
for every hypothesis
In order to get around the problem of needing an
exponential number of probabilities, one might
make the assumption that pieces of evidence are
independent
Under such an assumption
P(E1 E2 H) P(E1 H) P(E2 H)
P(E1 E2) P(E1) P(E2)
Is this a reasonable assumption?

13
Continued

Example a patient is suffering from a fever and
nausea
Can we treat these two symptoms as independent?
one might be causally linked to the other
the two combined may help identify a cause
(disease) that the symptoms separately might not
A weaker form of independence is conditional
independence
If hypothesis H is known to be true, then whether
E1 is true should not impact P(E2 H) or P(H
E2)
Again, it this a reasonable assumption?
Consider as an example
You want to run the sprinkler system if it is not
going to rain and you base your decision on
whether it will rain or not on whether it is
cloudy
the grass is wet, we want to know the probability
that you ran the sprinkler versus if it rained
evidential probabilities P(sprinkler wet) and
P(rain wet) are not independent of whether it
was cloudy or not

14
Bayesian Networks

We can avoid the assumption of independence by
including causality in our knowledge
For this, we enhance our previous approach by
using a network where directed edges denote some
form of dependence or causality

An example of a causal network is shown to the
right along with the probabilities (evidential
and prior)
we cannot use Bayes theorem directly because the
evidential probabilities are based on the prior
probability of cloudy
However, a propagation algorithm can be applied
where the prior probability for cloudiness will
impact the evidential probabilities of sprinkler
and rain
from there, we can finally compute the likelihood
of rain versus sprinkler

15
Propagation Algorithm

The idea behind computing probabilities in a
Bayesian network is the chain rule
Before describing the chain rule, what we now
want to compute in a network is the probability
of a particular path through the network
If we have a network as shown to the right, we
might want to compute the probability of P(A, B,
C, D, E) that is, the probability of visiting
the nodes A, B, C, D, E in that order

The chain rule says that P(A, B, C, D, E) P(A)
P(B A) P(C A, B) P(D A, B, C) P(E
A, B, C, D)
In our previous example, we want to know the
probability that it rained versus the probability
that we ran the sprinkler given that the grass is
wet
P(C, S, W) P(C) P(SC) P(WC,S)
P(C, R, W) P(C) P(RC) P(WC,R)

16
Example Continued

First, we compute P(Wet) which is the sum of all
possible paths through the network
The summation tries c 0, c 1, s 0, s 1, r
0 and r 1 (8 possibilities)
Now we know the denominator to apply in Bayes
theorem, the numerators will be P(S W) and P(R
W)
that is, the probability that the grass is wet
and the sprinkler was on, and the probability
that the grass is wet and it rained

So it is more likely that it rained than we ran
the sprinkler Notice the two probabilities do not
add up to 1
17
A Problem

The example we considered is of a simply
connected graph
A multiply connected cause graph is one where
cycles exist
The problem here is that the chain rule does not
work in a multiply connected graph
consider a graph of four nodes where A ? B ? D
and also A ? C ? D
P(A, B, C, D) has two paths of P(D )
we need some mechanism to allow us to compute
P(A, B, C, D) with both P(D A, B) and P(D A,
C)
By collapsing nodes we can remove the cycles
Collapse B and C into a single node
We dont have probabilities for B C so we need
to come up with a strategy to handle this
collapsed node
The strategy is to instantiate all possible
values of B and C and test out the Bayesian
network for each of these instantiations
For B and C, we would have True/True, True/False,
False/True and False/False
We run our propagation algorithm on all four
combinations and see which resulting P(A, B, C,
D) value is the highest, this tells us not only
the probability of P(D) but also what values of B
and C lead us to that conclusion

18
More

Our previous example was fairly easily reduced to
a singly connected graph
We would have to run our propagation algorithm 4
times, but that isnt too bad a price to pay to
get around the various problems
However, how realistic is it for a real world
problem to have such a simple way to reduce from
a multiply connected graph to a simply connected
one?
Consider the network on the next slide, it is
very complicatedly multiply connected
to reduce such a graph, we would have to collapse
a great number of nodes and then instantiate the
collapsed node(s) to all possible combinations
for instance, if we can get around our problem by
reducing several pathways to three collapsed
nodes, one with 6 nodes, one with 5 nodes and one
with 8 nodes, we would have to run our
propagation algorithm 26 25 28 224 times
this quickly leads to intractability
Note A good summary and example of Bayesian
nets is provided at http//www4.ncsu.edu/bahler/c
ourses/csc520f02/bayes1.html, you might want to
check it out

19
Real World Example
Here is a Bayesian net for classification of
intruders in an operating system Notice that it
contains cycles The probabilities for the edges
are learned by sorting through log data
20
Where do the Probabilities Come From?

Recall we need P(Hi) for every hypothesis and
P(EiHj) for every piece of evidence for each
hypothesis
If we use statistics, can we guarantee that the
statistics are unbiased?
I poll 100 doctors offices to find out how many
of their patients have the flu (to obtain the
prior probability P(flu))
This statistic is biased because I gathered the
information in the summer (the probability would
probably differ if I gathered the data in the
winter)
A prior probability is supposed to be entirely
independent of evidence such as the season, yet
gathering the statistics may introduce the bias
An alternative approach is to rank probabilities
with respect to each other and then supply values
(the argument being that probabilities dont have
to be exact, just relatively correct)
Bayesian networks can also be trained with data
sets to find reasonable probabilities
this is in fact the approach commonly used today

21
HMMs

A Markov model is a state transition diagram with
probabilities on the edges
We use a Markov model to compute the probability
of a certain sequence of states
see the figure to the right
In many problems, we have observations to tell us
what states have been reached, but observations
may not show us all of the states
Intermediate states (those that are not
identifiable from observations) are hidden
In the figure on the right, the observations are
Y1, Y2, Y3, Y4 and the hidden states are Q1, Q2,
Q3, Q4
The HMM allows us to compute the most probable
path that led to a particular observable state
This allows us to find which hidden states were
most likely to have occurred

This is extremely useful for recognition problems
where we know the end result but not how the end
result was produced
we know the patients symptoms but not the
disease that caused the symptoms to appear
we know the speech signal that the speaker
uttered, but not the phonemes that made up the
speech signal

22
Forward Algorithm

There are two central HMM algorithms
The Forward algorithm computes, given a series of
states, the probability of achieving that state
this is merely the product of each transition

Here is an example (from Wikipedia) The
probability of two rainy days is P(Rainy)
P(Rainy Rainy) .6.7 .42 The probability
of a sunny day followed by a rainy day
followed by a sunny day P(Sunny) P(Rainy
Sunny) P(Sunny Rainy) .4.4.3
.048
states ('Rainy', 'Sunny') observations
('walk', 'shop', 'clean') start_probability
'Rainy' 0.6, 'Sunny' 0.4 transition_probabili
ty 'Rainy' 'Rainy' 0.7, 'Sunny'
0.3, 'Sunny' 'Rainy' 0.4, 'Sunny' 0.6,
emission_probability 'Rainy'
'walk' 0.1, 'shop' 0.4, 'clean' 0.5,
'Sunny' 'walk' 0.6, 'shop' 0.3, 'clean'
0.1,
23
Viterbi Algorithm

Of more interest, and more interestingly, is
determining what sequence of hidden states
probably led to a given result
This is accomplished by using the Forward
algorithm to compute the probabilities of all
possible hidden state paths leading to the
observation of interest from the starting point
The best path is the one with the highest
probability
From the previous example, we might want to know
what the chance was that it rained given that a
friends activity was to walk
P(rainy walk) P(rainy) P(walk rainy) .6
.1 .06
P(sunny walk) P(sunny) P(walk sunny) .4
.6 .24
So it is far more likely that it was sunny since
your friend walked today
More complex is when you have a sequence of
observations in which case the probability is
P(condition1, condition2 action1, action2)
P(condition1) P(action1 condition1)
P(condition2 condition1) P(action2
condition2)
Here, we include the probability of the
transition from condition 1 to condition 2 in our
calculation

24
Example Continued

Now consider that we want to know what the
weather was like three days in a row given that
your friend walked the first day, shopped the
second day and cleaned the third day
We have many 8 probabilities to compute
P(sunny, sunny, sunny walk, shop, clean)
P(sunny) P(walk sunny) P(sunny sunny)
P(shop sunny) P(sunny sunny) P(clean
sunny) .48
P(sunny, sunny, rainy walk, shop, clean) .62
P(sunny, rainy, sunny walk, shop, clean)
P(sunny) P(walk sunny) P(rainy sunny)
P(shop rainy) P(sunny rainy) P(clean
sunny) .43
P(sunny, rainy, rainy walk, shop, clean) .75
P(rainy, sunny, sunny walk, shop, clean) .21
P(rainy, sunny, rainy walk, shop, clean) .35
P(rainy, rainy, sunny walk, shop, clean) .37
P(rainy, rainy, rainy walk, shop, clean) .69
So we find that the most likely (hidden) sequence
is sunny, rainy, rainy

25
Some Comments

Aside from the Forward and Viterbi algorithms,
there is also a Forward-Backward algorithm for
learning (or adjusting) the transition
probabilities
This will be beneficial when it comes to speech
recognition where transition values may not be
available
we will briefly examine this algorithm when we
cover learning in the next lecture
Both the Forward and Viterbi algorithms require a
great number of computations
This is reduced by using dynamic programming and
recursion but the number of computations still
grows exponentially as the number of transitions
increases
HMMs are used extensively for modern speech
recognition systems, leading to the best
performance (by far) for such automated systems
But HMMs are not used very often for other
reasoning problems

26
Certainty Factors

This form of uncertainty was introduced in the
MYCIN expert system (1971)
The idea was to annotate a rule with how certain
a conclusion might be
If A then B (.6)
if A is true, B can be concluded with .6 where .6
is a plausibility (rather than a fuzzy membership
or a probability)
certainty factors are provided by the domain
expert
There are numerous questions that arise with CFs
How do we combine CFs?
How does an expert provide a CF for a given rule?
Will CFs be consistent across all rules?
CFs are informal and so are not mathematically
sound
On the other hand,
CFs are more in the language of the expert,
unlike probabilities
CFs can denote belief (when positive) and
disbelief (when negative)
CFs have been used in many other systems since
MYCIN

27
Combining CFs

We need mechanisms to handle AND, OR, NOT and
Implications
If A and B Then C (.8)
CF(A) .5, CF(B) .7, what is the CF for C?
AND minimum
OR maximum
NOT 1 CF
Implications multiply CFs
so the CF for C as a conclusion above is min(.5,
.7) .8 .4
What if we have two rules, both of which suggest
C?
If D or E Then C (.7)
We also need to combine the CFs of the same
conclusion (combining evidence)
if CF(D) .8 and CF(E) .3, then CF(C)
max(.8, .3) .7 .56
so now we have CF(C) .4 and CF(C) .56
Combine the conclusions using CF1 CF2 CF1
CF2
so our new CF(C) .4 .56 .4 .56 .736
CF is now greater than the two individual CFs
while remaining lt 1

28
MYCIN Example
if (site culture is blood) (gram organism is
neg) (morphology organism is rod) (burn
patient is serious) then (identity organism is
pseudomonas) (.4) if (gram organism is neg)
(morphology organism is rod)
(compromised-host patient is yes) then
(identity organism is pseudomonas) (.6) if (gram
organism is pos) (morphology organism is
coccus) then (identity organism is pseudomonas)
(-.4) We know culture-1s site is blood, gram
is negative, morphology is most likely (.8) rod,
burn patient is semi-serious (serious at .5) and
the patient has been compromised (i.e., a virus)
CFpseudomonas min(1, 1, .8, .5) .4 .2
from rule 1 CFpseudomonas min(1, 1, 1) .6
.6 from rule 2 CFpseudomonas min(0, 0) -.4
0 from rule 3 CFpseudomonas .2 .6 - .2
.6 .68 (suggestive evidence)

Translating CFs to English
1.0 certain evidence
gt .8 strongly suggestive ev.
gt .5 suggestive ev.
gt 0 weakly suggestive ev.
0 no evidence

29
Fuzzier Approaches

There are several problems with CFs
Experts may not feel comfortable giving values
like .6 for one rule and .7 for another
If you are getting knowledge from two experts,
they may provide inconsistent CFs
Expert 1 is more confident in his rules than
expert 2, so expert 1s CFs are consistently
higher than expert 2
After a while, the CFs for the experts rules
might become more uniform
After the 500th rule, the expert starts using .5
for every CF!
Doctors in particular are likely to use fuzzy
vocabulary in place of numeric values
Terminology includes likely, plausible,
highly unlikely, ruled out, confirmed, etc
Why not use values more consistent with the
human?
Especially since it is very inhuman to compute
massive amounts of numeric calculations when
reasoning such that the combining rules for fuzzy
logic, CFs, probabilities are not very human-like

30
Example

Simple matching logic to determine if a patient
has a common viral infection
Uses a 9 valued vocabulary
Confirmed
Very likely
Likely
Somewhat likely
Neutral (dont know)
Somewhat unlikely
Unlikely
Very unlikely
Ruled Out
Other matchers will decide how to combine these
fuzzy values
For instance, if we want to know whether to call
the doctor, we might have features that ask is
it a common viral infection, does the patient
have nausea, and is the fever abnormally high
If at least somewhat likely, yes and yes, then
return yes otherwise return no

Features 1. Achy-eyes This-illness ? 2.
Fever This-illness ? 3. Achiness This-illness
? 4. Runny-nose This-illness ? 5.
Scratchy-throat This-illness ? 6.
Slightly-upset-stomach This-illness ? 7.
Tiredness This-illness ? 8. Malaise
This-illness ? 9. Headache This-illness
? Patterns Y ? ? ? ? ? ? ? ? ? Likely
? Y Y Y Y Y Y Y Y (5) ? Very-likely ?
Y Y Y Y Y Y Y Y (4) ? Likely ? Y ? Y ? ? ? ? ?
? Likely ? Y Y Y Y Y Y Y Y (3) ?
Somewhat-likely ? Y Y Y Y Y Y Y Y (2) ?
Neutral Otherwise ? Ruled Out
31
Advantages and Drawbacks

Advantages of CFs and Fuzzier Values
No need to use statistics, which might be biased
No need for massive computations as seen with
Bayes and HMMs
Combining conditions and conclusions is
permissible without violating any logic regarding
independence of hypotheses or evidence
The values do not have to be as accurate as in
probabilistic methods where the statistics should
be as accurate as possible
Although there is no formal learning mechanism
available, learning algorithms can be constructed
Its easy!
Disadvantages
Not formal, many people dislike informal
techniques for AI
We still need to get the values from somewhere,
domain experts are more likely to provide needed
values but they still may not be highly accurate

32
Conclusions

Many AI systems require some form of uncertainty
handling
Issues
Where do the values come from?
Can they be learned?
yes for HMMs, Bayesian networks, maybe for CF and
fuzzier values, no for fuzzy logic
Is the approach computationally expensive?
yes for HMMs and Bayesian networks, yes for
Bayesian probabilities if you do not assume
independence of values
Is the approach formal?
yes for all but CFs/fuzzier values
Applications
Speech recognition HMMs primarily
Expert Systems Bayesian networks, fuzzy logic,
CF/fuzzier values
Device controllers fuzzy logic
There is also non-monotonic logic and
non-monotonic reasoning having multiple belief
states