Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13 presentation

About This Presentation

Transcript and Presenter's Notes

Title: Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13

1
Bayesian approaches to knowledge representation
and reasoningPart 1(Chapter 13)
2
Bayesianism vs. Frequentism

Classical probability Frequentists
Probability of a particular event is defined
relative to its frequency in a sample space of
events.
E.g., probability of the coin will come up heads
on the next trial is defined relative to the
frequency of heads in a sample space of coin
tosses.

Bayesian probability
Combine measure of prior belief you have in a
proposition with your subsequent observations of
events.
Example Bayesian can assign probability to
statement The first e-mail message ever written
was not spam but frequentist cannot.

4
Bayesian Knowledge Representation and Reasoning

Question Given the data D and our prior
beliefs, what is the probability that h is the
correct hypothesis? (spam example)

Bayesian terminology (example -- spam
recognition)
Random variable X returns one of a set of
values
x1, x2, ...,xm,
or a continuous value in interval a,b with
probability distribution D(X).
Data D v1, v2, v3, ... Set of observed values
of random variables X1, X2, X3, ...

Hypothesis h Function taking instance j and
returning classification of j (e.g., spam or
not spam).
Space of hypotheses H Set of all possible
hypotheses

Prior probability of h
P(h) Probability that hypothesis h is true given
our prior knowledge
If no prior knowledge, all h ? H are equally
probable
Posterior probability of h
P(hD) Probability that hypothesis h is true,
given the data D.
Likelihood of D
P(Dh) Probability that we will see data D,
given hypothesis h is true.

8
Recall definition of conditional probability
Event space all e-mail messages X all spam
messages Y all messages containing word v1agra
9
Bayes Rule
10
Example Using Bayes Rule

Hypotheses
h message m is spam
?h message m is not spam
Data
message m contains viagra
message m does not contain viagra

Prior probability P(h) 0.1 P(?h)
0.9 Likelihood P( h) 0.6 P( h)
0.4 P( ? h) 0.03, P( ? h) 0.97
11

P() P( h) P(h) P( ?h)P(?h)
0.6 .1 .03 .9 0.09
P() 0.91
P(h ) P( h) P(h) / P()
0.6 0.1 / .09 .67
How would we learn these prior probabilities and
likelihoods from past examples of spam and not
spam?

12
Full joint probability distribution(CORRECTED)
Notation P(h,D) ? P(h ? D)
P (h ? ) P(h ) P() P(h ? -) P(h -)
P(-) etc.
13

Now suppose there is a second feature examined
does message contain the word offer?

P(mspam, viagra, offer)
Full joint distribution scales exponentially
with number of parameters
14

Bayes optimal classifier for spam
where fi is a feature (here, could be a
keyword)
In general, intractable.

Classification using naive Bayes
Assumes that all features are independent of one
another.
How do we learn the naive Bayes model from data?
How do we apply the naive Bayes model to a new
instance?

16
Example Training and Using Naive Bayes for
Classification

Features
CAPS Boolean (longest contiguous string of
capitalized letters in message is longer than 3)
URL Boolean (0 if no URL in message, 1 if at
least one URL in message)
Boolean (0 if does not appear at least once
in message 1 otherwise)

Training data
M1 DONT MISS THIS AMAZING OFFER !
spam
M2 Dear mm, for more , check this out
http//www.spam.com spam
M3 I plan to offer two sections of CS 250 next
year not spam
M4 Hi Mom, I am a bit short on right now,
can you not spam
send some ASAP? Love, me

18
Training a Naive Bayes Classifier

Two hypotheses spam or not spam
Estimate
P(spam) .5 P(?spam) .5
P(CAPS spam) .5 P(?CAPS spam) .5
P(URL spam) .5 P(?URL spam) .5
P( spam) .75 P(? spam) .25
P(CAPS ? spam ) .5 P(?CAPS ? spam) .5
P(URL ? spam) .25 P(?URL ? spam) .75
P( ? spam) .5 P(? ? spam) .5

m-estimate of probability (to fix cases where one
of the terms in the product is 0)

Now classify new message

M4 This is a ONE-TIME-ONLY offer that will get
you BIG , just click on http//www.spammers.com

21
Information Retrieval

Most important concepts
Defining features of a document
Indexing documents according to features
Retrieving documents in response to a query
Ordering retrieved documents by relevance
Early search engines
Features List of all terms (keywords) in
document (minus a, the, etc.)
Indexing by keyword
Retrieval by keyword match with query
Ordering by number of keywords matched
Problems with this approach

22
Naive Bayesian Document retrieval

Let D be a document (bag of words), Q be a
query (bag of words), and r be the event that D
is relevant to Q.
In document retreival, we want to compute
Or, odds ratio
In the book, they show (via a lot of algebra)
that
Chain rule P(A,B) p(AB) p(B)

23
Naive Bayesian Document retrieval

Let D be a document (bag of words), Q be a
query (bag of words), and r be the event that D
is relevant to Q.
In document retreival, we want to compute
Or, odds ratio
In the book, they show (via a lot of algebra)
that
Chain rule P(A,B) p(AB) p(B)

24
Naive Bayesian Document retrieval

Where Qj is the jth keyword in the query.
The probability of a query given a relevant
document D is estimated as the product of the
probabilities of each keyword in the query, given
the relevant document.
How to learn these probabilities?

25
Evaluating Information Retrieval Systems

Precision and Recall
Example Out of corpus of 100 documents, query
has following results
Precision Fraction of relevant documents in
results set 30/40 .75 How precise is
results set?
Recall Fraction of relevant documents in whole
corpus that are in results set 30/50 .60
How many relevant documents were recalled?

Tradeoff between recall and precision
If we want to ensure that recall is high, just
recall a lot of documents. Then precision may be
low. If we recall 100 of documents, but only
50 are relevant, then recall is 1, but precision
is 0.5.
If we want high chance for precision to be high,
just recall the single document judged most
relevant (Im feeling lucky in Google.) Then
precision will (likely) be 1.0, but recall will
be low.
When do you want high precision? When do you
want high recall?

27
Bayesian approaches to knowledge representation
and reasoningPart 2(Chapter 14, sections 1-4)
28

Recall Naive Bayes method
This can also be written in terms of cause and
effect

29
Naive Bayes
cause
Spam
v1agra
effects
offer
stock
Bayesian network
Spam
v1agra
stock
30
Each node has a conditional probability table
that gives its dependencies on its parents.
Spam
v1agra
stock
31
Semantics of Bayesian networks

If network is correct, can calculate full joint
probability distribution from network.
where parents(Xi) denotes specific values of
parents of Xi.

Sum of all boxes is 1.
32
Example from textbook

I'm at work, neighbor John calls to say my alarm
is ringing, but neighbor Mary doesn't call.
Sometimes it's set off by minor earthquakes. Is
there a burglar?
Variables Burglary, Earthquake, Alarm,
JohnCalls, MaryCalls
Network topology reflects "causal" knowledge
A burglar can set the alarm off
An earthquake can set the alarm off
The alarm can cause Mary to call
The alarm can cause John to call

33
Example continued
34
Complexity of Bayesian Networks

For n random Boolean variables
Full joint probability distribution 2n entries
Bayesian network with at most k parents per node
Each conditional probability table at most 2k
entries
Entire network n 2k entries

35
Exact inference in Bayesian networks

Query
What is P(Burglary JohnCallstrue MaryCalls
true)?
Notation Capital letters are distributions
lower case letters are values or variables,
depending on context.
We have

36
Lets calculate this for b Burglary true
Worse case complexity O(n 2n), where n is number
of Boolean variables. We can simplify
37
A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
38

Can speed up further via variable elimination.
However, bottom line on exact inference
In general, its intractable. (Exponential in
n.)
Solution
Approximate inference, by sampling.

39
Bayesian approaches to knowledge representation
and reasoningPart 3(Chapter 14, section 5)
40
What are the advantages of Bayesian networks?

Intuitive, concise representation of joint
probability distribution (i.e., conditional
dependencies) of a set of random variables.
Represents beliefs and knowledge about a
particular class of situations.
Efficient (?) (approximate) inference algorithms
Efficient, effective learning algorithms

41
Review of exact inference in Bayesian networks
General question What is P(xe)? Example
Question What is P(c r,w)?
42
General question What is P(xe)?
43
Event space
44
Event space
Cloudy
45
Event space
Cloudy
Rain
46
Event space
Sprinkler
Cloudy
Rain
47
Event space
Sprinkler
Cloudy
Wet Grass
Rain
48
(No Transcript)
49
Event space
Sprinkler
Cloudy
Wet Grass
Rain
50

Draw expression tree for
Worst-case complexity is exponential in n (number
of nodes)
Problem is having to enumerate all possibilities
for many variables.

51
Issues in Bayesian Networks

Building / learning network topology
Assigning / learning conditional probability
tables
Approximate inference via sampling

52
Real-World Example 1 The Lumière Project at
Microsoft Research

Bayesian network approach to answering user
queries about Microsoft Office.
At the time we initiated our project in Bayesian
information retrieval, managers in the Office
division were finding that users were having
difficulty finding assistance efficiently.
As an example, users working with the Excel
spreadsheet might have required assistance with
formatting a graph. Unfortunately, Excel has
no knowledge about the common term, graph, and
only considered in its keyword indexing the term
chart.

53
(No Transcript)
54

Networks were developed by experts from user
modeling studies.

55
(No Transcript)
56

Offspring of project was Office Assistant in
Office 97.

57
Real-World Example 2Diagnosing liver disorders
with Bayesian networks

Variables disorder class (16 possibilities)
plus 93 features from existing database of
patient records.
Data 600 patient records, which used those
features
Network structure designed by domain experts
(30 hours)

58
A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
59

Prior and conditional probability distributions
were learned from data in liver-disorders
database.
Problem Data doesnt give enough samples for
good conditional probability estimates.
For combinations of parent values that are not
adequately sampled, assume uniform distribution
over those values.

60
Results
number of observations number of evidence
variables in query window n means that
classification is counted as correct if it is in
the n most probable diagnoses given by the
network for the given evidence values.
61
Approximate inference in Bayesian networks

Instead of enumerating all possibilities, sample
to estimate probabilities.

62
Direct Sampling

Suppose we have no evidence, but we want to
determine P(c,s,r,w) for all c,s,r,w.
Direct sampling
Sample each variable in topological order,
conditioned on values of parents.
I.e., always sample from P(Xi parents(Xi))

63
Example

Sample from P(Cloudy). Suppose returns true.
Sample from P(Sprinkler Cloudy true).
Suppose returns false.
Sample from P(Rain Cloudy true). Suppose
returns true.
Sample from P(WetGrass Sprinkler false, Rain
true). Suppose returns true.
Here is the sampled event true, false, true,
true

Suppose there are N total samples, and let NS
(x1, ..., xn) be the observed frequency of the
specific event x1, ..., xn.
Suppose N samples, n nodes. Complexity O(Nn).
Problem 1 Need lots of samples to get good
probability estimates.
Problem 2 Many samples are not realistic low
likelihood.

65
Likelihood weighting

Now suppose we have evidence e. Thus values for
the evidence variables E are fixed.
We want to estimate P(X e)
Need to sample X and Y, where Y is the set of
non-evidence variables.
Each event sampled is weighted by the likelihood
that that event accords to the evidence.
I.e., events in which the actual evidence appears
unlikely should be given less weight.

Example
Estimate P(Rain Sprinkler true, WetGrass
true).
WeightedSample algorithm
Set weight w 1.0
Sample from Cloudy. Suppose it returns true.
Sprinkler is an evidence variable with value
true. Update likelihood weighting
Low likelihood for sprinkler if cloudy is true,
so this sample gets lower weight.

Sample from P(Rain Cloudy true). Suppose
this returns true.
WetGrass is an evidence variable with value true.
Update likelihood weighting
Return event true, true, true, true with weight
0.099.
Weight is low because cloudy true, so
sprinkler is unlikely to be true.

68
(No Transcript)
69
Problem with likelihood sampling

As number of evidence variables increases,
performance degrades. This is because most
samples will have very low weights, so weighted
estimate will be dominated by fraction of samples
that accord more than an infinitesimal likelihood
to the evidence.

70
Markov Chain Monte Carlo Sampling

One of most common methods used in real
applications.
Uses idea of Markov blanket of a variable Xi
parents, children, childrens parents
Recall that By construction of Bayesian
network, a node is conditionaly independent of
its non-descendants, given its parents.

Proposition A node Xi is conditionally
independent of all other nodes in the network,
given its Markov blanket.
Example.
Need to show that Xi is conditionally
independent of nodes outside its Markov blanket.
Need to show that Xi can be conditionally
dependent on childrens parents.

72
Example The proposition says B is
conditionally independent of F given A, C, E.
This can only be true if P(B A,C,E,F)
P(B A, C, E)
73
Prove We know, by definition of conditional
probability From tree we have
74
Thus
75
Now compute P(B A, C, E) Thus Q.
E.D.
76
Markov Chain Monte Carlo Sampling

Start with random sample from variables (x1,
..., xn). This is the current state of the
algorithm.
Next state Randomly sample value for one
non-evidence variable Xi , conditioned on current
values in Markov Blanket of Xi.

77
Example

Query What is P(Rain Sprinkler true,
WetGrass true)?
MCMC
Random sample, with evidence variables fixed
true, true, false, true
Repeat
Sample Cloudy, given current values of its Markov
blanket Sprinkler true, Rain false.
Suppose result is false. New state
false, true, false, true
Sample Rain, given current values of its Markov
blanket
Cloudy false, Sprinkler true, WetGrass
true. Suppose
result is true. New state false, true, true,
true.

Each sample contributes to estimate for query
P(Rain Sprinkler true, WetGrass true)
Suppose we perform 100 such samples, 20 with Rain
true and 80 with Rain false.
Then answer to the query is
Normalize (?20,80?) ?.20,.80?
Claim The sampling process settles into a
dynamic equilibrium in which the long-run
fraction of time spent in each state is exactly
proportional to its posterior probability, given
the evidence.
Proof of claim is on pp. 517-518.

79
(No Transcript)
80
Claim (again)

Claim MCMC settles into behavior in which each
state is sampled exactly according to its
posterior probability, given the evidence.
That is for all variables Xi, the probability of
the value xi of Xi appearing in a sample is
equal to P(xi e).

81
Proof of Claim (outline)

First, give example of Markov chain.
Now
Let x be a state, with x (x1, ..., xn).
Let q (x ? x?) be the transition probability
from state x to state x?.
Let ?t(x) be the probability that the system
will be in state x after t time steps, starting
from state x0.
Let ?t1(x) be the probability that the system
will be in state x after t1 time steps, starting
from state x0.

We have

Definition
Result from Markov chain theory Given q, there
is exactly one such stationary distribution ?
(assuming q is ergodic).

? is called the Markov processs stationary
distribution if ?t ?t1 for all x. Defining
equation for stationary distribution
84

One way to satisfy equation 1 is
Called property of detailed balance.
Detailed balance implies stationarity

Proof of claim Replace old version of this
slide with this version
Show that transition probability q (x ? x?)
defined by MCMC sampling satisfies detailed
balance equation, with a stationary distribution
equal to P(x e).
Let Xi be the variable to be sampled. Let e be
the values of the evidence variables and let Y be
the other non-evidence variables.
Current sample x vector(xi, y), with fixed
evidence variable values e.
We have, by definition of MCMC algorithm

Now, show this transition probability produces
detailed balance.
We want to show

88
Speech Recognition(Section 15.6)

Task Identify sequence of words uttered by
speaker, given acoustic signal.
Uncertainty introduced by noise, speaker error,
variation in pronunciation, homonyms, etc.
Thus speech recognition is viewed as problem of
probabilistic inference.

89
Speech Recognition

So far, weve looked at probabilistic reasoning
in static environments.
Speech Time sequence of static
environments.
Let X be the state variables (i.e., set of
non-evidence variables) describing the
environment (e.g., Words said during time step
t)
Let E be the set of evidence variables (e.g., S
features of acoustic signal).

The E values and X joint probability distribution
changes over time.
t1 X1, e1
t2 X2 , e2
etc.

At each t, we want to compute P(Words S).
We know from Bayes rule
P(S Words), for all words, is a previously
learned acoustic model.
E.g. For each word, probability distribution over
phones, and for each phone, probability
distribution over acoustic signals (which can
vary in pitch, speed, volume).
P(Words), for all words, is the language model,
which specifies prior probability of each
utterance.
E.g. bigram model probability of each word
following each other word.

Speech recognition typically makes three
assumptions
Process underlying change is itself stationary
i.e., state transition probabilities dont change
Current state X depends on only a finite history
of previous states (Markov assumption).
Markov process of order n Current state depends
only on n previous states.
Values et of evidence variables depend only on
current state Xt. (Sensor model)

93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
Example Im firsty, um, can I have something
to dwink?
98
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13 PowerPoint PPT Presentation