Title: Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13
1Bayesian approaches to knowledge representation
and reasoningPart 1(Chapter 13)
2Bayesianism vs. Frequentism
- Classical probability Frequentists
- Probability of a particular event is defined
relative to its frequency in a sample space of
events. - E.g., probability of the coin will come up heads
on the next trial is defined relative to the
frequency of heads in a sample space of coin
tosses.
3- Bayesian probability
- Combine measure of prior belief you have in a
proposition with your subsequent observations of
events. - Example Bayesian can assign probability to
statement The first e-mail message ever written
was not spam but frequentist cannot.
4Bayesian Knowledge Representation and Reasoning
- Question Given the data D and our prior
beliefs, what is the probability that h is the
correct hypothesis? (spam example)
5- Bayesian terminology (example -- spam
recognition) - Random variable X returns one of a set of
values - x1, x2, ...,xm,
-
- or a continuous value in interval a,b with
probability distribution D(X). - Data D v1, v2, v3, ... Set of observed values
of random variables X1, X2, X3, ...
6- Hypothesis h Function taking instance j and
returning classification of j (e.g., spam or
not spam). - Space of hypotheses H Set of all possible
hypotheses
7- Prior probability of h
- P(h) Probability that hypothesis h is true given
our prior knowledge - If no prior knowledge, all h ? H are equally
probable - Posterior probability of h
- P(hD) Probability that hypothesis h is true,
given the data D. - Likelihood of D
- P(Dh) Probability that we will see data D,
given hypothesis h is true.
8Recall definition of conditional probability
Event space all e-mail messages X all spam
messages Y all messages containing word v1agra
9Bayes Rule
10Example Using Bayes Rule
- Hypotheses
- h message m is spam
- ?h message m is not spam
-
- Data
- message m contains viagra
- message m does not contain viagra
Prior probability P(h) 0.1 P(?h)
0.9 Likelihood P( h) 0.6 P( h)
0.4 P( ? h) 0.03, P( ? h) 0.97
11- P() P( h) P(h) P( ?h)P(?h)
- 0.6 .1 .03 .9 0.09
- P() 0.91
- P(h ) P( h) P(h) / P()
- 0.6 0.1 / .09 .67
- How would we learn these prior probabilities and
likelihoods from past examples of spam and not
spam?
12Full joint probability distribution(CORRECTED)
Notation P(h,D) ? P(h ? D)
P (h ? ) P(h ) P() P(h ? -) P(h -)
P(-) etc.
13- Now suppose there is a second feature examined
does message contain the word offer?
P(mspam, viagra, offer)
Full joint distribution scales exponentially
with number of parameters
14- Bayes optimal classifier for spam
- where fi is a feature (here, could be a
keyword) - In general, intractable.
15- Classification using naive Bayes
- Assumes that all features are independent of one
another. - How do we learn the naive Bayes model from data?
- How do we apply the naive Bayes model to a new
instance?
16Example Training and Using Naive Bayes for
Classification
- Features
- CAPS Boolean (longest contiguous string of
capitalized letters in message is longer than 3) - URL Boolean (0 if no URL in message, 1 if at
least one URL in message) - Boolean (0 if does not appear at least once
in message 1 otherwise)
17- Training data
- M1 DONT MISS THIS AMAZING OFFER !
spam - M2 Dear mm, for more , check this out
http//www.spam.com spam - M3 I plan to offer two sections of CS 250 next
year not spam - M4 Hi Mom, I am a bit short on right now,
can you not spam - send some ASAP? Love, me
18Training a Naive Bayes Classifier
- Two hypotheses spam or not spam
- Estimate
- P(spam) .5 P(?spam) .5
- P(CAPS spam) .5 P(?CAPS spam) .5
- P(URL spam) .5 P(?URL spam) .5
- P( spam) .75 P(? spam) .25
- P(CAPS ? spam ) .5 P(?CAPS ? spam) .5
- P(URL ? spam) .25 P(?URL ? spam) .75
- P( ? spam) .5 P(? ? spam) .5
19- m-estimate of probability (to fix cases where one
of the terms in the product is 0)
20M4 This is a ONE-TIME-ONLY offer that will get
you BIG , just click on http//www.spammers.com
21Information Retrieval
- Most important concepts
- Defining features of a document
- Indexing documents according to features
- Retrieving documents in response to a query
- Ordering retrieved documents by relevance
- Early search engines
- Features List of all terms (keywords) in
document (minus a, the, etc.) - Indexing by keyword
- Retrieval by keyword match with query
- Ordering by number of keywords matched
- Problems with this approach
22Naive Bayesian Document retrieval
- Let D be a document (bag of words), Q be a
query (bag of words), and r be the event that D
is relevant to Q. - In document retreival, we want to compute
- Or, odds ratio
- In the book, they show (via a lot of algebra)
that -
- Chain rule P(A,B) p(AB) p(B)
23Naive Bayesian Document retrieval
- Let D be a document (bag of words), Q be a
query (bag of words), and r be the event that D
is relevant to Q. - In document retreival, we want to compute
- Or, odds ratio
- In the book, they show (via a lot of algebra)
that -
- Chain rule P(A,B) p(AB) p(B)
24Naive Bayesian Document retrieval
- Where Qj is the jth keyword in the query.
- The probability of a query given a relevant
document D is estimated as the product of the
probabilities of each keyword in the query, given
the relevant document. - How to learn these probabilities?
25Evaluating Information Retrieval Systems
- Precision and Recall
- Example Out of corpus of 100 documents, query
has following results - Precision Fraction of relevant documents in
results set 30/40 .75 How precise is
results set? - Recall Fraction of relevant documents in whole
corpus that are in results set 30/50 .60
How many relevant documents were recalled?
26- Tradeoff between recall and precision
- If we want to ensure that recall is high, just
recall a lot of documents. Then precision may be
low. If we recall 100 of documents, but only
50 are relevant, then recall is 1, but precision
is 0.5. - If we want high chance for precision to be high,
just recall the single document judged most
relevant (Im feeling lucky in Google.) Then
precision will (likely) be 1.0, but recall will
be low. - When do you want high precision? When do you
want high recall?
27Bayesian approaches to knowledge representation
and reasoningPart 2(Chapter 14, sections 1-4)
28- Recall Naive Bayes method
- This can also be written in terms of cause and
effect
29Naive Bayes
cause
Spam
v1agra
effects
offer
stock
Bayesian network
Spam
v1agra
stock
30Each node has a conditional probability table
that gives its dependencies on its parents.
Spam
v1agra
stock
31Semantics of Bayesian networks
- If network is correct, can calculate full joint
probability distribution from network. - where parents(Xi) denotes specific values of
parents of Xi. -
Sum of all boxes is 1.
32Example from textbook
- I'm at work, neighbor John calls to say my alarm
is ringing, but neighbor Mary doesn't call.
Sometimes it's set off by minor earthquakes. Is
there a burglar? - Variables Burglary, Earthquake, Alarm,
JohnCalls, MaryCalls - Network topology reflects "causal" knowledge
- A burglar can set the alarm off
- An earthquake can set the alarm off
- The alarm can cause Mary to call
- The alarm can cause John to call
33Example continued
34Complexity of Bayesian Networks
- For n random Boolean variables
- Full joint probability distribution 2n entries
- Bayesian network with at most k parents per node
- Each conditional probability table at most 2k
entries - Entire network n 2k entries
35Exact inference in Bayesian networks
- Query
- What is P(Burglary JohnCallstrue MaryCalls
true)? - Notation Capital letters are distributions
lower case letters are values or variables,
depending on context. - We have
36Lets calculate this for b Burglary true
Worse case complexity O(n 2n), where n is number
of Boolean variables. We can simplify
37A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
38- Can speed up further via variable elimination.
- However, bottom line on exact inference
-
- In general, its intractable. (Exponential in
n.) - Solution
- Approximate inference, by sampling.
39Bayesian approaches to knowledge representation
and reasoningPart 3(Chapter 14, section 5)
40What are the advantages of Bayesian networks?
- Intuitive, concise representation of joint
probability distribution (i.e., conditional
dependencies) of a set of random variables. - Represents beliefs and knowledge about a
particular class of situations. - Efficient (?) (approximate) inference algorithms
- Efficient, effective learning algorithms
41Review of exact inference in Bayesian networks
General question What is P(xe)? Example
Question What is P(c r,w)?
42General question What is P(xe)?
43Event space
44Event space
Cloudy
45Event space
Cloudy
Rain
46Event space
Sprinkler
Cloudy
Rain
47Event space
Sprinkler
Cloudy
Wet Grass
Rain
48(No Transcript)
49Event space
Sprinkler
Cloudy
Wet Grass
Rain
50- Draw expression tree for
- Worst-case complexity is exponential in n (number
of nodes) - Problem is having to enumerate all possibilities
for many variables.
51Issues in Bayesian Networks
- Building / learning network topology
- Assigning / learning conditional probability
tables - Approximate inference via sampling
52Real-World Example 1 The Lumière Project at
Microsoft Research
- Bayesian network approach to answering user
queries about Microsoft Office. - At the time we initiated our project in Bayesian
information retrieval, managers in the Office
division were finding that users were having
difficulty finding assistance efficiently. - As an example, users working with the Excel
spreadsheet might have required assistance with
formatting a graph. Unfortunately, Excel has
no knowledge about the common term, graph, and
only considered in its keyword indexing the term
chart.
53(No Transcript)
54- Networks were developed by experts from user
modeling studies.
55(No Transcript)
56- Offspring of project was Office Assistant in
Office 97.
57Real-World Example 2Diagnosing liver disorders
with Bayesian networks
- Variables disorder class (16 possibilities)
plus 93 features from existing database of
patient records. - Data 600 patient records, which used those
features - Network structure designed by domain experts
(30 hours)
58A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
59- Prior and conditional probability distributions
were learned from data in liver-disorders
database. - Problem Data doesnt give enough samples for
good conditional probability estimates. - For combinations of parent values that are not
adequately sampled, assume uniform distribution
over those values.
60Results
number of observations number of evidence
variables in query window n means that
classification is counted as correct if it is in
the n most probable diagnoses given by the
network for the given evidence values.
61Approximate inference in Bayesian networks
- Instead of enumerating all possibilities, sample
to estimate probabilities.
62Direct Sampling
- Suppose we have no evidence, but we want to
determine P(c,s,r,w) for all c,s,r,w. - Direct sampling
- Sample each variable in topological order,
conditioned on values of parents. - I.e., always sample from P(Xi parents(Xi))
63Example
- Sample from P(Cloudy). Suppose returns true.
- Sample from P(Sprinkler Cloudy true).
Suppose returns false. - Sample from P(Rain Cloudy true). Suppose
returns true. - Sample from P(WetGrass Sprinkler false, Rain
true). Suppose returns true. - Here is the sampled event true, false, true,
true
64- Suppose there are N total samples, and let NS
(x1, ..., xn) be the observed frequency of the
specific event x1, ..., xn. - Suppose N samples, n nodes. Complexity O(Nn).
- Problem 1 Need lots of samples to get good
probability estimates. - Problem 2 Many samples are not realistic low
likelihood.
65Likelihood weighting
- Now suppose we have evidence e. Thus values for
the evidence variables E are fixed. - We want to estimate P(X e)
- Need to sample X and Y, where Y is the set of
non-evidence variables. - Each event sampled is weighted by the likelihood
that that event accords to the evidence. - I.e., events in which the actual evidence appears
unlikely should be given less weight.
66- Example
- Estimate P(Rain Sprinkler true, WetGrass
true). - WeightedSample algorithm
- Set weight w 1.0
- Sample from Cloudy. Suppose it returns true.
- Sprinkler is an evidence variable with value
true. Update likelihood weighting - Low likelihood for sprinkler if cloudy is true,
so this sample gets lower weight.
67- Sample from P(Rain Cloudy true). Suppose
this returns true. - WetGrass is an evidence variable with value true.
Update likelihood weighting - Return event true, true, true, true with weight
0.099. - Weight is low because cloudy true, so
sprinkler is unlikely to be true.
68(No Transcript)
69Problem with likelihood sampling
- As number of evidence variables increases,
performance degrades. This is because most
samples will have very low weights, so weighted
estimate will be dominated by fraction of samples
that accord more than an infinitesimal likelihood
to the evidence.
70Markov Chain Monte Carlo Sampling
- One of most common methods used in real
applications. - Uses idea of Markov blanket of a variable Xi
- parents, children, childrens parents
- Recall that By construction of Bayesian
network, a node is conditionaly independent of
its non-descendants, given its parents.
71- Proposition A node Xi is conditionally
independent of all other nodes in the network,
given its Markov blanket. - Example.
- Need to show that Xi is conditionally
independent of nodes outside its Markov blanket. - Need to show that Xi can be conditionally
dependent on childrens parents.
72Example The proposition says B is
conditionally independent of F given A, C, E.
This can only be true if P(B A,C,E,F)
P(B A, C, E)
73Prove We know, by definition of conditional
probability From tree we have
74Thus
75Now compute P(B A, C, E) Thus Q.
E.D.
76Markov Chain Monte Carlo Sampling
- Start with random sample from variables (x1,
..., xn). This is the current state of the
algorithm. - Next state Randomly sample value for one
non-evidence variable Xi , conditioned on current
values in Markov Blanket of Xi.
77Example
- Query What is P(Rain Sprinkler true,
WetGrass true)? - MCMC
- Random sample, with evidence variables fixed
- true, true, false, true
- Repeat
- Sample Cloudy, given current values of its Markov
blanket Sprinkler true, Rain false.
Suppose result is false. New state - false, true, false, true
- Sample Rain, given current values of its Markov
blanket - Cloudy false, Sprinkler true, WetGrass
true. Suppose - result is true. New state false, true, true,
true.
78- Each sample contributes to estimate for query
- P(Rain Sprinkler true, WetGrass true)
- Suppose we perform 100 such samples, 20 with Rain
true and 80 with Rain false. - Then answer to the query is
- Normalize (?20,80?) ?.20,.80?
- Claim The sampling process settles into a
dynamic equilibrium in which the long-run
fraction of time spent in each state is exactly
proportional to its posterior probability, given
the evidence. - Proof of claim is on pp. 517-518.
79(No Transcript)
80Claim (again)
- Claim MCMC settles into behavior in which each
state is sampled exactly according to its
posterior probability, given the evidence. - That is for all variables Xi, the probability of
the value xi of Xi appearing in a sample is
equal to P(xi e).
81Proof of Claim (outline)
- First, give example of Markov chain.
- Now
- Let x be a state, with x (x1, ..., xn).
-
- Let q (x ? x?) be the transition probability
from state x to state x?. - Let ?t(x) be the probability that the system
will be in state x after t time steps, starting
from state x0. - Let ?t1(x) be the probability that the system
will be in state x after t1 time steps, starting
from state x0.
82 83- Definition
- Result from Markov chain theory Given q, there
is exactly one such stationary distribution ?
(assuming q is ergodic).
? is called the Markov processs stationary
distribution if ?t ?t1 for all x. Defining
equation for stationary distribution
84- One way to satisfy equation 1 is
- Called property of detailed balance.
- Detailed balance implies stationarity
85- Proof of claim Replace old version of this
slide with this version - Show that transition probability q (x ? x?)
defined by MCMC sampling satisfies detailed
balance equation, with a stationary distribution
equal to P(x e). - Let Xi be the variable to be sampled. Let e be
the values of the evidence variables and let Y be
the other non-evidence variables. - Current sample x vector(xi, y), with fixed
evidence variable values e. - We have, by definition of MCMC algorithm
86- Now, show this transition probability produces
detailed balance. -
- We want to show
87 88Speech Recognition(Section 15.6)
- Task Identify sequence of words uttered by
speaker, given acoustic signal. - Uncertainty introduced by noise, speaker error,
variation in pronunciation, homonyms, etc. - Thus speech recognition is viewed as problem of
probabilistic inference.
89Speech Recognition
- So far, weve looked at probabilistic reasoning
in static environments. - Speech Time sequence of static
environments. - Let X be the state variables (i.e., set of
non-evidence variables) describing the
environment (e.g., Words said during time step
t) - Let E be the set of evidence variables (e.g., S
features of acoustic signal).
90- The E values and X joint probability distribution
changes over time. - t1 X1, e1
- t2 X2 , e2
- etc.
91- At each t, we want to compute P(Words S).
- We know from Bayes rule
- P(S Words), for all words, is a previously
learned acoustic model. - E.g. For each word, probability distribution over
phones, and for each phone, probability
distribution over acoustic signals (which can
vary in pitch, speed, volume). - P(Words), for all words, is the language model,
which specifies prior probability of each
utterance. - E.g. bigram model probability of each word
following each other word.
92- Speech recognition typically makes three
assumptions -
- Process underlying change is itself stationary
- i.e., state transition probabilities dont change
- Current state X depends on only a finite history
of previous states (Markov assumption). - Markov process of order n Current state depends
only on n previous states. - Values et of evidence variables depend only on
current state Xt. (Sensor model) -
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97Example Im firsty, um, can I have something
to dwink?
98(No Transcript)