Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13 PowerPoint PPT Presentation

presentation player overlay
1 / 98
About This Presentation
Transcript and Presenter's Notes

Title: Bayesian approaches to knowledge representation and reasoning Part 1 Chapter 13


1
Bayesian approaches to knowledge representation
and reasoningPart 1(Chapter 13)
2
Bayesianism vs. Frequentism
  • Classical probability Frequentists
  • Probability of a particular event is defined
    relative to its frequency in a sample space of
    events.
  • E.g., probability of the coin will come up heads
    on the next trial is defined relative to the
    frequency of heads in a sample space of coin
    tosses.

3
  • Bayesian probability
  • Combine measure of prior belief you have in a
    proposition with your subsequent observations of
    events.
  • Example Bayesian can assign probability to
    statement The first e-mail message ever written
    was not spam but frequentist cannot.

4
Bayesian Knowledge Representation and Reasoning
  • Question Given the data D and our prior
    beliefs, what is the probability that h is the
    correct hypothesis? (spam example)

5
  • Bayesian terminology (example -- spam
    recognition)
  • Random variable X returns one of a set of
    values
  • x1, x2, ...,xm,
  • or a continuous value in interval a,b with
    probability distribution D(X).
  • Data D v1, v2, v3, ... Set of observed values
    of random variables X1, X2, X3, ...

6
  • Hypothesis h Function taking instance j and
    returning classification of j (e.g., spam or
    not spam).
  • Space of hypotheses H Set of all possible
    hypotheses

7
  • Prior probability of h
  • P(h) Probability that hypothesis h is true given
    our prior knowledge
  • If no prior knowledge, all h ? H are equally
    probable
  • Posterior probability of h
  • P(hD) Probability that hypothesis h is true,
    given the data D.
  • Likelihood of D
  • P(Dh) Probability that we will see data D,
    given hypothesis h is true.

8
Recall definition of conditional probability
Event space all e-mail messages X all spam
messages Y all messages containing word v1agra
9
Bayes Rule
10
Example Using Bayes Rule
  • Hypotheses
  • h message m is spam
  • ?h message m is not spam
  • Data
  • message m contains viagra
  • message m does not contain viagra

Prior probability P(h) 0.1 P(?h)
0.9 Likelihood P( h) 0.6 P( h)
0.4 P( ? h) 0.03, P( ? h) 0.97
11
  • P() P( h) P(h) P( ?h)P(?h)
  • 0.6 .1 .03 .9 0.09
  • P() 0.91
  • P(h ) P( h) P(h) / P()
  • 0.6 0.1 / .09 .67
  • How would we learn these prior probabilities and
    likelihoods from past examples of spam and not
    spam?

12
Full joint probability distribution(CORRECTED)
Notation P(h,D) ? P(h ? D)
P (h ? ) P(h ) P() P(h ? -) P(h -)
P(-) etc.
13
  • Now suppose there is a second feature examined
    does message contain the word offer?

P(mspam, viagra, offer)
Full joint distribution scales exponentially
with number of parameters
14
  • Bayes optimal classifier for spam
  • where fi is a feature (here, could be a
    keyword)
  • In general, intractable.

15
  • Classification using naive Bayes
  • Assumes that all features are independent of one
    another.
  • How do we learn the naive Bayes model from data?
  • How do we apply the naive Bayes model to a new
    instance?

16
Example Training and Using Naive Bayes for
Classification
  • Features
  • CAPS Boolean (longest contiguous string of
    capitalized letters in message is longer than 3)
  • URL Boolean (0 if no URL in message, 1 if at
    least one URL in message)
  • Boolean (0 if does not appear at least once
    in message 1 otherwise)

17
  • Training data
  • M1 DONT MISS THIS AMAZING OFFER !
    spam
  • M2 Dear mm, for more , check this out
    http//www.spam.com spam
  • M3 I plan to offer two sections of CS 250 next
    year not spam
  • M4 Hi Mom, I am a bit short on right now,
    can you not spam
  • send some ASAP? Love, me

18
Training a Naive Bayes Classifier
  • Two hypotheses spam or not spam
  • Estimate
  • P(spam) .5 P(?spam) .5
  • P(CAPS spam) .5 P(?CAPS spam) .5
  • P(URL spam) .5 P(?URL spam) .5
  • P( spam) .75 P(? spam) .25
  • P(CAPS ? spam ) .5 P(?CAPS ? spam) .5
  • P(URL ? spam) .25 P(?URL ? spam) .75
  • P( ? spam) .5 P(? ? spam) .5

19
  • m-estimate of probability (to fix cases where one
    of the terms in the product is 0)

20
  • Now classify new message

M4 This is a ONE-TIME-ONLY offer that will get
you BIG , just click on http//www.spammers.com

21
Information Retrieval
  • Most important concepts
  • Defining features of a document
  • Indexing documents according to features
  • Retrieving documents in response to a query
  • Ordering retrieved documents by relevance
  • Early search engines
  • Features List of all terms (keywords) in
    document (minus a, the, etc.)
  • Indexing by keyword
  • Retrieval by keyword match with query
  • Ordering by number of keywords matched
  • Problems with this approach

22
Naive Bayesian Document retrieval
  • Let D be a document (bag of words), Q be a
    query (bag of words), and r be the event that D
    is relevant to Q.
  • In document retreival, we want to compute
  • Or, odds ratio
  • In the book, they show (via a lot of algebra)
    that
  • Chain rule P(A,B) p(AB) p(B)

23
Naive Bayesian Document retrieval
  • Let D be a document (bag of words), Q be a
    query (bag of words), and r be the event that D
    is relevant to Q.
  • In document retreival, we want to compute
  • Or, odds ratio
  • In the book, they show (via a lot of algebra)
    that
  • Chain rule P(A,B) p(AB) p(B)

24
Naive Bayesian Document retrieval
  • Where Qj is the jth keyword in the query.
  • The probability of a query given a relevant
    document D is estimated as the product of the
    probabilities of each keyword in the query, given
    the relevant document.
  • How to learn these probabilities?

25
Evaluating Information Retrieval Systems
  • Precision and Recall
  • Example Out of corpus of 100 documents, query
    has following results
  • Precision Fraction of relevant documents in
    results set 30/40 .75 How precise is
    results set?
  • Recall Fraction of relevant documents in whole
    corpus that are in results set 30/50 .60
    How many relevant documents were recalled?

26
  • Tradeoff between recall and precision
  • If we want to ensure that recall is high, just
    recall a lot of documents. Then precision may be
    low. If we recall 100 of documents, but only
    50 are relevant, then recall is 1, but precision
    is 0.5.
  • If we want high chance for precision to be high,
    just recall the single document judged most
    relevant (Im feeling lucky in Google.) Then
    precision will (likely) be 1.0, but recall will
    be low.
  • When do you want high precision? When do you
    want high recall?

27
Bayesian approaches to knowledge representation
and reasoningPart 2(Chapter 14, sections 1-4)
28
  • Recall Naive Bayes method
  • This can also be written in terms of cause and
    effect

29
Naive Bayes
cause
Spam
v1agra
effects
offer
stock
Bayesian network
Spam
v1agra
stock
30
Each node has a conditional probability table
that gives its dependencies on its parents.
Spam
v1agra
stock
31
Semantics of Bayesian networks
  • If network is correct, can calculate full joint
    probability distribution from network.
  • where parents(Xi) denotes specific values of
    parents of Xi.

Sum of all boxes is 1.
32
Example from textbook
  • I'm at work, neighbor John calls to say my alarm
    is ringing, but neighbor Mary doesn't call.
    Sometimes it's set off by minor earthquakes. Is
    there a burglar?
  • Variables Burglary, Earthquake, Alarm,
    JohnCalls, MaryCalls
  • Network topology reflects "causal" knowledge
  • A burglar can set the alarm off
  • An earthquake can set the alarm off
  • The alarm can cause Mary to call
  • The alarm can cause John to call

33
Example continued
34
Complexity of Bayesian Networks
  • For n random Boolean variables
  • Full joint probability distribution 2n entries
  • Bayesian network with at most k parents per node
  • Each conditional probability table at most 2k
    entries
  • Entire network n 2k entries

35
Exact inference in Bayesian networks
  • Query
  • What is P(Burglary JohnCallstrue MaryCalls
    true)?
  • Notation Capital letters are distributions
    lower case letters are values or variables,
    depending on context.
  • We have

36
Lets calculate this for b Burglary true
Worse case complexity O(n 2n), where n is number
of Boolean variables. We can simplify
37
A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
38
  • Can speed up further via variable elimination.
  • However, bottom line on exact inference
  • In general, its intractable. (Exponential in
    n.)
  • Solution
  • Approximate inference, by sampling.

39
Bayesian approaches to knowledge representation
and reasoningPart 3(Chapter 14, section 5)
40
What are the advantages of Bayesian networks?
  • Intuitive, concise representation of joint
    probability distribution (i.e., conditional
    dependencies) of a set of random variables.
  • Represents beliefs and knowledge about a
    particular class of situations.
  • Efficient (?) (approximate) inference algorithms
  • Efficient, effective learning algorithms

41
Review of exact inference in Bayesian networks
General question What is P(xe)? Example
Question What is P(c r,w)?
42
General question What is P(xe)?
43
Event space
44
Event space
Cloudy
45
Event space
Cloudy
Rain
46
Event space
Sprinkler
Cloudy
Rain
47
Event space
Sprinkler
Cloudy
Wet Grass
Rain
48
(No Transcript)
49
Event space
Sprinkler
Cloudy
Wet Grass
Rain
50
  • Draw expression tree for
  • Worst-case complexity is exponential in n (number
    of nodes)
  • Problem is having to enumerate all possibilities
    for many variables.

51
Issues in Bayesian Networks
  • Building / learning network topology
  • Assigning / learning conditional probability
    tables
  • Approximate inference via sampling

52
Real-World Example 1 The Lumière Project at
Microsoft Research
  • Bayesian network approach to answering user
    queries about Microsoft Office.
  • At the time we initiated our project in Bayesian
    information retrieval, managers in the Office
    division were finding that users were having
    difficulty finding assistance efficiently.
  • As an example, users working with the Excel
    spreadsheet might have required assistance with
    formatting a graph. Unfortunately, Excel has
    no knowledge about the common term, graph, and
    only considered in its keyword indexing the term
    chart.

53
(No Transcript)
54
  • Networks were developed by experts from user
    modeling studies.

55
(No Transcript)
56
  • Offspring of project was Office Assistant in
    Office 97.

57
Real-World Example 2Diagnosing liver disorders
with Bayesian networks
  • Variables disorder class (16 possibilities)
    plus 93 features from existing database of
    patient records.
  • Data 600 patient records, which used those
    features
  • Network structure designed by domain experts
    (30 hours)

58
A. Onisko et al., A Bayesian network model for
diagnosis of liver disorders
59
  • Prior and conditional probability distributions
    were learned from data in liver-disorders
    database.
  • Problem Data doesnt give enough samples for
    good conditional probability estimates.
  • For combinations of parent values that are not
    adequately sampled, assume uniform distribution
    over those values.

60
Results
number of observations number of evidence
variables in query window n means that
classification is counted as correct if it is in
the n most probable diagnoses given by the
network for the given evidence values.
61
Approximate inference in Bayesian networks
  • Instead of enumerating all possibilities, sample
    to estimate probabilities.

62
Direct Sampling
  • Suppose we have no evidence, but we want to
    determine P(c,s,r,w) for all c,s,r,w.
  • Direct sampling
  • Sample each variable in topological order,
    conditioned on values of parents.
  • I.e., always sample from P(Xi parents(Xi))

63
Example
  • Sample from P(Cloudy). Suppose returns true.
  • Sample from P(Sprinkler Cloudy true).
    Suppose returns false.
  • Sample from P(Rain Cloudy true). Suppose
    returns true.
  • Sample from P(WetGrass Sprinkler false, Rain
    true). Suppose returns true.
  • Here is the sampled event true, false, true,
    true

64
  • Suppose there are N total samples, and let NS
    (x1, ..., xn) be the observed frequency of the
    specific event x1, ..., xn.
  • Suppose N samples, n nodes. Complexity O(Nn).
  • Problem 1 Need lots of samples to get good
    probability estimates.
  • Problem 2 Many samples are not realistic low
    likelihood.

65
Likelihood weighting
  • Now suppose we have evidence e. Thus values for
    the evidence variables E are fixed.
  • We want to estimate P(X e)
  • Need to sample X and Y, where Y is the set of
    non-evidence variables.
  • Each event sampled is weighted by the likelihood
    that that event accords to the evidence.
  • I.e., events in which the actual evidence appears
    unlikely should be given less weight.

66
  • Example
  • Estimate P(Rain Sprinkler true, WetGrass
    true).
  • WeightedSample algorithm
  • Set weight w 1.0
  • Sample from Cloudy. Suppose it returns true.
  • Sprinkler is an evidence variable with value
    true. Update likelihood weighting
  • Low likelihood for sprinkler if cloudy is true,
    so this sample gets lower weight.

67
  • Sample from P(Rain Cloudy true). Suppose
    this returns true.
  • WetGrass is an evidence variable with value true.
    Update likelihood weighting
  • Return event true, true, true, true with weight
    0.099.
  • Weight is low because cloudy true, so
    sprinkler is unlikely to be true.

68
(No Transcript)
69
Problem with likelihood sampling
  • As number of evidence variables increases,
    performance degrades. This is because most
    samples will have very low weights, so weighted
    estimate will be dominated by fraction of samples
    that accord more than an infinitesimal likelihood
    to the evidence.

70
Markov Chain Monte Carlo Sampling
  • One of most common methods used in real
    applications.
  • Uses idea of Markov blanket of a variable Xi
  • parents, children, childrens parents
  • Recall that By construction of Bayesian
    network, a node is conditionaly independent of
    its non-descendants, given its parents.

71
  • Proposition A node Xi is conditionally
    independent of all other nodes in the network,
    given its Markov blanket.
  • Example.
  • Need to show that Xi is conditionally
    independent of nodes outside its Markov blanket.
  • Need to show that Xi can be conditionally
    dependent on childrens parents.

72
Example The proposition says B is
conditionally independent of F given A, C, E.
This can only be true if P(B A,C,E,F)
P(B A, C, E)
73
Prove We know, by definition of conditional
probability From tree we have
74
Thus
75
Now compute P(B A, C, E) Thus Q.
E.D.
76
Markov Chain Monte Carlo Sampling
  • Start with random sample from variables (x1,
    ..., xn). This is the current state of the
    algorithm.
  • Next state Randomly sample value for one
    non-evidence variable Xi , conditioned on current
    values in Markov Blanket of Xi.

77
Example
  • Query What is P(Rain Sprinkler true,
    WetGrass true)?
  • MCMC
  • Random sample, with evidence variables fixed
  • true, true, false, true
  • Repeat
  • Sample Cloudy, given current values of its Markov
    blanket Sprinkler true, Rain false.
    Suppose result is false. New state
  • false, true, false, true
  • Sample Rain, given current values of its Markov
    blanket
  • Cloudy false, Sprinkler true, WetGrass
    true. Suppose
  • result is true. New state false, true, true,
    true.

78
  • Each sample contributes to estimate for query
  • P(Rain Sprinkler true, WetGrass true)
  • Suppose we perform 100 such samples, 20 with Rain
    true and 80 with Rain false.
  • Then answer to the query is
  • Normalize (?20,80?) ?.20,.80?
  • Claim The sampling process settles into a
    dynamic equilibrium in which the long-run
    fraction of time spent in each state is exactly
    proportional to its posterior probability, given
    the evidence.
  • Proof of claim is on pp. 517-518.

79
(No Transcript)
80
Claim (again)
  • Claim MCMC settles into behavior in which each
    state is sampled exactly according to its
    posterior probability, given the evidence.
  • That is for all variables Xi, the probability of
    the value xi of Xi appearing in a sample is
    equal to P(xi e).

81
Proof of Claim (outline)
  • First, give example of Markov chain.
  • Now
  • Let x be a state, with x (x1, ..., xn).
  • Let q (x ? x?) be the transition probability
    from state x to state x?.
  • Let ?t(x) be the probability that the system
    will be in state x after t time steps, starting
    from state x0.
  • Let ?t1(x) be the probability that the system
    will be in state x after t1 time steps, starting
    from state x0.

82
  • We have

83
  • Definition
  • Result from Markov chain theory Given q, there
    is exactly one such stationary distribution ?
    (assuming q is ergodic).

? is called the Markov processs stationary
distribution if ?t ?t1 for all x. Defining
equation for stationary distribution
84
  • One way to satisfy equation 1 is
  • Called property of detailed balance.
  • Detailed balance implies stationarity

85
  • Proof of claim Replace old version of this
    slide with this version
  • Show that transition probability q (x ? x?)
    defined by MCMC sampling satisfies detailed
    balance equation, with a stationary distribution
    equal to P(x e).
  • Let Xi be the variable to be sampled. Let e be
    the values of the evidence variables and let Y be
    the other non-evidence variables.
  • Current sample x vector(xi, y), with fixed
    evidence variable values e.
  • We have, by definition of MCMC algorithm

86
  • Now, show this transition probability produces
    detailed balance.
  • We want to show

87

88
Speech Recognition(Section 15.6)
  • Task Identify sequence of words uttered by
    speaker, given acoustic signal.
  • Uncertainty introduced by noise, speaker error,
    variation in pronunciation, homonyms, etc.
  • Thus speech recognition is viewed as problem of
    probabilistic inference.

89
Speech Recognition
  • So far, weve looked at probabilistic reasoning
    in static environments.
  • Speech Time sequence of static
    environments.
  • Let X be the state variables (i.e., set of
    non-evidence variables) describing the
    environment (e.g., Words said during time step
    t)
  • Let E be the set of evidence variables (e.g., S
    features of acoustic signal).

90
  • The E values and X joint probability distribution
    changes over time.
  • t1 X1, e1
  • t2 X2 , e2
  • etc.

91
  • At each t, we want to compute P(Words S).
  • We know from Bayes rule
  • P(S Words), for all words, is a previously
    learned acoustic model.
  • E.g. For each word, probability distribution over
    phones, and for each phone, probability
    distribution over acoustic signals (which can
    vary in pitch, speed, volume).
  • P(Words), for all words, is the language model,
    which specifies prior probability of each
    utterance.
  • E.g. bigram model probability of each word
    following each other word.

92
  • Speech recognition typically makes three
    assumptions
  • Process underlying change is itself stationary
  • i.e., state transition probabilities dont change
  • Current state X depends on only a finite history
    of previous states (Markov assumption).
  • Markov process of order n Current state depends
    only on n previous states.
  • Values et of evidence variables depend only on
    current state Xt. (Sensor model)

93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
Example Im firsty, um, can I have something
to dwink?
98
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com