Learning with uncertainty: Probability and Nave Bayes - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Learning with uncertainty: Probability and Nave Bayes

Description:

... all the real word problems: medicine, weather forecasting, computer networks... E.g.: Medicine is not an exact science. Practical ignorance ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 28
Provided by: deptcomput
Category:

less

Transcript and Presenter's Notes

Title: Learning with uncertainty: Probability and Nave Bayes


1
Learning with uncertaintyProbability and Naïve
Bayes
2
What is Uncertain Knowledge?
  • If reasonable truth exists,
  • ?x,y, parent(x,y) ? male(x) ? father(x,y)
  • the whole truth is a rare commodity!
  • ?x, smoker(x) ? drinker(x) ? practice-sport(x)
  • ? healthy-life(x) ???
  • There is no certainty. The assumption is
    acceptable if x is a respectful driver, doesnt
    work in polluted atmosphere...
  • Avoiding smoking and drinking does not prevent
    diseases but it is rational to think that implies
    a healthy life.
  • We will talk later about degree of belief.
  • The whole truth in not accessible in almost all
    the real word problems medicine, weather
    forecasting, computer networks
  • These domains imply lot of uncontrolled and
    unknown factors.

3
Summary
  • Representing uncertain knowledge probabilities
  • Representing a entire system by full joint
    probability distribution
  • Independence to simplify the world
  • Example
  • Bayesian inference and update

4
Handling Uncertain Knowledge
  • Consider the following rules
  • ?p, drinker(p) ? cirrhosis(p)
  • ?p, drinker(p) ? hepatitis(p) ? cirrhosis(p)
  • ?p, drinker(p) ? cirrhosis(p) ? car-accident(p)
    ? fight(p)
  • ?p, cirrhosis(p) ? drinker(p)
  • First Order Predicate Logic fails in complex
    domains because
  • Laziness
  • Impossible to list the complete set of
    antecedents or consequences.
  • Theoretical ignorance
  • E.g. Medicine is not an exact science.
  • Practical ignorance
  • Even if all the general rules are known it might
    be a particular case not taken into account.

5
Probability to and degree of belief
  • With uncertain knowledge one could at best,
    provides a degree of belief in the sentence.
  • 0 false, 1 true, in 0,1 somewhere
    between true and false!
  • The best tool for dealing with degree of belief
    is probability theory.
  • Probability provides a way of summarizing
    uncertainty from laziness and ignorance.

6
Evidence
  • The probability that a patient has cirrhosis is
    0.8 is based on the evidence received up to now.
  • prior (marginal) probability before evidence
    received.
  • posterior (conditional) probability after
    evidence received.

7
Prior (or marginal) Probability
  • Basic elements are Random Variables (RVs)
  • P(Atrue) prior probability that A is true.
  • assigned in the absence of other information
    (only).
  • Ex
  • for a dice, P(Dice4) 1/6
  • At the very beginning of a poker party, for the
    first player, P(Card9?)1/52, but this
    probability evolves with the distribution of the
    cards.
  • Domain values RV can take
  • Discrete Weather ? sunny,cloudy,rainy
  • Continuous X ? 0,3

8
Probability Density Function
  • Probability Density Function p.d.f.(x)
  • Assign a probability to every value of the domain
    of X
  • Discrete P(X?x) ?xp.d.f.(t)
  • Continuous P(X?x) ? xp.d.f.(t)dt
  • Example
  • Discrete p.d.f(Wsunny)0.3, p.d.f(Wcloudy)0.5,
    p.d.f(Wrainy)0.2
  • Continuous
    (normal
    distribution)

p.d.f(x)
x
9
Conditional Probability
  • Conditional (Posterior) Probability
  • P(AB) - probability of A given that all we know
    is B.
  • P(XY) - two dimensional table.
  • P(XxiYyj) for each i and j
  • Product Rule
  • P(A?B) P(AB)P(B) or vice versa P(A?B)
    P(BA)P(A)
  • P(X,Y) P(XY)P(Y)
  • N.B. In probability notation
  • P(A and B) P(A?B)P(A?B)P(A,B)

10
Kolmogorovs axioms
  • All probabilities are between 0 and 1.
  • 0 ? P(A) ? 1
  • Valid propositions have probability 1 and
    unsatifiable propositions have probability 0.
  • P(true)1, P(false)0
  • The probability of a disjunction is given by
  • P(A?B)P(A?B)P(A or B) P(A)P(B) - P(A?B)

11
Joint Probability Distribution
  • Completely specifies probability assignments to
    all propositions in the domain
  • Atomic Event
  • an assignment of particular values to all RVs in
    a domain
  • P(A,B,...) assigns probabilities to all possible
    atomic events.
  • With the joint probability distribution we can
    compute all the probabilities we want. Thus, we
    can answer all inference questions about the
    variables of the domain!

12
Marginalization with the Joint Probability
Distribution
  • Marginal probability
  • Marginal probability is obtained by summing (or
    integrating, more generally) the joint
    probability over the unwanted events.
  • This is called marginalization P(X) ?zP(X,z)
  • In the following we will use P(a) P(AT) and
    P(a) P(AF)
  • P(a) P(a,b,c) P(a,b,c) P(a,b,c)
    P(a,b,c)
  • P(a) 0.108 0.012 0.072
    0.008 0.2 -gt marginal probability
  • it is the sum over all combinations of B and C
    values

13
Using the Joint Probability Distribution
  • Disjunction
  • P(a?b) P(a)P(b)-P(a,b)
  • 0.20.34-0.18 0.36
  • Conditionality
  • P(ba) P(b,a)/ P(a) 0.18/0.20.9
  • What is the probability of c knowing A and B are
    true?
  • P(ca,b) P(c,a,b)/ P(a,b) 0.072/0.180.4
  • Conjunction
  • P(a,b) P(a,b,c) P(a,b,c) 0.1080.0720.18
  • it is the sum over all combinations of C values

14
The chain rule
  • The full joint distribution can be expressed by
    the chain rule
  • P(X1,X2,X3,X4) P(X1)P(X2X1) P(X3X1, X2)
    P(X4X1, X2, X3)
  • As P(AB) P(A,B)/P(B)

15
Bayes Rule
  • From the Product Rule we know
  • P(A?B) P(AB)P(B)
  • P(BA)P(A)
  • So P(AB) P(BA)P(A)/P(B)
  • This is the bayes rule
  • Why this rule is important?
  • It happens regularly that P(BA), P(B) and P(A)
    are known but P(AB) is not.
  • A doctor can know what the probability of a
    symptom B and disease A are and the probability
    that A causes B.
  • Meaning of the RV A can be seen as a hypothesis
    and B as evidence (or data)
  • P(A) is the prior probability of the hypothesis
    (in the absence of any evidence)
  • P(B) is the probability of the evidence
  • P(BA) is the likelihood that the evidence B was
    produced, given that the hypothesis was A
  • P(AB) is the posterior probability of the
    hypothesis A, given that the evidence is B.

16
Independence of RVs
  • The full joint distribution grows with the number
    of random variables
  • For n Boolean RVs the representation of the full
    joint distribution grows with O(2n).
  • Becomes intractable when the number of RVs grows
  • Independence of the RVs is a convenient way of
    simplifying the computation of the full joint
    distribution
  • If A and B are independent then
  • P(A,B) P(A) P(B)
  • P(AB) P(A)
  • So
  • And the full joint distribution grows with O(n).
  • However this assumption is rarely true!

17
Conditional Independence
  • Conditional Independence
  • assumption to simplify the inference procedure
  • A and B are independent knowing C
  • P(A,BC) P(AC)P(BC)
  • P(AB,C) P(AC)
  • Example medical diagnosis
  • P(flufever?winter) P(fever?winterflu)P(flu)/P(
    fever?winter)
  • It would be convenient if fever and winter were
    independent but they are not. During the winter
    one might catch the flu that provokes the fever.
  • However these variables are independent given the
    presence or the absence of a flu.

18
Conditional Independence - case study
  • We can check conditional independence of A and A
    knowing C
  • P(a,bc) 1/16/4/16 1/4
  • P(ac)?P(bc) 2/16/4/16 ? 2/16/4/16 2/4
    ? 2/4 1/4
  • P(a,bc) 3/16/12/16 1/4
  • P(ac)?P(bc) 4/16/12/16 ? 9/16/12/16 4/12
    ? 9/12 1/4
  • we can verify that with every combination of A,C
  • So P(A,BC)P(AC)?P(BC) then we can assume
    conditional independence
  • N.B. A and B are not independent P(B,A) ? P(A) ?
    P(B)
  • P(b,a) 4/16 1/4
  • P(a) ? P(b) 6/16 ? 11/16 66/256 0.2578 ? 1/4

19
Conditional Independence - Consequence
  • Generalisation, given a Cause (C) and N Effects
    (E) conditionally independent given C, using the
    chain rule
  • That can be rewritten
  • Thus, the full joint distribution table growths
    with O(n)
  • Conditional independence assertions allow
    probabilistic systems to scale up moreover, they
    are much more commonly available than absolute
    independence assertions.
  • Such probability is called naïve Bayes model. It
    is naïve because it is often used when the
    effects are NOT conditionally independent given
    the cause variable.

20
Bayesian Classification (1)
  • Let set of classes be c1, c2,cn
  • Let E be description of an instance.
  • Determine class of E by determining for each
    ci
  • P(E) can be determined since classes are complete
    and disjoint

21
Bayesian Classification (2)
  • Need
  • Priors P(ci)
  • Conditionals P(E ci)
  • P(ci) are easily estimated from data.
  • If ni of the examples in D are in ci,thenP(ci)
    ni / D
  • Assume instance is a conjunction of binary
    features
  • Too many possible instances (exponential in m) to
    estimate all P(E ci)
  • Naïve Bayes
  • If we assume features of an instance are
    independent given the class (ci) (conditionally
    independent)
  • we then only need to know P(ej ci) for each
    feature and class.

22
Naïve Bayes model - example
  • Example medical diagnosis
  • P(fever,winter,flu)
  • P(fever,winterflu)P(flu) (product rule)
  • P(feverflu)P(winterflu)P(flu) (conditional
    independence)
  • The original full joint distribution can be split
    into smaller pieces.
  • So for n Boolean symptoms conditionally
    independents given the disease, the
    representation of the full joint distribution
    grows with O(n) rather than O(2n)

23
Naïve Bayes Classifier - Example
Probability Estimates
P(yes) 9/14 0.64P(no) 5/14 0.36
P(voucheryesclassyes) 3/9
0.33P(voucheryesclassno) 3/5 0.60 Similar
values computed for other feature/value pairs.
Prediction
P(yes) P(googleyes) P(56kyes) P(maleyes)
P(yesyes) 0.0053 P(no) P(googleno) P(56kno)
P(maleno) P(yesno) 0.0206
24
Probability Estimates Smoothing
  • Normally, probabilities are estimated based on
    observed frequencies in the training data.
  • If D contains ni examples in class ci, and nij of
    these ni examples contains feature ej, then
  • However, estimating such probabilities from small
    training sets is error-prone.
  • To account for estimation from small samples,
    probability estimates are adjusted or smoothed.
  • Laplace smoothing using an m-estimate assumes
    that each feature is given a prior probability,
    p, that is assumed to have been previously
    observed in a virtual sample of size m.
  • For binary features, p is simply assumed to be
    0.5.

25
Naïve Bayes - Issues
  • Posterior Probabilities
  • Classification results of Naïve Bayes (the class
    with maximum posterior probability) are usually
    fairly accurate.
  • However, due to the inadequacy of the conditional
    independence assumption, the actual
    posterior-probability numerical estimates are
    not.
  • Output probabilities are generally very close to
    0 or 1.
  • Underflow Prevention
  • Multiplying lots of probabilities, which are
    between 0 and 1 by definition, can result in
    floating-point underflow.
  • Since log(xy) log(x) log(y), it is better to
    perform all computations by summing logs of
    probabilities rather than multiplying
    probabilities.

26
Hypothesis testing
  • What if we have more than one piece of evidence?
  • P(zA,B,,Y) P(A,B,,Yz)P(z)/P(A,B,,Y)
  • We know that P(A,B,,Y) Si P(A,B,,Y,zi) Si
    P(A,B,,Yzi)P(zi)
  • So,
  • P(zA,B,,Y) aP(A,B,,Yz)P(z) by
    normalisation.
  • Where a 1/(Si P(A,B,,Yzi)P(zi))
  • Example patient with a fever and who is shaking
  • The doctor want to know whether she has the flu
    or not.
  • He knows P(fever,shakingflu) 0.8,
    P(fever,shakingflu) 0.01, P(flu) 0.001,
    P(flu) 1-P(flu)
  • He doesnt know P(fever,shaking) thus
  • P(flufever,shaking) P(fever,shakingflu) ?
    P(flu) / P(fever,shakingflu) ? P(flu)
    P(fever,shakingflu) ? P(flu)
  • P(flufever,shaking) 0.8 ? 0.001 / 0.8 ? 0.001
    0.001 ? 0.999 0.08

27
Where do Probabilities come from?
  • Three approaches
  • Frequentist
  • from experiments.
  • Objectivist
  • real aspects of the universe (known model of the
    world).
  • Subjectivist
  • Human beliefs without external significance.
Write a Comment
User Comments (0)
About PowerShow.com