Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I - PowerPoint PPT Presentation

About This Presentation
Title:

Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I

Description:

Alternative Derivation: http://en.wikipedia.org/wiki/Bayes'_theorem. 3 ... will frequently encounter the word 'Viagra' in spam emails, but will seldom ... – PowerPoint PPT presentation

Number of Views:461
Avg rating:3.0/5.0
Slides: 68
Provided by: tigpbpIis
Category:

less

Transcript and Presenter's Notes

Title: Bayesian%20Methods%20with%20Monte%20Carlo%20Markov%20Chains%20I


1
Bayesian Methods with Monte Carlo Markov Chains I
  • Henry Horng-Shing Lu
  • Institute of Statistics
  • National Chiao Tung University
  • hslu_at_stat.nctu.edu.tw
  • http//tigpbp.iis.sinica.edu.tw/courses.htm

2
Part 1 Introduction to Bayesian Methods
3
Bayes' Theorem
  • Conditional Probability
  • One Derivation
  • Alternative Derivation
  • http//en.wikipedia.org/wiki/Bayes'_theorem

4
False Positive and Negative
  • Medical diagnosis
  • Type I and II Errors hypothesis testing in
    statistical inference
  • http//en.wikipedia.org/wiki/False_positive

Actual Status Actual Status
Disease (H1) Normal (H0)
Diagnosis Test Result Positive (Reject H0) True Positive (Power, 1-ß) False Positive (Type I Error, a)
Diagnosis Test Result Negative (Accept H0) False Negative (Type II Error, ß) True Negative (Confidence Level, 1-a)
5
Bayesian Inference (1)
  • False positives in a medical test
  • Test accuracy by conditional probabilities
  • Prior probabilities

6
Bayesian Inference (2)
  • Posterior probabilities by Bayes' theorem

7
Bayesian Inference (3)
  • Equal Prior probabilities
  • Posterior probabilities by Bayes theorem
  • http//en.wikipedia.org/wiki/Bayesian_inference

8
Bayesian Inference (4)
  • In the courtroom
  • and
  • Based on the evidence other than the DNA match,
    and
  • By the Bayes Theorem,

9
Naive Bayes Classifier
  • Naive Bayes Classifier is a simple probabilistic
    classifier based on applying Bayes' theorem with
    strong (naive) independence assumptions.
  • http//en.wikipedia.org/wiki/Naive_Bayes_classifie
    r

10
Naive Bayes Probabilistic Model (1)
  • The probability model for a classifier is a
    conditional modelwhere is a dependent class
    variable and
  • are several feature variables.
  • By Bayes theorem,

11
Naive Bayes Probabilistic Model (2)
  • Use repeated applications of the definition of
    conditional probability
  • and so forth.
  • Assume that each is conditionally independent
    of every other for , this means that

12
Naive Bayes Probabilistic Model (3)
  • So can be expressed as
  • So can be expressed like
  • where Z is constant if the values of the feature
    variables are known.
  • Constructing a classifier from the probability
    model

13
Bayesian Spam Filtering (1)
  • Bayesian spam filtering, a form of e-mail
    filtering, is the process of using a Naive Bayes
    classifier to identify spam email.
  • References
  • http//en.wikipedia.org/wiki/Spam_28e-mail29
    http//en.wikipedia.org/wiki/Bayesian_spam_filteri
    nghttp//www.gfi.com/whitepapers/why-bayesian-fil
    tering.pdf

14
Bayesian Spam Filtering (2)
  • Probabilistic model
  • where words mean certain words in spam
    emails.
  • Particular words have particular probabilities of
    occurring in spam emails and in legitimate
    emails. For instance, most email users will
    frequently encounter the word Viagra in spam
    emails, but will seldom see it in other emails.

15
Bayesian Spam Filtering (3)
  • Before mails can be filtered using this method,
    the user needs to generate a database with words
    and tokens (such as the sign, IP addresses and
    domains, and so on), collected from a sample of
    spam mails and valid mails.
  • After generating, each word in the email
    contributes to the email's spam probability. This
    contribution is called the posterior probability
    and is computed using Bayes theorem.

16
Bayesian Spam Filtering (4)
  • Then, the email's spam probability is computed
    over all words in the email, and if the total
    exceeds a certain threshold (say 95), the filter
    will mark the email as a spam.

17
Bayesian Network (1)
  • Bayesian network is compact representation of
    probability distributions via conditional
    independence.
  • For example, a Bayesian network could represent
    the probabilistic relationships between diseases
    and symptoms.
  • http//en.wikipedia.org/wiki/Bayesian_networkhttp
    //www.cs.ubc.ca/murphyk/Bayes/bnintro.htmlhttp
    //www.cs.huji.ac.il/nirf/Nips01-Tutorial/index.ht
    ml

18
Bayesian Network (2)
  • Conditional independencies graphical language
    capture structure of many real-world
    distributions
  • Graph structure provides much insight into domain
  • Allows knowledge discovery

19
Bayesian Network (3)
  • Qualitative part
  • Directed acyclic graph (DAG)
  • Nodes - random variables
  • Edges - direct influence

Quantitative part Set of conditional
probability distributions
Together Define a unique distribution in a
factored form
20
Inference
  • Posterior probabilities
  • Probability of any event given any evidence
  • Most likely explanation
  • Scenario that explains evidence
  • Rational decision making
  • Maximize expected utility
  • Value of Information
  • Effect of intervention

Radio
21
Example 1 (1)
22
Example 1 (2)
  • By the chain rule of probability, the joint
    probability of all the nodes in the graph above
    is
  • By using conditional independence relationships,
    we can rewrite this as
  • where we were allowed to simplify the third term
    because R is independent of S given its parent C,
    and the last term because W is independent of C
    given its parents S and R.

23
Example 1 (3)
  • Bayes theorem
  • is a normalizing constant, equal to the
    probability (likelihood) of the data.

24
Example 1 (4)
  • The posterior probability of each explanation
  • So we see that it is more likely that the grass
    is wet because it is raining the likelihood
    ratio is .

25
Part 2 MLE vs. Bayesian Methods
26
Maximum Likelihood Estimates (MLEs) vs. Bayesian
Methods
  • Binomial Experiments http//www.math.tau.ac.il/n
    in/Courses/ML04/ml2.ppt
  • More Explanations and Examples
  • http//www.dina.dk/phd/s/s6/learning2.pdf

27
MLE (1)
  • Binomial Experiments suppose we toss coin N
    times and the random variable is
  • We denote by ?the (unknown) probability
  • .
  • Estimation task
  • Given a sequence of toss samples we
    want to estimate the probabilities ? and
    .

28
MLE (2)
  • The number of heads we see has a binomial
    distribution
  • and thus
  • Clearly, the MLE of ?is and is also equal
    to MME of .

29
MLE (3)
  • Suppose we observe the sequence
  • H, H.
  • MLE estimate is .
  • Should we really believe that tails are
    impossible at this stage?
  • Such an estimate can have disastrous effect.
  • If we assume that P(T)0, then we are willing to
    act as though this outcome is impossible.

30
Bayesian Reasoning
  • In Bayesian reasoning we represent our
    uncertainty about the unknown parameter ?by a
    probability distribution.
  • This probability distribution can be viewed as
    subjective probability
  • This is a personal judgment of uncertainty.

31
Bayesian Inference
  • -prior distribution about the values of
  • -likelihood of binomial
    experiment given a known value ?
  • Given , we can compute posterior
    distribution on ?
  • The marginal likelihood is
  • http//www.dina.dk/phd/s/s6/learning2.pdf

32
Binomial Example (1)
  • In binomial experiment, the unknown parameter is
  • Simplest prior for (Uniform
    prior)
  • Likelihood
  • where k is number of heads in the sequence
  • Marginal Likelihood

33
Binomial Example (2)
  • Using integration by parts, we have
  • Multiply both side by choose , we have

34
Binomial Example (3)
  • The recursion terminates when ,
  • Thus,
  • We conclude that the posterior is

35
Binomial Example (4)
  • How do we predict (estimate ) using the
    posterior?
  • We can think of this as computing the probability
    of the next element in the sequence
  • Assumption if we know , the probability of
    is independent

36
Binomial Example (5)
  • Thus, we conclude that

37
Beta Prior (1)
  • The uniform priori distribution is a particular
    case of the Beta Distribution. Its general form
    is
  • Where and show as .
  • The expected value of the parameter is
  • The uniform is

38
Beta Prior (2)
  • There are important theoretical reasons for using
    the Beta prior distribution?
  • One of them has also important practical
    consequences it is the conjugate distribution of
    binomial sampling.
  • If the prior is and we have
    observed some data with and cases for the
    two possible values of the variable, then the
    posterior is also Beta with parameters

39
Beta Prior (3)
  • The expected value for the posterior
  • distribution is
  • The value represent the prior
  • probabilities for the value of the variables
  • based in our past experience.
  • The value is called equivalent sample
    size measure the importance of our past
    experience.
  • Larger values make that prior probabilities have
    more importance.

40
Beta Prior (4)
  • When , then we have maximum likelihood
    estimation

41
Multinomial Experiments
  • Now, assume that we have a variable taking values
    on a finite set and we have a serious
    of independent observations of this distribution,
    and we want to estimate the value
    , .
  • Let be the number of cases in the sample in
    which we have obtained the value
  • The MLE of is
  • The problems with small samples are completely
    analogous.

42
Dirichlet Prior (1)
  • We can also follow the Bayesian approach, but the
    prior distribution is the Dirichlet distribution,
    a generalization of the Beta distribution for
    more than 2
  • cases .
  • The expression of is
  • where is the equivalent sample size.

43
Dirichlet Prior (2)
  • The expected vector is
  • Greater value of s makes this distribution more
    concentrated around the mean vector.

44
Dirichlet Posterior
  • If we have a set of data with counts
  • , then the posterior distribution is also
  • Dirichlet with parameters
  • The Bayesian estimation of probabilities are
  • where , .

45
Multinomial Example (1)
  • Imagine that we have an urn with balls of
    different colors red(R), blue(B) and green(G)
    but on an unknown quantity.
  • Assume that we picked up balls with replacement,
    with the following sequence
  • .

46
Multinomial Example (2)
  • If we assume a Dirichlet prior distribution with
    parameters , then the estimated
    frequencies for red,blue and
  • green
  • Observe, as green has a positive probability,
    even if never appears in the sequence.

47
Part 3 An Example in Genetics
48
Example 1 in Genetics (1)
  • Two linked loci with alleles A and a, and B and b
  • A, B dominant
  • a, b recessive
  • A double heterozygote AaBb will produce gametes
    of four types AB, Ab, aB, ab

49
Example 1 in Genetics (2)
  • Probabilities for genotypes in gametes

No Recombination Recombination
Male 1-r r
Female 1-r r
AB ab aB Ab
Male (1-r)/2 (1-r)/2 r/2 r/2
Female (1-r)/2 (1-r)/2 r/2 r/2
50
Example 1 in Genetics (3)
  • Fisher, R. A. and Balmukand, B. (1928). The
    estimation of linkage from the offspring of
    selfed heterozygotes. Journal of Genetics, 20,
    7992.
  • More
  • http//en.wikipedia.org/wiki/Genetics
    http//www2.isye.gatech.edu/brani/isyebayes/bank/
    handout12.pdf

51
Example 1 in Genetics (4)
MALE MALE MALE MALE
AB (1-r)/2 ab (1-r)/2 aB r/2 Ab r/2
F E M A L E AB (1-r)/2 AABB (1-r) (1-r)/4 aABb (1-r) (1-r)/4 aABB r (1-r)/4 AABb r (1-r)/4
F E M A L E ab (1-r)/2 AaBb (1-r) (1-r)/4 aabb (1-r) (1-r)/4 aaBb r (1-r)/4 Aabb r (1-r)/4
F E M A L E aB r/2 AaBB (1-r) r/4 aabB (1-r) r/4 aaBB r r/4 AabB r r/4
F E M A L E Ab r/2 AABb (1-r) r/4 aAbb (1-r) r/4 aABb r r/4 AAbb r r/4
52
Example 1 in Genetics (5)
  • Four distinct phenotypes
  • AB, Ab, aB and ab.
  • A the dominant phenotype from (Aa, AA, aA).
  • a the recessive phenotype from aa.
  • B the dominant phenotype from (Bb, BB, bB).
  • b the recessive phenotype from bb.
  • AB 9 gametic combinations.
  • Ab 3 gametic combinations.
  • aB 3 gametic combinations.
  • ab 1 gametic combination.
  • Total 16 combinations.

53
Example 1 in Genetics (6)
  • Let , then

54
Example 1 in Genetics (7)
  • Hence, the random sample of n from the offspring
    of selfed heterozygotes will follow a multinomial
    distribution
  • We know that
  • and
  • So

55
Bayesian for Example 1 in Genetics (1)
  • To simplify computation, we let
  • The random sample of n from the offspring of
    selfed heterozygotes will follow a multinomial
    distribution

56
Bayesian for Example 1 in Genetics (2)
  • If we assume a Dirichlet prior distribution with
    parameters to estimate
    probabilities for AB, Ab, aB and ab.
  • Recall that
  • AB 9 gametic combinations.
  • Ab 3 gametic combinations.
  • aB 3 gametic combinations.
  • ab 1 gametic combination.
  • We consider

57
Bayesian for Example 1 in Genetics (3)
  • Suppose that we observe the data of
  • .
  • So the posterior distribution is also Dirichlet
    with parameters
  • The Bayesian estimation for probabilities are

58
Bayesian for Example 1 in Genetics (4)
  • Consider the original model,
  • The random sample of n also follow a multinomial
    distribution
  • We will assume a Beta prior distribution

59
Bayesian for Example 1 in Genetics (5)
  • The posterior distribution becomes
  • The integration in the above denominator,
  • does not have a close form.

60
Bayesian for Example 1 in Genetics (6)
  • How to solve this problem? Monte Carlo Markov
    Chains (MCMC) Method!
  • What value is appropriate for ?

61
Part 4 Monte Carlo Methods
62
Monte Carlo Methods (1)
  • Consider the game of solitaire whats the chance
    of winning with a properly shuffled deck?
  • http//en.wikipedia.org/wiki/Monte_Carlo_method
  • http//nlp.stanford.edu/local/talks/mcmc_2004_07_0
    1.ppt

Chance of winning is 1 in 4!
62
63
Monte Carlo Methods (2)
  • Hard to compute analytically because winning or
    losing depends on a complex procedure of
    reorganizing cards.
  • Insight why not just play a few hands, and see
    empirically how many do in fact win?
  • More generally, can approximate a probability
    density function using only samples from that
    density.

64
Monte Carlo Methods (3)
  • Given a very large set and a distribution
  • over it.
  • We draw a set of i.i.d. random samples.
  • We can then approximate the distribution using
    these samples.

65
Monte Carlo Methods (4)
  • We can also use these samples to compute
    expectations
  • And even use them to find a maximum

66
Monte Carlo Example
  • be i.i.d. , find ?
  • Solution
  • Use Monte Carlo method to approximation
  • gt x lt- rnorm(100000) 100000 samples from
    N(0,1)
  • gt x lt- x4
  • gt mean(x)
  • 1 3.034175

67
Exercises
  • Write your own programs similar to those examples
    presented in this talk.
  • Write programs for those examples mentioned at
    the reference web pages.
  • Write programs for the other examples that you
    know.
Write a Comment
User Comments (0)
About PowerShow.com