Title: Introduction to Bayesian Methods
1Introduction to Bayesian Methods
2What is probability?
- Frequentist
- frequency of a certain event when repeating a
random experiment infinite times - Examples
- 1. probability of heads in a coin toss
experiment defined - 2. probability that it will snow 4/20/2009 not
defined. - 3. probability of me getting cancer by 70 if
smoking 1 pack cig a day not defined
- Bayesian
- the quantification of uncertainty de Finetti
- this uncertainty can be equated with personal
belief of a quantity - Examples
- 1. belief (can be updated by observation)
- 2. belief (can be refined by modeling)
- 3. belief (as above)
3What do Bayesians do?
- Bayesians study the uncertainty of parameter ?,
conditional on data X - Bayesians use priors (non-experimental knowledge,
knowledge from other sources, the state of
ignorance, etc.) - Music expert, distinguishes mozart from haydn
- A lady tastes tea tea poured into milk or vice
versa. - A man predicts coin toss.
- Each of the 10 times, they got it right.
- Inference may differ due to different prior
beliefs. -
4Bayesian Machinery
- data X
- prior belief ---------? posterior belief
- about ? about ?
- likelihood
- L(?)P(X?)
- prior distribution ------------? posterior distn
- p(?) p(?X)
5Bayes Rule
6Question whats the proportion of As in
chromosome 2?
- Data pieces of sequence from shot-gun sequencing
- Parameter ? proportion of As in chr 2.
7Prior
- What is your prior belief about ??
? U0,1
? beta(4,12)
? beta(20,60)
8Beta Distribution for Prior of a Probability
- Beta(? a, ß)
- mean a/(aß)
- var mean(1-mean)/(1aß)
- Uniformbeta(1,1)
9Likelihood
- Data 13As out of a sequence of 50
- Model (for the data) determines the form of
likelihood. Also a subjective choice - Binomial
10Posterior
prior (beta)
likelihood (binomial)
posterior
posterior mean
posterior 95 credible interval (0.16, 0.37)
11Posterior is a compromise between prior and data
- Prior mean a/(aß)
- Sample mean y/n
- Posterior mean (ya)/(naß)
12Conjugate Priors (pseudo-counts)
- Prior ? Beta(a, ß)
- Likelihood y? ? ?y (1-?)(n-y)
- Posterior ?y Beta(ay,ßn-y)
-
- Conjugacy posterior distn has the same
parametric form as prior. - Beta distribution is conjugate to binomial
likelihood.
13Sequential update with conjugate prior
- ? beta(a,ß)
- data y1,n1-y1
- ?y1 beta(ay1,ßn1-y1)
- data y2,n2-y2
- ?y1,y2 beta(ay1y2, ßn1n2-y1-y2)
-
- Informative conjugate priors act like
pseudo-counts
14- Java applet demonstrating the updating of ? with
data - http//www.bolderstats.com/gallery/bayes/binomialR
evision.html
15Sequential update with conjugate prior
- ? beta(4,12)
- data 13,37
- ?y1 beta(17,39)
- data 12,38
- ?y1,y2 beta(29, 77)
-
- ?entire chr2beta(71e6, 167e6)
- ? E(?chr2)0.30, sd(?chr2)2.6e-5
- Large amount of data wash over the initial prior
16Ex. Multinomial Distribution
- Example proportions of all 4 nucleotides in
human chromosome 2. - Prior belief about ?(pA, pT, pG, pC)
- with the constraint pApTpGpC1
- Dirichlet distribution is a generalization of
beta distribution. e.g. ?Dir(?4,4,4,4)
17Likelihood
- Data 13As, 24Ts, 6Gs,7Cs
- Model (for the data) determines the form of
likelihood. Also a subjective choice - Multi-nomial is a multi-class version of binomial
18Posterior
- E(pAX)0.26
- E(pTX)0.42
- E(pGX)0.15
- E(pCX)0.17
19Multinomial with conjugate prior
- ? Dirichlet(4,4,4,4)
- data 13,24,6,7
- ?y1.Dirichlet(17,28,10,11)
- data y21,y22,y23,
-
- ?chr2Dirichlet(71e6, 71e6, 48e6, 48e6)
- Informative conjugate priors has a pseudo-count
interpretation
20Several parameters in one problem
- Example 1 Proportions in a multinomial model
- pA pT pG pC
- Example 2 What is the proportion of CpG islands
in Chr2 (of a Martian)? (?) - The amount of CpGs in non-island area (p1)
- The amount of CpGs in CpG-islands (p2)
21Prior
- ? Uniform(0,1)
- P1 and p2 are uniform in
- 0,1)x0,1) and p1ltp2
22Data
- Again shotgun sequence data. Instead of
nucleotides, count the number of transitions and
how many are CG
23Likelihood
24Posterior? likelihood prior
25Joint, Conditional, Marginal Distribution
- Two random variables?1,?2
- Joint distributionP(?1,?2)
- Conditional P(?1?2), P(?2?1)
- Marginal distribution p(?1), p(?2)
26Joint, Marginal, Conditional
P(?1,?2)
P(?1)
P(?1?2t)
t
27Joint ? Marginal by integration
- P(?1)?p(?1,?2)d?2
- P(?2)?p(?1,?2)d?1
- Integration
http//homepage2.nifty.com/hashimoto-t/misc/margin
-e.html
28Joint?Marginal by Simulation
29Posterior Inference by Computation
30Goal obtain a random sample from the posterior
- Two main approaches
- Independence sampling
- draw ?(1), ?(2), ?(M) p(?y)
- ?(1)??(2)? ??(M)
- Iterative sampling
- draw ?(i1) p(?(i1)?(i))
- Such that p(?(i))?p(?y)
31Posterior inference using Simulation
- Posterior mean for ?
- E(?)????p(?,p1,p2)d?dp1dp2
- Posterior variance
- var(?)???(?-E(?))2 p(?,p1,p2)d?dp1dp2
- Confidence Interval
- E(log(p1/(1-p1))-log(p2/(1-p2)))
- mean(lambda.simu)
- var(lambda.simu)
- quantile(c(0.025,0.975),lambda.simu)
- mean(log(p1.simu/(1-p1.simu))-log(p2.simu/(1-p2.si
mu))) - No need for integration, ever!
32Simulation vs Exact
- Difficulty with Exact
- Some posteriors have no close form
- Some likelihoods are too complex, high
dimensional, etc. - Needs strong math skills
- Limitations of Simulation
- Simulation has stochastic error
- Iterative algorithms may not converge
- Sometimes too easy beware!
- Requires programming
33Markov chain Monte Carlo (MCMC)
- Markov chain
- The (n1)-th step only depends on the n-th step.
- Transition probability
- Monte CarloSimulation
34Transition Distribution Stationary Distribution
of Markov chain
- A Markov chain is constructed by specifying a
transition probability - ? dance steps
- A Markov chain (aperiodic irreducible) has a
stationary distribution - ? map of footprints
- Idea construct a Markov chain for ? whose
stationary distn is p(?y)
35(No Transcript)
36Metropolis Algorithm
proposal step
acceptance/rejection step
37Relationship to Optimization
- is a stochastic version of a stepwise
mode-finding algorithm, always accepting upward
steps, only sometimes accepting downward steps
38An illustration
- http//www.probability.ca/jeff/java/rwm.html
39Back to the CpG island problem
- logposterior function(theta,y,n)
- sum(log(theta1theta2y(1-theta2)(n-y)
(1-theta1)theta3y(1-theta3)(n-y))) -
- jumping distribution
- proposal function(theta)
- thetarnorm(3,0,1)c(0.05,0.05,0.05)
-
- mcmatrix(NA,100000,3)
- names(mc)c(lambda,p1,p2)
- starting values
- mc1,c(0.7,0.1,0.2)
- for(i in 2100000)
- for(j in 1100)
- Candidate proposal(mci-1,)
if(j100) print("too many proposals outside
of boundary!\n") break if(prod(Candidatelt1)p
rod(Candidategt0)(Candidate3gtCandidate2)gt0)
cat(j)break r logposterior(Candidate,y
,n)-logposterior(mci-1,,y,n) if(runif(1) lt
exp(r) ) mci,Candidate else mci,mc
i-1,
40Results (lambda) Metropolis not converged in
1000 iterations
41p1 p2 first 1000 iterations
42Results (lambda) converged with 100,000 iterations
mean(lambda)0.63, sd0.25
43p1 p2 for 100,000 iterations
mean(p1)0.10, sd0.024, mean(p2)0.24, sd0.16,
44Gibbs Sampler
45Gibbs
46The Gibbs Sampler
47(No Transcript)
48Back to the mixture problem
- Augment the data with group membership zi,
i1,,10
49- Now the full conditional distributions are all
known distributions
50Gibbs code in R
- datread.table("D/guestlecture/CpG.dat",headerT)
- ydaty ndatn
- lambda lt- 0.3 p1 lt- 0.1 p2 lt- 0.2 z lt-
rep(0,10) - mcmatrix(NA, 10000, 13)
- mc1, c(lambda, p1, p2, z)
- for(i in 210000)
- lambda lt- rbeta(1, sum(z)1, 10-sum(z)1)
- for(j in 1100)
- p1 lt- rbeta(1, sum(yz)1, sum((n-y)z)1)
- p2 lt- rbeta(1, sum(y(1-z))1,
sum((n-y)(1-z))1) - if(p2gtp1) break
-
- ff1lambdap1y(1-p1)(n-y)
- ff2(1-lambda)p2y(1-p2)(n-y)
- zrbinom(10, rep(1,10), ff1/(ff1ff2))
- mci,c(lambda,p1,p2,z)
51lambda Gibbs at 1000 iterations
mean(lambda)0.66, sd0.25
52p1, p2 at 1000 iterations
mean(p1)0.1, sd(p1)0.025, mean(p2)0.27,
sd(p2)0.20
53Result (lambda) at 10k iterations
mean(lambda)0.64, sd0.25
54p1, p2 at 10k iterations
mean(p1)0.1, sd(p1)0.026, mean(p2)0.26,
sd(p2)0.19
55Compared with Metropolis
- Gibbs converged within the first 1000 iterations,
comparable to Metropolis at around 20,000. - The convergence of Gibbs at 10k looks better than
Metropolis at 100k.
56Explaining the Gibbs Sampler
- http//www.jstor.org/pss/2685208
- Explaining the Gibbs Sampler, Casella George,
the American Statistician 46(3)
57Difficulties with Iterative Sampling
- Burn-in earlier iterations are not
representative of the target distribution
influenced by the starting values. - Dependence effective number of independent draws
lt N. ? Thinning
58Methods to assess convergence
- CODA
- Eye-balling
- Galin Jones workDepartment of
StatisticsUniversity of Minnesota
59Summary
- The Bayesian approach
- Priorlikelihood?posterior
- Betabinomial?Beta
- Dirichletmultinomial?Dirichlet
- Multi-parameter problem
- Simulation from posterior
- MCMC
- Gibbs
60Additional material
61Why does Metropolis work?
- Outline of proof
- show that the simulated sequence ?1, ?2, is a
Markov chain - show that the Markov chain has a unique
stationary distribution - show that this distribution is the target
distribution p(?y) - Understanding the Metropolis-Hastings Algorithm,
Chib and Greenberg, the American Statistician,
49(4)
62A generalization Metropolis-Hastings
- In Metropolis, the jumping distribution has to be
symmetric - Hastings got rid of that constraint
63Properties of a good jumping rule
- Theoretically you can use any jumping
distribution as long as it can get you bto all
areas of the parameter space(irreducibility)
64Gibbs as a special case of M-H
- Acceptance ratio is always r1