Title: Nonparametric Bayesian Classification
1Nonparametric Bayesian Classification
Marc A. Coram University of Chicago http//galton.
uchicago.edu/coram
Persi Diaconis Steve Lalley
2Related Approaches
- Chipman, George, McCullough
- Bayesian CART (1998 a,b)
- Nested
- CART-like
- Coordinate aligned splits
- Good search ability
- Denison, Mallick, Smith
- Bayesian CART
- Bayesian splines and MARS
3Outline
- Medical example
- Theoretical framework
- Bayesian proposal
- Implementation
- Simulation experiments
- Theoretical results
- Extensions to a general setting
4Example AIDS Data(1-dimensional)
- AIDS patients
- Covariate of interest viral resistance level in
blood sample - Goal estimate conditional probability of response
5Idealized Setting
(X,Y) iid pairs X (covariate) X ? 0,1 Y
(response) Y ? 0,1 f0 (true parameter)
f0(x)P(Y1Xx)
What, then, is a straightforward way to proceed,
thinking like a Bayesian?
6Prior on f 1-dimension
- Pick a non-negative integer M at randomSay,
choose M0 with prob 1/2 M1 with prob
1/4 M2 with prob 1/8 . - Conditional on Mm, Randomly choose a step
function from 0,1 into 0,1 with m jumps - (i.e. locate the m jumps and (m1)
valuesindependently and uniformly)
7Perspective
- Simple prior on stepwise functions
- Functions are parameterized by
-
-
- Goal Get samples from the posterior average to
estimate posterior mean curve - Idea Use MCMC, but prefer analytical
calculations whenever possible
regions
jump locations
function values
8Observations
- The joint distribution of U, V, and the data has
density proportional to - Conditional on u, the counts are sufficient for
v.
where
9Observations II
The marginal of the posterior on U has density
proportional to
Where
Conditional on Uu and the data, Vs are
independent Beta random variables
and
10Consequently
- In principle
- We put a prior on piecewise constant curves
- The curves are specified by
- u, a vector in 0,1m
- v, a vector in 0,1m1
- for some m
- We sample curves from the posterior using MCMC
- We take the posterior mean (pointwise) of the
sampled curves - In practice
- We need only sample from the posterior on u
- We can then compute the conditional mean of all
the curves with this u.
11Implementation
- Build a reversible base chain to sample U from
the prior - E.g., start with an empty vector and add,
delete, and move coordinates randomly - Apply Metropolis-Hastings to construct a new
chain which samples from the posterior on U - Compute
12Simulation Experiment (a)
n1024
13n1024
14n1024
15n1024
16Predictive Probability Surface
17Posterior on -jumps
18Stable w.r.t Prior
19Decomposition
20Classification and Regression Trees(CART)
- Consider splitting the data into the set with Xltx
and the set with Xgtx - Choose x to maximize the fit
- Recurse on each subset
- Prune away splits according to a complexity
criterion whose parameter is determined by
cross-validation - Splits that do not explain enough variability
get pruned off
21Simulation Experiment (b)
22Bagging
- To bag an estimator you treat the estimator as
a black box - Repeatedly, generate bootstrap resamples from the
data set and run the estimator on these new data
sets. - Average the resulting estimates
23Simulation Experiment (c)
- True
- Posterior Mean
- CART
- Bagged Cart Full Trees
24Simulation Experiment (d)
- True
- Posterior Mean
- CART
- Bagged Cart cp0.005
25Simulation Experiment (e)
- True
- Posterior Mean
- CART
- Bagged Cart cp0.01
26Simulations 2-10
27CART
Bagged CART cp0.01
28Bagged Bayes??
29Smoothers?
30Boosting? (Lasso Stumps)
31Dyadic Bayes Diaconis, Freedman
32Monotone Invariance?
33Bayesian Consistency
-
- Consistent at f0 if
- The posterior probability of N?
- tends to 1 a.s. for any ? gt 0
- Since all f are bounded in L1, Consistency
implies a fortiori that
34Sample Size 8192
35Related WorkDiaconis and Freedman (1995)
?DF K?
Given Kk, split into 2k equal pieces.
(k3)
- Similar hierarchical prior, but
- Aggressive splitting
- Fixed split points
- Strong Results
- If ? dies off at a specific geometric rate
- Consistency for all f0
- If ? dies off just slower than this
- Posterior will be inconsistent at f01/2
- Consistency results cannot be taken for granted
36Consistency Theorem Thesis
- If (Xi,Yi) are drawn iid via (i1..n)
- X U(0,1)
- YXx Bernoulli(f0(x))
-
- And if ? is the specified prior on f, chosen so
that the tails the prior on hierarchy level M, - decay like exp(-n log(n) )
- Then ?n, the posterior,
- is a consistent estimate of f0,
- for any measurable f0.
37Method of Proof
- Barron, Schervish, Wasserman (1999)
- Need to show
- Lemma 1 Prior puts positive mass on all
Kullback-Leibler information neighborhoods of f0 - Choose sieves
- Fnf f has no more than n/log(n) splits
- Lemma 2 The ? -upper metric entropy of Fn is
o(n) - Lemma 3 ?(Fnc) decays exponentially
38New Result
- Coram and Lalley 2004/5 ( hopefully ? )
- Consistency holds for any prior with infinite
support, if the true function is not identically
½. - Consistency for the ½ case depends on the tail
decay - Proof revolves around a large-deviation question
- How does predictive probability behave as n --gt
infinityfor a model with man splits?
(0ltaltinfinity) - Proof uses subadditive ergodic theorem to take
advantage of self-similarity in the problem
39A Guessing Game
4064
41128
42256
43512
441024
452048
464096
478192
48A Voronoi Prior for 0,1d
?1
?2
V1
V2
V3
?5
V5
?3
V4
?4
49A Modified Voronoi Prior for General Spaces
- Choose M, as before
- Draw V(V1, V2, Vk)
- With each Vj drawn without replacement from an
a-priori fixed set A - In practice, I take AX1, , Xn
- This approximates drawing the Vs from the
marginal dist of X
50Discussion
- CON
- Not quite Bayesian
- A depends on the data
- PRO
- Only partitions the relevant subspace
- Applies in general metric spaces
- Only depends on D, the pairwise distance matrix
- Intuitive Content
51Intuition
k
Samples from the prior with k parts
522-dimensional Simulated Data
53Posterior Samples
54Posterior Mean
55Bagged Cart
56Weighted Voronoi
57Acknowledgements
- Steve Lalley
- Persi Diaconis
- National Science Foundation
- Lieberman Fellowship
- AIDS Data Andrew Zolopa, Howard Rice
58(No Transcript)
59Future Directions
- Theoretical
- Extend theoretical results to more general
setting - Tighten results to determine where inconsistency
first appears - Determine rate of convergence
- Practical
- Refine MCMC mixing using better simulated
tempering - Improve computational speed
- Explore weighted Voronoi and smoothed Voronoi
priors - Compare with SVMs and Boosting
- Use the posterior to produce confidence statements
60Smoothing
61Highlights
- Straightforward Bayesian motivation
- Implementation actually works
- Prior can be adjusted to utilize domain knowledge
- Provides a framework for inference
- Compares favorably with CART/Bagged CART
- Theoretically tractable
- Targets high dimensional problems
62Background
- Enormous Literature
- Theoretical results starting from the consistency
of nearest neighbors - Methodologies
- CART
- Logistic Regression
- Wavelets
- SVMs
- Neural Nets
- Bayesian Literature
- Bayesian CART
- Image Segmentation
- Bayesian Theory
- Diaconis and Freedman
- Barron, Schervish, Wasserman
63Posterior Calculation(2-dimensional example)
64Spatial Adaptation
65Nonparametric Prior 1-dimension
- Pick Kk from ?
-
- Partition of 0,1 into k intervals
- Assign Each ?j an Sj iid U(0,1)
)
)
)
?1
?2
?4
?3
66Consistency Results(1-dimensional)
- Setup
- Xs iid U(0,1)
- YXx Bernoulli(f0(x))
- ? is the prior on k
- Result
- If the tails of ? decay geometrically, then for
any measurable f0,?n is consistent at f0.
- Key tools
- Kullback-Leibler inequalities, Weierstrass
approximation - (Prior is Dense)
- Sieves
- (Prior is almost finite dimensional)
- Upper brackets
- (Prior is almost finite)
- Large deviations
- (Each likelihood ratio test is asymptotically
powerful)