Nonparametric Bayesian Classification - PowerPoint PPT Presentation

1 / 66

About This Presentation

Title:

Nonparametric Bayesian Classification

Description:

(CART) Consider splitting the data into the set with X x and the set with X x ... Compares favorably with CART/Bagged CART. Theoretically tractable ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 67

Provided by: marc2198

Category:

more less

Transcript and Presenter's Notes

Title: Nonparametric Bayesian Classification

1
Nonparametric Bayesian Classification
Marc A. Coram University of Chicago http//galton.
uchicago.edu/coram
Persi Diaconis Steve Lalley
2
Related Approaches

Chipman, George, McCullough
Bayesian CART (1998 a,b)
Nested
CART-like
Coordinate aligned splits
Good search ability
Denison, Mallick, Smith
Bayesian CART
Bayesian splines and MARS

3
Outline

Medical example
Theoretical framework
Bayesian proposal
Implementation
Simulation experiments
Theoretical results
Extensions to a general setting

4
Example AIDS Data(1-dimensional)

AIDS patients
Covariate of interest viral resistance level in
blood sample
Goal estimate conditional probability of response

5
Idealized Setting
(X,Y) iid pairs X (covariate) X ? 0,1 Y
(response) Y ? 0,1 f0 (true parameter)
f0(x)P(Y1Xx)
What, then, is a straightforward way to proceed,
thinking like a Bayesian?
6
Prior on f 1-dimension

Pick a non-negative integer M at randomSay,
choose M0 with prob 1/2 M1 with prob
1/4 M2 with prob 1/8 .
Conditional on Mm, Randomly choose a step
function from 0,1 into 0,1 with m jumps
(i.e. locate the m jumps and (m1)
valuesindependently and uniformly)

7
Perspective

Simple prior on stepwise functions
Functions are parameterized by
Goal Get samples from the posterior average to
estimate posterior mean curve
Idea Use MCMC, but prefer analytical
calculations whenever possible

regions
jump locations
function values
8
Observations

The joint distribution of U, V, and the data has
density proportional to
Conditional on u, the counts are sufficient for
v.

where
9
Observations II
The marginal of the posterior on U has density
proportional to
Where
Conditional on Uu and the data, Vs are
independent Beta random variables
and
10
Consequently

In principle
We put a prior on piecewise constant curves
The curves are specified by
u, a vector in 0,1m
v, a vector in 0,1m1
for some m
We sample curves from the posterior using MCMC
We take the posterior mean (pointwise) of the
sampled curves
In practice
We need only sample from the posterior on u
We can then compute the conditional mean of all
the curves with this u.

11
Implementation

Build a reversible base chain to sample U from
the prior
E.g., start with an empty vector and add,
delete, and move coordinates randomly
Apply Metropolis-Hastings to construct a new
chain which samples from the posterior on U
Compute

12
Simulation Experiment (a)
n1024

True
Posterior Mean

13
n1024

True
Posterior Mean

14
n1024

True
Posterior Mean

15
n1024

True
Posterior Mean

16
Predictive Probability Surface
17
Posterior on -jumps
18
Stable w.r.t Prior
19
Decomposition
20
Classification and Regression Trees(CART)

Consider splitting the data into the set with Xltx
and the set with Xgtx
Choose x to maximize the fit
Recurse on each subset
Prune away splits according to a complexity
criterion whose parameter is determined by
cross-validation
Splits that do not explain enough variability
get pruned off

21
Simulation Experiment (b)

True
Posterior Mean
CART

22
Bagging

To bag an estimator you treat the estimator as
a black box
Repeatedly, generate bootstrap resamples from the
data set and run the estimator on these new data
sets.
Average the resulting estimates

23
Simulation Experiment (c)

True
Posterior Mean
CART
Bagged Cart Full Trees

24
Simulation Experiment (d)

True
Posterior Mean
CART
Bagged Cart cp0.005

25
Simulation Experiment (e)

True
Posterior Mean
CART
Bagged Cart cp0.01

26
Simulations 2-10
27
CART
Bagged CART cp0.01
28
Bagged Bayes??
29
Smoothers?
30
Boosting? (Lasso Stumps)
31
Dyadic Bayes Diaconis, Freedman
32
Monotone Invariance?
33
Bayesian Consistency

Consistent at f0 if
The posterior probability of N?
tends to 1 a.s. for any ? gt 0
Since all f are bounded in L1, Consistency
implies a fortiori that

34
Sample Size 8192
35
Related WorkDiaconis and Freedman (1995)
?DF K?
Given Kk, split into 2k equal pieces.
(k3)

Similar hierarchical prior, but
Aggressive splitting
Fixed split points
Strong Results
If ? dies off at a specific geometric rate
Consistency for all f0
If ? dies off just slower than this
Posterior will be inconsistent at f01/2
Consistency results cannot be taken for granted

36
Consistency Theorem Thesis

If (Xi,Yi) are drawn iid via (i1..n)
X U(0,1)
YXx Bernoulli(f0(x))
And if ? is the specified prior on f, chosen so
that the tails the prior on hierarchy level M,
decay like exp(-n log(n) )
Then ?n, the posterior,
is a consistent estimate of f0,
for any measurable f0.

37
Method of Proof

Barron, Schervish, Wasserman (1999)
Need to show
Lemma 1 Prior puts positive mass on all
Kullback-Leibler information neighborhoods of f0
Choose sieves
Fnf f has no more than n/log(n) splits
Lemma 2 The ? -upper metric entropy of Fn is
o(n)
Lemma 3 ?(Fnc) decays exponentially

38
New Result

Coram and Lalley 2004/5 ( hopefully ? )
Consistency holds for any prior with infinite
support, if the true function is not identically
½.
Consistency for the ½ case depends on the tail
decay
Proof revolves around a large-deviation question
How does predictive probability behave as n --gt
infinityfor a model with man splits?
(0ltaltinfinity)
Proof uses subadditive ergodic theorem to take
advantage of self-similarity in the problem

39
A Guessing Game
40
64
41
128
42
256
43
512
44
1024
45
2048
46
4096
47
8192
48
A Voronoi Prior for 0,1d
?1
?2
V1
V2
V3
?5
V5
?3
V4
?4
49
A Modified Voronoi Prior for General Spaces

Choose M, as before
Draw V(V1, V2, Vk)
With each Vj drawn without replacement from an
a-priori fixed set A
In practice, I take AX1, , Xn
This approximates drawing the Vs from the
marginal dist of X

50
Discussion

CON
Not quite Bayesian
A depends on the data
PRO
Only partitions the relevant subspace
Applies in general metric spaces
Only depends on D, the pairwise distance matrix
Intuitive Content

51
Intuition
k
Samples from the prior with k parts
52
2-dimensional Simulated Data
53
Posterior Samples
54
Posterior Mean
55
Bagged Cart
56
Weighted Voronoi
57
Acknowledgements

Steve Lalley
Persi Diaconis
National Science Foundation
Lieberman Fellowship
AIDS Data Andrew Zolopa, Howard Rice

58
(No Transcript)
59
Future Directions

Theoretical
Extend theoretical results to more general
setting
Tighten results to determine where inconsistency
first appears
Determine rate of convergence
Practical
Refine MCMC mixing using better simulated
tempering
Improve computational speed
Explore weighted Voronoi and smoothed Voronoi
priors
Compare with SVMs and Boosting
Use the posterior to produce confidence statements

60
Smoothing
61
Highlights

Straightforward Bayesian motivation
Implementation actually works
Prior can be adjusted to utilize domain knowledge
Provides a framework for inference
Compares favorably with CART/Bagged CART
Theoretically tractable
Targets high dimensional problems

62
Background

Enormous Literature
Theoretical results starting from the consistency
of nearest neighbors
Methodologies
CART
Logistic Regression
Wavelets
SVMs
Neural Nets
Bayesian Literature
Bayesian CART
Image Segmentation
Bayesian Theory
Diaconis and Freedman
Barron, Schervish, Wasserman

63
Posterior Calculation(2-dimensional example)
64
Spatial Adaptation
65
Nonparametric Prior 1-dimension

Pick Kk from ?
Partition of 0,1 into k intervals
Assign Each ?j an Sj iid U(0,1)

)
)
)
?1
?2
?4
?3
66
Consistency Results(1-dimensional)

Setup
Xs iid U(0,1)
YXx Bernoulli(f0(x))
? is the prior on k
Result
If the tails of ? decay geometrically, then for
any measurable f0,?n is consistent at f0.