Nonparametric Bayesian Classification - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Nonparametric Bayesian Classification

Description:

(CART) Consider splitting the data into the set with X x and the set with X x ... Compares favorably with CART/Bagged CART. Theoretically tractable ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 67
Provided by: marc2198
Category:

less

Transcript and Presenter's Notes

Title: Nonparametric Bayesian Classification


1
Nonparametric Bayesian Classification
Marc A. Coram University of Chicago http//galton.
uchicago.edu/coram
Persi Diaconis Steve Lalley
2
Related Approaches
  • Chipman, George, McCullough
  • Bayesian CART (1998 a,b)
  • Nested
  • CART-like
  • Coordinate aligned splits
  • Good search ability
  • Denison, Mallick, Smith
  • Bayesian CART
  • Bayesian splines and MARS

3
Outline
  • Medical example
  • Theoretical framework
  • Bayesian proposal
  • Implementation
  • Simulation experiments
  • Theoretical results
  • Extensions to a general setting

4
Example AIDS Data(1-dimensional)
  • AIDS patients
  • Covariate of interest viral resistance level in
    blood sample
  • Goal estimate conditional probability of response

5
Idealized Setting
(X,Y) iid pairs X (covariate) X ? 0,1 Y
(response) Y ? 0,1 f0 (true parameter)
f0(x)P(Y1Xx)
What, then, is a straightforward way to proceed,
thinking like a Bayesian?
6
Prior on f 1-dimension
  • Pick a non-negative integer M at randomSay,
    choose M0 with prob 1/2 M1 with prob
    1/4 M2 with prob 1/8 .
  • Conditional on Mm, Randomly choose a step
    function from 0,1 into 0,1 with m jumps
  • (i.e. locate the m jumps and (m1)
    valuesindependently and uniformly)

7
Perspective
  • Simple prior on stepwise functions
  • Functions are parameterized by
  • Goal Get samples from the posterior average to
    estimate posterior mean curve
  • Idea Use MCMC, but prefer analytical
    calculations whenever possible

regions
jump locations
function values
8
Observations
  • The joint distribution of U, V, and the data has
    density proportional to
  • Conditional on u, the counts are sufficient for
    v.

where
9
Observations II
The marginal of the posterior on U has density
proportional to
Where
Conditional on Uu and the data, Vs are
independent Beta random variables
and
10
Consequently
  • In principle
  • We put a prior on piecewise constant curves
  • The curves are specified by
  • u, a vector in 0,1m
  • v, a vector in 0,1m1
  • for some m
  • We sample curves from the posterior using MCMC
  • We take the posterior mean (pointwise) of the
    sampled curves
  • In practice
  • We need only sample from the posterior on u
  • We can then compute the conditional mean of all
    the curves with this u.

11
Implementation
  • Build a reversible base chain to sample U from
    the prior
  • E.g., start with an empty vector and add,
    delete, and move coordinates randomly
  • Apply Metropolis-Hastings to construct a new
    chain which samples from the posterior on U
  • Compute

12
Simulation Experiment (a)
n1024
  • True
  • Posterior Mean

13
n1024
  • True
  • Posterior Mean

14
n1024
  • True
  • Posterior Mean

15
n1024
  • True
  • Posterior Mean

16
Predictive Probability Surface
17
Posterior on -jumps
18
Stable w.r.t Prior
19
Decomposition
20
Classification and Regression Trees(CART)
  • Consider splitting the data into the set with Xltx
    and the set with Xgtx
  • Choose x to maximize the fit
  • Recurse on each subset
  • Prune away splits according to a complexity
    criterion whose parameter is determined by
    cross-validation
  • Splits that do not explain enough variability
    get pruned off

21
Simulation Experiment (b)
  • True
  • Posterior Mean
  • CART

22
Bagging
  • To bag an estimator you treat the estimator as
    a black box
  • Repeatedly, generate bootstrap resamples from the
    data set and run the estimator on these new data
    sets.
  • Average the resulting estimates

23
Simulation Experiment (c)
  • True
  • Posterior Mean
  • CART
  • Bagged Cart Full Trees

24
Simulation Experiment (d)
  • True
  • Posterior Mean
  • CART
  • Bagged Cart cp0.005

25
Simulation Experiment (e)
  • True
  • Posterior Mean
  • CART
  • Bagged Cart cp0.01

26
Simulations 2-10
27
CART
Bagged CART cp0.01
28
Bagged Bayes??
29
Smoothers?
30
Boosting? (Lasso Stumps)
31
Dyadic Bayes Diaconis, Freedman
32
Monotone Invariance?
33
Bayesian Consistency
  • Consistent at f0 if
  • The posterior probability of N?
  • tends to 1 a.s. for any ? gt 0
  • Since all f are bounded in L1, Consistency
    implies a fortiori that

34
Sample Size 8192
35
Related WorkDiaconis and Freedman (1995)
?DF K?
Given Kk, split into 2k equal pieces.
(k3)
  • Similar hierarchical prior, but
  • Aggressive splitting
  • Fixed split points
  • Strong Results
  • If ? dies off at a specific geometric rate
  • Consistency for all f0
  • If ? dies off just slower than this
  • Posterior will be inconsistent at f01/2
  • Consistency results cannot be taken for granted

36
Consistency Theorem Thesis
  • If (Xi,Yi) are drawn iid via (i1..n)
  • X U(0,1)
  • YXx Bernoulli(f0(x))
  • And if ? is the specified prior on f, chosen so
    that the tails the prior on hierarchy level M,
  • decay like exp(-n log(n) )
  • Then ?n, the posterior,
  • is a consistent estimate of f0,
  • for any measurable f0.

37
Method of Proof
  • Barron, Schervish, Wasserman (1999)
  • Need to show
  • Lemma 1 Prior puts positive mass on all
    Kullback-Leibler information neighborhoods of f0
  • Choose sieves
  • Fnf f has no more than n/log(n) splits
  • Lemma 2 The ? -upper metric entropy of Fn is
    o(n)
  • Lemma 3 ?(Fnc) decays exponentially

38
New Result
  • Coram and Lalley 2004/5 ( hopefully ? )
  • Consistency holds for any prior with infinite
    support, if the true function is not identically
    ½.
  • Consistency for the ½ case depends on the tail
    decay
  • Proof revolves around a large-deviation question
  • How does predictive probability behave as n --gt
    infinityfor a model with man splits?
    (0ltaltinfinity)
  • Proof uses subadditive ergodic theorem to take
    advantage of self-similarity in the problem

39
A Guessing Game
40
64
41
128
42
256
43
512
44
1024
45
2048
46
4096
47
8192
48
A Voronoi Prior for 0,1d
?1
?2
V1
V2
V3
?5
V5
?3
V4
?4
49
A Modified Voronoi Prior for General Spaces
  • Choose M, as before
  • Draw V(V1, V2, Vk)
  • With each Vj drawn without replacement from an
    a-priori fixed set A
  • In practice, I take AX1, , Xn
  • This approximates drawing the Vs from the
    marginal dist of X

50
Discussion
  • CON
  • Not quite Bayesian
  • A depends on the data
  • PRO
  • Only partitions the relevant subspace
  • Applies in general metric spaces
  • Only depends on D, the pairwise distance matrix
  • Intuitive Content

51
Intuition
k
Samples from the prior with k parts
52
2-dimensional Simulated Data
53
Posterior Samples
54
Posterior Mean
55
Bagged Cart
56
Weighted Voronoi
57
Acknowledgements
  • Steve Lalley
  • Persi Diaconis
  • National Science Foundation
  • Lieberman Fellowship
  • AIDS Data Andrew Zolopa, Howard Rice

58
(No Transcript)
59
Future Directions
  • Theoretical
  • Extend theoretical results to more general
    setting
  • Tighten results to determine where inconsistency
    first appears
  • Determine rate of convergence
  • Practical
  • Refine MCMC mixing using better simulated
    tempering
  • Improve computational speed
  • Explore weighted Voronoi and smoothed Voronoi
    priors
  • Compare with SVMs and Boosting
  • Use the posterior to produce confidence statements

60
Smoothing
61
Highlights
  • Straightforward Bayesian motivation
  • Implementation actually works
  • Prior can be adjusted to utilize domain knowledge
  • Provides a framework for inference
  • Compares favorably with CART/Bagged CART
  • Theoretically tractable
  • Targets high dimensional problems

62
Background
  • Enormous Literature
  • Theoretical results starting from the consistency
    of nearest neighbors
  • Methodologies
  • CART
  • Logistic Regression
  • Wavelets
  • SVMs
  • Neural Nets
  • Bayesian Literature
  • Bayesian CART
  • Image Segmentation
  • Bayesian Theory
  • Diaconis and Freedman
  • Barron, Schervish, Wasserman

63
Posterior Calculation(2-dimensional example)
64
Spatial Adaptation
65
Nonparametric Prior 1-dimension
  • Pick Kk from ?
  • Partition of 0,1 into k intervals
  • Assign Each ?j an Sj iid U(0,1)



)
)
)
?1
?2
?4
?3
66
Consistency Results(1-dimensional)
  • Setup
  • Xs iid U(0,1)
  • YXx Bernoulli(f0(x))
  • ? is the prior on k
  • Result
  • If the tails of ? decay geometrically, then for
    any measurable f0,?n is consistent at f0.
  • Key tools
  • Kullback-Leibler inequalities, Weierstrass
    approximation
  • (Prior is Dense)
  • Sieves
  • (Prior is almost finite dimensional)
  • Upper brackets
  • (Prior is almost finite)
  • Large deviations
  • (Each likelihood ratio test is asymptotically
    powerful)
Write a Comment
User Comments (0)
About PowerShow.com