Title: Double Dirichlet Process Mixtures
1Double Dirichlet Process Mixtures
Sanjib Basu
Northern Illinois University and Rush University
Medical Center
Siddhartha Chib
Washington University, St. Louis
2- Dirichlet process mixtures are active research
areas - Dirichlet mixtures are it!
- The flexibility of DPM models supported its huge
popularity in wide variety of areas of
application. - DPM models are general and can be argued to have
less structure. - Double Dirichlet Process Mixtures add a degree of
structure, possibly at the expense of some degree
of flexibility, but possibly with better
interpretability in some cases - We discuss applications (and limitations) of
these semiparametric double mixtures - We compare fit-prediction duality with competing
models
3Other DP extensions
- Double Dirichlet process mixtures are a subclass
of dependent Dirichlet Process mixtures
(MacEachern 1999,) - Double DP mixture are different from Hierarchical
Dirichlet Processes (The et al. 2006 ) - Double DPM is simply independent DPMS
4Motivating Example 1
- Luminex measurements on two biomarker proteins
from n156 Patients - IL-1ß protein
- C-reactive protein
- The biological effects of these two proteins are
thought to be not (totally) overlapping.
5 Two Biomarkers (y1 and y2)
- Usual DP mixture of normals (Ferguson 1983,..)
- Questions
- Should we model the two biomarkers jointly?
- Should we cluster the patients based on both
biomarkers jointly? - The biomarkers may operate somewhat independently.
6Double DP mixtures
- Equicorrelation corr(y1i, y2i) are assumed to
be the same for all i1,,n - Clustering based on biomarker 1 and based on
biomarker 2 can be different
7Motivating Example 2 Interrater Agreement
- Agreement between 2 Raters (Melia and Diener-West
1994) - Each rater provides an ordinal rating on a scale
of 1-5 (lowest to highest invasion)of the extent
to which tumor has invaded the eye,n885
8Interrater agreement
- Kottas, Muller, Quintana (2005) analyzed these
data using a flexible DP mixture of Bivariate
probit ordinal model which modeled the
unstructured joint probabilities prob(Rater 1i
and Rater2 j), i1,,5, j1,,5 - One way to quantify interrrater agrrement is to
measure departure from the structured model of
independence - We consider a (mixture of) Double DP mixtures
model here which provides separate DP structures
for the two raters. We then measure agreement
from this model.
9Motivating Example 3
- Mixed model for longitudinal data
- It is common to assume (Bush and MacEachern 1996)
- Modeling the error covariance ?i or the error
variance (if ?i diag(?2i)) extends the normal
distribution assumption to normal scale mixtures
(t, Logistic,)
10Putting the two together
- One way to combine these two structures is
- Do we expect the random effects bi appearing in
the modeling the mean and the error variances to
cluster similarly? - The error variance model often is used to extend
the distributional assumption.
11Double DPM
- I will discuss
- Fitting
- Applicability
- Flexibility
- Limitations
- of such double semiparametric mixtures
- I will also compare these models with usual DP
models via predictive model comparison criteria
12Dirichlet process
- Dirichlet Process is a probability measure on the
space of distributions (probability measures) G. - G Dirichlet Process (?G0), where G0 is a
probability - Dirichlet Process assigns positive mass to every
open set of probabilities on support(G0) - Conjugacy Y1,., Yn i.i.d. G, ?(G) DP(? G0)
Then Posterior ?(GY) DP(? G0 nFn) where
Fn is the empirical distn. - Polya Urn Scheme
13Stick breaking and discreteness
- G DP(? G0) implies G is almost surely discrete
14Bayes estimate from DP
- The discrete nature of a random G from a DP leads
to some disturbing features, such as this result
from Diaconis and Freedman (1986) - Location model
- yi ? ?i, i1,n
- ? has prior ?(?), such as a normal prior
- ?1,, ?n i.i.d. G G DP(G0) -
symmetrized G0 Cauchy or t-distn - Then the posterior mean is an inconsistent
estimate of ?
15Dirichlet process mixtures (DPM)
- If we marginalize over ?i, we obtain a
semiparametric mixturewhere the mixing
distribution G is random and follows DP(?G0)
16DPM - clusters
- Since G is almost surely discrete, ?1,,?n form
clusters - ?1 ?5 ?8 ? ?1unique
- ?2 ?3 ?4 ?6 ?7 ? ?2unique etc.
- The number of clusters, and the clusters
themselves, are random.
17DPM MCMC
- The Polya urn/marginalized sampler (Escobar 1994,
Escobar West 1995) samples ?i one-at-a-time
from ?(?i ?-i, data) - Improvements, known as collapsed samplers, are
proposed in MacEachern (1994, 1998) where,
instead of sampling ?i , only the cluster
membership of ?i are sampled. - For non-conjugate DPM (sampling density f(yi ?i
) and base measure G0 are not conjugate), various
algorithms have been proposed.
18Finite truncation and Blocked Gibbs
- With this finite truncation, it is now a finite
mixture model with stick-breaking structure on qj
- (?1,....,?n) and (q1,....,qM) can be updated in
blocks (instead of one-at-time as in Polya Urn
sampler) which may provide better mixing
19 Comments
- In each iteration, the Polya urn/marginal sampler
cycles thru each observation, and for each,
assigns its membership among a new and existing
clusters. - The Poly urn sampler is also not straightforward
to implement in non-linear (non-conjugate)
problems or when the sample size n may not be
fixed. - For the blocked sampler, on the other hand, the
choice of the truncation M is not well understood.
20Model comparison in DPM models
- Basu and Chib (2003) developed Bayes factor/
marginal likelihood computation method for DPM. - This provided a framework for quantitative
comparison of DPM with competing parametric and
semi/nonparametric models.
21Marginal likelihood of DPM
- Based on the Basic marginal identity (Chib 1995)
log-posterior(?)log-likelihood(?)
log-prior(?) - log-marginallog-marginal
log-likelihood(?) log-prior(?)
log-posterior(?) - The posterior ordinate of DPM is evaluated via
prequential conditioning as in Chib (1995) - The likelihood ordinate of DPM is evaluated from
a (collapsed) sequential importance sampler.
22(No Transcript)
23(No Transcript)
24(No Transcript)
25Double Dirichlet process mixtures (DDPM)
- Marginalization obtains a double semiparametric
mixturewhere the mixing distributions G? and
G? are random
26Two Biomarkers case y1 and y2
27A simpler model normal means only
- We generate n50 (?i,?i) means and then (yi1,yi2)
observations from this Double-DPM model
28Double DPM
29Single DPM in the bivariate mean vector
Double DPM in mean components
30Model fitting
- We fitted the Double DPM and the Bivariate DPM
models to these data. - The Double DPM model can be fit by a two-stage
Polya urn sampler or a two-stage blocked Gibbs
sampler. - Collapsing can become more difficult.
31(No Transcript)
32Wallace (asymmetric) criterion for comparing two
clusters/partitions
- Let S be the number of mean pairs which are in
the same cluster in a MCMC posterior draw and
also in the true clustering. - Let nk, k1,..K be the number of means in cluster
Ck in the MCMC draw. - Then the Wallace asymmetric criterion for
comparing these two clusters is
33Measurements on two biomarker proteins by Luminex
panels
- Frozen parafin embedded tissues, pre and post
surgery - Luminex panel
- Nodal involvement
34Two biomarker proteins
35µpred
36ypred
37ypred
38log CPO log f(yi y-i)
LPML log ?f(yi y-i) Double DP
-1498.67 Bivariate DP -1533.01
39Model comparison
- I prefer to use marginal likelihood/ Bayes factor
for model comparison. - The DIC (Deviance Information Criterion) , as
proposed in Spiegelhalter et al. (2002) can be
problematic for missing data/random-effects/mixtur
e models. - Celeux et al. (2006) proposed many different DICs
for missing data models
40DIC3
- I have earlier considered DIC3 (Celeux et al.
2006, Richardson 2002) in missing data and random
effects models which is based on the observed
likelihood
- The integration over the latent parameters often
has to be obtained numerically. - This is difficult in the present problem
41DIC9
- I am proposing to use DIC9 which is similar to
DIC3 but is based on the conditional likelihood
42Convergence rate results Ghosal and Van Der
Vaart (2001)
- Normal location mixturesModel Yi i.i.d. p(y)
???(y-?)dG(?), i1,,n
G DP(G0), G0 is Normal Truth
p0(y) ???(y-?)dF(?) - Ghosal and Van Der Vaart (2001) Under some
regularity conditions, Hellinger distance (p,
p0) ? 0 almost surely at the rate of
(log n)3/2/?n
43Ghosal and Van Der Vaart (2001) results contd.
- Bivariate DP location-scale mixture of normals
Yi i.i.d. p(y) ????(y-?)dH(?,?),
i1,,n H DP(H0) - Ghosal and Van Der Vaart (2001) If H0 is Normal
?a compactly supported distn, then the
convergence rate is (log n)7/2/?n - Double DP location-scale mixture of normalsYi
i.i.d. p(y) ????(y-?)dG?(?) dG?(?),
i1,,n G? DP(G?0), G? DP(G?0) - Ghosal and Van Der Vaart (2001) If G?0 is
Normal, G?0 is compactly supported and the true
density p0(y) ????(y-?)dF1(?) dF2(?) is also
a double mixture, then Hellinger distance (p,
p0) ? 0 at the rate of (log n)3/2/?n
44Interrater data
- Agreement between 2 Raters (Melia and Diener-West
1994) - Each rater provides an ordinal rating on a scale
of 1-5 (lowest to highest invasion)of the extent
to which tumor has invaded the eye,n885
45DPM multivariate ordinal model
- Kottas, Muller and Quintana (2005)
46Interrater agreement
- The objective is to measure agreement between
raters beyond what is possible by chance. - This is often measured by departure from
independence, often specifically in the
diagonals - Polychoric correlation of the latent bivariate
normal Z has been used as a measure of
association. - of the latent bivariate normal
mixtures???
47Latent class model (Agresti Lang 1993)
- Ratings of the two raters within a class are
independent
48Mixtures of Double DPMs
- For each latent class, we model pc1j and pc2k by
two separate univariate ordinal probit DPM models
49Computational issue
- The sample size nc in latent group c is not
fixed. This causes problem for the
polya-urn/marginal sampler which works with fixed
sample size - Do, Muller, Tang (2005) suggested a solution to
this problem by jointly sampling the latent ?il
(?il,?il2) and the latent rating class
membership ?i.
50Estimated cell probabilities
51Marginal probability estimates
52Marginal probability estimates
53Joint mean covariance modeling
- Trial with n200 patients who had acute MI within
28 days of baseline and are depressed/low social
support - Underwent 6 months of usual care (control) or
individual and/or group-based cognitive
behavioral counseling (treatment). - Response y depression (Beck Depression
Inventory) measured at 0,182,365,548, 913, 1278
days (but actually at irregular intervals) - Covariate Treatment, Family history, Age, Sex,
BMI, - Intermittent missing response, missing covariate.
54(No Transcript)
55Model
- Pourahmadi (1999), Pourahmadi and Daniels (2002)
use a Cholesky decomposition of the covariance
which allows one to use log-linear model for the
variances and linear regression for the
off-diagonal terms
56Modeling the covariance
57Mean and variance level random effects
- ? A joint DPM for the two random effects together
which allows clustering at the patient level
- ? Or Double DPM, that is, independent DPM
separately for the each of the two random effects
which allows separate clustering at the mean and
variance level
- Most frequentist and parametric Bayesian analyses
use the latter independence among the mean and
variance level random effects.
58(No Transcript)
59Fixed effects estimates
60Pseudo marginal likelihood
61Summary
- Double DP mixtures may add a level of structure
to mixture modeling with DP. - They produce interesting product-clustering
- They are applicable to specific problems that may
benefit from this structure
62basu_at_niu.edu