Title: Statistical Models for Partial Membership
1Statistical Models for Partial Membership
- Katherine Heller
- Gatsby Computational Neuroscience Unit, UCL
- Sinead Williamson and Zoubin Ghahramani
- University of Cambridge
2Partial Membership
- Example Person with mixed ethnic background.
- Someone who is 50 Asian and 50 European partly
belongs to 2 different groups (ethnicities). - This partial membership may be relevant for
predicting this persons phenotype or food
preferences.
- Conceptually not the same as uncertain
membership. - Being certain that someone is half Asian and half
European is very different than being unsure of
their ethnicity. - More evidence (like DNA tests) can help resolve
uncertainty but will not change their ethnicity
memberships. - Work on modeling partial membership by fuzzy
logic community
3Outline
- Goal Describe a fully probabilistic approach to
- data modeling with partial memberships.
- Introduction
- Bayesian Partial Membership Model (BPM)
- BPM Learning
- Experiments
- Synthetic
- Senate Roll Call data
- Related Work
- Conclusions
- Nonparametric Extension?
4Finite Mixture Models
Consider modeling a data set,
, using a finite mixture of K
components
Generative Process
1) Choose a cluster
2) Generate a data point from that cluster
where
and
denote memberships of data points to clusters!
5Finite Mixture Models
Continuous Relaxation
where
and
denote memberships of data points to clusters!
denote partial memberships of data points to
clusters!
6Why does this make sense?
(0,1)
(.5,.5)
(1,0)
Partial Membership
Mixture Model
- If there is an Asian cluster and a European
cluster, the partial membership model will better
capture people with mixed ethnicity, whose
features lie in between.
7Exponential Family Distributions
Lets consider the case where
Sufficient Statistics
Natural Parameters
It follows that
Conjugate prior can be written as
8Bayesian Partial Membership Model
Generative Process
Ethnicity Example
For each k
Defines a distribution over features for each of
k ethnic groups
Defines ethnic composition of the population
Controls how similar to the population an
individual is expected to be
For each n
Ethnic composition of individual n
Feature values of individual n
9Bayesian Partial Membership Model
Generative Process
For each k
For each n
10BPM Sampled Data
- Each of the four plots shows 3000 data points
drawn from the BPM with the same 3
full-covariance Gaussian clusters.
11BPM Theory
Lemma 1 In the limit as a?0 the exponential
family BPM model is a mixture of K components
with mixing proportions
Lemma 2 In the limit as a? the exponential
family BPM model has only one component with
natural parameters
12BPM Learning
- Want to infer all unknowns given X
- We treat as fixed hyperparameters
- Goal Infer using MCMC
- All parameters in the BPM are continuous so we
can use Hybrid Monte Carlo. - Hybrid Monte Carlo is an efficient MCMC method
that uses gradient information to find high
probability regions.
13Synthetic Data
- Generated synthetic binary data set of 50 data
points, 32 dimensions, and 3 clusters. Ran HMC
sampler for 4000 iterations. Computed
14Senate Roll Call Data (2001-2002)
- (99 senators 1 outcome) x 633 votes
- K2 multivariate Bernoulli clusters
- Model adapted to handle missing data
15Senate Roll Call Comparisons
Blue Senator Schumer
Black Outcome
Red Senator Ensign
- Partial membership values are very sensitive to
exponent - For no value of do the membership values make
sense
16Senate Roll Call Comparisons
- Dirichlet Process Mixtures
- DPM confidently infers 4 clusters
- Uncertainty is not a good substitute
- for partial membership
Mean
Median
Min
Max
Outcome
BPM
DPM
Negative log predictive probability (in bits)
across senators
17Image Data
- 329 Tower and Sunset Images with 240 simple
binary texture and color features and K2
clusters.
18Related Work
- Latent Dirichlet Allocation (LDA)
- Mixed Membership Models
- Fuzzy Clustering
- Exponential Family PCA
19Future Work
- Would be nice to have a nonparametric version.
- Obvious thing to try Hierarchical Dirichlet
Processes. But this would require summing over
all infinitely many elements of , which isnt
computationally feasible. Also semantically not
very nice. - Indian Buffet Processes might work. Sample an IBP
matrix with interpretation that a 1 means having
some non-zero amount of membership in that
cluster, then draw continuous exact amount
separately.
20Conclusions
- Developed a fully probabilistic approach to data
modeling with partial membership. - Uses continuous latent variables and can be seen
as a relaxation of clustering with standard
mixture models. - Used Hybrid Monte Carlo for inference which was
extremely fast (finding sensible partial
membership structure after very few samples).
21Thank You
22Partial Membership
- Cornerstone of fuzzy set theory
- Traditional set theory Items belong to a set or
they dont 0,1. - Fuzzy set theory membership function
where denotes the degree
to which belongs to set - Fuzzy logic versus probabilistic models
- Misguided arguments that fuzzy logic is different
or supercedes probability theory. - While it might be easy to dismiss fuzzy logic,
its framework for representing partial membership
has inspired many researchers. - Google Scholar Over 45,000 fuzzy clustering
papers. Most cited papers cited as frequently as
most cited NIPS area papers.
23Related Work - Latent Dirichlet Allocation (LDA)
and Mixed Membership Models
- BPM generates data points at the document level
of LDA (no word plate). - Whereas LDA (or Mixed Membership models) assume
words (or attributes) are drawn using as
mixing proportions in a mixture model, and are
factorized, the BPM uses to form a convex
combination of natural parameters. Attributes not
drawn from mixture model and need not be
factorized. - BPM - potentially faster MCMC sampling since BPM
has all continuous parameters and LDA must infer
a discrete topic assignment for each word.
24Mixed Membership Model Generation
25Related Work Fuzzy Clustering
- Fuzzy k-means iteratively minimizes the following
objective - where d is the distance between a data point and
a cluster center, is the degree of membership
of a data point in a cluster, and controls the
amount of partial membership ( 1 is normal
k-means) - None of these variables have probabilistic
interpretations.
26Related Work Exponential Family PCA
- Originally formulated in terms of Bregman
divergences, it can be seen as a non-Bayesian
version of the BPM where the s are not
constrained (to normalize to 1 or be positive). - Not a convex combination of natural parameters
with the same sort of partial membership
interpretation. - If we wanted we could relax these same
constraints to get a Bayesian version of
Exponential Family PCA , but wed have to tweak
the model e.g. a Gaussian prior on .
27BPM Learning
- Hybrid Monte Carlo is an MCMC method that uses
gradient information. - Hybrid Monte Carlo simulates dynamics of a system
with continuous state variable on an energy
function - provide forces on the state variables which
encourage the system to find high probability
regions, while maintaining detailed balance.
28Bregman Divergence
- F is a strictly convex function, p and q are
points - Intuitively the difference between the value of F
at p and the value of the first order Taylor
expansion of F around q, evaluated at p.
29LDA Review
- 1. for z1K,
- Draw
- 2. For d1D,
- a) Draw
- b) for n1Nd
- i. Draw
- ii. Draw