Efficient Bayesian Algorithms for Clustering - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Efficient Bayesian Algorithms for Clustering

Description:

Efficient Bayesian Algorithms for Clustering – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 32
Provided by: katherin116
Category:

less

Transcript and Presenter's Notes

Title: Efficient Bayesian Algorithms for Clustering


1
Efficient Bayesian Algorithms for Clustering
  • Katherine Heller
  • Gatsby Computational Neuroscience Unit
  • Women In Machine Learning Workshop
  • San Diego, CA

2
What Is Clustering?
  • Imagine we have data in some feature space
  • One of the most important goals of unsupervised
    learning is to discover meaningful clusters in
    data.
  • There are many clustering methods spectral,
    hierarchical, k-means, mixture modeling, etc.
  • We take a model-based Bayesian approach to
    defining a cluster and evaluate cluster
    membership in this paradigm.

3
Marginal Likelihoods
  • We use marginal likelihoods to evaluate cluster
    membership
  • The marginal likelihood is defined as
  • and can be interpreted as
  • the probability that all data
  • points in were generated
  • from the same model with
  • unknown parameters
  • Used to compare cluster
  • models

4
Outline
  • Introduction
  • Clustering
  • Marginal Likelihoods
  • Bayesian Hierarchical Clustering
  • Clustering on Demand
  • Bayesian Sets
  • Content Based Image Retrieval
  • Conclusions

5
Traditional Hierarchical Clustering
  • As in Duda and Hart (1973)
  • Many distance metrics are possible

6
Limitations of Traditional Hierarchical
Clustering Algorithms
  • How many clusters should there be?
  • It is hard to choose a distance metric
  • They do not define a probabilistic model of the
    data, so they cannot
  • Predict the probability or cluster assignment of
    new data points
  • Be compared to or combined with other
    probabilistic models
  • Our Goal To overcome these limitations by
    defining a novel statistical approach to
    hierarchical clustering

7
Bayesian Hierarchical Clustering Building the
Tree
  • The algorithm is virtually identical to
    traditional hierarchical clustering except that
    instead of distance it uses marginal likelihood
    to decide on merges.
  • For each potential merge
    it compares two hypotheses
  • all data in came from one cluster
  • data in came from some other
    clustering
  • consistent with the subtrees
  • Prior
  • Posterior probability of merged hypothesis
  • Probability of data given the tree

Heller and Ghahramani ICML 2005
8
Comparison
Bayesian Hierarchical Clustering
Traditional Hierarchical Clustering
9
Summary of BHC
  • We developed a Bayesian Hierarchical Clustering
    algorithm which
  • Is simple, deterministic and fast (no MCMC,
    one-pass, etc.)
  • Can take as input any simple probabilistic model
    p(xq) and gives as output a mixture of these
    models
  • Suggests where to cut the tree and how many
    clusters there are in the data
  • Gives more reasonable results than traditional
    hierarchical clustering algorithms
  • This algorithm also
  • Recursively computes an approximation to the
    marginal likelihood of a Dirichlet Process
    Mixture
  • which can be easily turned into a new lower
    bound

10
Results a Toy Example
11
Results a Toy Example
12
Predicting New Data Points
13
4 Newsgroups Results
800 examples, 50 attributes rec.sport.baseball,
rec.sports.hockey, rec.autos, sci.space
14
Newsgroups Average Linkage HC
15
Newsgroups Bayesian HC
16
Clustering On Demand
Assume a universe of objects ( )
.
17
Clustering On Demand
18
Bayesian Sets Approach
  • Rank each object in by how well it would
    fit into a set which includes (i.e. how
    relevant it is to the query)
  • Use a Bayesian (model-based probabilistic)
    relevance criterion
  • Limit output to the top few items

Ghahramani and Heller NIPS 2005
19
Bayesian Sets Criterion
We can write this score as
20
Bayesian Sets Criterion
This has a nice intuitive interpretation
21
Bayesian Sets Criterion
22
Sparse Binary Data
E.g
If we use a multivariate Bernoulli model
With conjugate Beta prior
where
and
23
Results EachMovie
1813 people by 1532 movies
24
A Bayesian Content-Based Image Retrieval System
  • We can use the Bayesian Sets method as the basis
    of a content-based image retrieval system

25
The Image Retrieval Prototype System
  • The Algorithm
  • Input query word wpenguins
  • Find all training images with label w
  • Take the binary feature vectors for these
    training images as query set and use Bayesian
    Sets algorithm
  • For each image, x, in the unlabelled test set,
    we compute score(x) which measures the
    probability that x belongs in the set of images
    with the label w.
  • Return the images with the highest score
  • The algorithm is very fast
  • about 0.2 sec on this laptop to query 22,000
    test images

Heller and Ghahramani CVPR 2006
26
Example Queries
Query Desert
Query Pet
Query Sign
Query Building
Query Penguins
Query Eiffel
27
Example Training Images for Desert
28
Results for Image Retrieval

NNall - nearest neighbors to any member of the
query set Nnmean - nearest neighbors to the mean
of the query set BO - Behold Search online,
www.beholdsearch.com A Yavlinsky, E
Schofield and S Rüger (CIVR, 2005)
29
Future Work
  • Exploring bounds on the marginal likelihood of a
    DPM
  • How tight is the bound?
  • Improved structures for combinatorial
    approximation
  • Since the score is probabilistic it should be
    possible to find a principled threshold for
    number of items in the returned set
  • Automated Analogical Reasoning with Relational
    Data
  • Image Annotation
  • Incorporate relevance feedback

30
Conclusions
  • Presented work on Bayesian hierarchical
    clustering, information retrieval from sets of
    items, and image retrieval, all based on
    computing marginal likelihoods.
  • There are many interesting directions in which to
    take this work

31
Acknowledgements
  • Collaborators
  • Zoubin Ghahramani
  • Ricardo Silva
  • Venkat Ramesh
  • Thanks to
  • David MacKay, Avrim Blum, Sam Roweis, Alexei
    Yavlinsky, Simon Tong
Write a Comment
User Comments (0)
About PowerShow.com