Title: Generative Models
1Generative Models
- Tamara L Berg
- Stony Brook University
2Generative vs Discriminative
- Discriminative version build a classifier to
discriminate between monkeys and non-monkeys.
P(monkeyimage)
3Generative vs Discriminative
- Generative version - build a model that generates
images containing monkeys
P(imagemonkey)
P(imagenot monkey)
4Generative vs Discriminative
- Can use Bayes rule to compute p(monkeyimage) if
we know p(imagemonkey)
5Generative vs Discriminative
- Can use Bayes rule to compute p(monkeyimage) if
we know p(imagemonkey)
Discriminative
Generative
6Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
7Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
8Slide from Dan Klein
9Random Variables
Random variables
10Random Variables
Random variables
A random variable is some aspect of the world
about which we (may) have uncertainty. Random
variables can be Binary (e.g. true,false,
spam/ham), Take on a discrete set of values
(e.g. Spring, Summer, Fall, Winter), Or be
continuous (e.g. 0 1).
11Joint Probability Distribution
Random variables
Also written
Gives a real value for all possible assignments.
12Queries
Also written
Given a joint distribution, we can reason about
unobserved variables given observations
(evidence)
Stuff you care about
Stuff you already know
13Representation
Also written
14Representation
Also written
Graphical Models!
15Representation
Also written
Graphical models represent joint probability
distributions more economically, using a set of
local relationships among variables.
16Graphical Models
- Graphical models offer several useful properties
- 1. They provide a simple way to visualize the
structure of a probabilistic model and can be
used to design and motivate new models. - 2. Insights into the properties of the model,
including conditional independence properties,
can be obtained by inspection of the graph. - 3. Complex computations, required to perform
inference and learning in sophisticated models,
can be expressed in terms of graphical
manipulations, in which underlying mathematical
expressions are carried along implicitly.
from Chris Bishop
17Main kinds of models
- Undirected (also called Markov Random Fields) -
links express constraints between variables. - Directed (also called Bayesian Networks) - have
a notion of causality -- one can regard an arc
from A to B as indicating that A "causes" B.
18Main kinds of models
- Undirected (also called Markov Random Fields) -
links express constraints between variables. - Directed (also called Bayesian Networks) - have
a notion of causality -- one can regard an arc
from A to B as indicating that A "causes" B.
19Directed Graphical Models
Nodes
Edges
- Each node is associated with a random variable
20Directed Graphical Models
Nodes
Edges
- Each node is associated with a random variable
- Definition of joint probability in a graphical
model
21Example
22Example
Joint Probability
23Example
24Conditional Independence
- Independence
- Conditional Independence
-
Or,
25Conditional Independence
26Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
27Example
Joint Probability
28Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
Joint distribution from the example graph
29Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
Joint distribution from the example graph
Missing variables in the local conditional
probability functions correspond to missing edges
in the underlying graph. Removing an edge into
node i eliminates an argument from the
conditional probability factor
30Observations
- Graphs can have observed (shaded) and unobserved
nodes. If nodes are always unobserved they are
called hidden or latent variables - Probabilistic inference in graphical models is
the problem of computing a conditional
probability distribution over the values of some
of the nodes (the hidden or unobserved
nodes), given the values of other nodes (the
evidence or observed nodes).
31Inference computing conditional probabilities
Conditional Probabilities
Marginalization
32Inference Algorithms
- Exact algorithms
- Elimination algorithm
- Sum-product algorithm
- Junction tree algorithm
- Sampling algorithms
- Importance sampling
- Markov chain Monte Carlo
- Variational algorithms
- Mean field methods
- Sum-product algorithm and variations
- Semidefinite relaxations
33Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
34Exchangeability
- De Finetti Theorem of exchangeability (bag of
words theorem) the joint probability
distribution underlying the data is invariant to
permutation.
35Plates
- Plates - "macro" that allows subgraphs to be
replicated (graphical representation of the De
Finetti theorem).
36Bag of words for text
- Represent documents as a bags of words
37Example
- Doc1 the quick brown fox jumped
- Doc2 brown quick jumped fox the
- Would a bag of words model represent these two
documents differently?
38Bag of words for images
- Represent images as a bag of words
39Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
40A Simple Example Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
41A Simple Example Naïve Bayes
C Class F - Features
n
We only specify (parameters) prior
over class labels how
each feature depends on the class
42Slide from Dan Klein
43Slide from Dan Klein
44Slide from Dan Klein
45Percentage of documents in training set labeled
as spam/ham
Slide from Dan Klein
46In the documents labeled as spam, occurrence
percentage of each word (e.g. times the
occurred/ total words).
Slide from Dan Klein
47In the documents labeled as ham, occurrence
percentage of each word (e.g. times the
occurred/ total words).
Slide from Dan Klein
48Classification
The class that maximizes
49Classification
- In practice
- Multiplying lots of small probabilities can
result in floating point underflow
50Classification
- In practice
- Multiplying lots of small probabilities can
result in floating point underflow - Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities.
51Classification
- In practice
- Multiplying lots of small probabilities can
result in floating point underflow - Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities. - Since log is a monotonic function, the class with
the highest score does not change.
52Classification
- In practice
- Multiplying lots of small probabilities can
result in floating point underflow - Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities. - Since log is a monotonic function, the class with
the highest score does not change. - So, what we usually compute in practice is
53Naïve Bayes for modeling text/metadata topics
54Harvesting Image Databases from the Web Schroff,
F. , Criminisi, A. and Zisserman, A.
- Download images from the web via a search query
(e.g. penguin). - Re-rank images using a naïve Bayes model trained
on text surrounding the images and meta-data
features (image alt tag, image title tag, image
filename). - Top ranked images used to train an SVM classifier
to further improve ranking.
55Results
56Results
57Naive Bayes is Not So Naive
- Naïve Bayes First and Second place in KDD-CUP 97
competition, among 16 (then) state of the art
algorithms - Goal Financial services industry direct mail
response prediction model Predict if the
recipient of mail will actually respond to the
advertisement 750,000 records. - Robust to Irrelevant Features
- Irrelevant Features cancel each other without
affecting results - Very good in Domains with many equally important
features - A good dependable baseline for text
classification (but not the best)! - Optimal if the Independence Assumptions hold If
assumed independence is correct, then it is the
Bayes Optimal Classifier for problem - Very Fast Learning with one pass over the data
testing linear in the number of attributes, and
document collection size - Low Storage requirements
Slide from Mitch Marcus
58Naïve Bayes on images
59Visual Categorization with Bags of
KeypointsGabriella Csurka, Christopher R. Dance,
Lixin Fan, Jutta Willamowski, Cédric Bray
60Method
- Steps
- Detect and describe of image patches
- Assign patch descriptors to a set of
predetermined clusters (a vocabulary) with a
vector quantization algorithm - Construct a bag of keypoints, which counts the
number of patches assigned to each cluster - Apply a multi-class classifier (naïve Bayes),
treating the bag of keypoints as the feature
vector, and thus determine which category or
categories to assign to the image.
61Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
62Naive Bayes Parameters
- Problem Categorize images as one of 7 object
classes using Naïve Bayes classifier - Classes object categories (face, car, bicycle,
etc) - Features Images represented as a histogram
where bins are the cluster centers or visual word
vocabulary. Features are vocabulary counts. - treated as uniform.
- learned from training data images
labeled with category.
63Results
64Salient Object LocalizationBerg Berg
Class independent model to predict saliency of a
given fg/bg division Naïve Bayes Model
65Perceptual contrast Cues
texture
focus
saturation
hue
value
66Spatial Cues
Object size and location
67Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
68Naïve Bayes features
Classes salient/not salient Features Cues
computed on foreground regions Cues computed on
background regions Chi-square distance (contrast)
between foreground and background cues.
69Naïve Bayes features
Parameters Prior over classes
treated as uniform. Computed from
labeled training data (no overlap with test
categories).
70Classification
- For test images (of any category)
Compute likelihood of salient object over all
possible rectangular windows in the image.
Select the best region for each image
Example image
71Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
72pLSA
73pLSA
74Joint Probability
Marginalizing over topics determines the
conditional probability
75Fitting the model
Need to Determine the topic vectors common to
all documents. Determine the mixture components
specific to each document. Goal a model that
gives high probability to the words that appear
in the corpus. Maximum likelihood estimation of
the parameters is obtained by maximizing the
objective function
76pLSA on images
77Discovering objects and their location in
imagesJosef Sivic, Bryan C. Russell, Alexei A.
Efros, Andrew Zisserman, William T. Freeman
Documents Images Words visual words (vector
quantized SIFT descriptors) Topics object
categories Images are modeled as a mixture of
topics (objects).
78Goals
- They investigate three areas
- (i) topic discovery, where categories are
discovered by pLSA clustering on all available
images. - (ii) classification of unseen images, where
topics corresponding to object categories are
learnt on one set of images, and then used to
determine the object categories present in
another set. - (iii) object detection, where you want to
determine the location and approximate
segmentation of object(s) in each image.
79(i) Topic Discovery
Most likely words for 4 learnt topics (face,
motorbike, airplane, car)
80(ii) Image Classification
Confusion table for unseen test images against
pLSA trained on images containing four object
categories, but no background images.
81(ii) Image Classification
Confusion table for unseen test images against
pLSA trained on images containing four object
categories, and background images. Performance is
not quite as good.
82(iii) Topic Segmentation
83(iii) Topic Segmentation
84(iii) Topic Segmentation
85Talk Outline
- 1. Quick introduction to graphical models
- 2. Bag of words models
- - What are they?
- - Examples Naïve Bayes, pLSA, LDA
86LDADavid M Blei, Andrew Y Ng, Michael Jordan
87LDA
Per-document topic proportions
Per-word topic assignment
Observed word
88LDA
pLSA
Per-document topic proportions
Per-word topic assignment
Observed word
89LDA
Per-document topic proportions
Per-word topic assignment
Observed word
Dirichlet parameter
90LDA
topics
Dirichlet parameter
Per-document topic proportions
Per-word topic assignment
Observed word
91Generating Documents
92Joint Distribution
Joint distribution
93(No Transcript)
94LDA on text
Topic discovery from a text corpus. Highly ranked
words for 4 topics.
95LDA in Animals on the Web Tamara L Berg, David
Forsyth
96Animals on the Web Outline
Harvest pictures of animals from the web using
Google Text Search. Select visual exemplars
using text based information LDA
Use vision and text cues to extend to similar
images.
97Text Model
Latent Dirichlet Allocation (LDA) on the words
in collected web pages to discover 10 latent
topics for each category. Each topic defines a
distribution over words. Select the 50 most
likely words for each topic.
Example Frog Topics
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess
river ball range eyes game species legs golden
bullfrog session head spring book deep spotted de
am free mouse information round poison yellow
upon collection nature paper pond re lived center
talk buy arrow common prince
2.) frog information january links common red
transparent music king water hop tree pictures
pond green people available book call press toad
funny pottery toads section eggs bullet photo
nature march movies commercial november re clear
eyed survey link news boston list frogs bull
sites butterfly court legs type dot blue
98Select Exemplars
Rank images according to whether they have these
likely words near the image in the associated
page. Select up to 30 images per topic as
exemplars.
2.) frog information january links common red
transparent music king water hop tree pictures
pond green people available book call press ...
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess
river ball range eyes game species legs golden
bullfrog session head ...
99Extensions to LDA for pictures
100A Bayesian Hierarchical Model for Learning
Natural Scene CategoriesFei-Fei Li, Pietro Perona
An unsupervised approach to learn and recognize
natural scene categories. A scene is represented
by a collection of local regions. Each region is
represented as part of a theme (e.g. rock,
grass etc) learned from data.
101Generating Scenes
- 1.) Choose a category label (e.g. mountain
scene). - 2.) Given the mountain class, draw a probability
vector that will determine what intermediate
theme(s) (grass rock etc) to select while
generating each patch of the scene. - 3.) For creating each patch in the image, first
determine a particular theme out of the mixture
of possible themes, and then draw a codeword
given this theme. For example, if a rock theme
is selected, this will in turn privilege some
codewords that occur more frequently in rocks
(e.g. slanted lines). - 4.) Repeat the process of drawing both the theme
and codeword many times, eventually forming an
entire bag of patches that would construct a
scene of mountains.
102Results modeling themes
Left - distribution of the 40 intermediate
themes. Right - distribution of codewords as
well as the appearance of 10 codewords selected
from the top 20 most likely codewords for this
category model.
103Results modeling themes
Left - distribution of the 40 intermediate
themes. Right - distribution of codewords as
well as the appearance of 10 codewords selected
from the top 20 most likely codewords for this
category model.
104Results Scene Classification
correct
incorrect
105Results Scene Classification
correct
incorrect
106LDA for words and pictures
107Matching Words and PicturesKobus Barnard, Pinar
Duygulu, David Forsyth, Nando de Freitas, David
Blei and Michael Jordan
- Present a multi-modal extension to mixture of
latent Dirichlet allocation (MoM-LDA). - Apply the model to predicting words associated
with whole images (auto-annotation) and
corresponding to particular image regions (region
naming).
108MoM-LDA
109Results
110Results
111Results
112(No Transcript)
113(No Transcript)
114(No Transcript)
115Generative vs Discriminative
- Generative version - build a model that generates
images containing monkeys and images not
containing monkeys
P(imagemonkey)
P(imagenot monkey)