Generative Models - PowerPoint PPT Presentation

About This Presentation

Title:

Generative Models

Description:

Generative Models – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 116

Provided by: tamar4

Learn more at: http://vision.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Generative Models

1
Generative Models

Tamara L Berg
Stony Brook University

2
Generative vs Discriminative

Discriminative version build a classifier to
discriminate between monkeys and non-monkeys.

P(monkeyimage)
3
Generative vs Discriminative

Generative version - build a model that generates
images containing monkeys

P(imagemonkey)
P(imagenot monkey)
4
Generative vs Discriminative

Can use Bayes rule to compute p(monkeyimage) if
we know p(imagemonkey)

5
Generative vs Discriminative

Can use Bayes rule to compute p(monkeyimage) if
we know p(imagemonkey)

Discriminative
Generative
6
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

7
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

8
Slide from Dan Klein
9
Random Variables
Random variables
10
Random Variables
Random variables
A random variable is some aspect of the world
about which we (may) have uncertainty. Random
variables can be Binary (e.g. true,false,
spam/ham), Take on a discrete set of values
(e.g. Spring, Summer, Fall, Winter), Or be
continuous (e.g. 0 1).
11
Joint Probability Distribution
Random variables
Also written
Gives a real value for all possible assignments.
12
Queries
Also written
Given a joint distribution, we can reason about
unobserved variables given observations
(evidence)
Stuff you care about
Stuff you already know
13
Representation
Also written
14
Representation
Also written
Graphical Models!
15
Representation
Also written
Graphical models represent joint probability
distributions more economically, using a set of
local relationships among variables.
16
Graphical Models

Graphical models offer several useful properties
1. They provide a simple way to visualize the
structure of a probabilistic model and can be
used to design and motivate new models.
2. Insights into the properties of the model,
including conditional independence properties,
can be obtained by inspection of the graph.
3. Complex computations, required to perform
inference and learning in sophisticated models,
can be expressed in terms of graphical
manipulations, in which underlying mathematical
expressions are carried along implicitly.

from Chris Bishop
17
Main kinds of models

Undirected (also called Markov Random Fields) -
links express constraints between variables.
Directed (also called Bayesian Networks) - have
a notion of causality -- one can regard an arc
from A to B as indicating that A "causes" B.

18
Main kinds of models

Undirected (also called Markov Random Fields) -
links express constraints between variables.
Directed (also called Bayesian Networks) - have
a notion of causality -- one can regard an arc
from A to B as indicating that A "causes" B.

19
Directed Graphical Models

Directed Graph, G (X,E)

Nodes
Edges

Each node is associated with a random variable

20
Directed Graphical Models

Directed Graph, G (X,E)

Nodes
Edges

Each node is associated with a random variable

Definition of joint probability in a graphical
model

where are the parents of

21
Example
22
Example
Joint Probability
23
Example
24
Conditional Independence

Independence
Conditional Independence

Or,
25
Conditional Independence
26
Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
27
Example
Joint Probability
28
Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
Joint distribution from the example graph
29
Conditional Independence
By Chain Rule (using the usual arithmetic
ordering)
Joint distribution from the example graph
Missing variables in the local conditional
probability functions correspond to missing edges
in the underlying graph. Removing an edge into
node i eliminates an argument from the
conditional probability factor
30
Observations

Graphs can have observed (shaded) and unobserved
nodes. If nodes are always unobserved they are
called hidden or latent variables
Probabilistic inference in graphical models is
the problem of computing a conditional
probability distribution over the values of some
of the nodes (the hidden or unobserved
nodes), given the values of other nodes (the
evidence or observed nodes).

31
Inference computing conditional probabilities
Conditional Probabilities
Marginalization
32
Inference Algorithms

Exact algorithms
Elimination algorithm
Sum-product algorithm
Junction tree algorithm
Sampling algorithms
Importance sampling
Markov chain Monte Carlo
Variational algorithms
Mean field methods
Sum-product algorithm and variations
Semidefinite relaxations

33
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

34
Exchangeability

De Finetti Theorem of exchangeability (bag of
words theorem) the joint probability
distribution underlying the data is invariant to
permutation.

35
Plates

Plates - "macro" that allows subgraphs to be
replicated (graphical representation of the De
Finetti theorem).

36
Bag of words for text

Represent documents as a bags of words

37
Example

Doc1 the quick brown fox jumped
Doc2 brown quick jumped fox the
Would a bag of words model represent these two
documents differently?

38
Bag of words for images

Represent images as a bag of words

39
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

40
A Simple Example Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
41
A Simple Example Naïve Bayes
C Class F - Features
n
We only specify (parameters) prior
over class labels how
each feature depends on the class
42
Slide from Dan Klein
43
Slide from Dan Klein
44
Slide from Dan Klein
45
Percentage of documents in training set labeled
as spam/ham
Slide from Dan Klein
46
In the documents labeled as spam, occurrence
percentage of each word (e.g. times the
occurred/ total words).
Slide from Dan Klein
47
In the documents labeled as ham, occurrence
percentage of each word (e.g. times the
occurred/ total words).
Slide from Dan Klein
48
Classification
The class that maximizes
49
Classification

In practice
Multiplying lots of small probabilities can
result in floating point underflow

50
Classification

In practice
Multiplying lots of small probabilities can
result in floating point underflow
Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities.

51
Classification

In practice
Multiplying lots of small probabilities can
result in floating point underflow
Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities.
Since log is a monotonic function, the class with
the highest score does not change.

52
Classification

In practice
Multiplying lots of small probabilities can
result in floating point underflow
Since log(xy) log(x) log(y), we can sum log
probabilities instead of multiplying
probabilities.
Since log is a monotonic function, the class with
the highest score does not change.
So, what we usually compute in practice is

53
Naïve Bayes for modeling text/metadata topics
54
Harvesting Image Databases from the Web Schroff,
F. , Criminisi, A. and Zisserman, A.

Download images from the web via a search query
(e.g. penguin).
Re-rank images using a naïve Bayes model trained
on text surrounding the images and meta-data
features (image alt tag, image title tag, image
filename).
Top ranked images used to train an SVM classifier
to further improve ranking.

55
Results
56
Results
57
Naive Bayes is Not So Naive

Naïve Bayes First and Second place in KDD-CUP 97
competition, among 16 (then) state of the art
algorithms
Goal Financial services industry direct mail
response prediction model Predict if the
recipient of mail will actually respond to the
advertisement 750,000 records.
Robust to Irrelevant Features
Irrelevant Features cancel each other without
affecting results
Very good in Domains with many equally important
features
A good dependable baseline for text
classification (but not the best)!
Optimal if the Independence Assumptions hold If
assumed independence is correct, then it is the
Bayes Optimal Classifier for problem
Very Fast Learning with one pass over the data
testing linear in the number of attributes, and
document collection size
Low Storage requirements

Slide from Mitch Marcus
58
Naïve Bayes on images
59
Visual Categorization with Bags of
KeypointsGabriella Csurka, Christopher R. Dance,
Lixin Fan, Jutta Willamowski, Cédric Bray
60
Method

Steps
Detect and describe of image patches
Assign patch descriptors to a set of
predetermined clusters (a vocabulary) with a
vector quantization algorithm
Construct a bag of keypoints, which counts the
number of patches assigned to each cluster
Apply a multi-class classifier (naïve Bayes),
treating the bag of keypoints as the feature
vector, and thus determine which category or
categories to assign to the image.

61
Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
62
Naive Bayes Parameters

Problem Categorize images as one of 7 object
classes using Naïve Bayes classifier
Classes object categories (face, car, bicycle,
etc)
Features Images represented as a histogram
where bins are the cluster centers or visual word
vocabulary. Features are vocabulary counts.
treated as uniform.
learned from training data images
labeled with category.

63
Results
64
Salient Object LocalizationBerg Berg
Class independent model to predict saliency of a
given fg/bg division Naïve Bayes Model
65
Perceptual contrast Cues
texture
focus
saturation
hue
value
66
Spatial Cues
Object size and location
67
Naïve Bayes
C Class F - Features
We only specify (parameters) prior
over class labels how
each feature depends on the class
68
Naïve Bayes features
Classes salient/not salient Features Cues
computed on foreground regions Cues computed on
background regions Chi-square distance (contrast)
between foreground and background cues.
69
Naïve Bayes features
Parameters Prior over classes
treated as uniform. Computed from
labeled training data (no overlap with test
categories).
70
Classification

For test images (of any category)

Compute likelihood of salient object over all
possible rectangular windows in the image.
Select the best region for each image
Example image
71
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

72
pLSA
73
pLSA
74
Joint Probability
Marginalizing over topics determines the
conditional probability
75
Fitting the model
Need to Determine the topic vectors common to
all documents. Determine the mixture components
specific to each document. Goal a model that
gives high probability to the words that appear
in the corpus. Maximum likelihood estimation of
the parameters is obtained by maximizing the
objective function
76
pLSA on images
77
Discovering objects and their location in
imagesJosef Sivic, Bryan C. Russell, Alexei A.
Efros, Andrew Zisserman, William T. Freeman
Documents Images Words visual words (vector
quantized SIFT descriptors) Topics object
categories Images are modeled as a mixture of
topics (objects).
78
Goals

They investigate three areas
(i) topic discovery, where categories are
discovered by pLSA clustering on all available
images.
(ii) classification of unseen images, where
topics corresponding to object categories are
learnt on one set of images, and then used to
determine the object categories present in
another set.
(iii) object detection, where you want to
determine the location and approximate
segmentation of object(s) in each image.

79
(i) Topic Discovery
Most likely words for 4 learnt topics (face,
motorbike, airplane, car)
80
(ii) Image Classification
Confusion table for unseen test images against
pLSA trained on images containing four object
categories, but no background images.
81
(ii) Image Classification
Confusion table for unseen test images against
pLSA trained on images containing four object
categories, and background images. Performance is
not quite as good.
82
(iii) Topic Segmentation
83
(iii) Topic Segmentation
84
(iii) Topic Segmentation
85
Talk Outline

1. Quick introduction to graphical models
2. Bag of words models
- What are they?
- Examples Naïve Bayes, pLSA, LDA

86
LDADavid M Blei, Andrew Y Ng, Michael Jordan
87
LDA
Per-document topic proportions
Per-word topic assignment
Observed word
88
LDA
pLSA
Per-document topic proportions
Per-word topic assignment
Observed word
89
LDA
Per-document topic proportions
Per-word topic assignment
Observed word
Dirichlet parameter
90
LDA
topics
Dirichlet parameter
Per-document topic proportions
Per-word topic assignment
Observed word
91
Generating Documents
92
Joint Distribution
Joint distribution
93
(No Transcript)
94
LDA on text
Topic discovery from a text corpus. Highly ranked
words for 4 topics.
95
LDA in Animals on the Web Tamara L Berg, David
Forsyth
96
Animals on the Web Outline
Harvest pictures of animals from the web using
Google Text Search. Select visual exemplars
using text based information LDA
Use vision and text cues to extend to similar
images.
97
Text Model
Latent Dirichlet Allocation (LDA) on the words
in collected web pages to discover 10 latent
topics for each category. Each topic defines a
distribution over words. Select the 50 most
likely words for each topic.
Example Frog Topics
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess
river ball range eyes game species legs golden
bullfrog session head spring book deep spotted de
am free mouse information round poison yellow
upon collection nature paper pond re lived center
talk buy arrow common prince
2.) frog information january links common red
transparent music king water hop tree pictures
pond green people available book call press toad
funny pottery toads section eggs bullet photo
nature march movies commercial november re clear
eyed survey link news boston list frogs bull
sites butterfly court legs type dot blue
98
Select Exemplars
Rank images according to whether they have these
likely words near the image in the associated
page. Select up to 30 images per topic as
exemplars.
2.) frog information january links common red
transparent music king water hop tree pictures
pond green people available book call press ...
1.) frog frogs water tree toad leopard green
southern music king irish eggs folk princess
river ball range eyes game species legs golden
bullfrog session head ...
99
Extensions to LDA for pictures
100
A Bayesian Hierarchical Model for Learning
Natural Scene CategoriesFei-Fei Li, Pietro Perona
An unsupervised approach to learn and recognize
natural scene categories. A scene is represented
by a collection of local regions. Each region is
represented as part of a theme (e.g. rock,
grass etc) learned from data.
101
Generating Scenes

1.) Choose a category label (e.g. mountain
scene).
2.) Given the mountain class, draw a probability
vector that will determine what intermediate
theme(s) (grass rock etc) to select while
generating each patch of the scene.
3.) For creating each patch in the image, first
determine a particular theme out of the mixture
of possible themes, and then draw a codeword
given this theme. For example, if a rock theme
is selected, this will in turn privilege some
codewords that occur more frequently in rocks
(e.g. slanted lines).
4.) Repeat the process of drawing both the theme
and codeword many times, eventually forming an
entire bag of patches that would construct a
scene of mountains.

102
Results modeling themes
Left - distribution of the 40 intermediate
themes. Right - distribution of codewords as
well as the appearance of 10 codewords selected
from the top 20 most likely codewords for this
category model.
103
Results modeling themes
Left - distribution of the 40 intermediate
themes. Right - distribution of codewords as
well as the appearance of 10 codewords selected
from the top 20 most likely codewords for this
category model.
104
Results Scene Classification
correct
incorrect
105
Results Scene Classification
correct
incorrect
106
LDA for words and pictures
107
Matching Words and PicturesKobus Barnard, Pinar
Duygulu, David Forsyth, Nando de Freitas, David
Blei and Michael Jordan

Present a multi-modal extension to mixture of
latent Dirichlet allocation (MoM-LDA).
Apply the model to predicting words associated
with whole images (auto-annotation) and
corresponding to particular image regions (region
naming).

108
MoM-LDA
109
Results
110
Results
111
Results
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
Generative vs Discriminative