Part IV: Inference algorithms

About This Presentation

Title:

Part IV: Inference algorithms

Description:

estimating parameters in models with latent variables ... maximum likelihood parameter estimates: ... 0. Guess initial parameter values ... – PowerPoint PPT presentation

Number of Views:286

Avg rating:3.0/5.0

Slides: 90

Provided by: joshtenenb

Learn more at: https://cocosci.princeton.edu

Category:

more less

Transcript and Presenter's Notes

Title: Part IV: Inference algorithms

1
Part IV Inference algorithms
2
Estimation and inference

Actually working with probabilistic models
requires solving some difficult computational
problems
Two key problems
estimating parameters in models with latent
variables
computing posterior distributions involving large
numbers of variables

3
Part IV Inference algorithms

The EM algorithm
for estimation in models with latent variables
Markov chain Monte Carlo
for sampling from posterior distributions
involving large numbers of variables

4
Part IV Inference algorithms

The EM algorithm
for estimation in models with latent variables
Markov chain Monte Carlo
for sampling from posterior distributions
involving large numbers of variables

5
SUPERVISED
dog
dog
cat
dog
dog
cat
dog
cat
dog
cat
dog
cat
dog
cat
cat
dog
6
Supervised learning
Category A
Category B
What characterizes the categories? How should we
categorize a new observation?
7
Parametric density estimation

Assume that p(xc) has a simple form,
characterized by parameters ?
Given stimuli X x1, x2, , xn from category c,
find ? by maximum-likelihood estimation
or some form of Bayesian estimation

8
Spatial representations

Assume a simple parametric form for p(xc) a
Gaussian
For each category, estimate parameters
mean
variance

P(c)
c
x
p(xc)

?
9
The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
10
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
11
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
12
Multivariate Gaussians
13
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
14
Bayesian inference
Probability
x
15
UNSUPERVISED
16
Unsupervised learning
What latent structure is present? What are the
properties of a new observation?
17
An example Clustering
Assume each observed xi is from a cluster ci,
where ci is unknown What characterizes the
clusters? What cluster does a new x come from?
18
Density estimation

We need to estimate some probability
distributions
what is P(c)?
what is p(xc)?
But c is unknown, so we only know the value of x

P(c)
c
x
p(xc)
19
Supervised and unsupervised

Supervised learning categorization
Given x x1, , xn and c c1, , cn
Estimate parameters ? of p(xc) and P(c)
Unsupervised learning clustering
Given x x1, , xn
Estimate parameters ? of p(xc) and P(c)

20
Mixture distributions
mixture distribution mixture components
Probability
x
21
More generally

Unsupervised learning is density estimation
using distributions with latent variables

P(z)
z
Latent (unobserved)
x
Marginalize out (i.e. sum over) latent structure
P(xz)
Observed
22
A chicken and egg problem

If we knew which cluster the observations were
from we could find the distributions
this is just density estimation
If we knew the distributions, we could infer
which cluster each observation came from
this is just categorization

23
Alternating optimization algorithm

0. Guess initial parameter values
1. Given parameter estimates, solve for maximum a
posteriori assignments ci
2. Given assignments ci, solve for maximum
likelihood parameter estimates
3. Go to step 1

24
Alternating optimization algorithm
x c assignments to cluster ?, ?, P(c)
parameters ?
For simplicity, assume ?, P(c) fixed k-means
algorithm
25
Alternating optimization algorithm
Step 0 initial parameter values
26
Alternating optimization algorithm
Step 1 update assignments
27
Alternating optimization algorithm
Step 2 update parameters
28
Alternating optimization algorithm
Step 1 update assignments
29
Alternating optimization algorithm
Step 2 update parameters
30
Alternating optimization algorithm

0. Guess initial parameter values
1. Given parameter estimates, solve for maximum a
posteriori assignments ci
2. Given assignments ci, solve for maximum
likelihood parameter estimates
3. Go to step 1

why hard assignments?
31
Estimating a Gaussian(with hard assignments)
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
32
Estimating a Gaussian(with soft assignments)
the weight of each point is the probability of
being in the cluster
maximum likelihood parameter estimates
33
The Expectation-Maximization algorithm(clustering
version)

0. Guess initial parameter values
1. Given parameter estimates, compute posterior
distribution over assignments ci
2. Solve for maximum likelihood parameter
estimates, weighting each observation by the
probability it came from that cluster
3. Go to step 1

34
The Expectation-Maximization algorithm(more
general version)

0. Guess initial parameter values
1. Given parameter estimates, compute posterior
distribution over latent variables z
2. Find parameter estimates
3. Go to step 1

35
A note on expectations

For a function f(x) and distribution P(x), the
expectation of f with respect to P is
The expectation is the average of f, when x is
drawn from the probability distribution P

36
Good features of EM

Convergence
guaranteed to converge to at least a local
maximum of the likelihood (or other extremum)
likelihood is non-decreasing across iterations
Efficiency
big steps initially (other algorithms better
later)
Generality
can be defined for many probabilistic models
can be combined with a prior for MAP estimation

37
Limitations of EM

Local minima
e.g., one component poorly fits two clusters,
while two components split up a single cluster
Degeneracies
e.g., two components may merge, a component may
lock onto one data point, with variance going to
zero
May be intractable for complex models
dealing with this is an active research topic

38
EM and cognitive science

The EM algorithm seems like it might be a good
way to describe some bootstrapping
anywhere theres a chicken and egg problem
a prime example language learning

39
Probabilistic context free grammars
S
S ? NP VP 1.0 NP ? T N 0.7 NP ? N
0.3 VP ? V NP 1.0 T ? the 0.8
T ? a 0.2 N ? man 0.5 N ? ball 0.5
V ? hit 0.6 V ? took 0.4
P(tree) 1.0?0.7?1.0?0.8?0.5?0.6?0.7?0.8?0.5
40
EM and cognitive science

The EM algorithm seems like it might be a good
way to describe some bootstrapping
anywhere theres a chicken and egg problem
a prime example language learning
Fried and Holyoak (1984) explicitly tested a
model of human categorization that was almost
exactly a version of the EM algorithm for a
mixture of Gaussians

41
Part IV Inference algorithms

The EM algorithm
for estimation in models with latent variables
Markov chain Monte Carlo
for sampling from posterior distributions
involving large numbers of variables

42
The Monte Carlo principle

The expectation of f with respect to P can be
approximated by
where the xi are sampled from P(x)
Example the average of spots on a die roll

43
The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
44
Markov chain Monte Carlo

Sometimes it isnt possible to sample directly
from a distribution
Sometimes, you can only compute something
proportional to the distribution
Markov chain Monte Carlo construct a Markov
chain that will converge to the target
distribution, and draw samples from that chain
just uses something proportional to the target

45
Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))

Variables x(t1) independent of all previous
variables given immediate predecessor x(t)

46
An example card shuffling

Each state x(t) is a permutation of a deck of
cards (there are 52! permutations)
Transition matrix T indicates how likely one
permutation will become another
The transition probabilities are determined by
the shuffling procedure
riffle shuffle
overhand
one card

47
Convergence of Markov chains

Why do we shuffle cards?
Convergence to a uniform distribution takes only
7 riffle shuffles
Other Markov chains will also converge to a
stationary distribution, if certain simple
conditions are satisfied (called ergodicity)
e.g. every state can be reached in some number of
steps from every other state

48
Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))

States of chain are variables of interest
Transition matrix chosen to give target
distribution as stationary distribution

49
Metropolis-Hastings algorithm

Transitions have two parts
proposal distribution Q(x(t1)x(t))
acceptance take proposals with probability
A(x(t),x(t1)) min( 1,
)

P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
50
Metropolis-Hastings algorithm
p(x)
51
Metropolis-Hastings algorithm
p(x)
52
Metropolis-Hastings algorithm
p(x)
53
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
54
Metropolis-Hastings algorithm
p(x)
55
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
56
Gibbs sampling

Particular choice of proposal distribution
For variables x x1, x2, , xn
Draw xi(t1) from P(xix-i)
x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
xn(t)
(this is called the full conditional distribution)

57
Gibbs sampling
(MacKay, 2002)
58
MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
59
MCMC and cognitive science

The Metropolis-Hastings algorithm seems like a
good metaphor for aspects of development
Some forms of cultural evolution can be shown to
be equivalent to Gibbs sampling
(Griffiths Kalish, 2005)
For experiments based on MCMC, see talk by Adam
Sanborn at MathPsych!
The main use of MCMC is for probabilistic
inference in complex models

60
A selection of topics
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
61
Semantic gist of document
GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LE
AD ADAM
BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCI
AL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS
CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEM
BRANE ORGANISM FOOD LIVING
DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIEN
TS NURSE DOCTORS MEDICINE
BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE
TITLE SUBJECT PAGES
MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES
EAST
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MI
LK EATING
Semantic classes
MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREAT
ER HIGHER LARGER
ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST A
CROSS UPON
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LO
NG HIGH DIFFERENT
THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A
ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVE
RY
HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCI
ENTISTS
BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP
Syntactic classes
62
Summary

Probabilistic models can pose significant
computational challenges
parameter estimation with latent variables,
computing posteriors with many variables
Clever algorithms exist for solving these
problems, easing use of probabilistic models
These algorithms also provide a source of new
models and methods in cognitive science

63
(No Transcript)
64
Generative models for language
latent structure
observed data
65
Generative models for language
meaning
words
66
Topic models

Each document (or conversation, or segment of
either) is a mixture of topics
Each word is chosen from a single topic
where wi is the ith word
zi is the topic of the ith word
T is the number of topics

67
Generating a document
g
distribution over topics
z
z
z
topic assignments
observed words
w
w
w
68
w P(wz 1)
w P(wz 2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
69
Choose mixture weights for each document,
generate bag of words
g P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
70
Inferring topics from text

The topic model is a generative model for a set
of documents (assuming a set of topics)
a simple procedure for generating documents
Given the documents, we can try to find the
topics and their proportions in each document
This is an unsupervised learning problem
we can use the EM algorithm, but its not great
instead, we use Markov chain Monte Carlo

71
A selection from 500 topics P(wz j)
BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY S
MELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPI
NAL FIBERS SENSORY PAIN IS
CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL
VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED
ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUB
E NEGATIVE
ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM W
ORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE P
AINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS
STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS C
LASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTIO
N TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL G
IVEN
SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAU
TS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES A
TMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT S
ATURN MILES
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIF
IC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERV
ED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THE
ORIES BELIEVED DISCOVERED OBSERVE FACTS
72
A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
73
A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
74
Gibbs sampling for topics

Need full conditional distributions for variables
Since we only sample z we need

number of times word w assigned to topic j
number of times topic j used in document d
75
Gibbs sampling
iteration 1
76
Gibbs sampling
iteration 1 2
77
Gibbs sampling
iteration 1 2
78
Gibbs sampling
iteration 1 2
79
Gibbs sampling
iteration 1 2
80
Gibbs sampling
iteration 1 2
81
Gibbs sampling
iteration 1 2
82
Gibbs sampling
iteration 1 2
83
Gibbs sampling
iteration 1 2
1000
84
A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
85
(No Transcript)
86
(No Transcript)
87
Summary

Probabilistic models can pose significant
computational challenges
parameter estimation with latent variables,
computing posteriors with many variables
Clever algorithms exist for solving these
problems, easing use of probabilistic models
These algorithms also provide a source of new
models and methods in cognitive science