Title: Part IV: Inference algorithms
1Part IV Inference algorithms
2Estimation and inference
- Actually working with probabilistic models
requires solving some difficult computational
problems - Two key problems
- estimating parameters in models with latent
variables - computing posterior distributions involving large
numbers of variables
3Part IV Inference algorithms
- The EM algorithm
- for estimation in models with latent variables
- Markov chain Monte Carlo
- for sampling from posterior distributions
involving large numbers of variables
4Part IV Inference algorithms
- The EM algorithm
- for estimation in models with latent variables
- Markov chain Monte Carlo
- for sampling from posterior distributions
involving large numbers of variables
5SUPERVISED
dog
dog
cat
dog
dog
cat
dog
cat
dog
cat
dog
cat
dog
cat
cat
dog
6Supervised learning
Category A
Category B
What characterizes the categories? How should we
categorize a new observation?
7Parametric density estimation
- Assume that p(xc) has a simple form,
characterized by parameters ? - Given stimuli X x1, x2, , xn from category c,
find ? by maximum-likelihood estimation - or some form of Bayesian estimation
8Spatial representations
- Assume a simple parametric form for p(xc) a
Gaussian - For each category, estimate parameters
- mean
- variance
P(c)
c
x
p(xc)
?
9The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
10Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
11Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
12Multivariate Gaussians
13Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
14Bayesian inference
Probability
x
15UNSUPERVISED
16Unsupervised learning
What latent structure is present? What are the
properties of a new observation?
17An example Clustering
Assume each observed xi is from a cluster ci,
where ci is unknown What characterizes the
clusters? What cluster does a new x come from?
18Density estimation
- We need to estimate some probability
distributions - what is P(c)?
- what is p(xc)?
- But c is unknown, so we only know the value of x
P(c)
c
x
p(xc)
19Supervised and unsupervised
- Supervised learning categorization
- Given x x1, , xn and c c1, , cn
- Estimate parameters ? of p(xc) and P(c)
- Unsupervised learning clustering
- Given x x1, , xn
- Estimate parameters ? of p(xc) and P(c)
20Mixture distributions
mixture distribution mixture components
Probability
x
21More generally
- Unsupervised learning is density estimation
using distributions with latent variables
P(z)
z
Latent (unobserved)
x
Marginalize out (i.e. sum over) latent structure
P(xz)
Observed
22A chicken and egg problem
- If we knew which cluster the observations were
from we could find the distributions - this is just density estimation
- If we knew the distributions, we could infer
which cluster each observation came from - this is just categorization
23Alternating optimization algorithm
- 0. Guess initial parameter values
- 1. Given parameter estimates, solve for maximum a
posteriori assignments ci - 2. Given assignments ci, solve for maximum
likelihood parameter estimates - 3. Go to step 1
24Alternating optimization algorithm
x c assignments to cluster ?, ?, P(c)
parameters ?
For simplicity, assume ?, P(c) fixed k-means
algorithm
25Alternating optimization algorithm
Step 0 initial parameter values
26Alternating optimization algorithm
Step 1 update assignments
27Alternating optimization algorithm
Step 2 update parameters
28Alternating optimization algorithm
Step 1 update assignments
29Alternating optimization algorithm
Step 2 update parameters
30Alternating optimization algorithm
- 0. Guess initial parameter values
- 1. Given parameter estimates, solve for maximum a
posteriori assignments ci - 2. Given assignments ci, solve for maximum
likelihood parameter estimates - 3. Go to step 1
why hard assignments?
31Estimating a Gaussian(with hard assignments)
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
32Estimating a Gaussian(with soft assignments)
the weight of each point is the probability of
being in the cluster
maximum likelihood parameter estimates
33The Expectation-Maximization algorithm(clustering
version)
- 0. Guess initial parameter values
- 1. Given parameter estimates, compute posterior
distribution over assignments ci - 2. Solve for maximum likelihood parameter
estimates, weighting each observation by the
probability it came from that cluster - 3. Go to step 1
34The Expectation-Maximization algorithm(more
general version)
- 0. Guess initial parameter values
- 1. Given parameter estimates, compute posterior
distribution over latent variables z - 2. Find parameter estimates
- 3. Go to step 1
35A note on expectations
- For a function f(x) and distribution P(x), the
expectation of f with respect to P is - The expectation is the average of f, when x is
drawn from the probability distribution P -
36Good features of EM
- Convergence
- guaranteed to converge to at least a local
maximum of the likelihood (or other extremum) - likelihood is non-decreasing across iterations
- Efficiency
- big steps initially (other algorithms better
later) - Generality
- can be defined for many probabilistic models
- can be combined with a prior for MAP estimation
37Limitations of EM
- Local minima
- e.g., one component poorly fits two clusters,
while two components split up a single cluster - Degeneracies
- e.g., two components may merge, a component may
lock onto one data point, with variance going to
zero - May be intractable for complex models
- dealing with this is an active research topic
38EM and cognitive science
- The EM algorithm seems like it might be a good
way to describe some bootstrapping - anywhere theres a chicken and egg problem
- a prime example language learning
39Probabilistic context free grammars
S
S ? NP VP 1.0 NP ? T N 0.7 NP ? N
0.3 VP ? V NP 1.0 T ? the 0.8
T ? a 0.2 N ? man 0.5 N ? ball 0.5
V ? hit 0.6 V ? took 0.4
P(tree) 1.0?0.7?1.0?0.8?0.5?0.6?0.7?0.8?0.5
40EM and cognitive science
- The EM algorithm seems like it might be a good
way to describe some bootstrapping - anywhere theres a chicken and egg problem
- a prime example language learning
- Fried and Holyoak (1984) explicitly tested a
model of human categorization that was almost
exactly a version of the EM algorithm for a
mixture of Gaussians
41Part IV Inference algorithms
- The EM algorithm
- for estimation in models with latent variables
- Markov chain Monte Carlo
- for sampling from posterior distributions
involving large numbers of variables
42The Monte Carlo principle
- The expectation of f with respect to P can be
approximated by - where the xi are sampled from P(x)
- Example the average of spots on a die roll
-
43The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
44Markov chain Monte Carlo
- Sometimes it isnt possible to sample directly
from a distribution - Sometimes, you can only compute something
proportional to the distribution - Markov chain Monte Carlo construct a Markov
chain that will converge to the target
distribution, and draw samples from that chain - just uses something proportional to the target
45Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
- Variables x(t1) independent of all previous
variables given immediate predecessor x(t)
46An example card shuffling
- Each state x(t) is a permutation of a deck of
cards (there are 52! permutations) - Transition matrix T indicates how likely one
permutation will become another - The transition probabilities are determined by
the shuffling procedure - riffle shuffle
- overhand
- one card
47Convergence of Markov chains
- Why do we shuffle cards?
- Convergence to a uniform distribution takes only
7 riffle shuffles - Other Markov chains will also converge to a
stationary distribution, if certain simple
conditions are satisfied (called ergodicity) - e.g. every state can be reached in some number of
steps from every other state
48Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
- States of chain are variables of interest
- Transition matrix chosen to give target
distribution as stationary distribution
49Metropolis-Hastings algorithm
- Transitions have two parts
- proposal distribution Q(x(t1)x(t))
- acceptance take proposals with probability
- A(x(t),x(t1)) min( 1,
)
P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
50Metropolis-Hastings algorithm
p(x)
51Metropolis-Hastings algorithm
p(x)
52Metropolis-Hastings algorithm
p(x)
53Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
54Metropolis-Hastings algorithm
p(x)
55Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
56Gibbs sampling
- Particular choice of proposal distribution
- For variables x x1, x2, , xn
- Draw xi(t1) from P(xix-i)
- x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
xn(t) - (this is called the full conditional distribution)
57Gibbs sampling
(MacKay, 2002)
58MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
59MCMC and cognitive science
- The Metropolis-Hastings algorithm seems like a
good metaphor for aspects of development - Some forms of cultural evolution can be shown to
be equivalent to Gibbs sampling - (Griffiths Kalish, 2005)
- For experiments based on MCMC, see talk by Adam
Sanborn at MathPsych! - The main use of MCMC is for probabilistic
inference in complex models
60A selection of topics
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
61Semantic gist of document
GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LE
AD ADAM
BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCI
AL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS
CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEM
BRANE ORGANISM FOOD LIVING
DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIEN
TS NURSE DOCTORS MEDICINE
BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE
TITLE SUBJECT PAGES
MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES
EAST
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MI
LK EATING
Semantic classes
MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREAT
ER HIGHER LARGER
ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST A
CROSS UPON
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LO
NG HIGH DIFFERENT
THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A
ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVE
RY
HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCI
ENTISTS
BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP
Syntactic classes
62Summary
- Probabilistic models can pose significant
computational challenges - parameter estimation with latent variables,
computing posteriors with many variables - Clever algorithms exist for solving these
problems, easing use of probabilistic models - These algorithms also provide a source of new
models and methods in cognitive science
63(No Transcript)
64Generative models for language
latent structure
observed data
65Generative models for language
meaning
words
66Topic models
- Each document (or conversation, or segment of
either) is a mixture of topics - Each word is chosen from a single topic
- where wi is the ith word
- zi is the topic of the ith word
- T is the number of topics
67Generating a document
g
distribution over topics
z
z
z
topic assignments
observed words
w
w
w
68w P(wz 1)
w P(wz 2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
69Choose mixture weights for each document,
generate bag of words
g P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
70Inferring topics from text
- The topic model is a generative model for a set
of documents (assuming a set of topics) - a simple procedure for generating documents
- Given the documents, we can try to find the
topics and their proportions in each document - This is an unsupervised learning problem
- we can use the EM algorithm, but its not great
- instead, we use Markov chain Monte Carlo
71A selection from 500 topics P(wz j)
BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY S
MELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPI
NAL FIBERS SENSORY PAIN IS
CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL
VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED
ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUB
E NEGATIVE
ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM W
ORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE P
AINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS
STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS C
LASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTIO
N TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL G
IVEN
SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAU
TS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES A
TMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT S
ATURN MILES
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIF
IC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERV
ED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THE
ORIES BELIEVED DISCOVERED OBSERVE FACTS
72A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
73A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
74Gibbs sampling for topics
- Need full conditional distributions for variables
- Since we only sample z we need
number of times word w assigned to topic j
number of times topic j used in document d
75Gibbs sampling
iteration 1
76Gibbs sampling
iteration 1 2
77Gibbs sampling
iteration 1 2
78Gibbs sampling
iteration 1 2
79Gibbs sampling
iteration 1 2
80Gibbs sampling
iteration 1 2
81Gibbs sampling
iteration 1 2
82Gibbs sampling
iteration 1 2
83Gibbs sampling
iteration 1 2
1000
84A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
85(No Transcript)
86(No Transcript)
87Summary
- Probabilistic models can pose significant
computational challenges - parameter estimation with latent variables,
computing posteriors with many variables - Clever algorithms exist for solving these
problems, easing use of probabilistic models - These algorithms also provide a source of new
models and methods in cognitive science
88(No Transcript)
89When Bayes is useful
- Clarifying computational problems in cognition
- Providing rational explanations for behavior
- Characterizing knowledge informing induction
- Capturing inferences at multiple levels