Part IV: Inference algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Part IV: Inference algorithms

Description:

estimating parameters in models with latent variables ... maximum likelihood parameter estimates: ... 0. Guess initial parameter values ... – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 90
Provided by: joshtenenb
Category:

less

Transcript and Presenter's Notes

Title: Part IV: Inference algorithms


1
Part IV Inference algorithms
2
Estimation and inference
  • Actually working with probabilistic models
    requires solving some difficult computational
    problems
  • Two key problems
  • estimating parameters in models with latent
    variables
  • computing posterior distributions involving large
    numbers of variables

3
Part IV Inference algorithms
  • The EM algorithm
  • for estimation in models with latent variables
  • Markov chain Monte Carlo
  • for sampling from posterior distributions
    involving large numbers of variables

4
Part IV Inference algorithms
  • The EM algorithm
  • for estimation in models with latent variables
  • Markov chain Monte Carlo
  • for sampling from posterior distributions
    involving large numbers of variables

5
SUPERVISED
dog
dog
cat
dog
dog
cat
dog
cat
dog
cat
dog
cat
dog
cat
cat
dog
6
Supervised learning
Category A
Category B
What characterizes the categories? How should we
categorize a new observation?
7
Parametric density estimation
  • Assume that p(xc) has a simple form,
    characterized by parameters ?
  • Given stimuli X x1, x2, , xn from category c,
    find ? by maximum-likelihood estimation
  • or some form of Bayesian estimation

8
Spatial representations
  • Assume a simple parametric form for p(xc) a
    Gaussian
  • For each category, estimate parameters
  • mean
  • variance

P(c)
c
x
p(xc)

?
9
The Gaussian distribution
Probability density p(x)
(x-?)/?
variance ?2
10
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
11
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
12
Multivariate Gaussians
13
Estimating a Gaussian
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
14
Bayesian inference
Probability
x
15
UNSUPERVISED
16
Unsupervised learning
What latent structure is present? What are the
properties of a new observation?
17
An example Clustering
Assume each observed xi is from a cluster ci,
where ci is unknown What characterizes the
clusters? What cluster does a new x come from?
18
Density estimation
  • We need to estimate some probability
    distributions
  • what is P(c)?
  • what is p(xc)?
  • But c is unknown, so we only know the value of x

P(c)
c
x
p(xc)
19
Supervised and unsupervised
  • Supervised learning categorization
  • Given x x1, , xn and c c1, , cn
  • Estimate parameters ? of p(xc) and P(c)
  • Unsupervised learning clustering
  • Given x x1, , xn
  • Estimate parameters ? of p(xc) and P(c)

20
Mixture distributions
mixture distribution mixture components
Probability
x
21
More generally
  • Unsupervised learning is density estimation
    using distributions with latent variables

P(z)
z
Latent (unobserved)
x
Marginalize out (i.e. sum over) latent structure
P(xz)
Observed
22
A chicken and egg problem
  • If we knew which cluster the observations were
    from we could find the distributions
  • this is just density estimation
  • If we knew the distributions, we could infer
    which cluster each observation came from
  • this is just categorization

23
Alternating optimization algorithm
  • 0. Guess initial parameter values
  • 1. Given parameter estimates, solve for maximum a
    posteriori assignments ci
  • 2. Given assignments ci, solve for maximum
    likelihood parameter estimates
  • 3. Go to step 1

24
Alternating optimization algorithm
x c assignments to cluster ?, ?, P(c)
parameters ?
For simplicity, assume ?, P(c) fixed k-means
algorithm
25
Alternating optimization algorithm
Step 0 initial parameter values
26
Alternating optimization algorithm
Step 1 update assignments
27
Alternating optimization algorithm
Step 2 update parameters
28
Alternating optimization algorithm
Step 1 update assignments
29
Alternating optimization algorithm
Step 2 update parameters
30
Alternating optimization algorithm
  • 0. Guess initial parameter values
  • 1. Given parameter estimates, solve for maximum a
    posteriori assignments ci
  • 2. Given assignments ci, solve for maximum
    likelihood parameter estimates
  • 3. Go to step 1

why hard assignments?
31
Estimating a Gaussian(with hard assignments)
X x1, x2, , xn independently sampled from a
Gaussian
maximum likelihood parameter estimates
32
Estimating a Gaussian(with soft assignments)
the weight of each point is the probability of
being in the cluster
maximum likelihood parameter estimates
33
The Expectation-Maximization algorithm(clustering
version)
  • 0. Guess initial parameter values
  • 1. Given parameter estimates, compute posterior
    distribution over assignments ci
  • 2. Solve for maximum likelihood parameter
    estimates, weighting each observation by the
    probability it came from that cluster
  • 3. Go to step 1

34
The Expectation-Maximization algorithm(more
general version)
  • 0. Guess initial parameter values
  • 1. Given parameter estimates, compute posterior
    distribution over latent variables z
  • 2. Find parameter estimates
  • 3. Go to step 1

35
A note on expectations
  • For a function f(x) and distribution P(x), the
    expectation of f with respect to P is
  • The expectation is the average of f, when x is
    drawn from the probability distribution P

36
Good features of EM
  • Convergence
  • guaranteed to converge to at least a local
    maximum of the likelihood (or other extremum)
  • likelihood is non-decreasing across iterations
  • Efficiency
  • big steps initially (other algorithms better
    later)
  • Generality
  • can be defined for many probabilistic models
  • can be combined with a prior for MAP estimation

37
Limitations of EM
  • Local minima
  • e.g., one component poorly fits two clusters,
    while two components split up a single cluster
  • Degeneracies
  • e.g., two components may merge, a component may
    lock onto one data point, with variance going to
    zero
  • May be intractable for complex models
  • dealing with this is an active research topic

38
EM and cognitive science
  • The EM algorithm seems like it might be a good
    way to describe some bootstrapping
  • anywhere theres a chicken and egg problem
  • a prime example language learning

39
Probabilistic context free grammars
S
S ? NP VP 1.0 NP ? T N 0.7 NP ? N
0.3 VP ? V NP 1.0 T ? the 0.8
T ? a 0.2 N ? man 0.5 N ? ball 0.5
V ? hit 0.6 V ? took 0.4
P(tree) 1.0?0.7?1.0?0.8?0.5?0.6?0.7?0.8?0.5
40
EM and cognitive science
  • The EM algorithm seems like it might be a good
    way to describe some bootstrapping
  • anywhere theres a chicken and egg problem
  • a prime example language learning
  • Fried and Holyoak (1984) explicitly tested a
    model of human categorization that was almost
    exactly a version of the EM algorithm for a
    mixture of Gaussians

41
Part IV Inference algorithms
  • The EM algorithm
  • for estimation in models with latent variables
  • Markov chain Monte Carlo
  • for sampling from posterior distributions
    involving large numbers of variables

42
The Monte Carlo principle
  • The expectation of f with respect to P can be
    approximated by
  • where the xi are sampled from P(x)
  • Example the average of spots on a die roll

43
The Monte Carlo principle
The law of large numbers
Average number of spots
Number of rolls
44
Markov chain Monte Carlo
  • Sometimes it isnt possible to sample directly
    from a distribution
  • Sometimes, you can only compute something
    proportional to the distribution
  • Markov chain Monte Carlo construct a Markov
    chain that will converge to the target
    distribution, and draw samples from that chain
  • just uses something proportional to the target

45
Markov chains
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
  • Variables x(t1) independent of all previous
    variables given immediate predecessor x(t)

46
An example card shuffling
  • Each state x(t) is a permutation of a deck of
    cards (there are 52! permutations)
  • Transition matrix T indicates how likely one
    permutation will become another
  • The transition probabilities are determined by
    the shuffling procedure
  • riffle shuffle
  • overhand
  • one card

47
Convergence of Markov chains
  • Why do we shuffle cards?
  • Convergence to a uniform distribution takes only
    7 riffle shuffles
  • Other Markov chains will also converge to a
    stationary distribution, if certain simple
    conditions are satisfied (called ergodicity)
  • e.g. every state can be reached in some number of
    steps from every other state

48
Markov chain Monte Carlo
x
x
x
x
x
x
x
x
Transition matrix T P(x(t1)x(t))
  • States of chain are variables of interest
  • Transition matrix chosen to give target
    distribution as stationary distribution

49
Metropolis-Hastings algorithm
  • Transitions have two parts
  • proposal distribution Q(x(t1)x(t))
  • acceptance take proposals with probability
  • A(x(t),x(t1)) min( 1,
    )

P(x(t1)) Q(x(t)x(t1)) P(x(t)) Q(x(t1)x(t))
50
Metropolis-Hastings algorithm
p(x)
51
Metropolis-Hastings algorithm
p(x)
52
Metropolis-Hastings algorithm
p(x)
53
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 0.5
54
Metropolis-Hastings algorithm
p(x)
55
Metropolis-Hastings algorithm
p(x)
A(x(t), x(t1)) 1
56
Gibbs sampling
  • Particular choice of proposal distribution
  • For variables x x1, x2, , xn
  • Draw xi(t1) from P(xix-i)
  • x-i x1(t1), x2(t1),, xi-1(t1), xi1(t), ,
    xn(t)
  • (this is called the full conditional distribution)

57
Gibbs sampling
(MacKay, 2002)
58
MCMC vs. EM
EM converges to a single solution
MCMC converges to a distribution of solutions
59
MCMC and cognitive science
  • The Metropolis-Hastings algorithm seems like a
    good metaphor for aspects of development
  • Some forms of cultural evolution can be shown to
    be equivalent to Gibbs sampling
  • (Griffiths Kalish, 2005)
  • For experiments based on MCMC, see talk by Adam
    Sanborn at MathPsych!
  • The main use of MCMC is for probabilistic
    inference in complex models

60
A selection of topics
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
DISEASE BACTERIA DISEASES GERMS FEVER CAUSE CAUSED
SPREAD VIRUSES INFECTION VIRUS MICROORGANISMS PER
SON INFECTIOUS COMMON CAUSING SMALLPOX BODY INFECT
IONS CERTAIN
WATER FISH SEA SWIM SWIMMING POOL LIKE SHELL SHARK
TANK SHELLS SHARKS DIVING DOLPHINS SWAM LONG SEAL
DIVE DOLPHIN UNDERWATER
61
Semantic gist of document
GOLD IRON SILVER COPPER METAL METALS STEEL CLAY LE
AD ADAM
BEHAVIOR SELF INDIVIDUAL PERSONALITY RESPONSE SOCI
AL EMOTIONAL LEARNING FEELINGS PSYCHOLOGISTS
CELLS CELL ORGANISMS ALGAE BACTERIA MICROSCOPE MEM
BRANE ORGANISM FOOD LIVING
DOCTOR PATIENT HEALTH HOSPITAL MEDICAL CARE PATIEN
TS NURSE DOCTORS MEDICINE
BOOK BOOKS READING INFORMATION LIBRARY REPORT PAGE
TITLE SUBJECT PAGES
MAP NORTH EARTH SOUTH POLE MAPS EQUATOR WEST LINES
EAST
FOOD FOODS BODY NUTRIENTS DIET FAT SUGAR ENERGY MI
LK EATING
Semantic classes
MORE SUCH LESS MUCH KNOWN JUST BETTER RATHER GREAT
ER HIGHER LARGER
ON AT INTO FROM WITH THROUGH OVER AROUND AGAINST A
CROSS UPON
GOOD SMALL NEW IMPORTANT GREAT LITTLE LARGE BIG LO
NG HIGH DIFFERENT
THE HIS THEIR YOUR HER ITS MY OUR THIS THESE A
ONE SOME MANY TWO EACH ALL MOST ANY THREE THIS EVE
RY
HE YOU THEY I SHE WE IT PEOPLE EVERYONE OTHERS SCI
ENTISTS
BE MAKE GET HAVE GO TAKE DO FIND USE SEE HELP
Syntactic classes
62
Summary
  • Probabilistic models can pose significant
    computational challenges
  • parameter estimation with latent variables,
    computing posteriors with many variables
  • Clever algorithms exist for solving these
    problems, easing use of probabilistic models
  • These algorithms also provide a source of new
    models and methods in cognitive science

63
(No Transcript)
64
Generative models for language
latent structure
observed data
65
Generative models for language
meaning
words
66
Topic models
  • Each document (or conversation, or segment of
    either) is a mixture of topics
  • Each word is chosen from a single topic
  • where wi is the ith word
  • zi is the topic of the ith word
  • T is the number of topics

67
Generating a document
g
distribution over topics
z
z
z
topic assignments
observed words
w
w
w
68
w P(wz 1)
w P(wz 2)
HEART 0.2 LOVE 0.2 SOUL 0.2 TEARS 0.2 JOY
0.2 SCIENTIFIC 0.0 KNOWLEDGE 0.0 WORK
0.0 RESEARCH 0.0 MATHEMATICS 0.0
HEART 0.0 LOVE 0.0 SOUL 0.0 TEARS 0.0 JOY
0.0 SCIENTIFIC 0.2 KNOWLEDGE 0.2 WORK
0.2 RESEARCH 0.2 MATHEMATICS 0.2
topic 1
topic 2
69
Choose mixture weights for each document,
generate bag of words
g P(z 1), P(z 2) 0, 1 0.25,
0.75 0.5, 0.5 0.75, 0.25 1, 0
MATHEMATICS KNOWLEDGE RESEARCH WORK MATHEMATICS
RESEARCH WORK SCIENTIFIC MATHEMATICS WORK
SCIENTIFIC KNOWLEDGE MATHEMATICS SCIENTIFIC
HEART LOVE TEARS KNOWLEDGE HEART
MATHEMATICS HEART RESEARCH LOVE MATHEMATICS WORK
TEARS SOUL KNOWLEDGE HEART
WORK JOY SOUL TEARS MATHEMATICS TEARS LOVE LOVE
LOVE SOUL
TEARS LOVE JOY SOUL LOVE TEARS SOUL SOUL TEARS JOY
70
Inferring topics from text
  • The topic model is a generative model for a set
    of documents (assuming a set of topics)
  • a simple procedure for generating documents
  • Given the documents, we can try to find the
    topics and their proportions in each document
  • This is an unsupervised learning problem
  • we can use the EM algorithm, but its not great
  • instead, we use Markov chain Monte Carlo

71
A selection from 500 topics P(wz j)
BRAIN NERVE SENSE SENSES ARE NERVOUS NERVES BODY S
MELL TASTE TOUCH MESSAGES IMPULSES CORD ORGANS SPI
NAL FIBERS SENSORY PAIN IS
CURRENT ELECTRICITY ELECTRIC CIRCUIT IS ELECTRICAL
VOLTAGE FLOW BATTERY WIRE WIRES SWITCH CONNECTED
ELECTRONS RESISTANCE POWER CONDUCTORS CIRCUITS TUB
E NEGATIVE
ART PAINT ARTIST PAINTING PAINTED ARTISTS MUSEUM W
ORK PAINTINGS STYLE PICTURES WORKS OWN SCULPTURE P
AINTER ARTS BEAUTIFUL DESIGNS PORTRAIT PAINTERS
STUDENTS TEACHER STUDENT TEACHERS TEACHING CLASS C
LASSROOM SCHOOL LEARNING PUPILS CONTENT INSTRUCTIO
N TAUGHT GROUP GRADE SHOULD GRADES CLASSES PUPIL G
IVEN
SPACE EARTH MOON PLANET ROCKET MARS ORBIT ASTRONAU
TS FIRST SPACECRAFT JUPITER SATELLITE SATELLITES A
TMOSPHERE SPACESHIP SURFACE SCIENTISTS ASTRONAUT S
ATURN MILES
THEORY SCIENTISTS EXPERIMENT OBSERVATIONS SCIENTIF
IC EXPERIMENTS HYPOTHESIS EXPLAIN SCIENTIST OBSERV
ED EXPLANATION BASED OBSERVATION IDEA EVIDENCE THE
ORIES BELIEVED DISCOVERED OBSERVE FACTS
72
A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
73
A selection from 500 topics P(wz j)
FIELD MAGNETIC MAGNET WIRE NEEDLE CURRENT COIL POL
ES IRON COMPASS LINES CORE ELECTRIC DIRECTION FORC
E MAGNETS BE MAGNETISM POLE INDUCED
STORY STORIES TELL CHARACTER CHARACTERS AUTHOR REA
D TOLD SETTING TALES PLOT TELLING SHORT FICTION AC
TION TRUE EVENTS TELLS TALE NOVEL
JOB WORK JOBS CAREER EXPERIENCE EMPLOYMENT OPPORTU
NITIES WORKING TRAINING SKILLS CAREERS POSITIONS F
IND POSITION FIELD OCCUPATIONS REQUIRE OPPORTUNITY
EARN ABLE
MIND WORLD DREAM DREAMS THOUGHT IMAGINATION MOMENT
THOUGHTS OWN REAL LIFE IMAGINE SENSE CONSCIOUSNES
S STRANGE FEELING WHOLE BEING MIGHT HOPE
SCIENCE STUDY SCIENTISTS SCIENTIFIC KNOWLEDGE WORK
RESEARCH CHEMISTRY TECHNOLOGY MANY MATHEMATICS BI
OLOGY FIELD PHYSICS LABORATORY STUDIES WORLD SCIEN
TIST STUDYING SCIENCES
BALL GAME TEAM FOOTBALL BASEBALL PLAYERS PLAY FIEL
D PLAYER BASKETBALL COACH PLAYED PLAYING HIT TENNI
S TEAMS GAMES SPORTS BAT TERRY
74
Gibbs sampling for topics
  • Need full conditional distributions for variables
  • Since we only sample z we need

number of times word w assigned to topic j
number of times topic j used in document d
75
Gibbs sampling
iteration 1
76
Gibbs sampling
iteration 1 2
77
Gibbs sampling
iteration 1 2
78
Gibbs sampling
iteration 1 2
79
Gibbs sampling
iteration 1 2
80
Gibbs sampling
iteration 1 2
81
Gibbs sampling
iteration 1 2
82
Gibbs sampling
iteration 1 2
83
Gibbs sampling
iteration 1 2
1000
84
A visual example Bars
sample each pixel from a mixture of topics
pixel word image document
85
(No Transcript)
86
(No Transcript)
87
Summary
  • Probabilistic models can pose significant
    computational challenges
  • parameter estimation with latent variables,
    computing posteriors with many variables
  • Clever algorithms exist for solving these
    problems, easing use of probabilistic models
  • These algorithms also provide a source of new
    models and methods in cognitive science

88
(No Transcript)
89
When Bayes is useful
  • Clarifying computational problems in cognition
  • Providing rational explanations for behavior
  • Characterizing knowledge informing induction
  • Capturing inferences at multiple levels
Write a Comment
User Comments (0)
About PowerShow.com