A Bayesian approach to word segmentation: Theoretical and experimental results - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

A Bayesian approach to word segmentation: Theoretical and experimental results

Description:

Goal: compare our model (and others) to human performance in a Saffran-style experiment. ... wheres.the.doggie. Models used. Transitional probabilities (TP) ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 43
Provided by: sharongo
Category:

less

Transcript and Presenter's Notes

Title: A Bayesian approach to word segmentation: Theoretical and experimental results


1
A Bayesian approach to word segmentation
Theoretical and experimental results
  • Sharon Goldwater
  • Department of Linguistics
  • Stanford University

2
Word segmentation
  • One of the first problems infants must solve when
    learning language.
  • Infants make use of many different cues.
  • Phonotactics, allophonic variation, metrical
    (stress) patterns, effects of coarticulation, and
    statistical regularities in syllable sequences.
  • Statistics may provide initial bootstrapping.
  • Used very early (Thiessen Saffran, 2003).
  • Language-independent.

3
Modeling statistical segmentation
  • Previous work often focuses on how statistical
    information (e.g., transitional probabilities)
    can be used to segment speech.
  • Bayesian approach asks what information should be
    used by a successful learner.
  • What statistics should be collected?
  • What assumptions (by the learner) constrain
    possible generalizations?

4
Outline
  • 1. Computational model and theoretical results
  • What are the consequences of using different
    sorts of information for optimal word
    segmentation?
  • (joint work with Tom Griffiths and Mark Johnson)
  • 2. Modeling experimental data
  • Do humans behave optimally?
  • (joint work with Mike Frank, Vikash Mansinghka,
    Tom Griffiths, and Josh Tenenbaum)

5
Statistical segmentation
  • Work on statistical segmentation often discusses
    transitional probabilities (Saffran et al. 1996
    Aslin et al. 1998, Johnson Jusczyk, 2001).
  • P(syli syli-1) is often lower at word
    boundaries.
  • What do TPs have to say about words?
  • A word is a unit whose beginning predicts its
    end, but it does not predict other words.
  • A word is a unit whose beginning predicts its
    end, and it also predicts future words.

Or
6
Interpretation of TPs
  • Most previous work assumes words are
    statistically independent.
  • Experimental work Saffran et al. (1996), many
    others.
  • Computational work Brent (1999).
  • What about words predicting other words?

tupiro golabu bidaku padoti
golabubidakugolabutupiropadotibidakupadotitupi
7
Questions
  • If a learner assumes that words are independent
    units, what is learned (from more realistic
    input)?
  • Unigram model Generate each word independently.
  • What if the learner assumes that words are units
    that help predict other units?
  • Bigram model Generate each word conditioned on
    the previous word.
  • Approach use a Bayesian ideal observer model to
    examine the consequences of each assumption.

8
Bayesian learning
  • The Bayesian learner seeks to identify an
    explanatory linguistic hypothesis that
  • accounts for the observed data.
  • conforms to prior expectations.
  • Focus is on the goal of computation, not the
    procedure (algorithm) used to achieve the goal.

9
Bayesian segmentation
  • In the domain of segmentation, we have
  • Data unsegmented corpus (transcriptions).
  • Hypotheses sequences of word tokens.
  • Optimal solution is the segmentation with highest
    prior probability.

1 if concatenating words forms corpus, 0
otherwise.
Encodes unigram or bigram assumption (also
others).
10
Brent (1999)
  • Describes a Bayesian unigram model for
    segmentation.
  • Prior favors solutions with fewer words, shorter
    words.
  • Problems with Brents system
  • Learning algorithm is approximate (non-optimal).
  • Difficult to extend to incorporate bigram info.

11
A new unigram model (Dirichlet process)
  • Assume word wi is generated as follows
  • 1. Is wi a novel lexical item?

Fewer word types Higher probability
12
A new unigram model (Dirichlet process)
  • Assume word wi is generated as follows
  • 2. If novel, generate phonemic form x1xm
  • If not, choose lexical identity of wi from
    previously occurring words

Shorter words Higher probability
Power law Higher probability
13
Unigram model simulations
  • Same corpus as Brent (Bernstein-Ratner, 1987)
  • 9790 utterances of phonemically transcribed
    child-directed speech (19-23 months).
  • Average utterance length 3.4 words.
  • Average word length 2.9 phonemes.
  • Example input

yuwanttusiD6bUk lUkDz6b7wIThIzht nd6dOgi yuwant
tulUktDIs ...
14
Example results

15
Comparison to previous results
  • Proposed boundaries are more accurate than
    Brents, but fewer proposals are made.
  • Result word tokens are less accurate.

Boundary Precision Boundary Recall
Brent .80 .85
GGJ .92 .62
Precision correct / found

hits / (hits false alarms) Recall
found / true hits / (hits
misses)
Token F-score
Brent .68
GGJ .54
F-score an average of precision and recall.
16
What happened?
  • Model assumes (falsely) that words have the same
    probability regardless of context.
  • Positing amalgams allows the model to capture
    word-to-word dependencies.

P(Dt) .024 P(DtWAts) .46
P(Dttu) .0019
17
What about other unigram models?
  • Brents learning algorithm is insufficient to
    identify the optimal segmentation.
  • Our solution has higher probability under his
    model than his own solution does.
  • On randomly permuted corpus, our system achieves
    96 accuracy Brent gets 81.
  • Formal analysis shows undersegmentation is the
    optimal solution for any (reasonable) unigram
    model.

18
Bigram model (hierachical Dirichlet process)
  • Assume word wi is generated as follows
  • Is (wi-1,wi) a novel bigram?
  • If novel, generate wi using unigram model
    (almost).
  • If not, choose lexical identity of wi from words
    previously occurring after wi-1.

19
Example results

20
Quantitative evaluation
  • Compared to unigram model, more boundaries are
    proposed, with no loss in accuracy
  • Accuracy is higher than previous models

Boundary Precision Boundary Recall
GGJ (unigram) .92 .62
GGJ (bigram) .92 .84
Token F-score Type F-score
Brent (unigram) .68 .52
GGJ (bigram) .77 .63
21
Summary
  • Two different assumptions about what defines a
    word are consistent with behavioral evidence.
  • Different assumptions lead to different results.
  • Beginning of word predicts end of word
  • Optimal solution undersegments, finding common
    multi-word units.
  • Word also predicts next word
  • Segmentation is more accurate, adult-like.

22
Remaining questions
  • Is unigram segmentation sufficient to start
    bootstrapping other cues (e.g., stress)?
  • How prevalent are multi-word chunks in infant
    vocabulary?
  • Are humans able to segment based on bigram
    statistics?
  • Is there any evidence that human performance is
    consistent with Bayesian predictions?

23
Testing model predictions
  • Goal compare our model (and others) to human
    performance in a Saffran-style experiment.
  • Problem all models have near-perfect accuracy on
    experimental stimuli.
  • Solution compare changes in model performance
    relative to humans as task difficulty is varied.

tupiro golabu bidaku padoti
golabubidakugolabutupiropadotibidakupadotitupiro
24
Experimental method
  • Examine segmentation performance under different
    utterance length conditions.
  • Example lexicon lagi dazu tigupi bavulu kabitudu
    kipavazi
  • Conditions

wds/utt utts tot wds
1 1200 1200
2 600 1200
4 300 1200
6 200 1200
8 150 1200
12 100 1200
24 50 1200
25
Procedure
  • Training adult subjects listened to synthesized
    utterances in one length condition.
  • No pauses between syllables within utterances.
  • 500 ms pauses between utterances.
  • Testing 2AFC between words and part-word
    distractors.

Lexicon
lagi dazu tigupi bavulu kabitudu kipavazi
lagitigupibavulukabitudulagikipavazi dazukipavazib
avululagitigupikabitudu kipavazitigupidazukabitudu
lagitigupi
26
Human performance
27
Model comparison
  • Evaluated six different models.
  • Each model trained and tested on same stimuli as
    humans.
  • To simulate 2AFC, produce a score s(w) for each
    word in choice pair and use Luce choice rule
  • Compute best linear fit of each model to human
    data, then calculate correlation.

28
Models used
  • Three local statistic models, all similar to
    transitional probabilities (TP)
  • Segment at minima of P(syli syli-1).
  • s(w) minimum TP in w.
  • Swingley (2005)
  • Builds lexicon using local statistic and
    frequency thresholds.
  • s(w) max threshold at which w appears in
    lexicon.
  • PARSER (Perruchet and Vinter, 1998)
  • Incorporates principles of lexical competition
    and memory decay.
  • s(w) P(w) as defined by model.
  • GGJ (our Bayesian model)
  • s(w) P(w) as defined by model.

29
Results linear fit
30
Results words vs. part-words
31
Summary
  • Statistical segmentation is more difficult when
    utterances contain more words.
  • Gradual decay in performance is predicted by
    Bayesian model, but not by others tested.
  • Bayes predicts difficulty is primarily due to
    effects of competition.
  • In longer utterances, correct words are less
    probable because more other possibilities exist.
  • Local statistic approaches dont model
    competition.

32
Continuing work
  • Experiments with other task modifications will
    further test our models predictions.
  • Vary the length of exposure to training stimulus
  • Bayes longer exposure gt better performance.
  • TPs no effect of exposure.
  • Vary the number of lexical items
  • Bayes larger lexicon gt worse performance.
  • TPs larger lexicon gt better performance.

33
Conclusions
  • Computer simulations and experimental work
    suggest that
  • Unigram assumption causes ideal learners to
    undersegment fluent speech.
  • Human word segmentation may approximate Bayesian
    ideal learning.

34
(No Transcript)
35
Bayesian segmentation
Input data
Some hypotheses
whatsthat thedoggie wheresthedoggie ...
whatsthat thedoggie wheresthedoggie
whats that the doggie wheres the doggie
wh at sth at thedo ggie wh eres thedo ggie
w h a t s t h a t t h e d o g g i e w h e r e s t
h e d o g g i e
36
Search algorithm
  • Model defines a distribution over hypotheses. We
    use Gibbs sampling to find a good hypothesis.
  • Iterative procedure produces samples from the
    posterior distribution of hypotheses.
  • A batch algorithm (but online algorithms are
    possible, e.g., particle filtering).

P(hd)
h
37
Gibbs sampler
  • 1. Consider two hypotheses differing by a single
    word boundary
  • 2. Calculate probabilities of words that differ,
    given current analysis of all other words.
  • Model is exchangeable probability of a set of
    outcomes does not depend on ordering.
  • 3. Sample one of the two hypotheses according to
    the ratio of probabilities.

whats.that the.doggie wheres.the.doggie
whats.that the.dog.gie wheres.the.doggie
38
Models used
  • Transitional probabilities (TP)
  • Segment at minima of P(syli syli-1).
  • s(w) minimum TP in w. (Equivalently, use
    product).
  • Smoothed transitional probabilities
  • Avoid 0 counts by using add-? smoothing.
  • Mutual information (MI)
  • Segment where MI between syllables is lowest.
  • s(w) minimum MI in w. (Equivalently, use sum).

39
Models used
  • Swingley (2005)
  • Builds a lexicon, including syllable sequences
    above some threshold for both MI and n-gram
    frequency.
  • s(w) max threshold at which w appears in
    lexicon.
  • PARSER (Perruchet and Vinter, 1998)
  • Lexicon-based model incorporating principles of
    lexical competition and memory decay.
  • s(w) P(w) as defined by model.
  • GGJ (our Bayesian model)
  • s(w) P(w) as defined by model.

40
Results linear fit
41
Continuing work
  • Comparisons to human data and other models
  • Which words/categories are most robust?
  • Compare to Frequent Frames predictions (Mintz,
    2003).
  • Compare to corpus data from childrens
    production.
  • Modeling cue combination
  • Integrate morphology into syntactic model.
  • Model experimental work on cue combination in
    category learning.

42
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com