Unsupervised Morphological Segmentation With Log-Linear Models - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Unsupervised Morphological Segmentation With Log-Linear Models

Description:

Title: PowerPoint Presentation Last modified by: Hoifung Poon Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:63

Avg rating:3.0/5.0

Slides: 73

Provided by: ResearchM

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised Morphological Segmentation With Log-Linear Models

1
Unsupervised Morphological Segmentation With
Log-Linear Models

Hoifung Poon
University of Washington
Joint Work with
Colin Cherry and Kristina Toutanova

2
Machine Learning in NLP
3
Machine Learning in NLP
Unsupervised Learning
4
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
?
5
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
Little work except for a couple of cases
6
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
?
Global Features
7
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
???
Global Features
8
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
We developed a method for Unsupervised
Learning of Log-Linear Models with Global Features
Global Features
9
Machine Learning in NLP
Unsupervised Learning
Log-Linear Models
We applied it to morphological segmentation and
reduced F1 error by 10?50 compared to the state
of the art
Global Features
10
Outline

Morphological segmentation
Our model
Learning and inference algorithms
Experimental results
Conclusion

11
Morphological Segmentation

Breaks words into morphemes

12
Morphological Segmentation

Breaks words into morphemes
governments

13
Morphological Segmentation

Breaks words into morphemes
governments ? govern ? ment ? s

14
Morphological Segmentation

Breaks words into morphemes
governments ? govern ? ment ? s
lmpxtm
(according to their families)

15
Morphological Segmentation

Breaks words into morphemes
governments ? govern ? ment ? s
lmpxtm ? l ? mpm ? t ? m
(according to their families)

16
Morphological Segmentation

Breaks words into morphemes
governments ? govern ? ment ? s
lmpxtm ? l ? mpm ? t ? m
(according to their families)
Key component in many NLP applications
Particularly important for morphologically-rich
languages (e.g., Arabic, Hebrew, )

17
Why Unsupervised Learning ?

Text Unlimited supplies in any language
Segmentation labels?
Only for a few languages
Expensive to acquire

18
Why Log-Linear Models ?

Can incorporate arbitrary overlapping features
E.g., Al ? rb (the lord)
Morpheme features
Substrings Al, rb are likely morphemes
Substrings Alr, lrb are not likely morphemes
Etc.
Context features
Substrings between Al and are likely morphemes
Substrings between lr and are not likely
morphemes
Etc.

19
Why Global Features ?

Words can inform each other on segmentation
E.g., Al ? rb (the lord), l ? Al ? rb (to the
lord)

20
State of the Art in Unsupervised Morphological
Segmentation

Use directed graphical models
Morfessor Creutz Lagus 2007
Hidden Markov Model (HMM)
Goldwater et al. 2006
Based on Pitman-Yor processes
Snyder Barzilay 2008a, 2008b
Based on Dirichlet processes
Uses bilingual information to help segmentation
Phrasal alignment
Prior knowledge on phonetic correspondence
E.g., Hebrew w ? Arabic w, f ...

21
Unsupervised Learning with Log-Linear Models

Few approaches exist to this date
Contrastive estimation Smith Eisner 2005
Sampling Poon Domingos 2008

22
This Talk

First log-linear model for unsupervised
morphological segmentation
Combines contrastive estimation with sampling
Achieves state-of-the-art results
Can apply to semi-supervised learning

23
Outline

Morphological segmentation
Our model
Learning and inference algorithms
Experimental results
Conclusion

24
Log-Linear Model

State variable x ? X
Features fi X ? R
Weights ?i
Defines probability distribution over the states

25
Log-Linear Model

State variables x ? X
Features fi X ? R
Weights ?i
Defines probability distribution over the states

26
States for Unsupervised Morphological Segmentation

Words
wvlAvwn, Alrb,
Segmentation
w ? vlAv ? wn, Al ? rb,
Induced lexicon (unique morphemes)
w, vlAv, wn, Al, rb

27
Features for Unsupervised Morphological
Segmentation

Morphemes and contexts
Exponential priors on model complexity

28
Morphemes and Contexts

Count number of occurrences
Inspired by CCM Klein Manning, 2001
E.g., w ? vlAv ? wn

wvlAvwn (_)
vlAv (w_wn)
wn (Av_)
w (_vl)
29
Complexity-Based Priors

Lexicon prior ? ?
On lexicon length (total number of characters)
Favor fewer and shorter morpheme types
Corpus prior ? ?
On number of morphemes (normalized by word
length)
Favor fewer morpheme tokens
E.g., l ? Al ? rb, Al ? rb
l, Al, rb ? ? 5 ?
l ? Al ? rb ? ? 3/5 ?
Al ? rb ? ? 2/4 ?

30
Lexicon Prior Is Global Feature

Renders words interdependent in segmentation
E.g., lAlrb, Alrb
lAlrb ? ?
Alrb ? ?

31
Lexicon Prior Is Global Feature

Renders words interdependent in segmentation
E.g., lAlrb, Alrb
lAlrb ? l ? Al ? rb
Alrb ? ?

32
Lexicon Prior Is Global Feature

Renders words interdependent in segmentation
E.g., lAlrb, Alrb
lAlrb ? l ? Al ? rb
Alrb ? Al ? rb

33
Lexicon Prior Is Global Feature

Renders words interdependent in segmentation
E.g., lAlrb, Alrb
lAlrb ? l ? Alrb
Alrb ? ?

34
Lexicon Prior Is Global Feature

Renders words interdependent in segmentation
E.g., lAlrb, Alrb
lAlrb ? l ? Alrb
Alrb ? Alrb

35
Probability Distribution

For corpus W and segmentation S

Morphemes
Contexts
Lexicon Prior
Corpus Prior
36
Outline

Morphological segmentation
Our model
Learning and inference algorithms
Experimental results
Conclusion

37
Learning withLog-Linear Models

Maximizes likelihood of the observed data
? Moves probability mass to the observed data
From where? The set X that Z sums over
Normally, X all possible states
Major challenge
Efficient computation (approximation) of the sum
Particularly difficult in unsupervised learning

38
Contrastive Estimation

Smith Eisner 2005
X a neighborhood of the observed data
Neighborhood ? Pseudo-negative examples
Discriminate them from observed instances

39
Problem with Contrastive Estimation

Objects are independent from each other
Using global features leads to intractable
inference
In our case, could not use the lexicon prior

40
Sampling to the Rescue

Similar to Poon Domingos 2008
Markov chain Monte Carlo
Estimates sufficient statistics based on samples
Straightforward to handle global features

41
Our Learning Algorithm

Combines both ideas
Contrastive estimation
? Creates an informative neighborhood
Sampling
? Enables global feature (the lexicon prior)

42
Learning Objective

Observed W (words)
Hidden S (segmentation)
Maximizes log-likelihood of observing the words

43
Neighborhood

TRANS1 ? Transpose any pair of adjacent
characters
Intuition Transposition usually leads to a
non-word
E.g.,
lAlrb ? Allrb, llArb,
Alrb ? lArb, Arlb,

44
Optimization

Gradient descent

45
Supervised Learning andSemi-Supervised Learning

Readily applicable if there are
labeled segmentations (S)
Supervised Labels for all words
Semi-supervised Labels for some words

46
Inference Expectation

Gibbs sampling
ESWfi
For each observed word in turn, sample next
segmentation, conditioning on the rest
EW,Sfi
For each observed word in turn, sample a word
from neighborhood and next segmentation,
conditioning on the rest

47
Inference MAP Segmentation

Deterministic annealing
Gibbs sampling with temperature
Gradually lower the temperature from 10 to 0.1

48
Outline

Morphological segmentation
Our model
Learning and inference algorithms
Experimental results
Conclusion

49
Dataset

SB Snyder Barzilay 2008a, 2008b
About 7,000 parallel short phrases
Arabic and Hebrew with gold segmentation
Arabic Penn Treebank (ATB) 120,000 words

50
Methodology

Development set 500 words from SB
Use trigram context in our full model
Evaluation Precision, recall, F1 on segmentation
points

51
Experiment Objectives

Comparison with state-of-the-art systems
Unsupervised
Supervised or semi-supervised
Relative contributions of feature components

52
Experiment SB (Unsupervised)

Snyder Barzilay 2008b
SB-MONO Uses monolingual features only
SB-BEST Uses bilingual information
Our system Uses monolingual features only

53
Results SB (Unsupervised)
F1
54
Results SB (Unsupervised)
F1
55
Results SB (Unsupervised)
F1
56
Results SB (Unsupervised)
Reduces F1 error by 40
F1
57
Results SB (Unsupervised)
Reduces F1 error by 21
F1
58
Experiment ATB (Unsupervised)

Morfessor Categories-MAP Creutz Lagus 2007
Our system

59
Results ATB
Reduces F1 error by 11
F1
60
Experiment Ablation Tests

Conducted on the SB dataset
Change one feature component in each test
Priors
Context features

61
Results Ablation Tests
Corpus prior only
Both priors are crucial
F1
No priors
Lexicon prior only
FULL
NO-PR
COR
NO-CTXT
LEX
62
Results Ablation Tests
No context features
Overlapping context features are important
F1
FULL
NO-PR
COR
NO-CTXT
LEX
63
Experiment SB (Supervised and Semi-Supervised)

Snyder Barzilay 2008a
SB-MONO-S Monolingual features and labels
SB-BEST-S Bilingual information and labels
Our system Monolingual features and labels
Partial or all labels (25, 50, 75, 100)

64
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
65
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
66
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
67
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
68
Results SB (Supervised and Semi-Supervised)
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
Our-S 75
69
Results SB(Supervised and Semi-Supervised)
Reduces F1 error by 46 compared to
SB-MONO-S Reduces F1 error by 36 compared to
SB-BEST-S
F1
SB MONO-S
SB BEST-S
Our-S 25
Our-S 50
Our-S 75
Our-S 100
70
Conclusion