Morphology 3 Unsupervised Morphology Induction presentation

About This Presentation

Transcript and Presenter's Notes

Title: Morphology 3 Unsupervised Morphology Induction

1
Morphology 3Unsupervised Morphology Induction

Sudeshna Sarkar
IIT Kharagpur

2
LinguisticaUnsupervised Learning of Natural
Language Morphology Using MDL

John Goldsmith
Department of Linguistics
The University of Chicago

3
Unsupervised learning

Input untagged text in orthographic or phonetic
form
with spaces (or punctuation) separating words.
But no tagging or text preparation.
Output
List of stems, suffixes, and prefixes
List of signatures.
A signature a list of all suffixes (prefixes)
appearing in a given corpus with a given stem.
Hence, a stem in a corpus has a unique signature.
A signature has a unique set of stems associated
with it

4
(example of signature in English)

NULL.ed.ing.s
ask call point
ask asked asking asks
call called calling calls
point pointed pointing points

5
output

Roots (stems of stems) and the inner structure
of stems
Regular allomorphy of stems
e.g., learn delete stem-final e in English
before ing and ed

6
Essence of Minimum Description Length (MDL)

Jorma Rissanen Stochastic Complexity in
Statistical Inquiry (1989)
Work by Michael Brent and Carl de Marcken on
word-discovery using MDL
We are given
a corpus, and
a probabilistic morphology, which technically
means that we are given a distribution over
certain strings of stems and affixes.

The higher the probability is that the morphology
assigns to the (observed) corpus, the better that
morphology is as a model of that data.
Better said
-1 log probability (corpus) is a measure of how
well the morphology models the data the smaller
that number is, the better the morphology models
the data.
This is known as the optimal compressed length of
the data, given the model.
Using base 2 logs, this number is a measure in
information theoretic bits.

8
Essence of MDL

The goodness of the morphology is also measured
by how compact the morphology is.
We can measure the compactness of a morphology in
information theoretic bits.

9
How can we measure the compactness of a
morphology?

Lets consider a naïve version of description
length count the number of letters.
This naïve version is nonetheless helpful in
seeing the intuition involved.

10
Naive Minimum Description Length
Corpus jump, jumps, jumping laugh, laughed,
laughing sing, sang, singing the, dog, dogs
total 62 letters
Analysis Stems jump laugh sing sang dog (20
letters) Suffixes s ing ed (6 letters) Unanalyzed
the (3 letters) total 29 letters.
Notice that the description length goes UP if we
analyze sing into sing
11
Essence of MDL

The best overall theory of a corpus is the one
for which the sum of
log prob (corpus)
length of the morphology
(thats the description length) is the smallest.

12
Essence of MDL
13
Overall logic

Search through morphology space for the
morphology which provides the smallest
description length.

Application of MDL to iterative search of
morphology-space, with successively finer-grained
descriptions

15
Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
16
Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
17
Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
18
Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
19
Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
20
Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
21
Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
22
Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
23
Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
24
1. Bootstrap heuristic

A function that takes words as inputs and gives
an initial hypothesis regarding what are stems
and what are affixes.
In theory, the search space is enormous each
word w of length w has at least w analyses,
so search space has at least members.

25
Better bootstrap heuristics

Heuristic, not perfection! Several good
heuristics. Best is a modification of a good idea
of Zellig Harris (1955)
Current variant
Cut words at certain peaks of successor
frequency.
Problems can over-cut can under-cut and can
put cuts too far to the right (aborti-
problem). Not a problem!

26
Successor frequency
n
g o v e r
Empirically, only one letter follows gover n
27
Successor frequency
e
i
m
g o v e r n
o
s

Empirically, 6 letters follows govern n
28
Successor frequency
g o v e r n m
e
Empirically, 1 letter follows governm e
g o v e r 1 n 6 m 1 e
peak of successor frequency
29
Lots of errors
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
30
Even so

We set conditions
Accept cuts with stems at least 5 letters in
length
Demand that successor frequency be a clear peak
1 N 1 (e.g. govern-ment)
Then for each stem, collect all of its suffixes
into a signature and accept only signatures with
at least 5 stems to it.

31
2. Incremental heuristics

Course-grained to fine-grained
1. Stems and suffixes to split
Accept any analysis of a word if it consists of a
known stem and a known suffix.
2. Loose fit suffixes and signatures to split
Collect any string that precedes a known suffix.
Find all of its apparent suffixes, and use MDL to
decide if its worth it to do the analysis. Well
return to this in a moment.

32
Incremental heuristic

3.Slide stem-suffix boundary to the left Again,
use MDL to decide.
How do we use MDL to decide?

33
Using MDL to judge a potential stem

act, acted, action, acts.
We have the suffixes NULL, ed, ion, and s, but no
signature NULL.ed.ion.s
Lets compute cost versus savings of signature
NULL.ed.ion.s
Savings
Stem savings 3 copies of the stem act thats 3
x 4 12 letters almost 60 bits.

34
Cost of NULL.ed.ing.s

A pointer to each suffix

To give a feel for this
Total cost of suffix list about 30 bits. Cost of
pointer to signature total cost is -- all
the stems using it chip in to pay for its cost,
though.
35

Cost of signature about 45 bits
Savings about 60 bits
so MDL says Do it! Analyze the words as stem
suffix.
Notice that the cost of the analysis would have
been higher if one or more of the suffixes had
not already existed.

36
Todays presentation

The task unsupervised learning
Overview of program and output
Overview of Minimum Description Length framework
Application of MDL to iterative search of
morphology-space, with successively finer-grained
descriptions
Mathematical model
Current capabilities
Current challenges

37
Model

A model to give us a probability of each word in
the corpus (hence, its optimal compressed
length) and
A morphology whose length we can measure.

38
Frequency of analyzed word
x means the count of xs in the corpus (token
count)
W is analyzed as belonging to Signature s, stem
T and suffix F.
Where W is the total number of words.
Actually what we care about is the log of this
39
(No Transcript)
40
Next, lets see how to measurethe length of a
morphology

A morphology is a set of 3 things
A list of stems
A list of suffixes
A list of signatures with the associated stems.
Well make an effort to make our grammars consist
primarily of lists, whose length is conceptually
simple.

41
Length of a list

A header telling us how long the list is, of
length (roughly) log2 N, where N is the length.
N entries. Whats in an entry?
Raw lists a list of strings of letters, where
the length of each letter is log2 (26) the
information content of a letter (we can use a
more accurate conditional probability).
Pointer lists

42
Lists

Raw suffix list
ed
s
ing
ion
able

Signature 1
Suffixes
pointer to ing
pointer to ed
Signature 2
Suffixes
pointer to ing
pointer to ion

The length of each pointer is
-- usually cheaper than the letters themselves
43

The fact that a pointer to a symbol has a length
that is inversely proportional to its frequency
is the key
We want the shortest overall grammar so
That means maximizing the re-use of units (stems,
affixes, signatures, etc.)

44
structure
Number of letters
Signatures, which well get to shortly
45
Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
46
Repair heuristics using MDL

We could compute the entire MDL in one state of
the morphology make a change compute the whole
MDL in the proposed (modified) state and
compared the two lengths.

Original morphology Compressed data
Revised morphology compressed data
lt gt
47

But its better to have a more thoughtful
approach.
Lets define

Then the size of the punctuation for the 3 lists
is
Then the change of the size of the punctuation in
the lists
48
Size of the suffix component, remember
Change in its size when we consider a
modification to the morphology 1. Global effects
of change of number of suffixes 2. Effects on
change of size of suffixes in both states 3.
Suffixes present only in state 1 4. Suffixes
present only in state 2
49
Suffix component change
Suffixes whose counts change
Global effect of change on all suffixes
Contribution of suffixes that appear only in
State1
Contribution of suffixes that appear only in
State 2
50
Current research projects

Allomorphy Automatic discovery of relationship
between stems (lovlove, winwinn)
Use of syntax (automatic learning of syntactic
categories)
Rich morphology other languages (e.g., Swahili),
other sub-languages (e.g., biochemistry
sub-language) where the mean morphemes/word is
much higher
Ordering of morphemes

51
Allomorphy Automatic discovery of relationship
between stems

Currently learns (unfortunately, over-learns) how
to delete stem-final letters in order to simplify
signatures.
E.g., delete stem-final e in English before
suffixes ing, -ed, -ion (etc.).

52
Automatic learning of syntactic categories

Work in progress with Mikhail Belkin (U of
Chicago)
Pursuing Shi and Maliks 1997 application of
spectral graph theory (vision)
Finding eigenvector decomposition of a graph that
represents bigrams and trigrams

53
Rich morphologies

A practical challenge for use in data-mining and
information retrieval in patent applications
(de-oxy-ribo-nucle-ic, etc.)
Swahili, Hungarian, Turkish, etc.

54
(No Transcript)
55
Unsupervised Knowledge-Free Morpheme Boundary
Detection

Stefan Bordag
University of Leipzig
Example
Related work
Part One Generating training data
Part Two Training and Applying a Classificator
Preliminary results
Further research

56
Example clearly early

The examples used throughout this presentation
are clearly and early
In one case, the stem is clear and in the other
early
Other word forms of same lemmas
clearly clearest, clear, clearer, clearing
early earlier, erliest
Semantically related words
clearly logically, really, totally, weakly,
early morning, noon, day, month, time,
Correct morpheme boundaries analysis
clearly ? clear-ly but not clearl-y or
clea-rly
early ? early or earl-y but not ear-ly

57
Three approaches to morpheme boundary detection

Three kinds of approaches
Genetic Algorithms and the Minimum Description
Length model
(Kazakov 97 01), (Goldsmith 01), (Creutz 03
05)
This approach utilizes only word list, not the
context information for each word from corpus.
This possibly results in an upper limit on
achievable performance (especially with regards
to irregularities).
One advantage is that smaller corpora sufficient
Semantics based
(Schone Jurafsky 01), (Baroni 03)
General problem of this approach with examples
like deeply and deepness where semantic
similarity is unlikely
Letter Successor Variety (LSV) based
(Harris 55), (Hafer Weiss 74) first
application, but low performance
Also applied only to a word list
Further hampered by noise in the data

58
2. New solution in two parts
clear-ly lately early
compute LSV
s LSV freq multiletter bigram
The talk 1 Talk was 1
Talk speech 20 Was is 15
The talk wasvery informative
similar words
cooccurrences
sentences
train classifier
clear-ly late-ly early
apply classifier
59
2.1. First part Generating training data with
LSV and distributed Semantics

Overview
Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information
Frequency of word A and B is nA and nB
Frequency of cooccurrence of A with B is nAB
Corpus size is n
Significance computation is Poisson approximation
of log-likelihood (Dunning 93) (Quasthoff Wolff
02)

60
Neighbors of clearly

Most significant left neighbors
very
quite
so
Its
most
its
shows
results
thats
stated
Quite

Most significant right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood

Its clearly labeled
clearly
very clearly shows
61
2.2. New solution as combination of two existing
approaches

Overview
Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information
Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker

62
Similar words to clearly

Most significant left neighbors
very
quite
so
Its
most
its
shows
results
thats
stated
Quite

Most significant right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood

weakly legally closely clearly greatly linearly
really
63
2.3. New solution as combination of two existing
approaches

Overview
Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information
Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker
Sort those words by edit distance and keep 150
most similar ? since further words only add
random noise

64
Similar words to clearly sorted by edit distance
Sorted List clearly closely greatly legally linea
rly really weakly

Most significant
left neighbors
very
quite
so
Its
most
its
shows
results
thats
stated
Quite

Most significant
right neighbors
defined
written
labeled
marked
visible
demonstrated
superior
stated
shows
demonstrates
understood

65
2.4. New solution as combination of two existing
approaches

Overview
Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information
Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker
Sort those words by edit distance and keep 150
most similar ? since further words only add
random noise
Compute letter successor variety for each
transition between two characters of the input
word
Report boundaries where the LSV is above
threshold

66
2.5. Letter successor variety

Letter successor variety Harris (55)
where word-splitting occurs if the number of
distinct letters that follows a given sequence of
characters surpasses the threshold.
Input are the 150 most similar words
Observing how many different letters occur after
a part of the string
c- In the given list after c- 5 letters
cl- only 3 letters
cle- only 1 letter
-ly but reversed before ly 16 different
letters (16 different stems preceding the suffix
ly)
c l e a r l y
28 5 3 1 1 1 1 1 f. left (thus after
cl 5 various letters)
1 1 2 1 3 16 10 14 f. right (thus before
-y 10 var. letters)

67
2.5.1. Balancing factors

LSV score for each possible boundary is not
normalized and needs to be weighted against
several factors that otherwise add noise
freq Frequency differences between beginning and
middle of word
multiletter Representation of single phonemes
with several letters
bigram Certain fixed combinations of letters
Final score s for each possible boundary is then
s LSV freq multiletter bigram

68
2.5.2. Balancing factors Frequency

LSV is not normalized against frequency
28 different first letters within 150 words
5 different second letters within 11 words,
beginning with c
3 different third letters within 4 words,
beginning with cl
Computing frequency weight freq
4 out of 11 begin with cl- then weight is 4/11
c l e a r l y
150 11 4 1 1 1 1 1 of 11 4
begin with cl
0.1 0.4 0.3 1 1 1 1 1
from left

69
2.5.3. Balancing factors Multiletter Phonemes

Problem Two or more letters which together
represent one phoneme carry away the nominator
for the overlap factor quotient
Letter split variety
s c h l i m m e
7 1 7 2 1 1 2
2 1 1 1 2 4 15
Computing overlap factor
150 27 18 18 6 5 5 5
2 2 2 2 3 7 105 150
thus at this point the LSV 7 is
weighted 1 (18/18), but since sch is one phoneme,
it should have been 18/150 !
Solution Ranking of bi- and trigrams, highest
receives weight of 1.0
Overlap factor is recomputed as weighted average
In this case that means 1.0 27/150, since sch
is the highest trigram and has a weight of 1.0.

70
2.5.4. Balancing factors Bigrams

It is obvious that th in English is almost
never to be divided
Computation of bigram ranking over all words in
word list and give 0.1 weight to highest ranked
and 1.0 to lowest ranked.
LSV score then multiplied with resulting weight.
Thus, the German ch- which is the highest ranked
bigram receives a penalty of 0.1 and thus it is
nearly impossible that it becomes a morpheme
boundary

71
2.5.5. Sample computation

Compute letter successor variety
c l e a r - l y e
a r l y
28 5 3 1 1 1 1 1 40
5 1 1 2 1
1 1 2 1 3 16 10 10 1
2 1 4 6 19
Balancing Frequencies
150 11 4 1 1 1 1 1 150
9 2 2 2 1
1 1 2 2 5 76 90 150 1
2 2 6 19 150
Balancing Multiletter weights
Bi l 0.4 0.1 0.5 0.2 0.5 0.0
0.2 0.2 0.5 0.0
Tri r 0.1 0.1 0.1 0.1 0.0
0.0 0.1 0.0
Bi l 0.5 0.2 0.5 0.0 0.1 0.3
0.5 0.0 0.1 0.3
Tri r 0.1 0.1 0.0 0.0 0.2
0.0 0.0 0.2
Balancing Bigram weight
0.1 0.5 0.2 0.5 0.0 0.1
0.2 0.5 0.0 0.1
Left and Right LSV scores
0.1 0.3 0.0 0.4 1.0 0.9
0.0 0.0 0.5 1.7
0.3 0.9 0.1 0.0 12.4 3.7
1.0 0.0 0.7 0.2
Computing right score for clear-ly
16(76/900.176/150)/(1.00.1)(1-0.0)12.4

72
Second Part Training and Applying classifier
root

Any word list can be stored in a trie
(Fredkin60) or in a more efficient version of a
trie, a PATRICIA compact tree (PCT)
(Morrison68)
Example
clearly
early
lately
clear
late

r
y
e
a
l
a
e
r
a
e
t
l
a
l
c
e
a

l
l

End or beginning of word
c

73
3.1. PCT as a Classificator
root
clear
ly
late
root
ear

late

clear
ly
late
ly2
1
1
cl

ear

late

ly1
ly1
1
1

cl

ly1
ly1
1
Apply deepest found node
retrieve known information

ly1
Amazing?ly
add known information
dear?ly
clear-ly, late-ly, early, Clear, late
amazing-ly
dearly
74
4. Evaluation

Boundary measuring each boundary detected can be
correct or wrong (precision) or boundaries can be
not detected (recall)
First evaluation is global LSV with the proposed
improvements

75
Evaluating LSV Precision vs. Recall
76
Evaluating LSV F-measure
77
Evaluating combination Precision vs. Recall
78
Evaluating combination F-measure
79
Comparing combination with global LSV
80
4.1. Results

German newspaper corpus with 35 million sentences
English newspaper corpus with 13 million
sentences

t5 German English
lsv Precision 80,20 70,35
lsv Recall 34,52 10,86
lsv F-measure 48,27 18,82
combined Precision 68,77 52,87
combined Recall 72,11 52,56
combined F-measure 70,40 55,09
81
4.2. Statistics
en lsv en comb tr lsv tr comb fi lsv fi comb
Corpus size 13 million 13 million 1 million 1 million 4 million 4 million
nunmber of word( form)s 167.377 167.377 582.923 582.923 1.636.336 1.636.336
analysed words 49.159 94.237 26.307 460.791 68.840 1.380.841
boundaries 70.106 131.465 31.569 812.454 84.193 3.138.039
morph. length 2,60 2,56 2,29 3,03 2,32 3,73
length of analysed words 8,97 8,91 9,75 10,62 11,94 13,34
length of unanalysed words 7,56 6,77 10,12 8,15 12,91 10,47
morphemes per word 2,43 2,40 2,20 2,76 2,22 3,27
82
Assessing true error rate

Typical sample list of words considered as wrong
due to CELEX
Tau-sende Tausend-e
senegales-isch-e senegalesisch-e
sensibelst-en sens-ibel-sten
separat-ist-isch-e separ-at-istisch-e
tris-t trist
triump-hal triumph-al
trock-en trocken
unueber-troff-en un-uebertroffen
trop-f-en tropf-en
trotz-t-en trotz-ten
ver-traeum-t-e vertraeumt-e
Reasons
Gender e (in (Creutz Lagus 05) for example
counted as correct)
compounds (sometimes separated, sometimes not)
-t-en Error
With proper names isch often not analyzed
Connecting elements

83
4.4. Real example
Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-m
ann Ver-trau-ens-sache Ver-trau-ensvorschuß Ver-tr
au-ensvo-tum Ver-trau-ens-würd-igkeit Ver-traut-es
Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver
-triebenenverbände Ver-triebs-beleg-e

Orien-tal
Orien-tal-ische
Orien-tal-ist
Orien-tal-ist-en
Orien-tal-ist-ik
Orien-tal-ist-in
Orient-ier-ung
Orient-ier-ungen
Orient-ier-ungs-hilf-e
Orient-ier-ungs-hilf-en
Orient-ier-ungs-los-igkeit
Orient-ier-ungs-punkt
Orient-ier-ungs-punkt-e
Orient-ier-ungs-stuf-e

84
5. Further research

Examine quality on various language types
Improve trie-based classificator
Possibly combine with other existing algorithms
Find out how to acquire morphology of
non-concatenative languages
Deeper analysis
find deletions
alternations
insertions
morpheme classes etc.

85
References

(Argamon et al. 04) Shlomo Argamon, Navot Akiva,
Amihood Amir, and Oren Kapah. Effcient
unsupervized recursive word segmentation using
minimun desctiption length. In Proceedings of
Coling 2004, Geneva, Switzerland, 2004.
GLDV-Tagung, pages 93-99, Leipzig, March 1998.
Deutscher Universitätsverlag.
(Baroni 03) Marco Baroni. Distribution-driven
morpheme discovery A computational/experimental
study. Yearbook of Morphology, pages 213-248,
2003. France, http//www.sle.sharp.co.uk/senseval2
/, 5-6 July 2001.
(Creutz Lagus 05) Mathias Creutz and Krista
Lagus. Unsupervised morpheme segmentation and
morphology induction from text corpora using
morfessor 1.0. In Publications in Computer and
Information Science, Report A81. Helsinki
University of Technology, March 2005.
(Déjean 98) Hervé Déjean. Morphemes as necessary
concept for structures discovery from untagged
corpora. In D.M.W. Powers, editor,
NeMLaP3/CoNLL98 Workshop on Paradigms and
Grounding in Natural Language Learning, ACL,
pages 295-299, Adelaide, January 1998.
(Dunning 93) T. E. Dunning. Accurate methods for
the statistics of surprise and coincidence.
Computational Linguistics, 19(1)61-74, 1993.

86
6. References II

(Goldsmith 01) John Goldsmith. Unsupervised
learning of the morphology of a natural language.
Computational Linguistics, 27(2)153-198, 2001.
(Hafer Weiss 74) Margaret A. Hafer and Stephen
F. Weiss. Word segmentation by letter successor
varieties. Information Storage and Retrieval,
10371-385, 1974.
(Harris 55) Zellig S. Harris. From phonemes to
morphemes. Language, 31(2)190-222, 1955.
(Kazakov 97) Dimitar Kazakov. Unsupervised
learning of naive morphology with genetic
algorithms. In A. van den Bosch, W. Daelemans,
and A. Weijters, editors, Workshop Notes of the
ECML/MLnet Workshop on Empirical Learning of
Natural Language Processing Tasks, pages 105-112,
Prague, Czech Republic, April 1997.
(Quasthoff Wolff 02) Uwe Quasthoff and
Christian Wolff. The poisson collocation measure
and its applications. In Second International
Workshop on Computational Approaches to
Collocations. 2002.
(Schone Jurafsky 01) Patrick Schone and Daniel
Jurafsky. Language-independent induction of part
of speech class labels using only language
universals. In Workshop at IJCAI-2001, Seattle,
WA., August 2001. Machine Learning Beyond
Supervision.

87
E. Gender-e vs. Frequency-e
vs. other-e andere 8.4 keine 6.8 rote
11.6 stolze 8.0 drehte 10.8 winzige
9.7 lustige 13.2 rufe 4.4 Dumme 12.6
vs. Gender-e Schule 8.4 Devise 7.8 Sonne
4.5 Abendsonne 5.3 Abende 5.5 Liste
6.5
Frequency-e Affe 2.7 Junge 5.3 Knabe
4.6 Bursche 2.4 Backstage 3.0

Write a Comment

User Comments (0)

About PowerShow.com

Morphology 3 Unsupervised Morphology Induction PowerPoint PPT Presentation