Title: Morphology 3 Unsupervised Morphology Induction
1Morphology 3Unsupervised Morphology Induction
- Sudeshna Sarkar
- IIT Kharagpur
2LinguisticaUnsupervised Learning of Natural
Language Morphology Using MDL
- John Goldsmith
- Department of Linguistics
- The University of Chicago
3Unsupervised learning
- Input untagged text in orthographic or phonetic
form - with spaces (or punctuation) separating words.
- But no tagging or text preparation.
- Output
- List of stems, suffixes, and prefixes
- List of signatures.
- A signature a list of all suffixes (prefixes)
appearing in a given corpus with a given stem. - Hence, a stem in a corpus has a unique signature.
- A signature has a unique set of stems associated
with it -
4(example of signature in English)
- NULL.ed.ing.s
- ask call point
-
- ask asked asking asks
- call called calling calls
- point pointed pointing points
5output
- Roots (stems of stems) and the inner structure
of stems - Regular allomorphy of stems
- e.g., learn delete stem-final e in English
before ing and ed
6Essence of Minimum Description Length (MDL)
- Jorma Rissanen Stochastic Complexity in
Statistical Inquiry (1989) - Work by Michael Brent and Carl de Marcken on
word-discovery using MDL - We are given
- a corpus, and
- a probabilistic morphology, which technically
means that we are given a distribution over
certain strings of stems and affixes.
7- The higher the probability is that the morphology
assigns to the (observed) corpus, the better that
morphology is as a model of that data. - Better said
- -1 log probability (corpus) is a measure of how
well the morphology models the data the smaller
that number is, the better the morphology models
the data. - This is known as the optimal compressed length of
the data, given the model. - Using base 2 logs, this number is a measure in
information theoretic bits.
8Essence of MDL
- The goodness of the morphology is also measured
by how compact the morphology is. - We can measure the compactness of a morphology in
information theoretic bits.
9How can we measure the compactness of a
morphology?
- Lets consider a naïve version of description
length count the number of letters. - This naïve version is nonetheless helpful in
seeing the intuition involved.
10Naive Minimum Description Length
Corpus jump, jumps, jumping laugh, laughed,
laughing sing, sang, singing the, dog, dogs
total 62 letters
Analysis Stems jump laugh sing sang dog (20
letters) Suffixes s ing ed (6 letters) Unanalyzed
the (3 letters) total 29 letters.
Notice that the description length goes UP if we
analyze sing into sing
11Essence of MDL
- The best overall theory of a corpus is the one
for which the sum of - log prob (corpus)
- length of the morphology
- (thats the description length) is the smallest.
12Essence of MDL
13Overall logic
- Search through morphology space for the
morphology which provides the smallest
description length.
14- Application of MDL to iterative search of
morphology-space, with successively finer-grained
descriptions
15Corpus
Pick a large corpus from a language -- 5,000 to
1,000,000 words.
16Corpus
Feed it into the bootstrapping heuristic...
Bootstrap heuristic
17Corpus
Bootstrap heuristic
Out of which comes a preliminary
morphology, which need not be superb.
Morphology
18Corpus
Bootstrap heuristic
Feed it to the incremental heuristics...
Morphology
incremental heuristics
19Corpus
Out comes a modified morphology.
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
20Corpus
Is the modification an improvement? Ask MDL!
Bootstrap heuristic
Morphology
modified morphology
incremental heuristics
21Corpus
If it is an improvement, replace the morphology...
Bootstrap heuristic
modified morphology
Morphology
Garbage
22Corpus
Send it back to the incremental heuristics
again...
Bootstrap heuristic
modified morphology
incremental heuristics
23Continue until there are no improvements to try.
Morphology
modified morphology
incremental heuristics
241. Bootstrap heuristic
- A function that takes words as inputs and gives
an initial hypothesis regarding what are stems
and what are affixes. - In theory, the search space is enormous each
word w of length w has at least w analyses,
so search space has at least members.
25Better bootstrap heuristics
- Heuristic, not perfection! Several good
heuristics. Best is a modification of a good idea
of Zellig Harris (1955) - Current variant
- Cut words at certain peaks of successor
frequency. - Problems can over-cut can under-cut and can
put cuts too far to the right (aborti-
problem). Not a problem!
26Successor frequency
n
g o v e r
Empirically, only one letter follows gover n
27Successor frequency
e
i
m
g o v e r n
o
s
Empirically, 6 letters follows govern n
28Successor frequency
g o v e r n m
e
Empirically, 1 letter follows governm e
g o v e r 1 n 6 m 1 e
peak of successor frequency
29Lots of errors
9 18 11 6 4 1 2 1 1 2
1 1
c o n s e r v a t i
v e s
wrong
right
wrong
30Even so
- We set conditions
- Accept cuts with stems at least 5 letters in
length - Demand that successor frequency be a clear peak
1 N 1 (e.g. govern-ment) - Then for each stem, collect all of its suffixes
into a signature and accept only signatures with
at least 5 stems to it.
312. Incremental heuristics
- Course-grained to fine-grained
- 1. Stems and suffixes to split
- Accept any analysis of a word if it consists of a
known stem and a known suffix. - 2. Loose fit suffixes and signatures to split
Collect any string that precedes a known suffix. - Find all of its apparent suffixes, and use MDL to
decide if its worth it to do the analysis. Well
return to this in a moment.
32Incremental heuristic
- 3.Slide stem-suffix boundary to the left Again,
use MDL to decide. - How do we use MDL to decide?
33Using MDL to judge a potential stem
- act, acted, action, acts.
- We have the suffixes NULL, ed, ion, and s, but no
signature NULL.ed.ion.s - Lets compute cost versus savings of signature
NULL.ed.ion.s - Savings
- Stem savings 3 copies of the stem act thats 3
x 4 12 letters almost 60 bits.
34Cost of NULL.ed.ing.s
To give a feel for this
Total cost of suffix list about 30 bits. Cost of
pointer to signature total cost is -- all
the stems using it chip in to pay for its cost,
though.
35- Cost of signature about 45 bits
- Savings about 60 bits
- so MDL says Do it! Analyze the words as stem
suffix. - Notice that the cost of the analysis would have
been higher if one or more of the suffixes had
not already existed.
36Todays presentation
- The task unsupervised learning
- Overview of program and output
- Overview of Minimum Description Length framework
- Application of MDL to iterative search of
morphology-space, with successively finer-grained
descriptions - Mathematical model
- Current capabilities
- Current challenges
37Model
- A model to give us a probability of each word in
the corpus (hence, its optimal compressed
length) and - A morphology whose length we can measure.
38Frequency of analyzed word
x means the count of xs in the corpus (token
count)
W is analyzed as belonging to Signature s, stem
T and suffix F.
Where W is the total number of words.
Actually what we care about is the log of this
39(No Transcript)
40Next, lets see how to measurethe length of a
morphology
- A morphology is a set of 3 things
- A list of stems
- A list of suffixes
- A list of signatures with the associated stems.
- Well make an effort to make our grammars consist
primarily of lists, whose length is conceptually
simple.
41Length of a list
- A header telling us how long the list is, of
length (roughly) log2 N, where N is the length. - N entries. Whats in an entry?
- Raw lists a list of strings of letters, where
the length of each letter is log2 (26) the
information content of a letter (we can use a
more accurate conditional probability). - Pointer lists
42Lists
- Raw suffix list
- ed
- s
- ing
- ion
- able
- Signature 1
- Suffixes
- pointer to ing
- pointer to ed
- Signature 2
- Suffixes
- pointer to ing
- pointer to ion
The length of each pointer is
-- usually cheaper than the letters themselves
43- The fact that a pointer to a symbol has a length
that is inversely proportional to its frequency
is the key - We want the shortest overall grammar so
- That means maximizing the re-use of units (stems,
affixes, signatures, etc.)
44structure
Number of letters
Signatures, which well get to shortly
45Information contained in the Signature component
list of pointers to signatures
ltXgt indicates the number of distinct elements in X
46Repair heuristics using MDL
- We could compute the entire MDL in one state of
the morphology make a change compute the whole
MDL in the proposed (modified) state and
compared the two lengths.
Original morphology Compressed data
Revised morphology compressed data
lt gt
47- But its better to have a more thoughtful
approach. - Lets define
Then the size of the punctuation for the 3 lists
is
Then the change of the size of the punctuation in
the lists
48Size of the suffix component, remember
Change in its size when we consider a
modification to the morphology 1. Global effects
of change of number of suffixes 2. Effects on
change of size of suffixes in both states 3.
Suffixes present only in state 1 4. Suffixes
present only in state 2
49Suffix component change
Suffixes whose counts change
Global effect of change on all suffixes
Contribution of suffixes that appear only in
State1
Contribution of suffixes that appear only in
State 2
50Current research projects
- Allomorphy Automatic discovery of relationship
between stems (lovlove, winwinn) - Use of syntax (automatic learning of syntactic
categories) - Rich morphology other languages (e.g., Swahili),
other sub-languages (e.g., biochemistry
sub-language) where the mean morphemes/word is
much higher - Ordering of morphemes
51Allomorphy Automatic discovery of relationship
between stems
- Currently learns (unfortunately, over-learns) how
to delete stem-final letters in order to simplify
signatures. - E.g., delete stem-final e in English before
suffixes ing, -ed, -ion (etc.).
52Automatic learning of syntactic categories
- Work in progress with Mikhail Belkin (U of
Chicago) - Pursuing Shi and Maliks 1997 application of
spectral graph theory (vision) - Finding eigenvector decomposition of a graph that
represents bigrams and trigrams
53Rich morphologies
- A practical challenge for use in data-mining and
information retrieval in patent applications
(de-oxy-ribo-nucle-ic, etc.) - Swahili, Hungarian, Turkish, etc.
54(No Transcript)
55Unsupervised Knowledge-Free Morpheme Boundary
Detection
- Stefan Bordag
- University of Leipzig
- Example
- Related work
- Part One Generating training data
- Part Two Training and Applying a Classificator
- Preliminary results
- Further research
56Example clearly early
- The examples used throughout this presentation
are clearly and early - In one case, the stem is clear and in the other
early - Other word forms of same lemmas
- clearly clearest, clear, clearer, clearing
- early earlier, erliest
- Semantically related words
- clearly logically, really, totally, weakly,
- early morning, noon, day, month, time,
- Correct morpheme boundaries analysis
- clearly ? clear-ly but not clearl-y or
clea-rly - early ? early or earl-y but not ear-ly
57Three approaches to morpheme boundary detection
- Three kinds of approaches
- Genetic Algorithms and the Minimum Description
Length model - (Kazakov 97 01), (Goldsmith 01), (Creutz 03
05) - This approach utilizes only word list, not the
context information for each word from corpus. - This possibly results in an upper limit on
achievable performance (especially with regards
to irregularities). - One advantage is that smaller corpora sufficient
- Semantics based
- (Schone Jurafsky 01), (Baroni 03)
- General problem of this approach with examples
like deeply and deepness where semantic
similarity is unlikely - Letter Successor Variety (LSV) based
- (Harris 55), (Hafer Weiss 74) first
application, but low performance - Also applied only to a word list
- Further hampered by noise in the data
582. New solution in two parts
clear-ly lately early
compute LSV
s LSV freq multiletter bigram
The talk 1 Talk was 1
Talk speech 20 Was is 15
The talk wasvery informative
similar words
cooccurrences
sentences
train classifier
clear-ly late-ly early
apply classifier
592.1. First part Generating training data with
LSV and distributed Semantics
- Overview
- Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information - Frequency of word A and B is nA and nB
- Frequency of cooccurrence of A with B is nAB
- Corpus size is n
- Significance computation is Poisson approximation
of log-likelihood (Dunning 93) (Quasthoff Wolff
02)
60Neighbors of clearly
- Most significant left neighbors
- very
- quite
- so
- Its
- most
- its
- shows
- results
- thats
- stated
- Quite
- Most significant right neighbors
- defined
- written
- labeled
- marked
- visible
- demonstrated
- superior
- stated
- shows
- demonstrates
- understood
Its clearly labeled
clearly
very clearly shows
612.2. New solution as combination of two existing
approaches
- Overview
- Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information - Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker
62Similar words to clearly
- Most significant left neighbors
- very
- quite
- so
- Its
- most
- its
- shows
- results
- thats
- stated
- Quite
- Most significant right neighbors
- defined
- written
- labeled
- marked
- visible
- demonstrated
- superior
- stated
- shows
- demonstrates
- understood
weakly legally closely clearly greatly linearly
really
632.3. New solution as combination of two existing
approaches
- Overview
- Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information - Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker - Sort those words by edit distance and keep 150
most similar ? since further words only add
random noise
64Similar words to clearly sorted by edit distance
Sorted List clearly closely greatly legally linea
rly really weakly
- Most significant
- left neighbors
- very
- quite
- so
- Its
- most
- its
- shows
- results
- thats
- stated
- Quite
- Most significant
- right neighbors
- defined
- written
- labeled
- marked
- visible
- demonstrated
- superior
- stated
- shows
- demonstrates
- understood
652.4. New solution as combination of two existing
approaches
- Overview
- Use context information to gather common direct
neighbors of the input word ? they are most
probably marked by the same grammatical
information - Use these neighbor cooccurrences to find words
that have similar cooccurrence profiles ? those
that are surrounded by the same cooccurrences
bear mostly the same grammatical marker - Sort those words by edit distance and keep 150
most similar ? since further words only add
random noise - Compute letter successor variety for each
transition between two characters of the input
word - Report boundaries where the LSV is above
threshold
662.5. Letter successor variety
- Letter successor variety Harris (55)
- where word-splitting occurs if the number of
distinct letters that follows a given sequence of
characters surpasses the threshold. - Input are the 150 most similar words
- Observing how many different letters occur after
a part of the string - c- In the given list after c- 5 letters
- cl- only 3 letters
- cle- only 1 letter
-
- -ly but reversed before ly 16 different
letters (16 different stems preceding the suffix
ly) - c l e a r l y
- 28 5 3 1 1 1 1 1 f. left (thus after
cl 5 various letters) - 1 1 2 1 3 16 10 14 f. right (thus before
-y 10 var. letters)
672.5.1. Balancing factors
- LSV score for each possible boundary is not
normalized and needs to be weighted against
several factors that otherwise add noise - freq Frequency differences between beginning and
middle of word - multiletter Representation of single phonemes
with several letters - bigram Certain fixed combinations of letters
- Final score s for each possible boundary is then
- s LSV freq multiletter bigram
682.5.2. Balancing factors Frequency
- LSV is not normalized against frequency
- 28 different first letters within 150 words
- 5 different second letters within 11 words,
beginning with c - 3 different third letters within 4 words,
beginning with cl - Computing frequency weight freq
- 4 out of 11 begin with cl- then weight is 4/11
- c l e a r l y
- 150 11 4 1 1 1 1 1 of 11 4
begin with cl -
-
- 0.1 0.4 0.3 1 1 1 1 1
from left -
692.5.3. Balancing factors Multiletter Phonemes
- Problem Two or more letters which together
represent one phoneme carry away the nominator
for the overlap factor quotient - Letter split variety
- s c h l i m m e
- 7 1 7 2 1 1 2
- 2 1 1 1 2 4 15
- Computing overlap factor
- 150 27 18 18 6 5 5 5
- 2 2 2 2 3 7 105 150
- thus at this point the LSV 7 is
weighted 1 (18/18), but since sch is one phoneme,
it should have been 18/150 ! - Solution Ranking of bi- and trigrams, highest
receives weight of 1.0 - Overlap factor is recomputed as weighted average
- In this case that means 1.0 27/150, since sch
is the highest trigram and has a weight of 1.0.
702.5.4. Balancing factors Bigrams
- It is obvious that th in English is almost
never to be divided - Computation of bigram ranking over all words in
word list and give 0.1 weight to highest ranked
and 1.0 to lowest ranked. - LSV score then multiplied with resulting weight.
- Thus, the German ch- which is the highest ranked
bigram receives a penalty of 0.1 and thus it is
nearly impossible that it becomes a morpheme
boundary
712.5.5. Sample computation
- Compute letter successor variety
- c l e a r - l y e
a r l y - 28 5 3 1 1 1 1 1 40
5 1 1 2 1 - 1 1 2 1 3 16 10 10 1
2 1 4 6 19 - Balancing Frequencies
- 150 11 4 1 1 1 1 1 150
9 2 2 2 1 - 1 1 2 2 5 76 90 150 1
2 2 6 19 150 - Balancing Multiletter weights
- Bi l 0.4 0.1 0.5 0.2 0.5 0.0
0.2 0.2 0.5 0.0 - Tri r 0.1 0.1 0.1 0.1 0.0
0.0 0.1 0.0 - Bi l 0.5 0.2 0.5 0.0 0.1 0.3
0.5 0.0 0.1 0.3 - Tri r 0.1 0.1 0.0 0.0 0.2
0.0 0.0 0.2 - Balancing Bigram weight
- 0.1 0.5 0.2 0.5 0.0 0.1
0.2 0.5 0.0 0.1 - Left and Right LSV scores
- 0.1 0.3 0.0 0.4 1.0 0.9
0.0 0.0 0.5 1.7 - 0.3 0.9 0.1 0.0 12.4 3.7
1.0 0.0 0.7 0.2 - Computing right score for clear-ly
- 16(76/900.176/150)/(1.00.1)(1-0.0)12.4
72Second Part Training and Applying classifier
root
- Any word list can be stored in a trie
(Fredkin60) or in a more efficient version of a
trie, a PATRICIA compact tree (PCT)
(Morrison68) - Example
- clearly
- early
- lately
- clear
- late
r
y
e
a
l
a
e
r
a
e
t
l
a
l
c
e
a
l
l
End or beginning of word
c
733.1. PCT as a Classificator
root
clear
ly
late
root
ear
late
clear
ly
late
ly2
1
1
cl
ear
late
ly1
ly1
1
1
cl
ly1
ly1
1
Apply deepest found node
retrieve known information
ly1
Amazing?ly
add known information
dear?ly
clear-ly, late-ly, early, Clear, late
amazing-ly
dearly
744. Evaluation
- Boundary measuring each boundary detected can be
correct or wrong (precision) or boundaries can be
not detected (recall) - First evaluation is global LSV with the proposed
improvements
75Evaluating LSV Precision vs. Recall
76Evaluating LSV F-measure
77Evaluating combination Precision vs. Recall
78Evaluating combination F-measure
79Comparing combination with global LSV
804.1. Results
- German newspaper corpus with 35 million sentences
- English newspaper corpus with 13 million
sentences
t5 German English
lsv Precision 80,20 70,35
lsv Recall 34,52 10,86
lsv F-measure 48,27 18,82
combined Precision 68,77 52,87
combined Recall 72,11 52,56
combined F-measure 70,40 55,09
814.2. Statistics
en lsv en comb tr lsv tr comb fi lsv fi comb
Corpus size 13 million 13 million 1 million 1 million 4 million 4 million
nunmber of word( form)s 167.377 167.377 582.923 582.923 1.636.336 1.636.336
analysed words 49.159 94.237 26.307 460.791 68.840 1.380.841
boundaries 70.106 131.465 31.569 812.454 84.193 3.138.039
morph. length 2,60 2,56 2,29 3,03 2,32 3,73
length of analysed words 8,97 8,91 9,75 10,62 11,94 13,34
length of unanalysed words 7,56 6,77 10,12 8,15 12,91 10,47
morphemes per word 2,43 2,40 2,20 2,76 2,22 3,27
82Assessing true error rate
- Typical sample list of words considered as wrong
due to CELEX - Tau-sende Tausend-e
- senegales-isch-e senegalesisch-e
- sensibelst-en sens-ibel-sten
- separat-ist-isch-e separ-at-istisch-e
- tris-t trist
- triump-hal triumph-al
- trock-en trocken
- unueber-troff-en un-uebertroffen
- trop-f-en tropf-en
- trotz-t-en trotz-ten
- ver-traeum-t-e vertraeumt-e
- Reasons
- Gender e (in (Creutz Lagus 05) for example
counted as correct) - compounds (sometimes separated, sometimes not)
- -t-en Error
- With proper names isch often not analyzed
- Connecting elements
834.4. Real example
Ver-trau-enskrise Ver-trau-ensleute Ver-trau-ens-m
ann Ver-trau-ens-sache Ver-trau-ensvorschuß Ver-tr
au-ensvo-tum Ver-trau-ens-würd-igkeit Ver-traut-es
Ver-trieb-en Ver-trieb-spartn-er Ver-triebene Ver
-triebenenverbände Ver-triebs-beleg-e
- Orien-tal
- Orien-tal-ische
- Orien-tal-ist
- Orien-tal-ist-en
- Orien-tal-ist-ik
- Orien-tal-ist-in
- Orient-ier-ung
- Orient-ier-ungen
- Orient-ier-ungs-hilf-e
- Orient-ier-ungs-hilf-en
- Orient-ier-ungs-los-igkeit
- Orient-ier-ungs-punkt
- Orient-ier-ungs-punkt-e
- Orient-ier-ungs-stuf-e
845. Further research
- Examine quality on various language types
- Improve trie-based classificator
- Possibly combine with other existing algorithms
- Find out how to acquire morphology of
non-concatenative languages - Deeper analysis
- find deletions
- alternations
- insertions
- morpheme classes etc.
85References
- (Argamon et al. 04) Shlomo Argamon, Navot Akiva,
Amihood Amir, and Oren Kapah. Effcient
unsupervized recursive word segmentation using
minimun desctiption length. In Proceedings of
Coling 2004, Geneva, Switzerland, 2004.
GLDV-Tagung, pages 93-99, Leipzig, March 1998.
Deutscher Universitätsverlag. - (Baroni 03) Marco Baroni. Distribution-driven
morpheme discovery A computational/experimental
study. Yearbook of Morphology, pages 213-248,
2003. France, http//www.sle.sharp.co.uk/senseval2
/, 5-6 July 2001. - (Creutz Lagus 05) Mathias Creutz and Krista
Lagus. Unsupervised morpheme segmentation and
morphology induction from text corpora using
morfessor 1.0. In Publications in Computer and
Information Science, Report A81. Helsinki
University of Technology, March 2005. - (Déjean 98) Hervé Déjean. Morphemes as necessary
concept for structures discovery from untagged
corpora. In D.M.W. Powers, editor,
NeMLaP3/CoNLL98 Workshop on Paradigms and
Grounding in Natural Language Learning, ACL,
pages 295-299, Adelaide, January 1998. - (Dunning 93) T. E. Dunning. Accurate methods for
the statistics of surprise and coincidence.
Computational Linguistics, 19(1)61-74, 1993.
866. References II
- (Goldsmith 01) John Goldsmith. Unsupervised
learning of the morphology of a natural language.
Computational Linguistics, 27(2)153-198, 2001. - (Hafer Weiss 74) Margaret A. Hafer and Stephen
F. Weiss. Word segmentation by letter successor
varieties. Information Storage and Retrieval,
10371-385, 1974. - (Harris 55) Zellig S. Harris. From phonemes to
morphemes. Language, 31(2)190-222, 1955. - (Kazakov 97) Dimitar Kazakov. Unsupervised
learning of naive morphology with genetic
algorithms. In A. van den Bosch, W. Daelemans,
and A. Weijters, editors, Workshop Notes of the
ECML/MLnet Workshop on Empirical Learning of
Natural Language Processing Tasks, pages 105-112,
Prague, Czech Republic, April 1997. - (Quasthoff Wolff 02) Uwe Quasthoff and
Christian Wolff. The poisson collocation measure
and its applications. In Second International
Workshop on Computational Approaches to
Collocations. 2002. - (Schone Jurafsky 01) Patrick Schone and Daniel
Jurafsky. Language-independent induction of part
of speech class labels using only language
universals. In Workshop at IJCAI-2001, Seattle,
WA., August 2001. Machine Learning Beyond
Supervision.
87E. Gender-e vs. Frequency-e
vs. other-e andere 8.4 keine 6.8 rote
11.6 stolze 8.0 drehte 10.8 winzige
9.7 lustige 13.2 rufe 4.4 Dumme 12.6
vs. Gender-e Schule 8.4 Devise 7.8 Sonne
4.5 Abendsonne 5.3 Abende 5.5 Liste
6.5
Frequency-e Affe 2.7 Junge 5.3 Knabe
4.6 Bursche 2.4 Backstage 3.0