Title: Computational Morphology
1Computational Morphology
2Computational morphology
- The big questions
- Efficient generation and recognition
- Common data format
- Common "runtime" algorithm for all languages
- Established results
- Lexical representations are regular languages
- Morphological alternations are regular relations
- Regular relations can be compiled into
finite-state transducers - Burning issues
- Nonconcatenative phenomena reduplication
(Malay), interdigitation (Arabic) - Nonlocal dependencies
- Syntax/morphology interface
3Overview
- Computational morphology
- A success story
- Realizational Morphology (is finite-state)
- Lexical representations
- Realization rules
- Morphophonological rules
- Rules of referral
- Elsewhere principle (Panini's principle)
- Challenges
4Computational morphology
5Two challenges
- Morphotactics
- Words are composed of smaller elements that must
be combined in a certain order - piti-less-ness is English
- piti-ness-less is not English
- Phonological alternations
- The shape of an element may vary depending on the
context - pity is realized as piti in pitilessness
- die becomes dy in dying
6 Morphology is regular (rational)
- The relation between the surface forms of a
language and the corresponding lexical forms can
be described as a regular relation. - A regular relation consists of ordered pairs of
strings. - leafNPl leaves hangVPast hung
- Any finite collection of such pairs is a regular
relation. - Regular relations are closed under operations
such as concatenation, iteration, union, and
composition. - Complex regular relations can be derived from
simple relations.
7Morphology is finite-state
- A regular relation can be defined using the
metalanguage of regular expressions. - A regular expression can be compiled into a
finite-state transducer that implements the
relation computationally.
8Regular morphotactics
- Principles of word-formation in most languages
can be defined as a regular language or relation
using operators such as concatenation and union. - Toy example union, () optionality
- Noun ear father
- Adj clear clever fat
- Adv ever
- NPref anti
- AdjSuff er est
- NSuf s
- English (NPref) Noun (NSuf) Adj (AdjSuff)
Adv
9Simple lexicon
a
e
n
i
t
f
a
t
a
a
r
l
e
c
v
e
h
r
a
e
s
t
r
v
e
r
f
s
e
e
a
t
h
e
(NPref) Noun (NSuf) Adj (AdjSuff) Adv
10Regular alternations
- Phonological alternations can be represented as
regular relations using special regular
expression operators - Ordered rewrite systems (Panini 500 BC,
ChomskyHalle 1968) - Parallel two-level systems (Koskenniemi 1983)
- Simple gemination rule for English
- t -gt t t .. C V _ e r s t
- Geminate t at the end of a monosyllabic stem
with a single vowel that is followed by er or
est. (fater -gt fatter vs. greater-gtgreater).
11Transducer lexicon
a
e
n
i
t
f
a
t
a
a
r
l
e
c
v
e
h
r
a
e
s
t
r
v
e
r
f
e
s
e
0t
a
t
h
e
(NPref) Noun (NSuf) Adj (AdjSuff) Adv
.o. t -gt t t _ e r s t
12Lexical transducer
- Bidirectional generation or analysis
- Compact and fast
- Comprehensive systems have been built for over 20
languages - English, German, Dutch, French, Italian, Spanish,
Portuguese, Finnish, Russian, Turkish, Japanese,
Basque, Greek, Arabic, Bulgarian,
13Morphology is a solved problem
14Who cares?
- The success of computational morphology has not
made any impact within linguistics. - Computational concerns
- completeness of coverage, physical size, speed of
application, formal power, - Academic concerns
- explanation, universal principles,
generalizations, theoretical predictions, elegant
formalism, - Let's try to build a bridge
15Realizational Morphology
- Gregory Stump, Inflectional Morphology. A Theory
of Paradigm Structure. Cambridge U. Press. 2001. - A rich set of notational conventions designed to
capture important linguistic generalizations. - Interpretable, precise formalism.
- Computational implementation in DATR (Finkel
Stump 2002). - The good news Realizational morphology is a
finite-state model.
16Finite-state advantage
- Casting Stump's system into a regular expression
formalism that has a compiler has a fundamental
advantage over implementation in systems such as
DATR. - DATR can be used to generate an inflected surface
form from its lexical representation but it is
not directly usable for recognition. In contrast,
finite-state transducers are bidirectional
generator/recognizers. - Issues to be addressed
- Lexical representations
- Realization rules ( rules of exponence)
- Morphophonological rules
- Rules of referral
- Rule ordering by general principles
17Lexical representation
A phonological representation
A set of morphological properties
18Realization rule
phonological input
phonological output
features
- RRn,t,C(ltX,sgt) def ltY', sgt
rule block
features realized by the rule
category
19Rule application
- Realization rules are ordered into blocks by the
linguist. - Within blocks, the ordering is determined by
specificity (Elsewhere rule, Panini's principle). - The final output of a realization rule may depend
on morphophonological rules. - X " Y " Y'
20Cascade of rule applications
ltbet, SubPer1, NumSg, ObjPer2, NumSg,
TnsPastRecgt
21Observations
- The lexical representations of Realizational
Morphology constitute a regular language. - They can be described by a regular expression.
- All examples of realization rules given in
Stump's book represent regular relations. - They can be compiled compiled into finite-state
transducers. - Because regular relations are closed under
composition, the cascade of rule applications
yields a single transducer. - We can eliminate the features from the surface
side once the composition has been done.
22Literal example
In a real application, one would prefer a more
parsimonious encoding of the feature structure.
23Realization rules
- Stump's realization rules can easily be expressed
in Parc/XRCE regular expression formalism. - Example
- RR3, ObjPer2, NumSg, V(ltX,sgt) def ltkoX, sgt
- define R301 . . -gt ko "lt" _ ObjAgr 2
Sg - "Rule R301 Insert ( rewrite the empty string
as) "ko" - to the beginning of a phonological form whose
object - agreement features contain the values 2 and Sg."
24Morphophonological rules
- The output of a realization rule may be subject
to a morphophonological rule. - Stump's morphophonemic rules are simple rewrite
rules, easily expressed in the Parc/XRCE regular
expression formalism. - If XWvowel1 and YXvowel2Z, then the
indicated volwel2 is absent from Y'. - Vowel -gt 0 Vowel "" _
- where "" marks the place where the suffix is
inserted.
25Rules of referral
- Realization rules may be defined in terms of
other realization rules. - The same affix can express more than one bundle
of morphological features (syncretism). - In Lingala, mo expresses class 4 singular 3rd
person agreement for subjects and objects. - In the Parc/XRCE regular expression formalism, a
rule of referral corresponds to a substitution
operation. - If R305 is the object agreement rule, the
corresponding subject agreement rule is - R305, Obj, Sub
- It yields a transducer identical to R305 except
that the insertion of mo is controlled by subject
agreement features.
26Elsewhere principle
- While the rule blocks are ordered by the
linguist, the realization rules within each block
and the morphophonological rules are ordered by
specificity. - A specific rule takes precedence over a more
general rule in cases where both are applicable. - This principle is very important for Stump. But
he gives no precise definition for it within his
formalism. - The Elsewhere Principle is an extremely simple
notion for realization rules and for
symbol-to-symbol morphophonological rules in a
finite-state model.
27Specific vs. General
28Input/Output languages
- Rule A and Rule B have the same input language
the universal ( "sigma star") language. - Both rules can be applied without failure to any
string. If the context is not met, the output is
the same as the input. - The output languages are not the same.
- A "successful" application an obligatory rule
removes from the output language the strings to
which it has applied. - Every string missing from the output language of
Rule B is missing from the output language of
Rule A, but not vice versa. - The output language of Rule A is a proper subset
of the output language of Rule B.
29Output language of Rule A
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
30Output language of Rule B
Rule A
k -gt 0 Vowel _ Vowel
Rule B
k -gt v u _ u
31Principled rule ordering
- The relationship of any two rules A and B that
insert a string or replace a particular symbol
can be determined by the following method - Extract the output languages (a finite-state
operation). - Check whether one is the proper subset of the
other (a finite-state operation). - This determination can be done efficiently and
without any knowledge of how the rules were
expressed.
32Discussion
- It is evident that Realizational Morphology is
yet another variant of finite-state morphology. - Stump could say "Your theory is a notational
variant of mine but mine is better." - There are many examples where notation matters
- B gt A _ C "B must occur between A and C."
- ? A B ? ? B C ?
- Stump's convoluted and cumbersome notation takes
no advantage of the nice formal and computational
properties that it in fact has. It is a
finite-state model that does not know its name.
33Morphotactic challenges
- Most languages build words by concatenation
- unthinkingly
- parismutnngauniraqlauqsimanngitjunga
(Inuktitut) - (parimunngauniralauqsimanngittunga I never said
I wanted to go to Paris) - Some languages also have nonconcatenative
processes of word formation - Arabic interdigitation
- Malay reduplication
34Interdigitation in Arabic
Concatenative kuutib a
stem suffix
The root, template and vocalization morphemes
interdigitate into a stem.
35Full-stem reduplication in Malay
- In Malay, the overt plural of bagi (suitcase)
is bagibagi (orthographically bagi-bagi) the
plural of peraturan (rule) is
peraturanperaturan, etc. -
- To model such pluralization, you need to copy the
stem, no matter what it is and no matter how long
it is. - Such full-stem reduplication appears to be far
beyond finite-state power - The copy language, ww w e L, is
context-sensitive.
36Compile-replace a new algorithm
- Define networks using concatenation, as before,
but in such a way that the paths in the network
may themselves contain regular expressions. - Reapply the compiler to its own output, compiling
the regular expression substrings and replacing
them with the result of the compilation.
37A non-linguistic example before compile-replace
Network containing a regular expression,
a delimited with and .
38Non-linguistic example after compile-replace
Maps every string in the infinite a language to
the regular expression from which the language
was compiled.
39Iteration operator
- n
- A2 denotes two concatenations of the language A
with itself, equvalent to A A. - A bagi, pelanbuhan,
- A2 bagibagi, bagipelanbuhan, pelanbuhanbagi,
pelanbuhanpelanbuhan. - Finite-state languages and relations are closed
under n-ary concatenation.
40Compile-replace in Malay
- Before
- Lemma b a g i Noun Plural
- Underlying form b a g i 2
- After
- Lemma b a g i Noun Plural
- Surface string b a g i b a g i
- The compile replace operation does not create any
ill-formed reduplicates such as pelabuhanbagi.
41Merge operators for Arabic
- Merge a Filler into a Template
- .mgt. is the merge to the right operator and
- .ltm. is the merge to the left operator.
- k t b .mgt. C V V C V C
- k V V t V b
- k V V t V b .ltm. u i
- k u u t i b
42Compile-replace in Arabicbefore and after
- Before
- Lemma k t b Root C V C V C Template a
Voc - Underlying k t b .mgt. C V C V C .ltm. a
- After
- Lemma k t b Root C V C V C Template a
Voc - Surface k a t a b
- Alternation rules apply to the interdigitated
stems to produce the real surface strings.
43XRCE Arabic
- Lexicon
- 4930 roots
- 400 phonologically distinct patterns
- 90,000 stems
- 72 million words
- Rules
- 66 alternation rules for deletion, assimilation,
etc. - Construction
- compile-replace algorithm merges roots and
patterns to form stems - composition with alternation rules creates the
final transducer with optional vowels - time required a few minutes
44Conclusion
- Computationally, morphology is a solved problem.
Syntax-morphology interface
45References
- Lauri Karttunen, "Computing with Realizational
Morphology" in CICLing-2003, A. Gelbukh (ed.),
Lecture Notes in Computer Science 2588, pages
205-216. Springer Verlag. 2003. - For a copy write to karttune_at_parc.com
- This PowerPoint presentation will be available at
a local web site. - Kenneth R. Beesley Lauri Karttunen,
Finite-State Morphology, CSLI Publications.
February 2003. (Software included).