Title: CSA4050 Advanced Topics in NLP
1CSA4050 Advanced Topicsin NLP
- Non-Concatenative Morphology
- Reduplication
- Interdigitation
2Reference
- Ken Beesely and Lauri Karttunen, Finite State
Non-Concatenative Morphotactics, Proceedings of
SIGPHON-2000
3Koskenniemi 1983
- "Only restricted infixation and reduplication
can be handled adequately with the present
system. Some extensions or revisions will be
necessary for an adequate description of
languages possessing extensive infixation or
reduplication"
4Non-Concatenative Languages
- Most languages build words by stringing together
morphemes like beads on a string. - The word-building processes of prefixation and
suffixation can be straightforwardly modeled in
finite state terms by concatenation. - But some languages also exhibit non-concatenative
morphotactics.
5Non-Concatenative Phenomena1. Reduplication
- In Malaybagi (bag)bagi-bagi (bags)
- Although this may appear concatenative, it does
not involve concatenating a predictible morpheme
like "s". Instead the entire stem is copied no
matter what its length. - In general language class (ww w ? L) is context
sensitive, but if L is finite, we can construct
an FS network that encodes it.
6General Solution for Reduplication
- Therefore, assuming the number of words subject
to reduplication is finite, it is possible to
construct a lexical transducer for languages like
Malay. - To handle reduplication, a new operator n is
introduced - An denotes n concatenations of A.
7Remarks from Beesleyon Context Sensitivity
- finite-state grammars (cannot handle unlimited
nesting or non-nested terminal dependencies) - context-free (can handle unlimited nesting,
suchas matched parentheses in arithmetic
expressions, but cannot handle non-nested
dependencies between terminals) - context-sensitive (can also handle
non-nesteddependencies between terminals, as
indogdogwhere terminal elements 1 and 4 have
to bethe same, 2 and 5 have to be the same,
and3 and 6 have to be the same. These
dependenciescross, so they're not nested.
8Non-Concatenation 2. Interdigitation
- In Arabic and Maltese, prefixes and suffixes
attach to stems in the usual concatenative way,
but stems themselves are formed by a process
known as interdigitation. - An example of occurs with the Arabic stem "katab"
(wrote). - This stem is composed of three elements
- the all consonant root ktb
- an abstract consonant-vowel template CVCVC
- a vocalisation aa (in this case signifying
perfect tense and active voice)
9Interdigitation
- The same root ktb can combine with the same
template CVCVC and a different vocalism ui
(signifying imperfect aspect and passive voice)
to produce "kutib" (was written). - The same root ktb can combine with a different
template CVVCVC and the vocalism ui to produce
"kuutib" another form of the verb.
10Intermediate ResultTemplate Root
d v v r v s
11Final ResultIntermediate Result Vocalism
d u u r i s
12Merge
- In this case the filler language contains an
infinite set of strings (i, ui, uui ) but only
one path can be constructed because all strings
end in i. Hence the earlier vowels must be "u". - This need not always be the case (eg if the
filler language were ui).
13Merge Operators
- To introduce the merge operation into the Xerox
calculus new operators, .ltm. and .mgt. have been
introduced. - These differ only in the order of arguments.
- T .ltm. F and F .mgt. T represent the same
merge operation with F and T as filler and
template respectively.
14The Composite Transducer
- With these operators the network above can be
compiled by using the following expressiond r
s .mgt. C V V C V C .ltm. u i
15Merge
template
c v v c v c
vocalism
root
d r s
16Compile-Replace
- Regular expressions are compiled into networks as
usual, but in addition, - the compiler is then applied to its own output.
- Central idea
- transduce to a language that has the format of
regular expressions. - The compile-replace algorithm then replaces the
regular expression with the result of its own
compilation.
17Compile Replace Simple Example
This network maps the string a to a
(i.e. the same RE but with special delimiters)
Application of CR to the lower side of
the network eliminates the markers, compile
the RE a and maps the upper side to to the
language resulting from the compilation.
18The result of compiling a
- To answer the question what does this network
do? - Figure out what it does in upward and downward
- directions
19The result of compiling a
When applied in the upward direction, this
transducer maps any string of the infinite a
language into the regular expression from which
it was compiled.
When applied in the downward direction, it maps
from a to all the strings in the language a,
0, a, aa, ...
20Compile-Replace 1
- Copy input path to output path until is
encountered on indicated (in our case lower) side
of the network. - Extract path until closing delimiter .
21Compile-Replace 2
- Symbols along indicated side are concatenated
into a string and eliminated from the path
leaving just the symbols on the opposite side.
The remaining net is - The extracted string is compiled into a second
network using the standard network compiler
22Compile-Replace 3
- The 2 networksare combined together using the
cross product operator. - The result
- is spliced between the origin and destination
states of the regular expression path.
23Reduplication Revisited
- Applying compile-replace to this transducer
- Lexical b a g i Noun Plural
- Surface b a g i 2
- yields this one
- Lexical b a g i Noun Plural
- Surface b a g i b a g i
24Interdigitation Revisited
- Applying compile-replace to this transducerUp
k i t e b Verb Past 3SgDok t b .mgt. C V C
V C .ltm. i e - yields this oneUp k i t e b Verb Past 3Sg
- Do k i t e b
25Remember Two Central Problems
- Morphotactics constraints on combinations of
morphemes governing the formation of valid words.
unbelievable vs. believeunable - Phonological/Orthographical Alternation (spelling
rules)how morphemes are realised in particular
environmentsfly s flies
26Xerox Perspective
- Morphotactics handle with lexc
- Phonological/Orthographical Alternation (spelling
rules)handle with xfst
lexc
Morphotactics
Lexicon FST
Lexical Transducer
.o.
xfst
Rules FST
Alternations