Title: Morphology and Finite-State Transducers
1Morphology and Finite-State Transducers
- by Mathias Creutz
- 31 October 2001
- Chapter 3, Jurafsky Martin
2Contents
- Morphology
- morphemes, inflection and derivation, allomporphs
- Morphological Parsing
- finite-state automata, two-level morphology
- Finite-State Transducers
- rules, combination of FSTs, lexicon-free FSTs
- Human Morphological Processing
- Exercise
3Morphology
- Morphology is the study of the way words are
built up from smaller meaning-bearing units,
morphemes. - e.g. talo ssa ni kin
- Two broad classes of morphemes, stems and
affixes - the stem is the main morpheme of the word,
supplying the main meaning, e.g. talo in
talossanikin
4Affixes
- Affixes add additional meanings.
- Concatenative morphology uses the following types
of affixes - prefixes, e.g. epä- in epäolennainen
- suffixes, e.g. ssa in talossa
- circumfixes, e.g. German ge- -t in gesagt
(have said)
5Non-concatenative Morphology
- In non-concatenative morphology the stem morpheme
is split up. The following types of affixes are
used - infixes, e.g. Californian Jurok, sepolah (field),
segepolah (fields) - transfixes, e.g. Hebrew, lamad (he studied),
limed (he taught), lumad (he was taught) - This type of non-concatenative morphology is
called templatic or root-and-pattern morphology.
6Inflection and Derivation
- There are two broad classes of ways to form words
from morphemes inflection and derivation.
7Inflection
- Inflection is the combination of a word stem with
a grammatical morpheme, usually resulting in a
word of the same class as the original stem, and
usually filling some syntactic function, e.g.
plural of nouns. - talo (singular), talot (plural)
- Inflection is productive.
- talo, talot vs. auto, autot vs. metsä, metsät
- The meaning of the resulting word is easily
predictable.
8Derivation
- Derivation is the combination of a word stem with
a grammatical morpheme, usually resulting in a
word of a different class, often with a meaning
hard to predict exactly. - e.g. järki, järjestää, järjestö,
järjestellä, järjestelmä,
järjestelmällinen, järjestelmällisyys - Not always productive.
- järki, järjestää vs. metsä, metsästää vs.
talo, talostaa?
9Allomorphs
- A group of allomorphs make up one morpheme class.
An allomorph is a special variant of a morpheme. - e.g. Finnish illative ending ltvowel_lengtheninggt
n, hltvowelgtn, seen, siin ? taloon, metsään,
taloihin, huoneeseen, huoneisiin - e.g. Finnish stem variation käsi, käden,
kättä, käteen
10Why Allomorphs?
- Phonological constraints
- e.g. vowel harmony, talossa vs. metsässä
- Morphological paradigms
- e.g. käsi, käden vs. kasi, kasin,
Swedish leta, letade vs. heta, hette - Irregularities
- e.g. cat, cats vs. goose, geese
- Orthographic constraints, i.e. spelling rules
- e.g. cat, cats vs. city, cities
11Morphological Parsing
- Parsing means taking an input and producing some
sort of structure for it. - Morphological parsing means breaking down a word
form into its constituent morphemes. - e.g. talossa ? talo ssa
- Mapping of a word form to its baseform is called
stemming. - e.g. talossa ? talo
12Finite-State Morphological Parsing
- In order to build a parser we need the following
- a lexicon containing the stems and affixes,
- morphotactics, i.e. the model of morpheme
ordering, e.g. talossani instead of
talonissa, - a set of rules (orthographic, etc.), i.e. the
model of changes that occur in a word, usually
when two morphemes combine, e.g. city s ?
cities.
13Finite-State Automaton for Inflection of English
Verbs
irreg-past-verb-form
reg-verb-stem
preterite (-ed)
q0
past-participle (-ed)
reg-verb-stem
progressive (-ing)
irreg-verb-stem
3-singular (-s)
14Finite-State Automaton for Inflection of the
Verbs talk, test and sing
u
a
n
s
s
t
g
e
e
k
l
a
t
d
q0
e
s
e
t
d
t
a
s
l
g
k
n
i
i
s
g
n
15Two-Level Morphology
- Two-level morphology represents a word as a
correspondence between a lexical level, which
represents a simple concatenation of morphemes
making up a word, and the surface level, which
represents the actual spelling of the final word.
Lexical
s
n
i
g
PROG
V
s
n
i
g
g
n
i
Surface
16Finite-State Transducer
- A transducer maps between one set of symbols and
another a finite state transducer does this via
a finite automaton. - Where an FSA accepts a language stated over a
finite alphabet of single symbols, e.g. ?a, b,
c, ..., an FST accepts a language stated over
pairs of symbols, e.g. ?aa, bb, ac, a?,
??, ... - In two-level morphology, we call pairs like aa
default pairs, and refer to them by a single
symbol a. - An FST can be seen as a recognizer, generator,
translator or a set relator.
17Finite-State Transducer for Inflection of the
Verbs talk, test and sing
n
g
iu
V?
n
g
s
ia
V?
s
t
PSTPCP?
e
V?
PRET?
PRETe
k
l
a
t
?d
q0
e
s
PSTPCPe
t
?d
t
a
s
l
k
V?
?g
PROGi
?n
i
g
n
3SGs
18Examples
Lexical form Surface form
talk V talk
sing V 3SG sings
test V PROG testing
talk V PRET talked
sing V PRET sang
talk V PSTPCP talked
sing V PSTPCP sung
19Useful FST Operations
- Inversion Switch input and output labels.
- e.g. ?(T)ab, cd ? ?(inv(T))ba, dc
- Intersection Only sequences of pairs accepted by
both transducerT1 and transducerT2 are accepted
by transducer T1T2. - Composition The output of transducer T1 serves
as input to T2. This is marked as T1ºT2 or
T2(T1).
20Spelling Rules and FSTs
Name Description of Rule Example
Consonant doubling 1-letter consonant doubled before -ing/-ed beg/begging
E deletion Silent e dropped before -ing and ed make/making
E insertion e added after s, -z, -x, -ch, -sh before -s watch/watches
Y replacement -y changes to ie before -s, and to -i before -ed try/tries
K insertion verbs ending with vowel -c add -k panic/panicked
21Three levels
- Add an intermediate level between the lexical and
surface levels
Lexical
i
k
s
3SG
V
s
Intermediate
i
k
s
s
s
i
k
s
s
e
s
Surface
22FST for the E-insertion Rule
q5
?
other
other
z, s, x
z, s, x
?
s
z, s, x
?
?e
q0
q3
q4
q1
q2
s
z, x
, other
, other
23Combination of FSTs (1)
Lexicon-FST
...
Rule1-FST
RuleN-FST
24Combination of FSTs (2)
Lexicon-FST
Intermediate
i
k
s
s
s
...
Rule1-FST
RuleN-FST
Intersect
25Combination of FSTs (3)
Compose
Lexicon-FST
Intermediate
i
k
s
s
s
...
Rule1-FST
RuleN-FST
Intersect
26Intersection and Composition
- For each state qi in transducer T1 and state qj
in transducer T2, create a new state qij. - Intersection For any pair ab, if T1 transitions
from qi to qn, and T2 transitions from qj to qm,
T1T2 transitions from qij to qnm. - Composition If T1 transitions from qi to qn with
the pair ab, and T2 transitions from qj to qm
with the pair bc, then T1ºT2 transitions from
qij to qnm with the pair ac.
27Lexicon-Free FSTs
- Used in information-retrieval
- E.g. the Porter algorithm, which is based on a
series of simple cascaded rewrite rules - ATIONAL ? ATE (relational ? relate)
- ING ? ? if stem contains vowel (motoring ? motor)
- Errors occur
- organization ? organ, doing ? doe, university ?
universe
28Human Morphological Processing (1)
- How are multi-morphemic words represented in the
minds of human speakers? - full-listing hypothesis vs. minimum redundancy
hypothesis - Experiments
- Stanners et al. 1979 a word is recognized faster
if it has been seen before (priming) lifting ?
lift, burned ? burn, selective ?/ select, i.e.
different representations for inflection and
derivation. - Marsen-Wilson et al. 1994 spoken derived words
can prime their stems, but only if their meaning
is close government ? govern, department ?/
depart
29Human Morphological Processing (2)
- Speech errors Speakers mix up the order of
words... - e.g. if you break it, itll drop
- ... and also attach affixes to the wrong stems
- e.g. its not only we who have screw looses (for
screws loose) - e.g. easy enoughly (for easily enough)
30Excercise (1/3)
- Your task is to create a finite-state transducer
that can analyze the following Finnish word
forms
Surface form Lexical form
talo talo NOM
taloon talo ILL
talomme talo NOM POS1PL
taloomme talo ILL POS1PL
metsä metsä NOM
metsään metsä ILL
metsämme metsä NOM POS1PL
metsäämme metsä ILL POS1PL
31Exercise (2/3)
- The morphological tags have the following
meaning NOM nominative ILL illative
POS1PL possessive, 1st person plural. - Take a look at Fig 3.16, 3.17 and 3.18 in
Jurafsky Martin. Create three separate
finite-state transducers that you finally combine
into one - a) Create a transducer that operates between the
intermediate and surface level. This transducer
handles the vowel lengthening that is necessary
for the illative form talo ILL ? taloon vs.
metsä ILL ? metsään.
32Excercise (3/3)
- b) Create a transducer that operates between the
intermediate and surface level. This transducer
handles the deletion of n in front of a
possessive ending talo mme ? talomme
vs. taloon mme
? taloomme. - c) Create a transducer that operates between the
lexical and the intermediate level. This
transducer maps morphological tags onto endings. - d) Combine all the transducers into one.
- Present your transducers as graphs or tables (cf.
Fig. 3.15 in Jurafsky Martin)