Title: Morphology
1Morphology
- See
- Harald Trost Morphology. Chapter 2 of R Mitkov
(ed.) The Oxford Handbook of Computational
Linguistics, Oxford (2004) OUP - D Jurafsky JH Martin Speech and Language
Processing, Upper Saddle River NJ (2000)
Prentice Hall, Chapter 3 quite technical
2Morphology - reminder
- Internal analysis of word forms
- morpheme allomorphic variation
- Words usually consist of a root plus affix(es),
though some words can have multiple roots, and
some can be single morphemes - lexeme abstract notion of group of word forms
that belong together - lexeme root stem base form dictionary
(citation) form
3Role of morphology
- Commonly made distinction inflectional vs
derivational - Inflectional morphology is grammatical
- number, tense, case, gender
- Derivational morphology concerns word building
- part-of-speech derivation
- words with related meaning
4Inflectional morphology
- Grammatical in nature
- Does not carry meaning, other than grammatical
meaning - Highly systematic, though there may be
irregularities and exceptions - Simplifies lexicon, only exceptions need to be
listed - Unknown words may be guessable
- Language-specific and sometimes idiosyncratic
- (Mostly) helpful in parsing
5Derivational morphology
- Lexical in nature
- Can carry meaning
- Fairly systematic, and predictable up to a point
- Simplifies description of lexicon regularly
derived words need not be listed - Unknown words may be guessable
- But
- Apparent derivations have specialised meaning
- Some derivations missing
- Languages often have parallel derivations which
may be translatable
6Morphological processes
- Affixes prefix, suffix, infix, circumfix
- Vowel change (umlaut, ablaut)
- Gemination, (partial) reduplication
- Root and pattern
- Stress (or tone) change
- Sandhi
7Morphophonemics
- Morphemes and allomorphs
- eg plur (e)s, vowel change, y?ies, f?ves, um
?a, ?, ... - Morphophonemic variation
- Affixes and stems may have variants which are
conditioned by context - eg ing in lifting, swimming, boxing, raining,
hoping, hopping - Rules may be generalisable across morphemes
- eg (e)s in cats, boxes, tomatoes, matches,
dishes, buses - Applies to both plur (nouns) and 3rd sing
pres (verbs)
8Morphology in NLP
- Analysis vs synthesis
- what does dogs mean? vs what is the plural of
dog? - Analysis
- Need to identify lexeme
- Tokenization
- To access lexical information
- Inflections (etc) carry information that will be
needed by other processes (eg agreement useful in
parsing, inflections can carry meaning (eg tense,
number) - Morphology can be ambiguous
- May need other process to disambiguate (eg German
en) - Synthesis
- Need to generate appropriate inflections from
underlying representation
9Morphology in NLP
- String-handling programs can be written
- More general approach
- formalism to write rules which express
correspondence between surface and underlying
form (eg dogs dog plur) - Computational algorithm (program) which can apply
those rules to actual instances - Especially of interest if rules (though not
program) is independent of direction analysis or
synthesis
10Role of lexicon in morphology
- Rules interact with the lexicon
- Obviously category information
- eg rules that apply to nouns
- Note also morphology-related subcategories
- eg er verbs in French, rules for gender
agreement - Other lexical information can impact on
morphology - eg all fish have two forms of the plural (s and
?) - in Slavic languages case inflections differ for
inanimate and animate nouns)
11Problems with rules
- Exceptions have to be covered
- Including systematic irregularities
- May be a trade-off between treating something as
a small group of irregularities or as a list of
unrelated exceptions (eg French irregular verbs,
English f?ves) - Rules must not over/under-generate
- Must cover all and only the correct cases
- May depend on what order the rules are applied in
12Tokenization
- The simplest form of analysis is to reduce
different word forms into tokens - Also called normalization
- For example, if you want to count how many times
a given word occurs in a text - Or you want to search for texts containing
certain words (e.g. Google)
13Morphological processing
- Stemming
- String-handling approaches
- Regular expressions
- Mapping onto finite-state automata
- 2-level morphology
- Mapping between surface form and lexical
representation
14Stemming
- Stemming is the particular case of tokenization
which reduces inflected forms to a single base
form or stem - (Recall our discussion of stem base form
dictionary form citation form) - Stemming algorithms are basic string-handling
algorithms, which depend on rules which identify
affixes that can be stripped
15Finite state automata
- A finite state automaton is a simple and
intuitive formalism with straightforward
computational properties (so easy to implement) - A bit like a flow chart, but can be used for both
recognition (analysis) and generation - FSAs have a close relationship with regular
expressions, a formalism for expressing strings,
mainly used for searching texts, or stipulating
patterns of strings
16Finite state automata
- A bit like a flow chart, but can be used for both
recognition and generation - Transition network
- Unique start point
- Series of states linked by transitions
- Transitions represent input to be accounted for,
or output to be generated - Legal exit-point(s) explicitly identified
17ExampleJurafsky Martin, Figure 2.10
- Loop on q3 means that it can account for infinite
length strings - Deterministic because in any state, its
behaviour is fully predictable
18Non-deterministic FSAJurafsky Martin, Figure
2.18
- At state q2 with input a there is a choice of
transitions - We can also have jump arcs (or empty
transitions), which also introduce non-determinism
19An FSA to handle morphology
Spot the deliberate mistake overgeneration
20Finite State Transducers
- A transducer defines a relationship (a mapping)
between two things - Typically used for two-level morphology, but
can be used for other things - Like an FSA, but each state transition stipulates
a pair of symbols, and thus a mapping
21Finite State Transducers
- Three functions
- Recognizer (verification) takes a pair of
strings and verifies if the FST is able to map
them onto each other - Generator (synthesis) can generate a legal pair
of strings - Translator (transduction) given one string, can
generate the corresponding string - Mapping usually between levels of representation
- spys spies
- Lexicalintermediate foxNPs foxs
- Intermediatesurface foxs foxes
22Some conventions
- Transitions are marked by
- A non-changing transition xx can be shown
simply as x - Wild-cards are shown as _at_
- Empty string shown as e
23An examplebased on Trost p.42
spys spies
s
p
yi
e
s
e
e
toys toys
t
o
y
0
s
e
e
s
h
e
e
s
e
l
fv
e
w
i
fv
e
s
e
e
24Using wild cards and loops
s
p
yi
e
s
0
0
t
o
y
0
s
0
0
Can be collapsed into a single FST
25Another example (JM Fig. 3.9, p.74)
f o x c a t d o g
P s
Ne
q4
q1
g o o s e s h e e p m o u s e
S
Ne
q0
q5
q2
q7
S
g oe oe s e s h e e p m oi uesc e
Ne
P
q6
q3
lexicalintermediate
26f o x c a t d o g
q1
q0
o
s1
s2
f
x
a
c
t
q0
q1
s3
s4
d
g
o
s5
s6
27- 0 ff oo xx 1 Ne 4 P ss 7
- 0 ff oo xx 1 Ne 4 S 7
- 0 cc aa tt 1 Ne 4 P ss 7
- 0 ss hh ee pp 2 Ne 5 S 7
- 0 gg oe oe ss ee 3 Ne 5 P 7
f o x N P s f o x s f o x N S f o x c
a t N P s c a t s s h e e p N S s h e e
p g o o s e N P g e e s e
f o x c a t d o g
P s
Ne
q4
q1
g o o s e s h e e p m o u s e
S
Ne
q0
q5
q2
q7
S
g oe oe s e s h e e p m oi uesc e
Ne
P
q6
q3
28Lexicalsurface mappingJM Fig. 3.14, p.78
f o x N P s f o x s c a t N P s c a t
s
e ? e / x s z __ s
290 ff 0 oo 0 xx 1 e 2 ee 3 ss
4 0 0 cc 0 aa 0 tt 0 e 0
ss 0 0
f o x s f o x e s c a t s c a t s
other
q5
e other
z, s, x
s
e
z, s, x
e
ee
s
q0
q1
q4
q2
q3
, other
z, x
30FST
- But you dont have to draw all these FSTs
- They map neatly onto rule formalisms
- What is more, these can be generated
automatically - Therefore, slightly different formalism
31FST compiler http//www.xrce.xerox.com/competencie
s/content-analysis/fsCompiler/fsinput.html d o g
N P .x. d o g s c a t N P .x. c a t s
f o x N P .x. f o x e s g o o s e N P .x.
g e e s e
s0 c -gt s1, d -gt s2, f -gt s3, g -gt s4. s1 a
-gt s5. s2 o -gt s6. s3 o -gt s7. s4 ltoegt
-gt s8. s5 t -gt s9. s6 g -gt s9. s7 x -gt
s10. s8 ltoegt -gt s11. s9 ltNsgt -gt s12. s10
ltNegt -gt s13. s11 s -gt s14. s12 ltP0gt -gt
fs15. s13 ltPsgt -gt fs15. s14 e -gt s16. fs15
(no arcs) s16 ltN0gt -gt s12.