Title: Morphology: Parsing Words
1Lecture 3
2What is morphology?
- The study of how words are composed from
smaller, meaning-bearing units (morphemes) - Stems children, undoubtedly,
- Affixes (prefixes, suffixes, circumfixes,
infixes) - Immaterial
- Trying
- Gesagt
- Absobldylutely
- Concatenative vs. non-concatenative (e.g. Arabic
root-and-pattern) morphological systems
3Morphology Helps Define Word Classes
- AKA morphological classes, parts-of-speech
- Closed vs. open (function vs. content) class
words - Pronoun, preposition, conjunction, determiner,
- Noun, verb, adverb, adjective,
4(English) Inflectional Morphology
- Word stem grammatical morpheme
- Usually produces word of same class
- Usually serves a syntactic function (e.g.
agreement) - like ? likes or liked
- bird ? birds
- Nominal morphology
- Plural forms
- s or es
- Irregular forms (goose/geese)
- Mass vs. count nouns (fish/fish,email or emails?)
- Possessives (cats, cats)
5- Verbal inflection
- Main verbs (sleep, like, fear) verbs relatively
regular - -s, ing, ed
- And productive Emailed, instant-messaged, faxed,
homered - But some are not regular eat/ate/eaten,
catch/caught/caught - Primary (be, have, do) and modal verbs (can,
will, must) often irregular and not productive - Be am/is/are/were/was/been/being
- Irregular verbs few (250) but frequently
occurring - So.English inflectional morphology is fairly
easy to model.with some special cases...
6(English) Derivational Morphology
- Word stem grammatical morpheme
- Usually produces word of different class
- More complicated than inflectional
- E.g. verbs --gt nouns
- -ize verbs ? -ation nouns
- generalize, realize ? generalization, realization
- E.g. verbs, nouns ? adjectives
- embrace, pity? embraceable, pitiable
- care, wit ? careless, witless
7- E.g. adjective ? adverb
- happy ? happily
- But rules have many exceptions
- Less productive evidence-less, concern-less,
go-able, sleep-able - Meanings of derived terms harder to predict by
rule - clueless, careless, nerveless
8Parsing
- Taking a surface input and identifying its
components and underlying structure - Morphological parsing parsing a word into stem
and affixes, identifying its parts and their
relationships - Stem and features
- goose ? goose N SG or goose V
- geese ? goose N PL
- gooses ? goose V 3SG
- Bracketing indecipherable ? in de cipher
able
9Why parse words?
- For spell-checking
- Is muncheble a legal word?
- To identify a words part-of-speech (pos)
- For sentence parsing, for machine translation,
- To identify a words stem
- For information retrieval
- Why not just list all word forms in a lexicon?
10How do people represent words?
- Hypotheses
- Full listing hypothesis words listed
- Minimum redundancy hypothesis morphemes listed
- Experimental evidence
- Priming experiments (Does seeing/hearing one word
facilitate recognition of another?) suggest
neither - Regularly inflected forms prime stem but not
derived forms - But spoken derived words can prime stems if they
are semantically close (e.g. government/govern
but not department/depart)
11- Speech errors suggest affixes must be represented
separately in the mental lexicon - easy enoughly
12What do we need to build a morphological parser?
- Lexicon list of stems and affixes (w/
corresponding pos) - Morphotactics of the language model of how and
which morphemes can be affixed to a stem - Orthographic rules spelling modifications that
may occur when affixation occurs - in ? il in context of l (in- legal)
13Using FSAs to Represent English Plural Nouns
- English nominal inflection
plural (-s)
reg-n
q0
q2
q1
irreg-pl-n
irreg-sg-n
- Inputs cats, geese, goose
14- Derivational morphology adjective fragment
adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2
- Adj-root1 clear, happy, real (clearly)
- Adj-root2 big, red (bigly)
15FSAs can also represent the Lexicon
- Expand each non-terminal arc in the previous FSA
into a sub-lexicon FSA (e.g. adj_root2 big,
red) and then expand each of these stems into
its letters (e.g. red ? r e d) to get a
recognizer for adjectives
e
r
q1
q2
un-
q3
q7
q0
b
d
q4
-er, -est
q5
g
q6
i
16But..
- Covering the whole lexicon this way will require
very large FSAs with consequent search and
maintenance problems - Adding new items to the lexicon means recomputing
the whole FSA - Non-determinism
- FSAs tell us whether a word is in the language or
not but usually we want to know more - What is the stem?
- What are the affixes and what sort are they?
- We used this information to recognize the word
can we get it back?
17Parsing with Finite State Transducers
- cats ?cat N PL (a plural NP)
- Koskenniemis two-level morphology
- Idea word is a relationship between lexical
level (its morphemes) and surface level (its
orthography) - Morphological parsing find the mapping
(transduction) between lexical and surface levels
c a t N PL
c a t s
18Finite State Transducers can represent this
mapping
- FSTs map between one set of symbols and another
using an FSA whose alphabet ? is composed of
pairs of symbols from input and output alphabets - In general, FSTs can be used for
- Translators (HelloCiao)
- Parser/generator s(HelloHow may I help you?)
- As well as Kimmo-style morphological parsing
19- FST is a 5-tuple consisting of
- Q set of states q0,q1,q2,q3,q4
- ? an alphabet of complex symbols, each an i/o
pair s.t. i ? I (an input alphabet) and o ? O (an
output alphabet) and ? is in I x O - q0 a start state
- F a set of final states in Q q4
- ?(q,io) a transition function mapping Q x ? to
Q - Emphatic Sheep ? Quizzical Cow
ao
bm
ao
ao
!?
q0
q4
q1
q2
q3
20FST for a 2-level Lexicon
cc
aa
tt
q3
q0
q1
q2
e
g
q4
q6
q7
q5
s
eo
eo
Reg-n Irreg-pl-n Irreg-sg-n
c a t g oe oe s e g o o s e
21FST for English Nominal Inflection
N?
reg-n
PLs
q1
q4
SG-
N?
irreg-n-sg
q0
q7
q2
q5
SG-
q3
q6
irreg-n-pl
PL-s
N?
22Useful Operations on Transducers
- Cascade running 2 FSTs in sequence
- Intersection represent the common transitions in
FST1 and FST2 (ASR finding pronunciations) - Composition apply FST2 transition function to
result of FST1 transition function - Inversion exchanging the input and output
alphabets (recognize and generate with same FST) - cf ATT FSM Toolkit and papers by Mohri, Pereira,
and Riley
23Orthographic Rules and FSTs
- Define additional FSTs to implement rules such as
consonant doubling (beg ? begging), e deletion
(make ? making), e insertion (watch ? watches),
etc.
Lexical f o x N PL
Intermediate f o x s
Surface f o x e s
24Porter Stemmer
- Used for tasks in which you only care about the
stem - IR, modeling given/new distinction, topic
detection, document similarity - Rewrite rules (e.g. misunderstanding --gt
misunderstand --gt understand --gt ) - Not perfect . But sometimes it doesnt matter
too much - Fast and easy
25Summing Up
- FSTs provide a useful tool for implementing a
standard model of morphological analysis, Kimmos
two-level morphology - But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the
rule-based Porter Stemmer - Next time
- Read Ch 4
- Read over HW1 and ask questions now