Title: Morphology: Words and their Parts
1Morphology Wordsand their Parts
Slides adapted from Jurafsky, Martin Hirschberg
and Dorr.
2English Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes - We can usefully divide morphemes into two classes
- Stems The core meaning bearing units
- Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions
3Nouns and Verbs (English)
- Nouns are simple (not really)
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb
4Regulars and Irregulars
- Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules) - Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.
5Regular and Irregular Nouns and Verbs
- Regulars
- Walk, walks, walking, walked, walked
- Table, tables
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
- Goose, geese
6Why care about morphology?
- Spelling correction referece
- Morphology in machine translation
- Spanish words quiero and quieres are both related
to querer want - Hyphenation algorithms refer-ence
- Part-of-speech analysis google, googler
- Text-to-speech grapheme-to-phoneme conversion
- hothouse (/T/ or /D/)
- Allows us to guess at meaning
- Twas brillig and the slithy toves
- Muggles moogled migwiches
7Concatenative Morphology
- MorphemeMorphemeMorpheme
- Stems often called lemma, base form, root,
lexeme - hopeing hoping hop hopping
- Affixes
- Prefixes Antidisestablishmentarianism
- Suffixes Antidisestablishmentarianism
- Infixes hingi (borrow) humingi (borrower) in
Tagalog - Circumfixes sagen (say) gesagt (said) in
German
8What useful information does morphology give us?
- Different things in different languages
- Spanish hablo, hablaré/ English I speak, I will
speak - English book, books/ Japanese hon, hon
- Languages differ in how they encode morphological
information - Isolating languages (e.g. Cantonese) have no
affixes each word usually has 1 morpheme - Agglutinative languages (e.g. Finnish, Turkish)
are composed of prefixes and suffixes added to a
stem (like beads on a string) each feature
realized by a single affix, e.g. Finnish
9- epäjärjestelmällistyttämättömyydellänsäkäänköhän
- Wonder if he can also ... with his capability of
not causing things to be unsystematic - Inflectional languages (e.g. English) merge
different features into a single affix (e.g. s
in likes indicates both person and tense) and
the same feature can be realized by different
affixes - Polysynthetic languages (e.g. Inuit languages)
express much of their syntax in their morphology,
incorporating a verbs arguments into the verb,
e.g. Western Greenlandic - Aliikusersuillammassuaanerartassagaluarpaalli.ali
iku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal
-lientertainment-provide-SEMITRANS-one.good.at-CO
P-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but
'However, they will say that he is a great
entertainer, but ...' - So.different languages may require very
different morphological analyzers
10What we want
- Something to automatically do the following kinds
of mappings - Cats cat N PL
- Cat cat N SG
- Cities city N PL
- Merging merge V Present-participle
- Caught catch V past-participle
11Morphology Can Help Define Word Classes
- AKA morphological classes, parts-of-speech
- Closed vs. open (function vs. content) class
words - Pronoun, preposition, conjunction, determiner,
- Noun, verb, adverb, adjective,
- Identifying word classes is useful for almost any
task in NLP, from translation to speech
recognition to topic detectionvery basic
semantics
12(English) Inflectional Morphology
- Word stem grammatical morpheme ? different
forms of same word - Usually produces word of same class
- Usually serves a syntactic or grammatical
function (e.g. agreement) - like ? likes or liked
- bird ? birds
- Nominal morphology
- Plural forms
- s or es
- Irregular forms (goose/geese)
13- Mass vs. count nouns (fish/fish(es), email or
emails?) - Possessives (cats, cats)
- Verbal inflection
- Main verbs (sleep, like, fear) relatively regular
- -s, ing, ed
- And productive emailed, instant-messaged, faxed,
homered - But some are not
- eat/ate/eaten, catch/caught/caught
- Primary (be, have, do) and modal verbs (can,
will, must) often irregular and not productive - Be am/is/are/were/was/been/being
- Irregular verbs few (250) but frequently
occurring
14Derivational Morphology
- Word stem syntactic/grammatical morpheme ? new
words - Usually produces word of different class
- Incomplete process derivational morphs cannot be
applied to just any member of a class - Verbs --gt nouns
- -ize verbs ? -ation nouns
- generalize, realize ? generalization, realization
- synthesize but not synthesization
15- Verbs, nouns ? adjectives
- embrace, pity? embraceable, pitiable
- care, wit ? careless, witless
- Adjective ? adverb
- happy ? happily
- Process selective in unpredictable ways
- Less productive nerveless/evidence-less,
malleable/sleep-able, rar-ity/rareness - Meanings of derived terms harder to predict by
rule - clueless, careless, nerveless, sleepless
16Compounding
- Two base forms join to form a new word
- Bedtime, Weinerschnitzel, Rotwein
- Careful? Compound or derivation?
17Morphotactics
- What are the rules for constructing a word in a
given language? - Pseudo-intellectual vs. intellectual-pseudo
- Rational-ize vs ize-rational
- Cretin-ous vs. cretin-ly vs. cretin-acious
18- Semantics In English, un- cannot attach to
adjectives that already have a negative
connotation - Unhappy vs. unsad
- Unhealthy vs. unsick
- Unclean vs. undirty
- Phonology In English, -er cannot attach to words
of more than two syllables - great, greater
- Happy, happier
- Competent, competenter
- Elegant, eleganter
- Unruly, ?unrulier
19Morphological Parsing
- These regularities enable us to create software
to parse words into their component parts
20Morphology and FSAs
- Wed like to use the machinery provided by FSAs
to capture facts about morphology - Ie. Accept strings that are in the language
- And reject strings that are not
- And do it in a way that doesnt require us to in
effect list all the words in the language
21What do we need to build a morphological parser?
- Lexicon list of stems and affixes (w/
corresponding p.o.s.) - Morphotactics of the language model of how and
which morphemes can be affixed to a stem - Orthographic rules spelling modifications that
may occur when affixation occurs - in ? il in context of l (in- legal)
- Most morphological phenomena can be described
with regular expressions so finite state
techniques often used to represent morphological
processes
22Start Simple
- Regular singular nouns are ok
- Regular plural nouns have an -s on the end
- Irregulars are ok as is
23Simple Rules
24Now Add in the Words
25- Derivational morphology adjective fragment
adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2
- Adj-root1 clear, happi, real (clearly)
- Adj-root2 big, red (bigly)
26Parsing/Generation vs. Recognition
- We can now run strings through these machines to
recognize strings in the language - Accept words that are ok
- Reject words that are not
- But recognition is usually not quite what we need
- Often if we find some string in the language we
might like to find the structure in it (parsing) - Or we have some structure and we want to produce
a surface form (production/generation) - Example
- From cats to cat N PL
27Finite State Transducers
- The simple story
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write
cat N PL
28Applications
- The kind of parsing were talking about is
normally called morphological analysis - It can either be
- An important stand-alone component of an
application (spelling correction, information
retrieval) - Or simply a link in a chain of processing
29FSTs
Kimmo Koskenniemis two-level morphology Idea
word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
30Transitions
- cc means read a c on one tape and write a c on
the other - Ne means read a N symbol on one tape and write
nothing on the other - PLs means read PL and write an s
31Typical Uses
- Typically, well read from one tape using the
first symbol on the machine transitions (just as
in a simple FSA). - And well write to the second tape using the
other symbols on the transitions. - In general, FSTs can be used for
- Translators (HelloCiao)
- Parser/generators (HelloHow may I help you?)
- As well as Kimmo-style morphological parsing
32Ambiguity
- Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state. - Didnt matter which path was actually traversed
- In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result
33Ambiguity
- Whats the right parse (segmentation) for
- Unionizable
- Union-ize-able
- Un-ion-ize-able
- Each represents a valid path through the
derivational morphology machine.
34Ambiguity
- There are a number of ways to deal with this
problem - Simply take the first output found
- Find all the possible outputs (all paths) and
return them all (without choosing) - Bias the search so that only one or a few likely
paths are explored
35The Gory Details
- Of course, its not as easy as
- cat N PL lt-gt cats
- As we saw earlier there are geese, mice and oxen
- But there are also a whole host of
spelling/pronunciation changes that go along with
inflectional changes - Cats vs Dogs
- Fox and Foxes
36Multi-Tape Machines
- To deal with this we can simply add more tapes
and use the output of one tape machine as the
input to the next - So to handle irregular spelling changes well add
intermediate tapes with intermediate symbols
37Generativity
- Nothing really privileged about the directions.
- We can write from one and read from the other or
vice-versa. - One way is generation, the other way is analysis
38Multi-Level Tape Machines
- We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape
39Lexical to Intermediate Level
40Intermediate to Surface
- The add an e rule as in foxs lt-gt foxes
41Foxes
42Note
- A key feature of this machine is that it doesnt
do anything to inputs to which it doesnt apply. - Meaning that they are written out unchanged to
the output tape.
43Overall Scheme
- We now have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity). - Lexical level to intermediate forms
- We have a larger set of machines that capture
orthographic/spelling rules. - Intermediate forms to surface forms
44Overall Scheme
45Cascades
- This is a scheme that well see again and again.
- Overall processing is divided up into distinct
rewrite steps - The output of one layer serves as the input to
the next - The intermediate tapes may or may not wind up
being useful in their own right
46Porter Stemmer (1980)
- Used for tasks in which you only care about the
stem - IR, modeling given/new distinction, topic
detection, document similarity - Lexicon-free morphological analysis
- Cascades rewrite rules (e.g. misunderstanding --gt
misunderstand --gt understand --gt ) - Easily implemented as an FST with rules e.g.
- ATIONAL ? ATE
- ING ? e
- Not perfect .
- Doing ? doe
47- Policy ? police
- Does stemming help?
- IR, little
- Topic detection, more
48Summing Up
- FSTs provide a useful tool for implementing a
standard model of morphological analysis, Kimmos
two-level morphology - But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the
rule-based Porter Stemmer - Next time
- Read Ch 4
- HW1 assigned see web page http//www.cs.columbia
.edu/kathy/NLP