Title: Natural%20Language%20Processing
1Natural Language Processing
2Morphology (Ch 3)
- Finite-state methods are particularly useful in
dealing with a lexicon. - So well switch to talking about some facts about
words and then come back to computational methods
3English Morphology
- Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes - Two classes of morhphemes
- Stems The core meaning bearing units
- Affixes adhere to stems to change their meanings
and grammatical functions
4Examples
- Insubstantial, trying, unreadable
5English Morphology
- We can also divide morphology up into two broad
classes - Inflectional
- Derivational
6Word Classes
- Things like nouns and verbs
- Well go into the gory details when we cover POS
tagging - Relevant now how stems and affixes combine
depends on word class of the stem
7Inflectional Morphology
- Inflectional morphology concerns the combination
of stems and affixes where the resulting word - Has the same word class as the original
- Serves a grammatical purpose different from the
original (agreement, tense) - bird ? birds
- like ? likes or liked
8Nouns and Verbs (English)
- Nouns are simple
- Markers for plural and possessive
- Verbs are only slightly more complex
- Markers appropriate to the tense of the verb
9Regulars and Irregulars
- Some words misbehave
- Mouse/mice, goose/geese, ox/oxen
- Go/went, fly/flew
- The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont. - (Different meaning than regular languages!)
10Regular and Irregular Verbs
- Regulars
- Walk, walks, walking, walked, walked
- Irregulars
- Eat, eats, eating, ate, eaten
- Catch, catches, catching, caught, caught
- Cut, cuts, cutting, cut, cut
11Regular Verbs
- If you know a regular verb stem, you can predict
the other forms by adding a predictable ending
and making regular spelling changes (details in
the chapter) - The regular class is productive includes new
verbs. - Emailed, instant-messaged, faxed, googled,
12Derivational Morphology
- Quasi-systematicity
- Irregular meaning changes
- Healthful vs. Healthy
- Clue?clueless (lacking understanding)
- Art ? artless (without guile not artificial)
- Changes of word class
- Examples
- Computerize (V) Computerization (N)
- Appoint (V) Appointee (N)
- Computation (N) Computational (Adj)
- eatation spellation sleepable scienceless
13Morphological Processing Requirements
- Lexicon
- Word repository
- Stems and affixes (with corresponding parts of
speech) - Morphotactics
- Morpheme ordering
- Orthographic Rules
- Spelling changes due to affixation
- City -s ? cities (not citys)
14Morphotactics using FSAs
English nominal inflection Inputs cats, goose,
geese
15Adding the Words
Expand each non-terminal into each stem in its
class reg-noun cat, dog,
Then expand each step to the letters it includes
16Derivational Rules
17Limitations
- FSAs can only tell us whether a word in in the
language or not, but what if we want to know
more? - What is the stem?
- What are the affixes, and of what sort?
18Parsing/Generation vs. Recognition
- Recognition is usually not quite what we need.
- Usually if we find some string in the language we
need to find the structure in it (parsing) - Or we have some structure and we want to produce
a surface form (production/generation) - Examples
- From cats to cat N PL
- From cat N Pl to cats
19Applications
- The kind of parsing were talking about is
normally called morphological analysis - It can either be
- An important stand-alone component of an
application (spelling correction, information
retrieval) - Or simply a step in a processing sequence
20Finite State Transducers
- Basic idea
- Add another tape
- Add extra symbols to the transitions
- On one tape we read cats, on the other we write
cat N PL, or the other way around.
21FSTs
Two-level morphology represents a word as a
correspondence between lexical (the morphemes)
and surface (the orthographic word)
levels Parsing maps surface to lexical
level Visualize a FST as a 2-tape FSA which
recognizes/generates pairs of strings
22Transitions
- cc means read a c on one tape and write a c on
the other - Ne means read a N symbol on one tape and write
nothing on the other - PLs means read PL and write an s
23Typical Uses
- Typically, well read from one tape using the
first symbol - And write to the second tape using the other
symbol - Closure properties of FSTs inversion and
composition - So, they may used in reverse and they may be
cascaded
24Ambiguity
- Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state. - Didnt matter which path was actually traversed
- In FSTs the path to an accept state does matter
since different paths represent different parses
and different outputs will result
25Ambiguity
- Whats the right parse for
- Unionizable
- Union-ize-able
- Un-ion-ize-able
- Each represents a valid path through the
derivational morphology machine.
26Ambiguity
- There are a number of ways to deal with this
problem - Simply take the first output found
- Find all the possible outputs (all paths) and
return them all (without choosing) - Bias the search so that only one or a few likely
paths are explored
27Details
- Of course, its not as easy as
- cat N PL lt-gt cats
- As we saw earlier there are geese, mice and oxen
- But there are also spelling/pronunciation changes
that go along with inflectional changes (e.g.,
plural of fox is foxes, not foxs)
28Multi-Tape Machines
- add another tape, and use the output of one tape
machine as the input to the next - To handle irregular spelling changes well add
intermediate tapes with intermediate symbols
29Multi-Level Tape Machines
- We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape
30Lexical to Intermediate Level
31Intermediate to Surface
- The add an e rule as in foxs ?? foxes
epsilon
(error in the text)
32Succeeds iff rule applies removes all s from
output string useful if lt 2 spelling rules apply
( separates all affixes and stems another nlp
process may care about mid-word morphemes )
- Suppose q0, q1, q2 were not final states
epsilon
(error in the text)
33Overall Plan
34Summing Up
- FSTs allow us to take an input and deliver a
structure based on it - Or take a structure and create a surface form
- Or take a structure and create another structure
35Summing Up
- In many applications its convenient to decompose
the problem into a set of cascaded transducers
where - The output of one feeds into the input of the
next. - Well see this scheme again for deeper semantic
processing.
36Summing Up
- FSTs provide a useful tool for implementing a
standard model of morphological analysis
(two-level morphology) - Toolkits such as ATT FSM Toolkit available
- Other approaches are also used, e.g., the
rule-based Porter Stemmer, and memory-based
learning