Morphology: Words and their Parts - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Morphology: Words and their Parts

Description:

Spanish words quiero and quieres are both related to querer want' ... Spanish: hablo, hablar / English: I speak, I will speak ... – PowerPoint PPT presentation

Number of Views:217

Avg rating:3.0/5.0

Slides: 49

Provided by: juliahir

Learn more at: http://www1.cs.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Morphology: Words and their Parts

1
Morphology Wordsand their Parts

CS 4705

Slides adapted from Jurafsky, Martin Hirschberg
and Dorr.
2
English Morphology

Morphology is the study of the ways that words
are built up from smaller meaningful units called
morphemes
We can usefully divide morphemes into two classes
Stems The core meaning bearing units
Affixes Bits and pieces that adhere to stems to
change their meanings and grammatical functions

3
Nouns and Verbs (English)

Nouns are simple (not really)
Markers for plural and possessive
Verbs are only slightly more complex
Markers appropriate to the tense of the verb

4
Regulars and Irregulars

Ok so it gets a little complicated by the fact
that some words misbehave (refuse to follow the
rules)
Mouse/mice, goose/geese, ox/oxen
Go/went, fly/flew
The terms regular and irregular will be used to
refer to words that follow the rules and those
that dont.

5
Regular and Irregular Nouns and Verbs

Regulars
Walk, walks, walking, walked, walked
Table, tables
Irregulars
Eat, eats, eating, ate, eaten
Catch, catches, catching, caught, caught
Cut, cuts, cutting, cut, cut
Goose, geese

6
Why care about morphology?

Spelling correction referece
Morphology in machine translation
Spanish words quiero and quieres are both related
to querer want
Hyphenation algorithms refer-ence
Part-of-speech analysis google, googler
Text-to-speech grapheme-to-phoneme conversion
hothouse (/T/ or /D/)
Allows us to guess at meaning
Twas brillig and the slithy toves
Muggles moogled migwiches

7
Concatenative Morphology

MorphemeMorphemeMorpheme
Stems often called lemma, base form, root,
lexeme
hopeing hoping hop hopping
Affixes
Prefixes Antidisestablishmentarianism
Suffixes Antidisestablishmentarianism
Infixes hingi (borrow) humingi (borrower) in
Tagalog
Circumfixes sagen (say) gesagt (said) in
German

8
What useful information does morphology give us?

Different things in different languages
Spanish hablo, hablaré/ English I speak, I will
speak
English book, books/ Japanese hon, hon
Languages differ in how they encode morphological
information
Isolating languages (e.g. Cantonese) have no
affixes each word usually has 1 morpheme
Agglutinative languages (e.g. Finnish, Turkish)
are composed of prefixes and suffixes added to a
stem (like beads on a string) each feature
realized by a single affix, e.g. Finnish

epäjärjestelmällistyttämättömyydellänsäkäänköhän
Wonder if he can also ... with his capability of
not causing things to be unsystematic
Inflectional languages (e.g. English) merge
different features into a single affix (e.g. s
in likes indicates both person and tense) and
the same feature can be realized by different
affixes
Polysynthetic languages (e.g. Inuit languages)
express much of their syntax in their morphology,
incorporating a verbs arguments into the verb,
e.g. Western Greenlandic
Aliikusersuillammassuaanerartassagaluarpaalli.ali
iku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal
-lientertainment-provide-SEMITRANS-one.good.at-CO
P-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but
'However, they will say that he is a great
entertainer, but ...'
So.different languages may require very
different morphological analyzers

10
What we want

Something to automatically do the following kinds
of mappings
Cats cat N PL
Cat cat N SG
Cities city N PL
Merging merge V Present-participle
Caught catch V past-participle

11
Morphology Can Help Define Word Classes

AKA morphological classes, parts-of-speech
Closed vs. open (function vs. content) class
words
Pronoun, preposition, conjunction, determiner,
Noun, verb, adverb, adjective,
Identifying word classes is useful for almost any
task in NLP, from translation to speech
recognition to topic detectionvery basic
semantics

12
(English) Inflectional Morphology

Word stem grammatical morpheme ? different
forms of same word
Usually produces word of same class
Usually serves a syntactic or grammatical
function (e.g. agreement)
like ? likes or liked
bird ? birds
Nominal morphology
Plural forms
s or es
Irregular forms (goose/geese)

Mass vs. count nouns (fish/fish(es), email or
emails?)
Possessives (cats, cats)
Verbal inflection
Main verbs (sleep, like, fear) relatively regular
-s, ing, ed
And productive emailed, instant-messaged, faxed,
homered
But some are not
eat/ate/eaten, catch/caught/caught
Primary (be, have, do) and modal verbs (can,
will, must) often irregular and not productive
Be am/is/are/were/was/been/being
Irregular verbs few (250) but frequently
occurring

14
Derivational Morphology

Word stem syntactic/grammatical morpheme ? new
words
Usually produces word of different class
Incomplete process derivational morphs cannot be
applied to just any member of a class
Verbs --gt nouns
-ize verbs ? -ation nouns
generalize, realize ? generalization, realization
synthesize but not synthesization

Verbs, nouns ? adjectives
embrace, pity? embraceable, pitiable
care, wit ? careless, witless
Adjective ? adverb
happy ? happily
Process selective in unpredictable ways
Less productive nerveless/evidence-less,
malleable/sleep-able, rar-ity/rareness
Meanings of derived terms harder to predict by
rule
clueless, careless, nerveless, sleepless

16
Compounding

Two base forms join to form a new word
Bedtime, Weinerschnitzel, Rotwein
Careful? Compound or derivation?

17
Morphotactics

What are the rules for constructing a word in a
given language?
Pseudo-intellectual vs. intellectual-pseudo
Rational-ize vs ize-rational
Cretin-ous vs. cretin-ly vs. cretin-acious

Semantics In English, un- cannot attach to
adjectives that already have a negative
connotation
Unhappy vs. unsad
Unhealthy vs. unsick
Unclean vs. undirty
Phonology In English, -er cannot attach to words
of more than two syllables
great, greater
Happy, happier
Competent, competenter
Elegant, eleganter
Unruly, ?unrulier

19
Morphological Parsing

These regularities enable us to create software
to parse words into their component parts

20
Morphology and FSAs

Wed like to use the machinery provided by FSAs
to capture facts about morphology
Ie. Accept strings that are in the language
And reject strings that are not
And do it in a way that doesnt require us to in
effect list all the words in the language

21
What do we need to build a morphological parser?

Lexicon list of stems and affixes (w/
corresponding p.o.s.)
Morphotactics of the language model of how and
which morphemes can be affixed to a stem
Orthographic rules spelling modifications that
may occur when affixation occurs
in ? il in context of l (in- legal)
Most morphological phenomena can be described
with regular expressions so finite state
techniques often used to represent morphological
processes

22
Start Simple

Regular singular nouns are ok
Regular plural nouns have an -s on the end
Irregulars are ok as is

23
Simple Rules
24
Now Add in the Words
25

Derivational morphology adjective fragment

adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2

Adj-root1 clear, happi, real (clearly)
Adj-root2 big, red (bigly)

26
Parsing/Generation vs. Recognition

We can now run strings through these machines to
recognize strings in the language
Accept words that are ok
Reject words that are not
But recognition is usually not quite what we need
Often if we find some string in the language we
might like to find the structure in it (parsing)
Or we have some structure and we want to produce
a surface form (production/generation)
Example
From cats to cat N PL

27
Finite State Transducers

The simple story
Add another tape
Add extra symbols to the transitions
On one tape we read cats, on the other we write
cat N PL

28
Applications

The kind of parsing were talking about is
normally called morphological analysis
It can either be
An important stand-alone component of an
application (spelling correction, information
retrieval)
Or simply a link in a chain of processing

29
FSTs
Kimmo Koskenniemis two-level morphology Idea
word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
30
Transitions

cc means read a c on one tape and write a c on
the other
Ne means read a N symbol on one tape and write
nothing on the other
PLs means read PL and write an s

31
Typical Uses

Typically, well read from one tape using the
first symbol on the machine transitions (just as
in a simple FSA).
And well write to the second tape using the
other symbols on the transitions.
In general, FSTs can be used for
Translators (HelloCiao)
Parser/generators (HelloHow may I help you?)
As well as Kimmo-style morphological parsing

32
Ambiguity

Recall that in non-deterministic recognition
multiple paths through a machine may lead to an
accept state.
Didnt matter which path was actually traversed
In FSTs the path to an accept state does matter
since differ paths represent different parses and
different outputs will result

33
Ambiguity

Whats the right parse (segmentation) for
Unionizable
Union-ize-able
Un-ion-ize-able
Each represents a valid path through the
derivational morphology machine.

34
Ambiguity

There are a number of ways to deal with this
problem
Simply take the first output found
Find all the possible outputs (all paths) and
return them all (without choosing)
Bias the search so that only one or a few likely
paths are explored

35
The Gory Details

Of course, its not as easy as
cat N PL lt-gt cats
As we saw earlier there are geese, mice and oxen
But there are also a whole host of
spelling/pronunciation changes that go along with
inflectional changes
Cats vs Dogs
Fox and Foxes

36
Multi-Tape Machines

To deal with this we can simply add more tapes
and use the output of one tape machine as the
input to the next
So to handle irregular spelling changes well add
intermediate tapes with intermediate symbols

37
Generativity

Nothing really privileged about the directions.
We can write from one and read from the other or
vice-versa.
One way is generation, the other way is analysis

38
Multi-Level Tape Machines

We use one machine to transduce between the
lexical and the intermediate level, and another
to handle the spelling changes to the surface
tape

39
Lexical to Intermediate Level
40
Intermediate to Surface

The add an e rule as in foxs lt-gt foxes

41
Foxes
42
Note

A key feature of this machine is that it doesnt
do anything to inputs to which it doesnt apply.
Meaning that they are written out unchanged to
the output tape.

43
Overall Scheme

We now have one FST that has explicit information
about the lexicon (actual words, their spelling,
facts about word classes and regularity).
Lexical level to intermediate forms
We have a larger set of machines that capture
orthographic/spelling rules.
Intermediate forms to surface forms

44
Overall Scheme
45
Cascades

This is a scheme that well see again and again.
Overall processing is divided up into distinct
rewrite steps
The output of one layer serves as the input to
the next
The intermediate tapes may or may not wind up
being useful in their own right

46
Porter Stemmer (1980)

Used for tasks in which you only care about the
stem
IR, modeling given/new distinction, topic
detection, document similarity
Lexicon-free morphological analysis
Cascades rewrite rules (e.g. misunderstanding --gt
misunderstand --gt understand --gt )
Easily implemented as an FST with rules e.g.
ATIONAL ? ATE
ING ? e
Not perfect .
Doing ? doe

Policy ? police
Does stemming help?
IR, little
Topic detection, more

48
Summing Up

FSTs provide a useful tool for implementing a
standard model of morphological analysis, Kimmos
two-level morphology
But for many tasks (e.g. IR) much simpler
approaches are still widely used, e.g. the
rule-based Porter Stemmer
Next time
Read Ch 4
HW1 assigned see web page http//www.cs.columbia
.edu/kathy/NLP