Morphology: Words and their Parts - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Morphology: Words and their Parts

Description:

Spanish words quiero and quieres are both related to querer want' ... Spanish: hablo, hablar / English: I speak, I will speak ... – PowerPoint PPT presentation

Number of Views:217
Avg rating:3.0/5.0
Slides: 49
Provided by: juliahir
Category:

less

Transcript and Presenter's Notes

Title: Morphology: Words and their Parts


1
Morphology Wordsand their Parts
  • CS 4705

Slides adapted from Jurafsky, Martin Hirschberg
and Dorr.
2
English Morphology
  • Morphology is the study of the ways that words
    are built up from smaller meaningful units called
    morphemes
  • We can usefully divide morphemes into two classes
  • Stems The core meaning bearing units
  • Affixes Bits and pieces that adhere to stems to
    change their meanings and grammatical functions

3
Nouns and Verbs (English)
  • Nouns are simple (not really)
  • Markers for plural and possessive
  • Verbs are only slightly more complex
  • Markers appropriate to the tense of the verb

4
Regulars and Irregulars
  • Ok so it gets a little complicated by the fact
    that some words misbehave (refuse to follow the
    rules)
  • Mouse/mice, goose/geese, ox/oxen
  • Go/went, fly/flew
  • The terms regular and irregular will be used to
    refer to words that follow the rules and those
    that dont.

5
Regular and Irregular Nouns and Verbs
  • Regulars
  • Walk, walks, walking, walked, walked
  • Table, tables
  • Irregulars
  • Eat, eats, eating, ate, eaten
  • Catch, catches, catching, caught, caught
  • Cut, cuts, cutting, cut, cut
  • Goose, geese

6
Why care about morphology?
  • Spelling correction referece
  • Morphology in machine translation
  • Spanish words quiero and quieres are both related
    to querer want
  • Hyphenation algorithms refer-ence
  • Part-of-speech analysis google, googler
  • Text-to-speech grapheme-to-phoneme conversion
  • hothouse (/T/ or /D/)
  • Allows us to guess at meaning
  • Twas brillig and the slithy toves
  • Muggles moogled migwiches

7
Concatenative Morphology
  • MorphemeMorphemeMorpheme
  • Stems often called lemma, base form, root,
    lexeme
  • hopeing hoping hop hopping
  • Affixes
  • Prefixes Antidisestablishmentarianism
  • Suffixes Antidisestablishmentarianism
  • Infixes hingi (borrow) humingi (borrower) in
    Tagalog
  • Circumfixes sagen (say) gesagt (said) in
    German

8
What useful information does morphology give us?
  • Different things in different languages
  • Spanish hablo, hablaré/ English I speak, I will
    speak
  • English book, books/ Japanese hon, hon
  • Languages differ in how they encode morphological
    information
  • Isolating languages (e.g. Cantonese) have no
    affixes each word usually has 1 morpheme
  • Agglutinative languages (e.g. Finnish, Turkish)
    are composed of prefixes and suffixes added to a
    stem (like beads on a string) each feature
    realized by a single affix, e.g. Finnish

9
  • epäjärjestelmällistyttämättömyydellänsäkäänköhän
  • Wonder if he can also ... with his capability of
    not causing things to be unsystematic
  • Inflectional languages (e.g. English) merge
    different features into a single affix (e.g. s
    in likes indicates both person and tense) and
    the same feature can be realized by different
    affixes
  • Polysynthetic languages (e.g. Inuit languages)
    express much of their syntax in their morphology,
    incorporating a verbs arguments into the verb,
    e.g. Western Greenlandic
  • Aliikusersuillammassuaanerartassagaluarpaalli.ali
    iku-sersu-i-llammas-sua-a-nerar-ta-ssa-galuar-paal
    -lientertainment-provide-SEMITRANS-one.good.at-CO
    P-say.that-REP-FUT-sure.but-3.PL.SUBJ/3SG.OBJ-but
    'However, they will say that he is a great
    entertainer, but ...'
  • So.different languages may require very
    different morphological analyzers

10
What we want
  • Something to automatically do the following kinds
    of mappings
  • Cats cat N PL
  • Cat cat N SG
  • Cities city N PL
  • Merging merge V Present-participle
  • Caught catch V past-participle

11
Morphology Can Help Define Word Classes
  • AKA morphological classes, parts-of-speech
  • Closed vs. open (function vs. content) class
    words
  • Pronoun, preposition, conjunction, determiner,
  • Noun, verb, adverb, adjective,
  • Identifying word classes is useful for almost any
    task in NLP, from translation to speech
    recognition to topic detectionvery basic
    semantics

12
(English) Inflectional Morphology
  • Word stem grammatical morpheme ? different
    forms of same word
  • Usually produces word of same class
  • Usually serves a syntactic or grammatical
    function (e.g. agreement)
  • like ? likes or liked
  • bird ? birds
  • Nominal morphology
  • Plural forms
  • s or es
  • Irregular forms (goose/geese)

13
  • Mass vs. count nouns (fish/fish(es), email or
    emails?)
  • Possessives (cats, cats)
  • Verbal inflection
  • Main verbs (sleep, like, fear) relatively regular
  • -s, ing, ed
  • And productive emailed, instant-messaged, faxed,
    homered
  • But some are not
  • eat/ate/eaten, catch/caught/caught
  • Primary (be, have, do) and modal verbs (can,
    will, must) often irregular and not productive
  • Be am/is/are/were/was/been/being
  • Irregular verbs few (250) but frequently
    occurring

14
Derivational Morphology
  • Word stem syntactic/grammatical morpheme ? new
    words
  • Usually produces word of different class
  • Incomplete process derivational morphs cannot be
    applied to just any member of a class
  • Verbs --gt nouns
  • -ize verbs ? -ation nouns
  • generalize, realize ? generalization, realization
  • synthesize but not synthesization

15
  • Verbs, nouns ? adjectives
  • embrace, pity? embraceable, pitiable
  • care, wit ? careless, witless
  • Adjective ? adverb
  • happy ? happily
  • Process selective in unpredictable ways
  • Less productive nerveless/evidence-less,
    malleable/sleep-able, rar-ity/rareness
  • Meanings of derived terms harder to predict by
    rule
  • clueless, careless, nerveless, sleepless

16
Compounding
  • Two base forms join to form a new word
  • Bedtime, Weinerschnitzel, Rotwein
  • Careful? Compound or derivation?

17
Morphotactics
  • What are the rules for constructing a word in a
    given language?
  • Pseudo-intellectual vs. intellectual-pseudo
  • Rational-ize vs ize-rational
  • Cretin-ous vs. cretin-ly vs. cretin-acious

18
  • Semantics In English, un- cannot attach to
    adjectives that already have a negative
    connotation
  • Unhappy vs. unsad
  • Unhealthy vs. unsick
  • Unclean vs. undirty
  • Phonology In English, -er cannot attach to words
    of more than two syllables
  • great, greater
  • Happy, happier
  • Competent, competenter
  • Elegant, eleganter
  • Unruly, ?unrulier

19
Morphological Parsing
  • These regularities enable us to create software
    to parse words into their component parts

20
Morphology and FSAs
  • Wed like to use the machinery provided by FSAs
    to capture facts about morphology
  • Ie. Accept strings that are in the language
  • And reject strings that are not
  • And do it in a way that doesnt require us to in
    effect list all the words in the language

21
What do we need to build a morphological parser?
  • Lexicon list of stems and affixes (w/
    corresponding p.o.s.)
  • Morphotactics of the language model of how and
    which morphemes can be affixed to a stem
  • Orthographic rules spelling modifications that
    may occur when affixation occurs
  • in ? il in context of l (in- legal)
  • Most morphological phenomena can be described
    with regular expressions so finite state
    techniques often used to represent morphological
    processes

22
Start Simple
  • Regular singular nouns are ok
  • Regular plural nouns have an -s on the end
  • Irregulars are ok as is

23
Simple Rules
24
Now Add in the Words
25
  • Derivational morphology adjective fragment

adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2
  • Adj-root1 clear, happi, real (clearly)
  • Adj-root2 big, red (bigly)

26
Parsing/Generation vs. Recognition
  • We can now run strings through these machines to
    recognize strings in the language
  • Accept words that are ok
  • Reject words that are not
  • But recognition is usually not quite what we need
  • Often if we find some string in the language we
    might like to find the structure in it (parsing)
  • Or we have some structure and we want to produce
    a surface form (production/generation)
  • Example
  • From cats to cat N PL

27
Finite State Transducers
  • The simple story
  • Add another tape
  • Add extra symbols to the transitions
  • On one tape we read cats, on the other we write
    cat N PL

28
Applications
  • The kind of parsing were talking about is
    normally called morphological analysis
  • It can either be
  • An important stand-alone component of an
    application (spelling correction, information
    retrieval)
  • Or simply a link in a chain of processing

29
FSTs
Kimmo Koskenniemis two-level morphology Idea
word is a relationship between lexical level (its
morphemes) and surface level (its orthography)
30
Transitions
  • cc means read a c on one tape and write a c on
    the other
  • Ne means read a N symbol on one tape and write
    nothing on the other
  • PLs means read PL and write an s

31
Typical Uses
  • Typically, well read from one tape using the
    first symbol on the machine transitions (just as
    in a simple FSA).
  • And well write to the second tape using the
    other symbols on the transitions.
  • In general, FSTs can be used for
  • Translators (HelloCiao)
  • Parser/generators (HelloHow may I help you?)
  • As well as Kimmo-style morphological parsing

32
Ambiguity
  • Recall that in non-deterministic recognition
    multiple paths through a machine may lead to an
    accept state.
  • Didnt matter which path was actually traversed
  • In FSTs the path to an accept state does matter
    since differ paths represent different parses and
    different outputs will result

33
Ambiguity
  • Whats the right parse (segmentation) for
  • Unionizable
  • Union-ize-able
  • Un-ion-ize-able
  • Each represents a valid path through the
    derivational morphology machine.

34
Ambiguity
  • There are a number of ways to deal with this
    problem
  • Simply take the first output found
  • Find all the possible outputs (all paths) and
    return them all (without choosing)
  • Bias the search so that only one or a few likely
    paths are explored

35
The Gory Details
  • Of course, its not as easy as
  • cat N PL lt-gt cats
  • As we saw earlier there are geese, mice and oxen
  • But there are also a whole host of
    spelling/pronunciation changes that go along with
    inflectional changes
  • Cats vs Dogs
  • Fox and Foxes

36
Multi-Tape Machines
  • To deal with this we can simply add more tapes
    and use the output of one tape machine as the
    input to the next
  • So to handle irregular spelling changes well add
    intermediate tapes with intermediate symbols

37
Generativity
  • Nothing really privileged about the directions.
  • We can write from one and read from the other or
    vice-versa.
  • One way is generation, the other way is analysis

38
Multi-Level Tape Machines
  • We use one machine to transduce between the
    lexical and the intermediate level, and another
    to handle the spelling changes to the surface
    tape

39
Lexical to Intermediate Level
40
Intermediate to Surface
  • The add an e rule as in foxs lt-gt foxes

41
Foxes
42
Note
  • A key feature of this machine is that it doesnt
    do anything to inputs to which it doesnt apply.
  • Meaning that they are written out unchanged to
    the output tape.

43
Overall Scheme
  • We now have one FST that has explicit information
    about the lexicon (actual words, their spelling,
    facts about word classes and regularity).
  • Lexical level to intermediate forms
  • We have a larger set of machines that capture
    orthographic/spelling rules.
  • Intermediate forms to surface forms

44
Overall Scheme
45
Cascades
  • This is a scheme that well see again and again.
  • Overall processing is divided up into distinct
    rewrite steps
  • The output of one layer serves as the input to
    the next
  • The intermediate tapes may or may not wind up
    being useful in their own right

46
Porter Stemmer (1980)
  • Used for tasks in which you only care about the
    stem
  • IR, modeling given/new distinction, topic
    detection, document similarity
  • Lexicon-free morphological analysis
  • Cascades rewrite rules (e.g. misunderstanding --gt
    misunderstand --gt understand --gt )
  • Easily implemented as an FST with rules e.g.
  • ATIONAL ? ATE
  • ING ? e
  • Not perfect .
  • Doing ? doe

47
  • Policy ? police
  • Does stemming help?
  • IR, little
  • Topic detection, more

48
Summing Up
  • FSTs provide a useful tool for implementing a
    standard model of morphological analysis, Kimmos
    two-level morphology
  • But for many tasks (e.g. IR) much simpler
    approaches are still widely used, e.g. the
    rule-based Porter Stemmer
  • Next time
  • Read Ch 4
  • HW1 assigned see web page http//www.cs.columbia
    .edu/kathy/NLP
Write a Comment
User Comments (0)
About PowerShow.com