Morphology: Parsing Words - PowerPoint PPT Presentation

About This Presentation
Title:

Morphology: Parsing Words

Description:

Lecture 3 Morphology: Parsing Words – PowerPoint PPT presentation

Number of Views:390
Avg rating:3.0/5.0
Slides: 26
Provided by: juliah165
Category:

less

Transcript and Presenter's Notes

Title: Morphology: Parsing Words


1
Lecture 3
  • Morphology Parsing Words

2
What is morphology?
  • The study of how words are composed from
    smaller, meaning-bearing units (morphemes)
  • Stems children, undoubtedly,
  • Affixes (prefixes, suffixes, circumfixes,
    infixes)
  • Immaterial
  • Trying
  • Gesagt
  • Absobldylutely
  • Concatenative vs. non-concatenative (e.g. Arabic
    root-and-pattern) morphological systems

3
Morphology Helps Define Word Classes
  • AKA morphological classes, parts-of-speech
  • Closed vs. open (function vs. content) class
    words
  • Pronoun, preposition, conjunction, determiner,
  • Noun, verb, adverb, adjective,

4
(English) Inflectional Morphology
  • Word stem grammatical morpheme
  • Usually produces word of same class
  • Usually serves a syntactic function (e.g.
    agreement)
  • like ? likes or liked
  • bird ? birds
  • Nominal morphology
  • Plural forms
  • s or es
  • Irregular forms (goose/geese)
  • Mass vs. count nouns (fish/fish,email or emails?)
  • Possessives (cats, cats)

5
  • Verbal inflection
  • Main verbs (sleep, like, fear) verbs relatively
    regular
  • -s, ing, ed
  • And productive Emailed, instant-messaged, faxed,
    homered
  • But some are not regular eat/ate/eaten,
    catch/caught/caught
  • Primary (be, have, do) and modal verbs (can,
    will, must) often irregular and not productive
  • Be am/is/are/were/was/been/being
  • Irregular verbs few (250) but frequently
    occurring
  • So.English inflectional morphology is fairly
    easy to model.with some special cases...

6
(English) Derivational Morphology
  • Word stem grammatical morpheme
  • Usually produces word of different class
  • More complicated than inflectional
  • E.g. verbs --gt nouns
  • -ize verbs ? -ation nouns
  • generalize, realize ? generalization, realization
  • E.g. verbs, nouns ? adjectives
  • embrace, pity? embraceable, pitiable
  • care, wit ? careless, witless

7
  • E.g. adjective ? adverb
  • happy ? happily
  • But rules have many exceptions
  • Less productive evidence-less, concern-less,
    go-able, sleep-able
  • Meanings of derived terms harder to predict by
    rule
  • clueless, careless, nerveless

8
Parsing
  • Taking a surface input and identifying its
    components and underlying structure
  • Morphological parsing parsing a word into stem
    and affixes, identifying its parts and their
    relationships
  • Stem and features
  • goose ? goose N SG or goose V
  • geese ? goose N PL
  • gooses ? goose V 3SG
  • Bracketing indecipherable ? in de cipher
    able

9
Why parse words?
  • For spell-checking
  • Is muncheble a legal word?
  • To identify a words part-of-speech (pos)
  • For sentence parsing, for machine translation,
  • To identify a words stem
  • For information retrieval
  • Why not just list all word forms in a lexicon?

10
How do people represent words?
  • Hypotheses
  • Full listing hypothesis words listed
  • Minimum redundancy hypothesis morphemes listed
  • Experimental evidence
  • Priming experiments (Does seeing/hearing one word
    facilitate recognition of another?) suggest
    neither
  • Regularly inflected forms prime stem but not
    derived forms
  • But spoken derived words can prime stems if they
    are semantically close (e.g. government/govern
    but not department/depart)

11
  • Speech errors suggest affixes must be represented
    separately in the mental lexicon
  • easy enoughly

12
What do we need to build a morphological parser?
  • Lexicon list of stems and affixes (w/
    corresponding pos)
  • Morphotactics of the language model of how and
    which morphemes can be affixed to a stem
  • Orthographic rules spelling modifications that
    may occur when affixation occurs
  • in ? il in context of l (in- legal)

13
Using FSAs to Represent English Plural Nouns
  • English nominal inflection

plural (-s)
reg-n
q0
q2
q1
irreg-pl-n
irreg-sg-n
  • Inputs cats, geese, goose

14
  • Derivational morphology adjective fragment

adj-root1
-er, -ly, -est
un-
q5
adj-root1
q3
q4
?
-er, -est
adj-root2
  • Adj-root1 clear, happy, real (clearly)
  • Adj-root2 big, red (bigly)

15
FSAs can also represent the Lexicon
  • Expand each non-terminal arc in the previous FSA
    into a sub-lexicon FSA (e.g. adj_root2 big,
    red) and then expand each of these stems into
    its letters (e.g. red ? r e d) to get a
    recognizer for adjectives

e
r
q1
q2
un-
q3
q7
q0
b
d
q4
-er, -est
q5
g
q6
i
16
But..
  • Covering the whole lexicon this way will require
    very large FSAs with consequent search and
    maintenance problems
  • Adding new items to the lexicon means recomputing
    the whole FSA
  • Non-determinism
  • FSAs tell us whether a word is in the language or
    not but usually we want to know more
  • What is the stem?
  • What are the affixes and what sort are they?
  • We used this information to recognize the word
    can we get it back?

17
Parsing with Finite State Transducers
  • cats ?cat N PL (a plural NP)
  • Koskenniemis two-level morphology
  • Idea word is a relationship between lexical
    level (its morphemes) and surface level (its
    orthography)
  • Morphological parsing find the mapping
    (transduction) between lexical and surface levels

c a t N PL
c a t s
18
Finite State Transducers can represent this
mapping
  • FSTs map between one set of symbols and another
    using an FSA whose alphabet ? is composed of
    pairs of symbols from input and output alphabets
  • In general, FSTs can be used for
  • Translators (HelloCiao)
  • Parser/generator s(HelloHow may I help you?)
  • As well as Kimmo-style morphological parsing

19
  • FST is a 5-tuple consisting of
  • Q set of states q0,q1,q2,q3,q4
  • ? an alphabet of complex symbols, each an i/o
    pair s.t. i ? I (an input alphabet) and o ? O (an
    output alphabet) and ? is in I x O
  • q0 a start state
  • F a set of final states in Q q4
  • ?(q,io) a transition function mapping Q x ? to
    Q
  • Emphatic Sheep ? Quizzical Cow

ao
bm
ao
ao
!?
q0
q4
q1
q2
q3
20
FST for a 2-level Lexicon
cc
aa
tt
  • E.g.

q3
q0
q1
q2
e
g
q4
q6
q7
q5
s
eo
eo
Reg-n Irreg-pl-n Irreg-sg-n
c a t g oe oe s e g o o s e
21
FST for English Nominal Inflection
N?
reg-n
PLs
q1
q4
SG-
N?
irreg-n-sg
q0
q7
q2
q5
SG-
q3
q6
irreg-n-pl
PL-s
N?
22
Useful Operations on Transducers
  • Cascade running 2 FSTs in sequence
  • Intersection represent the common transitions in
    FST1 and FST2 (ASR finding pronunciations)
  • Composition apply FST2 transition function to
    result of FST1 transition function
  • Inversion exchanging the input and output
    alphabets (recognize and generate with same FST)
  • cf ATT FSM Toolkit and papers by Mohri, Pereira,
    and Riley

23
Orthographic Rules and FSTs
  • Define additional FSTs to implement rules such as
    consonant doubling (beg ? begging), e deletion
    (make ? making), e insertion (watch ? watches),
    etc.

Lexical f o x N PL
Intermediate f o x s
Surface f o x e s
24
Porter Stemmer
  • Used for tasks in which you only care about the
    stem
  • IR, modeling given/new distinction, topic
    detection, document similarity
  • Rewrite rules (e.g. misunderstanding --gt
    misunderstand --gt understand --gt )
  • Not perfect . But sometimes it doesnt matter
    too much
  • Fast and easy

25
Summing Up
  • FSTs provide a useful tool for implementing a
    standard model of morphological analysis, Kimmos
    two-level morphology
  • But for many tasks (e.g. IR) much simpler
    approaches are still widely used, e.g. the
    rule-based Porter Stemmer
  • Next time
  • Read Ch 4
  • Read over HW1 and ask questions now
Write a Comment
User Comments (0)
About PowerShow.com