Morphology and Sentences - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Morphology and Sentences

Description:

The same word can have multiple forms. The most common case is sentence-initial capitalization; we need to preserve ... Recall our goal to conflate similar word forms ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 19
Provided by: VasileiosH9
Category:

less

Transcript and Presenter's Notes

Title: Morphology and Sentences


1
Morphology and Sentences
  • Vasileios Hatzivassiloglou
  • University of Texas at Dallas

2
Matching words
  • The same word can have multiple forms
  • The most common case is sentence-initial
    capitalization we need to preserve this
    information while matching the forms
  • Important to recognize proper nouns
  • Another cause is alternative spellings
  • labour / labor, labelled / labeled
  • multiple formats for dates, phone numbers, etc.

3
Morphology
  • Many languages have multiple word forms related
    to a single base form (root form)
  • A lexeme is the representative base form from
    which other closely related forms are produced
  • Three productive processes
  • Inflection
  • Derivation
  • Compounding

4
Inflection
  • Addition of prefixes and suffixes that change
    meaning only slightly
  • Lexeme and syntactic category stays the same
  • Features such as number and tense are modified
  • Examples
  • -s to form plural nouns
  • -ed to form past tense

5
Derivation
  • More radical change in meaning
  • Possible change in syntactic category
  • Examples
  • -ly transforms adjectives to adverbs (wide-ly)
  • -en transforms adjectives to verbs (weak-en)
  • -able transforms verbs into adjectives
    (accept-able)
  • -er transforms verbs into nouns (teach-er)

6
Compounding
  • Combination of two words into a new word
  • Result can be written as one word, as one
    hyphenated word, or as two words, but is
    pronounced as a single word and has distinct,
    recognizable meaning
  • Most unrestricted process (noun-noun most common
    in English)
  • Examples
  • disk drive, mad cow disease

7
Irregularities in morphology
  • Compounding unpredictable
  • Exceptions in both inflectional and derivational
    morpology
  • irregular plurals and verb tenses
  • word formation affixes do not always apply, e.g.,
    difficultly and oldly
  • sometimes other processes compete, e.g., hardly

8
Reverse morphology engineering
  • Recall our goal to conflate similar word forms
  • Approach recognize morphological inflation
    and/or derivation and combine the word types
  • Is it advisable to do so?

9
When to remove morphology
  • Depends on the task
  • Removing morphology loses some syntactic
    information
  • Process is inexact, so it will sometime lose
    semantic information as well
  • Usually applied for retrieval purposes (e.g., in
    information retrieval) but not in most text
    analysis tasks
  • Regardless, it is important to recognize
    morphology

10
Stemming
  • Rule-based approach that removes common suffixes
    such as er and ing
  • Most commonly used algorithm is Porters (1980)
  • Result conflates sometimes unrelated word forms
    and will not work well in morphologically rich
    languages

11
Handling compounds
  • By default, we will recognize members of many
    compounds as separate words
  • It is often beneficial to add commonly used
    compounds into the lexicon
  • This increases the lexicon size substantially,
    but also increases the accuracy of our
    representation and further processing

12
Sentences
  • Definition
  • A span of text that is syntactically complete
  • Observable through the use of punctuation that
    separates sentences (period but also question
    mark, exclamation mark, semicolon, and maybe
    colon and long dash)
  • Irregularities include quoted text, and
    typographic conventions in the US.
  • You remind me, she said, of your mother.

13
Length of sentences
  • Usually, we study short examples
  • In real life, sentences in prepared text are
    longer
  • News text
  • Mode of 23 words
  • 75 of sentences are longer than 15 words
  • 58 of sentences are longer than 20 words
  • 26 of sentences are longer than 30 words

14
Recognizing sentences
  • Most significant problem is the ambiguous use of
    punctuation for the period in abbreviations and
    numbers as well
  • Basic algorithm Treat all periods and other
    punctuation marks above as sentence separators
  • Accuracy
  • approximately 90

15
Improving sentence recognition
  • Use information from
  • capitalization (observable)
  • knowledge of proper names
  • knowledge of abbreviations
  • knowledge of possible abbreviation positions
    (e.g., vs. and Dr. cannot be sentence final)

16
A heuristic algorithm
  • Label every punctuation mark (not inside a word)
    as sentence separators
  • Adjust for quote position in American English
  • Revise this if the next word is not capitalized
  • For periods, also revise this if the last word is
    an abbreviation that cannot occur at the end of
    the sentence

17
Sentence recognition is classification
  • Decide the role of each punctuation mark
  • Using observed data from labeled training cases
    to predict new occurrences
  • Possible features
  • length and case of words before/after
  • probability of each word to be sentence
    final/initial
  • probability of parts of speech to be sentence
    final/initial (class-based smoothing)

18
Reading
  • Section 3.1 (introductory part, before 3.1.1)
  • Review material on different word classes
    Sections 3.1.1 3.1.4
  • Sections 4.2.3 4.2.4
Write a Comment
User Comments (0)
About PowerShow.com