Persian Morphological Parser using POS Tagger - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Persian Morphological Parser using POS Tagger

Description:

CART is a statistical approach that is used here in predicting both attached and ... CART needs previous words to be accurately tagged for it to work correctly. ... – PowerPoint PPT presentation

Number of Views:245
Avg rating:3.0/5.0
Slides: 26
Provided by: arvinfa
Category:

less

Transcript and Presenter's Notes

Title: Persian Morphological Parser using POS Tagger


1
Persian Morphological Parser using POS Tagger
  • By
  • Ali Azimizadeh (aazimizadeh_at_yahoo.com)
  • Mohammad Mehdi Arab (medi.arab_at_gmail.com)
  • Center of Speech Technology, SimAva Co.
  • Mashhad, Iran
  • Presented By
  • Arvin Farahmand
  • Ryerson University, Dept. of Electrical and
    Computer Engineering.
  • Toronto, Canada

2
Outline
  • Why a Morphological Parser?
  • Persian Morphology
  • Problems in Parsing
  • Existing Approaches
  • Shortcomings
  • Our Approach
  • Example
  • Experimental Results
  • Advantages
  • Disadvantages

3
Why a morphological parser?
  • Morphological parsing is a an important part of
    Natural Language Processing (NLP). This parser
    has been designed as part of the Parsgooyan NLP.
    It is designed to
  • Reduce need of a large vocabulary
  • Find the syllable of stress in pronunciation
    based on the attached morpheme
  • Adjust tags generated by the tagger
  • Improve quality of token-to-word converter

4
Persian Morphology
  • There are five distinct type of morphemes
  • Lexical nouns, adjectives, verbs (mrd, man)
  • Grammatical pronouns, prepositions (v, and)
  • Derivative particles (napak, unclean)
  • Inflective plural suffixes, superlatives
    (ktabha books)
  • Clitic possessives (ktabam, my book)

5
Problems in Parsing
  • 1. Homographs
  • Words that are written the same but are
    pronounced differently. Mainly due to lack of
    written vowels in Persian.
  • e.g krdy -gt kardy (you did)
  • kordy (Kurdish)

6
Problems in Parsing
  • 2. Out Of Vocabulary (OOV) Words
  • Most existing NLP systems for the Persian are
    lexicon based.
  • What to do for words that are not in the
    vocabulary?

7
Problems in Parsing
  • 3. Lack of a tokenization standard
  • Morphemes in Persian can be attached or detached.
    They can proceed or precede the word. This makes
    it difficult to correctly tokenize a word as a
    whole.
  • e.g. my_rvm -gt my can be may (wine), or mee
    (imperfective marker)

8
Existing Approaches
  • The two major approaches, in Persian, thus far
    have been
  • Perslex Riazati 1997, which is based on Englex.
    It is a two step morphological parser
  • Shiraz NLP (Megerdoomian, 2000), based on feature
    structures and unification

9
Shortcomings
  • Although existing approaches give results with
    high accuracies, there are two weaknesses which
    our approach tries to address
  • Inability to deal with words that dont exist in
    their vocabulary in a flexible manner
  • Inability to deal with homographs

10
Our Approach
  • The parser presented uses a rule-based system and
    statistical features of POS.
  • This addresses some of the short comings of
    existing approaches.

11
Lexical Morphemes
  • In our approach we divide morphemes in Persian
    into two categories
  • - First Order Morphemes Morphemes that do not
    have homographs e.g. tan (yours)
  • - Second Order Morphemes Morphemes that do have
    homographs e.g. my -gt may (wine), or mee
    (imperfect tense marker)

12
Steps
  • Mapping Persian letters are converted to English
    for usage in Festival software
  • Tokenization words are separated based on spaces
    and punctuations
  • POS Tagger
  • first order morphemes are attached to the roots
    without any special rules
  • second order morphemes are attached to the
    roots using CART.

13
CART
  • CART is a statistical approach that is used here
    in predicting both attached and free morphemes.
  • At each branch a yes/no question is answered and
    a final decision is available at the leaf.
  • Ours was constructed from 30,000 sets of three
    word groups
  • ltprevious word, morpheme, next wordgt

14
Example
  • bh mdrsh myrvm (I go to school).
  • Step 1 has already been performed and letters are
    mapped to English.
  • Step 2 - Tokenizer Output
  • bh
  • mdrsh
  • myrvm

15
Example
  • bh mdrsh myrvm (I go to school).
  • Step 3 - POS Tagger
  • bh ltPREPgt
  • mdrsh ltN_SINGgt
  • myrvm ???

16
Example
  • bh mdrsh myrvm (I go to school).
  • My (mee) is the problematic part, but by feeding
    it to CART and making use of previous word mdrsh
    and next word rvm, it is found to be a morpheme.
  • myrvm is tagged as ltV_IMPgt

17
Algorithm
18
Training Data
  • Training data set used was made up of 2,500,000
    words, divided into 38 tags.
  • Almost 99 of all verbs in usage were included in
    the training data.
  • Most OOV words encountered were nouns and
    adjectives.

19
Experimental Results
20
Sources of Error
  • A large portion of errors were due to proper
    names, specifically those that end in y.
  • These were often mistaken with adjectives which
    also often end in y.
  • e.g. Tanavoly

21
Advantages - Homographs
  • POS Tags are used in resolving the homograph
    problem.
  • e.g. mn -gt man (me), or man (three kilos). One is
    a pronoun while the other is a noun.

22
Advantages OOV Words
  • Making use of complete POS strengthens the system
    against Out Of Vocabulary (OOV) words.
  • e.g. imantan ra hfz knid (protect your faith)
  • imantan is tagged N_SING, if tan is separated and
    iman is searched in vocabulary but not found the
    whole word is decided to be a noun. tan is still
    separated.

23
Disadvantages - Homographs
  • Homographic separation is still not perfect.
    Homographs that have the same POS can not be
    distinguished.
  • E.g. bvd -gt bood (was) bovad (will be)

24
Disadvantages Error Propagation
  • If the POS tagger makes a tagging error in one of
    the first words in a sentence the error will
    propagate through the rest of the words in the
    sentence.
  • Attached morphemes can be incorrectly tagged as
    words and be made separate from their roots.
  • CART needs previous words to be accurately tagged
    for it to work correctly.

25
The End
Write a Comment
User Comments (0)
About PowerShow.com