Persian Morphological Parser using POS Tagger

About This Presentation

Title:

Persian Morphological Parser using POS Tagger

Description:

CART is a statistical approach that is used here in predicting both attached and ... CART needs previous words to be accurately tagged for it to work correctly. ... – PowerPoint PPT presentation

Number of Views:245

Avg rating:3.0/5.0

Slides: 26

Provided by: arvinfa

Category:

more less

Transcript and Presenter's Notes

Title: Persian Morphological Parser using POS Tagger

1
Persian Morphological Parser using POS Tagger

By
Ali Azimizadeh (aazimizadeh_at_yahoo.com)
Mohammad Mehdi Arab (medi.arab_at_gmail.com)
Center of Speech Technology, SimAva Co.
Mashhad, Iran
Presented By
Arvin Farahmand
Ryerson University, Dept. of Electrical and
Computer Engineering.
Toronto, Canada

2
Outline

Why a Morphological Parser?
Persian Morphology
Problems in Parsing
Existing Approaches
Shortcomings
Our Approach
Example
Experimental Results
Advantages
Disadvantages

3
Why a morphological parser?

Morphological parsing is a an important part of
Natural Language Processing (NLP). This parser
has been designed as part of the Parsgooyan NLP.
It is designed to
Reduce need of a large vocabulary
Find the syllable of stress in pronunciation
based on the attached morpheme
Adjust tags generated by the tagger
Improve quality of token-to-word converter

4
Persian Morphology

There are five distinct type of morphemes
Lexical nouns, adjectives, verbs (mrd, man)
Grammatical pronouns, prepositions (v, and)
Derivative particles (napak, unclean)
Inflective plural suffixes, superlatives
(ktabha books)
Clitic possessives (ktabam, my book)

5
Problems in Parsing

1. Homographs
Words that are written the same but are
pronounced differently. Mainly due to lack of
written vowels in Persian.
e.g krdy -gt kardy (you did)
kordy (Kurdish)

6
Problems in Parsing

2. Out Of Vocabulary (OOV) Words
Most existing NLP systems for the Persian are
lexicon based.
What to do for words that are not in the
vocabulary?

7
Problems in Parsing

3. Lack of a tokenization standard
Morphemes in Persian can be attached or detached.
They can proceed or precede the word. This makes
it difficult to correctly tokenize a word as a
whole.
e.g. my_rvm -gt my can be may (wine), or mee
(imperfective marker)

8
Existing Approaches

The two major approaches, in Persian, thus far
have been
Perslex Riazati 1997, which is based on Englex.
It is a two step morphological parser
Shiraz NLP (Megerdoomian, 2000), based on feature
structures and unification

9
Shortcomings

Although existing approaches give results with
high accuracies, there are two weaknesses which
our approach tries to address
Inability to deal with words that dont exist in
their vocabulary in a flexible manner
Inability to deal with homographs

10
Our Approach

The parser presented uses a rule-based system and
statistical features of POS.
This addresses some of the short comings of
existing approaches.

11
Lexical Morphemes

In our approach we divide morphemes in Persian
into two categories
- First Order Morphemes Morphemes that do not
have homographs e.g. tan (yours)
- Second Order Morphemes Morphemes that do have
homographs e.g. my -gt may (wine), or mee
(imperfect tense marker)

12
Steps

Mapping Persian letters are converted to English
for usage in Festival software
Tokenization words are separated based on spaces
and punctuations
POS Tagger
first order morphemes are attached to the roots
without any special rules
second order morphemes are attached to the
roots using CART.

13
CART

CART is a statistical approach that is used here
in predicting both attached and free morphemes.
At each branch a yes/no question is answered and
a final decision is available at the leaf.
Ours was constructed from 30,000 sets of three
word groups
ltprevious word, morpheme, next wordgt

14
Example

bh mdrsh myrvm (I go to school).
Step 1 has already been performed and letters are
mapped to English.
Step 2 - Tokenizer Output
bh
mdrsh
myrvm

15
Example

bh mdrsh myrvm (I go to school).
Step 3 - POS Tagger
bh ltPREPgt
mdrsh ltN_SINGgt
myrvm ???

16
Example

bh mdrsh myrvm (I go to school).
My (mee) is the problematic part, but by feeding
it to CART and making use of previous word mdrsh
and next word rvm, it is found to be a morpheme.
myrvm is tagged as ltV_IMPgt

17
Algorithm
18
Training Data

Training data set used was made up of 2,500,000
words, divided into 38 tags.
Almost 99 of all verbs in usage were included in
the training data.
Most OOV words encountered were nouns and
adjectives.

19
Experimental Results
20
Sources of Error

A large portion of errors were due to proper
names, specifically those that end in y.
These were often mistaken with adjectives which
also often end in y.
e.g. Tanavoly

21
Advantages - Homographs

POS Tags are used in resolving the homograph
problem.
e.g. mn -gt man (me), or man (three kilos). One is
a pronoun while the other is a noun.

22
Advantages OOV Words

Making use of complete POS strengthens the system
against Out Of Vocabulary (OOV) words.
e.g. imantan ra hfz knid (protect your faith)
imantan is tagged N_SING, if tan is separated and
iman is searched in vocabulary but not found the
whole word is decided to be a noun. tan is still
separated.

23
Disadvantages - Homographs

Homographic separation is still not perfect.
Homographs that have the same POS can not be
distinguished.
E.g. bvd -gt bood (was) bovad (will be)

24
Disadvantages Error Propagation

If the POS tagger makes a tagging error in one of
the first words in a sentence the error will
propagate through the rest of the words in the
sentence.
Attached morphemes can be incorrectly tagged as
words and be made separate from their roots.
CART needs previous words to be accurately tagged
for it to work correctly.

25
The End

Write a Comment

User Comments (0)