Morphology Based Natural Language Processing tools for Indian Languages PowerPoint PPT Presentation

presentation player overlay
1 / 35
About This Presentation
Transcript and Presenter's Notes

Title: Morphology Based Natural Language Processing tools for Indian Languages


1
Morphology Based Natural Language
Processingtools for Indian Languages
  • By
  • Manish Shrivastava
  • Nitin Agrawal
  • Bibhuti Mohapatra
  • Smriti Singh

2
Outline
  • Introduction
  • Morphological Analysis
  • Stemmer
  • Phrase Level Morphological Analyser
  • Word Level Morphological Analyser
  • Part of Speech Tagger
  • Future Work

3
Introduction
  • POS tagging is the process of identifying lexical
    category of a word in a sentence on the basis of
    its context
  • Input rama Kola rha hO
  • Output rama_PN Kola_VB rha_VDM hO_VC
  • (PN Proper noun, VB Verb base, VDM Verb
    durative male, VC Verb copula)
  • Wide applications
  • Information retrieval, Machine translation, Word
    sense disambiguation, Question answering system
    etc.

4
Motivation
  • Accurate POS taggers not available for Hindi
    language
  • Indian languages are morphologically rich. Hence,
  • First step is to analyze the language
  • Tools for harnessing morphological information
    are needed

5
Our Approach
  • Morphological Analysis of Hindi
  • Noun analysis
  • 20 paradigms
  • Suffix analysis based on GNPC values
  • Verb analysis
  • 600 verb group paradigms
  • Morpheme analysis based on TAMGNP values

6
Our Approach cntd
  • Intermediate tools for initial processing
  • Stemmer
  • Identifying suffixes and root
  • Morphological Analyser
  • Analysing suffixes provided by stemmer
  • Providing category and feature information

7
Hindi Morphological Analysis
  • The analyses done so far can be categorized
    broadly into
  • Noun Analysis
  • Verb Analysis

8
Noun Analysis
  • Categorized into 20 paradigms based on
  • Vowel ending
  • Valid suffixes
  • Gender, Number, Person and Case information
  • Suffix-Replacement rules
  • Different rules for same suffixes depending on
    paradigms

9
Noun Analysis Table
10
Verb Analysis
  • Verb Group analyzed on the basis of
  • Tense
  • Aspect
  • Modality
  • Gender
  • Number
  • Person

11
Verb Group Analysis
  • Verb Groups can be listed using TAMGNP values
  • Matrix generated based on above values.
  • Presently there are 622 unique paradigms in the
    Matrix and more are being added

12
TAM GNP Matrix
13
Linguistic resources for Verbs
  • Suffix List for verbs
  • Morpheme analysis for VG
  • Disambiguation rules for Morphological Analyser
    and Part of Speech tagger

14
Applications of VG Analysis
  • VG analysis helps in
  • Identifying Verb Group in a sentence
  • Identifying Main Verb and Auxiliary Verb
  • Identifying Suffixes
  • Generation of Verb Group

15
Stemmer
  • Stemming is the process of removing the suffix
    and producing root or stem after appropriate
    replacement
  • If input is word laD,ikyaaoM then root is
    laD,kI where suffix is i yaaoM and
    replacement is I
  • Traditional stemmers produce only stem
  • For laD,ikyaaoM output will be laD,k

16
Hindi Stemmer
  • Given a word, our stemmer can provide
  • root
  • stem
  • suffix and
  • grammatical category
  • Example Input laD,kaoM
  • Root laD,ka
  • Stem laD,k
  • Suffix aoM
  • Grammatical Category Noun

17
Resources
  • The resources used for Hindi Stemmer are
  • Morphological Analysis
  • Suffix replacement rules
  • Contain suffix and category information
  • Wordlists from Wordnet

18
Stemmer Architecture
aoM/- , Gar (noun)
iksaana
iksaanaaoM
19
Results
20
Applications
  • Part-of-Speech Tagger
  • Provides grammatical category information
  • Search Engines
  • Helps in indexing the data
  • WordNet querying
  • Provides root for querying Wordnet
  • Morphological analysis
  • Provides the suffix of the given word

21
Morphological Analyser
  • Morphological Analysis is the process of
    providing grammatical information of a word given
    its suffix.
  • Example,
  • Word rhogaa
  • Category Verb Root rh Suffix ogaa
  • Person 3rd Preson, due to the presence of o
  • TenseFuture, due to the presence of ga
  • Gender Male, due to the presence of a

22
Phrase Level Morphological Analyser(PLMA)
  • Using Verb Group Paradigms
  • Each paradigm represents a unique Verb Group
  • Paradigms give morpheme sequence forming the Verb
    Group
  • Example,
  • (Verb-root)(fct)(gnr)(SPACE)(cop)(pnr)
  • Verb Group Kolata hO
  • Analysis Verb, Aspect Stative, Gender Male

23
Word Level Morphological Analyser(WLMA)
  • Identifies the structural component of the word
  • Provides Person, Number, Case, Aspect, Gender,
    Modality and Tense information
  • Input Kolata
  • Root Kola Suffix ta Category Verb
  • Morpheme t Analysis stative aspect
  • Morpheme a Analysis male gender

24
Resources Required
  • The first module in WLMA is the stemmer which
    requires
  • Suffix-Replacement Rules
  • Wordlists
  • Second module is Suffix Analyser which needs
  • Morpheme Analysis

25
WLMA Architecture
SUFFIX RULES
WORDLIST
MORPHEME ANALYSIS
aoM /- , ghar
iksaana
aooM, Pl, obl
SUFFIX ANALYSER
STEMMER
aoM, ghar
iksaana, aoM, ghar
aooM , Plural, oblique
iksaanaaoM
Input iksaanaaoM Root iksaana, Suffix aoM
Category noun Analysis Pl, obl
MORPHOLOGICAL ANALYSER
iksaanaaoM
26
Results
  • Input file A news Item from www.bbc.co.uk/hindi
  • Output

27
POS tagger using WLMA
  • Tagging is done in four stages
  • Stemmer Provides Category and suffix
  • Morphological Analyser Provides Grammatical
    information
  • Disambiguation Modifies analysis of suffix based
    on context
  • Tag generation Final tag is provided based on
    the final analysis

28
Disambiguation
  • Tag Disambiguation is necessary as morphological
    information depends on context
  • Input laD,ko
  • Analysis Singular Oblique (if followed by
    Case Marker)
  • Or
  • Analysis Plural Direct (if alone)

29
Disambiguation
  • Similarly, Modality and Aspect in a Verb Group
    depends on the word position in the Group
  • Example
  • rama Kata rhta hO
  • Here, rhta gives Durative aspect, Male gender
  • rama Gar maoM rhta hO
  • Here, rhta gives Verb Main, Stative aspect, Male
    gender

30
Block Diagram
WLMA
Input
Suffix Analyser
Stemmer
Disambiguation
Tag Generation
Tagged Output
POS Tagger
31
Tag Generation
  • Tags for each word are provided in the following
    format
  • Noun Tag Pattern Cat_GN_C
  • Verb Group Tag Pattern Cat_GNP_TAM
  • Adjective Tag Pattern Cat_GN_C
  • Pronoun Tag Pattern Cat_GNP_C

32
Tag Generation
  • Example
  • For
  • Input baura laD,ka Kolata hO
  • Output baura_Adj_MS_D
  • laD,ka_N_MS_D
  • Kolata_VM_MS3_PSX
  • hOO_VC_MS3_PSX

33
Future Work
  • The analyses will be improved to include still
    unseen paradigms
  • Phrase level approach would be tried to tag
    complete groups
  • Hybrid approach using stochastic methods might be
    used to improve accuracy over unknown words.

Stemmer
WLMA
POS Tagger
Future work
34
References
  • C. D. Manning and H. Schutze, Foundation of
    statistical Natural Language Processing. MIT
    Press, 2002.
  • L. V. Guider, Automated part of speech tagging
    A brief overview, Handout for LING361,
    Georgetown University, Fall 1995.
  • D. Jurafsky and J. H. Martin, Speech and Language
    Processing.
  • M. Porter, An algorithm for suffix stripping,
    Proceedings of SIGIR,
  • R. S. Akshar Bharati and V. Chaitanya, Natural
    Language Processing A Paninian Perspective.
    Prentice-Hall India, 1995.
  • R.-A. G.Saudagar, An Automated Generation Rule
    for Hindi. MCA Dissertation, 1998.
  • E. Brill, A simple rule based part of speech
    tagger, Proceedings of the DARPA Speech and
    Natural Language Workshop, 1992.

35
Thank You
Write a Comment
User Comments (0)
About PowerShow.com