Shallow Parsing for South Asian Languages - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Shallow Parsing for South Asian Languages

Description:

Assigning grammatical classes to words in a natural language sentence. Text Chunking. Dividing the text in syntactically co-related parts of words. ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 23
Provided by: Hima5
Category:

less

Transcript and Presenter's Notes

Title: Shallow Parsing for South Asian Languages


1
Shallow Parsing for South Asian Languages
  • -Himanshu Agrawal

2
Shallow Parsing
  • Parts Of Speech Tagging
  • Assigning grammatical classes to words in a
    natural language sentence.
  • Text Chunking
  • Dividing the text in syntactically co-related
    parts of words.
  • Example NP He VP reckons NP the current
    account deficit
  • VP will narrow PP to NP only 1.8 billion
    PP in NP September .

3
Applications
  • Direct Applications
  • Automatic Spell Checking Software
  • Grammar Suggestions ( MS word pop-ups)
  • Full Parsing
  • Indirect Applications
  • Machine Translation Systems
  • Web Search ( )

4
Nature of the problem of Shallow Parsing
  • A classic problem of classifying input tokens
    into given classes.
  • The sequence aspect
  • The sequence of best classes.
  • The best sequence of classes.
  • Typically, the classifying information is the
    language context of the word under consideration.

5
Shallow Parsing for English
  • The problem has been well worked upon for
    English.
  • Very Efficient Systems Exist
  • Example
  • Brills Tagger 95, Transformation Based
    Learning.
  • Adwait Ratnaparkhi 99, Parsing with Maximum
    Entropy
  • Significant effect on the development of MT
    systems for European Languages

6
Shallow Parsing for South Asian Languages
  • Portability of Shallow Parsing Systems across
    languages ??
  • NOT GOOD !!
  • Inflectional Richness of the Languages.
  • Training on 22,000 words and Testing on 5000
    words.

POS tagging only English Hindi
Brills Transformation Based Learning 87 79
Ratnaparkhis Maximum Entropy Based Learning 89 81
7
Challenges with Indian Languages.
  • Poor Disambiguation between certain POS class
    categories example
  • NNP and NNC !! (Error Type 1)
  • JJ and NN !! (Error Type 2)
  • Inflectional Richness of the language
  • Absence of markers like the capitalization of
    proper nouns and etc.
  • Is that Raj ?

8
On Improving the performance for Hindi and other
South Asian Languages.
  • There can be two ways
  • Improving the classifying information by the use
    of better features or using language specific
    information or both.
  • Improving the learning by better training and
    better inference-ing.

9
A. POS Tagging
  • For better training and inference-ing.
  • Approach 1 Training on a hierarchical structure
    of tags
  • Approach 2 Building a knowledge database from
    raw / un-annotated text to use as a look
    up.

10
Approach 1Training on Hierarchical Tagset
  • Training in steps, on a hierarchical structure of
    classes.
  • Training Level
  • ? 1
  • ? 2

11
Approach 1Training on Hierarchical Tagset
  • The approach was devised to minimize the number
    of errors that are made within a family class.
  • Results
  • 73.33
  • Reason
  • No mechanism to correct errors in the part 1 of
    training
  • Jittered language constructs while training in
    part 2.

12
Approach 2Building a knowledge database for
look up.
  • The Knowledge database consists of words and the
    POS tags it is known to have occurred with.
  • How is it important ??
  • Inflectional richness Vs per class ambiguity

13
Building the knowledge database
  • Adding words and their POS tags from the training
    data.
  • Training on 22,000 words on Gold Standard POS
    tags, and creating a training model A.
  • Using model A to annotate the raw text
    consisting of 2 Lakh words.
  • Extracting the words/POS tags of words tagged
    with very high confidence measure. And adding
    them to the database.

14
Using the knowledge database
  • For the final tagging
  • We use model A to get the probability of each
    tag to be associated with a word.
  • ie P(tagi / word) for (every tag)
  • for (every word in the test data)
  • If a word is found in the database, we choose the
    tag in its entry, which has the highest
    probability.
  • If not found, we let the tag predicted in the
    first run remain unchanged.

15
Approach 2
  • Results
  • 84.90

16
Training for Model A
  • We use Linear Chain Implementation of the
    Conditional Random Fields. Taku Kudo et. Al. 2005
  • We use simple language independent features
  • Word Window -2, 2.
  • Suffix Information as in last 2, 3, 4 chars.
  • Presence of Special Characters.
  • Word Length.

17
B. Chunking
  • We have followed the approach used by Anirudh,
    Himanshu 06 NWAI.
  • 2 step Training
  • Training on Boundary-Label scheme for extracting
    Chunk Labels.
  • Training on Boundaries with added information of
    chunk labels.

18
Chunking cont.
  • Training for identifying Chunk tags is also done
    using a linear chain implementation of CRF.
  • Features
  • Word window of -2, 2
  • POS tag window of -2, 2
  • Chunk Labels, for chunk Boundary Identification
    -2, 0

19
Chunking
  • Results
  • 92.69

20
Consolidated Results
  • The results below are on calculated on the
    development data.

Hindi Telugu Bengali
POS Tagging 84.90 71.22 81.09
Chunking 92.69 91.77 94.90
21
Conclusions
  • Training on a tag-set optimal for capturing the
    language patterns.
  • If training is done in more than one step, esp.
    such that tags in the subsequent step are
    directly dependent on the tags in the present
    step, then it is of importance that there exist a
    way to re-tag the mis-tagged tokens.

22
References
  • Charles Sutton, An Introduction to Conditional
    Random Fields for Relational Learning
  • Adwait Ratnaparkhi ,1998, Maximum Entropy Models
    For Natural Language Ambiguity Resolution,
    Dissertation in Computer and Information
    Science,University Of Pennslyvania,1998.
  • Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005
    ,HMM Based Chunker for Hindi, IIIT Hyderabad.
  • Thorsten Brants. 2000. TnT - A Statistical
    Part-of- Speech Tagger Proceedings of the sixth
    conference on Applied Natural Language Processing
    (2000) 224231.
  • Himanshu Agrawal, Anirudh Mani 2006, Part Of
    Speech Tagging and Chunking Using Conditional
    Random Fields Proceedings of the NLPAI MLcontest
    workshop, National Workshop on Artificial
    Intelligence.
Write a Comment
User Comments (0)
About PowerShow.com