Shallow Parsing for South Asian Languages

About This Presentation

Title:

Shallow Parsing for South Asian Languages

Description:

Assigning grammatical classes to words in a natural language sentence. Text Chunking. Dividing the text in syntactically co-related parts of words. ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 23

Provided by: Hima5

Category:

more less

Transcript and Presenter's Notes

Title: Shallow Parsing for South Asian Languages

1
Shallow Parsing for South Asian Languages

-Himanshu Agrawal

2
Shallow Parsing

Parts Of Speech Tagging
Assigning grammatical classes to words in a
natural language sentence.
Text Chunking
Dividing the text in syntactically co-related
parts of words.
Example NP He VP reckons NP the current
account deficit
VP will narrow PP to NP only 1.8 billion
PP in NP September .

3
Applications

Direct Applications
Automatic Spell Checking Software
Grammar Suggestions ( MS word pop-ups)
Full Parsing
Indirect Applications
Machine Translation Systems
Web Search ( )

4
Nature of the problem of Shallow Parsing

A classic problem of classifying input tokens
into given classes.
The sequence aspect
The sequence of best classes.
The best sequence of classes.
Typically, the classifying information is the
language context of the word under consideration.

5
Shallow Parsing for English

The problem has been well worked upon for
English.
Very Efficient Systems Exist
Example
Brills Tagger 95, Transformation Based
Learning.
Adwait Ratnaparkhi 99, Parsing with Maximum
Entropy
Significant effect on the development of MT
systems for European Languages

6
Shallow Parsing for South Asian Languages

Portability of Shallow Parsing Systems across
languages ??
NOT GOOD !!
Inflectional Richness of the Languages.
Training on 22,000 words and Testing on 5000
words.

POS tagging only English Hindi
Brills Transformation Based Learning 87 79
Ratnaparkhis Maximum Entropy Based Learning 89 81
7
Challenges with Indian Languages.

Poor Disambiguation between certain POS class
categories example
NNP and NNC !! (Error Type 1)
JJ and NN !! (Error Type 2)
Inflectional Richness of the language
Absence of markers like the capitalization of
proper nouns and etc.
Is that Raj ?

8
On Improving the performance for Hindi and other
South Asian Languages.

There can be two ways
Improving the classifying information by the use
of better features or using language specific
information or both.
Improving the learning by better training and
better inference-ing.

9
A. POS Tagging

For better training and inference-ing.
Approach 1 Training on a hierarchical structure
of tags
Approach 2 Building a knowledge database from
raw / un-annotated text to use as a look
up.

10
Approach 1Training on Hierarchical Tagset

Training in steps, on a hierarchical structure of
classes.
Training Level
? 1
? 2

11
Approach 1Training on Hierarchical Tagset

The approach was devised to minimize the number
of errors that are made within a family class.
Results
73.33
Reason
No mechanism to correct errors in the part 1 of
training
Jittered language constructs while training in
part 2.

12
Approach 2Building a knowledge database for
look up.

The Knowledge database consists of words and the
POS tags it is known to have occurred with.
How is it important ??
Inflectional richness Vs per class ambiguity

13
Building the knowledge database

Adding words and their POS tags from the training
data.
Training on 22,000 words on Gold Standard POS
tags, and creating a training model A.
Using model A to annotate the raw text
consisting of 2 Lakh words.
Extracting the words/POS tags of words tagged
with very high confidence measure. And adding
them to the database.

14
Using the knowledge database

For the final tagging
We use model A to get the probability of each
tag to be associated with a word.
ie P(tagi / word) for (every tag)
for (every word in the test data)
If a word is found in the database, we choose the
tag in its entry, which has the highest
probability.
If not found, we let the tag predicted in the
first run remain unchanged.

15
Approach 2

Results
84.90

16
Training for Model A

We use Linear Chain Implementation of the
Conditional Random Fields. Taku Kudo et. Al. 2005
We use simple language independent features
Word Window -2, 2.
Suffix Information as in last 2, 3, 4 chars.
Presence of Special Characters.
Word Length.

17
B. Chunking

We have followed the approach used by Anirudh,
Himanshu 06 NWAI.
2 step Training
Training on Boundary-Label scheme for extracting
Chunk Labels.
Training on Boundaries with added information of
chunk labels.

18
Chunking cont.

Training for identifying Chunk tags is also done
using a linear chain implementation of CRF.
Features
Word window of -2, 2
POS tag window of -2, 2
Chunk Labels, for chunk Boundary Identification
-2, 0

19
Chunking

Results
92.69

20
Consolidated Results

The results below are on calculated on the
development data.

Hindi Telugu Bengali
POS Tagging 84.90 71.22 81.09
Chunking 92.69 91.77 94.90
21
Conclusions

Training on a tag-set optimal for capturing the
language patterns.
If training is done in more than one step, esp.
such that tags in the subsequent step are
directly dependent on the tags in the present
step, then it is of importance that there exist a
way to re-tag the mis-tagged tokens.

22
References

Charles Sutton, An Introduction to Conditional
Random Fields for Relational Learning
Adwait Ratnaparkhi ,1998, Maximum Entropy Models
For Natural Language Ambiguity Resolution,
Dissertation in Computer and Information
Science,University Of Pennslyvania,1998.
Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005
,HMM Based Chunker for Hindi, IIIT Hyderabad.
Thorsten Brants. 2000. TnT - A Statistical
Part-of- Speech Tagger Proceedings of the sixth
conference on Applied Natural Language Processing
(2000) 224231.
Himanshu Agrawal, Anirudh Mani 2006, Part Of
Speech Tagging and Chunking Using Conditional
Random Fields Proceedings of the NLPAI MLcontest
workshop, National Workshop on Artificial
Intelligence.