Title: Shallow Parsing for South Asian Languages
1Shallow Parsing for South Asian Languages
2Shallow Parsing
- Parts Of Speech Tagging
- Assigning grammatical classes to words in a
natural language sentence. - Text Chunking
- Dividing the text in syntactically co-related
parts of words. - Example NP He VP reckons NP the current
account deficit - VP will narrow PP to NP only 1.8 billion
PP in NP September .
3Applications
- Direct Applications
- Automatic Spell Checking Software
- Grammar Suggestions ( MS word pop-ups)
- Full Parsing
- Indirect Applications
- Machine Translation Systems
- Web Search ( )
4Nature of the problem of Shallow Parsing
- A classic problem of classifying input tokens
into given classes. - The sequence aspect
- The sequence of best classes.
- The best sequence of classes.
- Typically, the classifying information is the
language context of the word under consideration.
5Shallow Parsing for English
- The problem has been well worked upon for
English. - Very Efficient Systems Exist
- Example
- Brills Tagger 95, Transformation Based
Learning. - Adwait Ratnaparkhi 99, Parsing with Maximum
Entropy - Significant effect on the development of MT
systems for European Languages
6Shallow Parsing for South Asian Languages
- Portability of Shallow Parsing Systems across
languages ?? - NOT GOOD !!
- Inflectional Richness of the Languages.
- Training on 22,000 words and Testing on 5000
words. -
POS tagging only English Hindi
Brills Transformation Based Learning 87 79
Ratnaparkhis Maximum Entropy Based Learning 89 81
7Challenges with Indian Languages.
- Poor Disambiguation between certain POS class
categories example - NNP and NNC !! (Error Type 1)
- JJ and NN !! (Error Type 2)
- Inflectional Richness of the language
- Absence of markers like the capitalization of
proper nouns and etc. - Is that Raj ?
8On Improving the performance for Hindi and other
South Asian Languages.
- There can be two ways
- Improving the classifying information by the use
of better features or using language specific
information or both. - Improving the learning by better training and
better inference-ing.
9A. POS Tagging
- For better training and inference-ing.
- Approach 1 Training on a hierarchical structure
of tags - Approach 2 Building a knowledge database from
raw / un-annotated text to use as a look
up. -
10Approach 1Training on Hierarchical Tagset
-
- Training in steps, on a hierarchical structure of
classes. - Training Level
- ? 1
-
- ? 2
11Approach 1Training on Hierarchical Tagset
- The approach was devised to minimize the number
of errors that are made within a family class. - Results
- 73.33
- Reason
- No mechanism to correct errors in the part 1 of
training - Jittered language constructs while training in
part 2.
12Approach 2Building a knowledge database for
look up.
- The Knowledge database consists of words and the
POS tags it is known to have occurred with. - How is it important ??
- Inflectional richness Vs per class ambiguity
13Building the knowledge database
- Adding words and their POS tags from the training
data. - Training on 22,000 words on Gold Standard POS
tags, and creating a training model A. - Using model A to annotate the raw text
consisting of 2 Lakh words. - Extracting the words/POS tags of words tagged
with very high confidence measure. And adding
them to the database.
14Using the knowledge database
- For the final tagging
- We use model A to get the probability of each
tag to be associated with a word. - ie P(tagi / word) for (every tag)
- for (every word in the test data)
-
- If a word is found in the database, we choose the
tag in its entry, which has the highest
probability. - If not found, we let the tag predicted in the
first run remain unchanged.
15Approach 2
16Training for Model A
- We use Linear Chain Implementation of the
Conditional Random Fields. Taku Kudo et. Al. 2005 - We use simple language independent features
- Word Window -2, 2.
- Suffix Information as in last 2, 3, 4 chars.
- Presence of Special Characters.
- Word Length.
17B. Chunking
- We have followed the approach used by Anirudh,
Himanshu 06 NWAI. - 2 step Training
- Training on Boundary-Label scheme for extracting
Chunk Labels. - Training on Boundaries with added information of
chunk labels. -
18Chunking cont.
- Training for identifying Chunk tags is also done
using a linear chain implementation of CRF. - Features
- Word window of -2, 2
- POS tag window of -2, 2
- Chunk Labels, for chunk Boundary Identification
-2, 0
19Chunking
20Consolidated Results
- The results below are on calculated on the
development data.
Hindi Telugu Bengali
POS Tagging 84.90 71.22 81.09
Chunking 92.69 91.77 94.90
21Conclusions
- Training on a tag-set optimal for capturing the
language patterns. - If training is done in more than one step, esp.
such that tags in the subsequent step are
directly dependent on the tags in the present
step, then it is of importance that there exist a
way to re-tag the mis-tagged tokens.
22References
- Charles Sutton, An Introduction to Conditional
Random Fields for Relational Learning - Adwait Ratnaparkhi ,1998, Maximum Entropy Models
For Natural Language Ambiguity Resolution,
Dissertation in Computer and Information
Science,University Of Pennslyvania,1998. - Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005
,HMM Based Chunker for Hindi, IIIT Hyderabad. - Thorsten Brants. 2000. TnT - A Statistical
Part-of- Speech Tagger Proceedings of the sixth
conference on Applied Natural Language Processing
(2000) 224231. - Himanshu Agrawal, Anirudh Mani 2006, Part Of
Speech Tagging and Chunking Using Conditional
Random Fields Proceedings of the NLPAI MLcontest
workshop, National Workshop on Artificial
Intelligence.