Title: Work at TACOLA Lab
1Work at TACOLA Lab
- Team Members
- T.V.Geetha Ranjani Parthasarathi Madhan Karky
- E.UmaMaheswari J.Balaji Subalalitha
Elanchezhiyan.K, Karthika, Thenmalar,
Radhakrishnan, Kandasamy, Padmavathi, Aruna,
Vijayavani
2Tamil Language Processing
- Tamil Language Processing
- Morphological analyser
- Normal Words, Compound Words, Colloquial Words
- Parser
- Simple, Complex and Compound Sentences
- Semantic analysis based on UNL
- Language Technology
- Blog Mining
- Ontology Based Information Extraction
- Personalized Search
- Parallelization for NLP Processing
- Emotion detection form text
- Carnatic Music Processing
- Raga Modelling
- Singer, Genre Identification
- Music Emotion Recognition
- Tamil Language Oriented Tools
- Dictionary
- Text Compaction
- UNL Based Work
- UNL for semantic representation
- Nested UNL
- Concept based Search
- Bi-lingual Search
- Event Processing
- Discourse Analysis
- Summarization
- Question answering
- Thirukural Search
- Lyric Oriented Processing
- Lyric Mining
- Lyrics for Tunes
- Pleasantness
3Papers for TIC 2011
- Tamil Language Oriented Tools
- Agaraadhi A Novel Online Dictionary Framework
- An Efficient Tamil Text Compaction System.
(Surukkupai) - Kuralagam, A Concept Relation Based Search
Framework for Thirukural. - Popularity Based Scoring Model for Tamil Word
Games - Tamil Language Processing
- Template based Multilingual Summary Generation.
- On Emotion detection from Tamil Text.
- Tamil Summary Generation for Cricket Match.
- Lyric Oriented Processing
- Lyric Mining Word, Rhyme Concept
Co-occurrence Analysis. - Special Indices for LaaLaLaa Lyric Analysis
Generation Framework.
4AGARAADHIA NOVEL ONLINE DICTIONARY FRAMEWORK
- Elanchezhiyan.K
- Karthikeyan.S
- T.V.Geetha
- Ranjani Parthasarathi
- Madhan Karky
5OBJECTIVES
- Agaraadhi, a dictionary framework for indexing
and retrieving Tamil words, their meaning,
analysis and related information. - Framework to incorporate various unique features
- designed to provide additional information to
the user regarding the word that they query
about.
6INTRODUCTION
- Agaraadhi dictionary has more than 3 lac words in
various domains such as - General,
- Literature,
- Medical,
- Engineering,
- Computer Science,
- Birds Name and More
- The Agaraadhi is a Tamil English bilingual
dictionary.
7INTRODUCTION CONT
- The Agaraadhi is a Tamil English bilingual
dictionary with 20 features. such as - morphological analysis,
- morphological generation,
- word usage statistics,
- word pleasantness analysis,
- spell checking,
- similar word finder,
- word usage in literature,
- picture dictionary,
- number to text conversion,
- phonetic transliteration,
- live usage analysis from micro blogs and more
8AGARAADHI FRAMEWORK CONT
9AGARAADHI FEATURES
- Morphological Analyser
- gives the morphological features of the query
word such as root word, parts of speech, gender,
tense and count. - If the Query word is padithaan, Morphological
Analyser gives as padi as root, word represents
male gender and query word is past tense and so
on. - Morphological GeneratorTamil morphological
generator tackles different syntactic categories
such as nouns, verbs, post positions, adjectives,
adverbs. - The generator is used to generate possible
morphological variations of the query word. - Spell Checker
- used to check the spelling of Tamil words and to
provide alternative suggestions for the wrongly
spelt words. - If root word not in dictionary - generates all
the possible suggestions with minimum variations
from the given word
10AGARAADHI FEATURES
- Word Suggestions
- gives the list of equivalent or related words for
the given query word. - Word Pleasantness
- score generator provides how easy it is to
pronounce the word. - Word Popularity Score
- shows the word usage in the web based on
frequency distribution of the word across the
popular blogs, news articles, social nets etc. - Word Usage Statistics
- shows the usage of the word in the social network
over the past one week. - Word Usage in Literature
- finds the usage of words in popular literature
such as Thirukural, Bharathiyar Padalgal, Avvai
songs and also Lyrics of Tamil Movie songs.
11AGARAADHI FEATURES
- Word of the Day
- A rare word is randomly chosen and is displayed
in the opening page to facilitate users to learn
a new word every day. - Number to Text Converter
- converts a number to Tamil word equivalent as
well as in English text. For example in Tamil we
represent oru Arpputham (????????) for 100
million, Kumbam (???????) for 10 billion and
finally up to Anniyan (????????) for one zilli - Picture Dictionary
- Pictures, photos or line drawings to depict
popular words have been included in the
dictionary to enable efficient learning for
children using this tool.
12RESULTS
- Query word pookkal (???????)
- http//www.agaraadhi.com/dict/OD.jsp?wE0AEAAE
0AF82E0AE95E0AF8DE0AE95E0AEB3E0AF
8DlntaSubmit.x8Submit.y7 - Query word mazhai (???)
- http//www.agaraadhi.com/dict/OD.jsp?wE0AEAEE
0AEB4E0AF88lntaSubmit.x21Submit.y4 - Query word fruit
- http//www.agaraadhi.com/dict/OD.jsp?wfruitlnen
13FUTURE WORK
- Providing APIs for programmers and developing
mobile apps for Agaraadhi framework will open a
good platform for many researchers and developers
working in Tamil Computing area.
14REFERENCE
- Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002. - Anandan, R. Parthasarathi, and Geetha,
Morphological Generator for Tamil. Tamil Inayam,
Malaysia, 2001. - J. Jai Hari Raju, P. IndhuReka, Dr. Madhan Karky,
Statistical Analysis and visualization of Tamil
Usage in Live Text Streams, Tamil Internet
Conference, Coimbatore, 2010.
15An Efficient Tamil Text Compaction System
- N.M.Revathi
- G.P.Shanthi
- Elanchezhiyan.K
- T V Geetha
- Ranjani Parthasarathi
- Madhan Karky
16OBJECTIVES
- Why Compacting?
- limited message length in blog sites and tiny
user interface of mobile phones. - saves online storage space and hence reduction in
cost. - The paper proposes
- a text compaction system for Tamil, first of its
kind in Tamil. - Idea of compaction
- Getting the shortest word has no specific rule it
is mainly aimed at understanding. - can be obtained by omitting letters, replacing
prefix and suffix through suitable symbols and
numbers.
17FRAMEWORK ARCHITECTURE
18FRAMEWORK CONT..
- Input Processing
- The morphological analyzer removes the suffix (if
present) added to the word and delivers the root
word (RW).
19FRAMEWORK CONT..
- Identification of the category Extraction of
compact word - Three categories of words common Tamil words,
abbreviations/acronyms, numbers. - abbreviations /acronyms by comparing it with the
keys of the hashmap. - With the help of the hash key and a mapping
algorithm, the compact word is retrieved. - Otherwise belongs to either the common tamil word
or numbers - If numbers - Numerical analyser for text to
number conversion. - Output Processing
- Tamil tool Morphological Generator to add the
suitable suffix to cater to the rules of the
language.
20RESULT AND ANALYSIS
- Tested with over 10,000 words.
- The final result is reduced to 40 of the
original text.
21REFERENCES
- Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002. - Fung, L. M. (2005). SMS short form identification
and codec. Unpublished masters thesis, National
University of Singapore, Singapore . - Acrophile (LSLarkey, P Ogilvie, MA Price, B
Tamilio, 2000) a system that automatically
searches acronym expansion pairs. - Short Message Service (SMS) Texting Symbols A
Functional Analysis of 10,000 Cellular Phone Text
Messages by Robert E. Beasley,Franklin College. -
22Kuralagam - Concept Relation based Search Engine
for Thirukkural
- Elanchezhiyan.K
- T.V.Geetha
- Ranjani Parthasarathi
- Madhan Karky
23Objectives
- Kuralagam is a conceptual search framework for
Thirukkural based on UNL Framework. - Searching with keywords in kurals and
intepretations - Concept based search based on CoReX conceptual
indexing based on UNL - Bilingual search English and Tamil
- Showing Relationships between the concepts.
24Kuralagam Framework
25Offline Processing
- Web Crawler
- A Thirukkural statistics crawler
- crawls the news and blog documents - to find the
usage of each individual Thirukkural. - The usage recorded for measuring the popularity
score for each Thirukkural - Enconversion Based on UNL
- Indexed based on CoReX Framework
26UNL Enconversion
- UNL is an intermediate language
- processes knowledge across languagebarriers.
- captures semantics by converting natural language
terms present in the document to concepts. - concepts are connected to the other concepts
through UNL relations - 46 UNL relations - plf(Place From), plt(Place To), tmf(Time from),
tmt(Time to) etc - Process of converting a natural language text to
UNL graph is known as Enconversion - reverse process is known as Deconversion.
27An Example speaks more...
- ExJohn was playing in the garden
28Indexer
- The Kuralagam Indexer is designed based on CoReX
Techniques. - The Indexer stores and manages the UNL graphs in
two different indices. - Concept only index (C index), and
- Concept-Relation-Concept index (CRC index)
29Online Processing
- Query Translation and Expansion
- converts the user query to UNL graph.
- uses CRC (Concept Relation Concept) CoReX indices
to fetch similarity thesaurus and co-occurrence
list to populate the Multi list Data Structure. - Search and Ranking
- fetches the Thirukkural number and its details.
- Thirukkurals for a given query are fetched using
the two types of concept relation indices namely
CRC and C. - The query concept is expanded using related CRC
indices pointing to the query concept. - helps in retrieving many Thirukkurals
conceptually related to the query not possible
with key word Thirukkural search engines. - The ranking is based on
- priority to the indices in the order CRCgtC
- usage score
- frequency occurrence of the query concept
30Tab Layout
31Performance Evaluation
- The accuracy of the Thirukkural search engine was
measured using the average precision and mean
average precision. - The comparisons between concept based search and
keyword based search were measured using Average
Precision methodology
32Average Precision
33Reference
- 1. Subalalitha, T V Geetha, Ranjani Parthasarathy
and Madhan Karky Vairamuthu. CoReX A Concept
Based Semantic Indexing Technique. In SWM-08.
2008. India. - 2. Foundation, U., the Universal Networking
Language (UNL) Specifications Version 3 3ed.
December 2004 UNL Computer Society, 2004.
8(5).Center UNDL Foundation - 3. Anandan, R. Parthasarathi, and Geetha,
Morphological Analyser for Tamil. ICON 2002,
2002. - 4. T.Dhanabalan, K.Saravanan, and T.V.Geetha.
2002. Tamil to UNL Enconverter, ICUKL, Goa,
India. - 5. Andrew, T. and S. Falk. User performance
versus precision measures for simple search
tasks. In 29th Annual international ACM SIGIR
Conference on Research and Development in
information Retrieval 2006. Seattle, Washington,
USA.
34Template Based MultiLingual Summary Generation
- Subalalitha C.N
- E.Umamaheswari
- T V Geetha
- Ranjani Parthasarathi
- Madhan Karky
35Aim
- To generate a multi lingual summary using based
on Universal Networking Language (UNL) Framework
36The Architechture
37Multi Lingual Summary Generation using UNL
- Template based Information Extraction
- Seven tourism specific templates have been
designed and used - Templates filled using semantic information
inherent in UNL input graphs - Template information is language independent and
can be used with any desired language. -
38Example Templates for Tourism Domain
Template Semantics inherited from UNL
God iofgtgod, iofgtgoddess, iclgtgod
Food iclgtfood, iclgtfruit
Flaura and Fauna iclgtanimal, iclgtreptile, iclgtmammal, iclgt plant
Boarding facility iclgtfacility
Transport facility iclgttransport
Place iclgtplace, iofgtplace, iofgtcity, iofgtcountry
Distance icl gtunit , icl gtnumber
39SummaryGeneration
- The template information is converted to target
language using respective UNL-target language
dictionaries. - UNL-target language dictionaries contains root
words. - Natural language term from the root word is
obtained using target language information like
case suffixes and language technology tools like
morphological generator -
- (???????????????????)
- When these converted template information is
fitted into target language specific dynamic
sentence patterns, a summary is generated.
40Performance Evaluation
- Tested with 33,000 Tamil and English text
documents enconverted to UNL graphs. - The performance of the methodology proposed has
been evaluated using human judgement. - The accuracy of the summary generated has
achieved 90 .
- Further Enhancements
- Query specific summary
- Comparing the performance with human generated
summaries.
41References
- 1 Elanchezhiyan K, T V Geetha, Ranjani
Parthasarathi Madhan Karky, CoRe Concept
Based Query Expansion, Tamil Internet Conference,
Coimbatore, 2010. - 2 Alkesh Patel , Tanveer Siddiqui , U. S.
Tiwary , A language independent approach to
multilingual text summarization, Conference
RIAO2007, Pittsburgh PA, U.S.A. May 30-June
1,2007 - 3David Kirk Evans, Identifying Similarity in
Text Multi-Lingual Analysis for Summarization ,
Doctor of Philosophy thesis, Graduate School of
Arts and Sciences , Columbia University, 2005 - 4 Radev, Allison, Blair-Goldensohn et al
(2004), MEAD a platform for multidocument
multilingual text summarization - 5 The Universal Networking Language (UNL)
Specifications Version 3 Edition 3, UNL Center
UNDL Foundation December 2004. - Jagadeesh J, Prasad Pingali, Vasudeva Varma,
Sentence Extraction Based Single Document
Summarization Workshop on Document
Summarization, March, 2005, IIIT Allahabad. - 7 Naresh Kumar Nagwani, Dr. Shrish Verma , A
Frequent Term and Semantic Similarity based
Single Document Text Summarization Algorithm
International Journal of Computer Applications
(0975 8887) Volume 17 No.2, March 2011 . - 8Prof. R. Nedunchelian, Centroid Based
Summarization of Multiple Documents Implemented
using Timestamps First International Conference
on Emerging Trends in Engineering and Technology,
IEEE 2008