Title: CRBLP
1CRBLPs (Center for Research on Bangla Language
Processing) Activities and Achievements on Bangla
Language Processing, January 2007
- Naushad UzZaman
- CRBLP, BRAC U, Bangladesh
- http//www.naushadzaman.com
2CRBLPs Activities
- Center for Research on Bangla Language
Processing, CRBLP working on Bangla Language
Processing since 2004 - 11 Research staff (9 Computer Science background,
2 linguistics background) - Students working part-time, doing internship
- 13 Summer 2006 Interns and 7 former members
- Motivation of open source
- Academic
- Offered course on language processing (CSE 431
Natural Language Processing, offered at Spring
2006 and Spring 2007 in BRAC U) - Thesis on NLP
- Summer Internship
3CRBLP Members (Full Time Staff Members)
- Dr. Mumit Khan email website
- Head, CRBLP and Associate Professor, CSE
Department - Matin Saad Abdullah email?
- Program Manager, CRBLP and Senior Lecturer, CSE
Department - Naira Khan email
- Linguist, CRBLP and Lecturer, English and
Humanities (On Leave) - Zahurul Islam email website
- Research Programmer, CRBLP and Part-time Faculty
Member, CSE - Naushad UzZaman email website
- Research Programmer, CRBLP and Part-time Faculty
Member, CSE - Md. Abul Hasnat, Research Programmer email
website - S. M. Murtoza Habib, Research Programmer email
website - Firoj Alam, Research Programmer email website
4CRBLP Members (Part-time and Interns)
- Part Time Staff Members
- Kamrul Hayder, Language Consultant
- M. Abdur Rahman, Research Assistant
- Maruf Muqtadir, Research Assistant
- Summer 2006 Research Interns
- Fahim Muhammad Hasan
- M. Hammad Ali
- Ayesha Binte Mosaddeque
- Nafid Haque
- Yeasir Arafat
- Nizam Uddin
- M. Abdur Rahman
- Fahim Tawfique Chowdhury
- Munirul Mansur
- Md. Jahangir Alam
- Annajiat Alim Rasel
- Munshi Asadullah
- Salman Zaman
5Areas of Research
- Document Authoring
- Information Retrieval
- Optical Character Recognition
- Pronunciation Generator
- Speech Processing
- Morphology
- Parts of Speech Tagging
- Syntax
- And also few other small research projects
6Document Authoring, BanglaPad
- The current version of the BanglaPad includes the
following features - 1. Platform independent. (Current version tested
on Windows and Linux). - 2. Edit Bangla and English text in the same
document. - 3. Rich text editing with pictures and tables.
- 4. Export document as HTML. (You can develop web
contents in Bangla using this feature!) - 5. Support character encoding including UTF8 and
UTF16. - 6. Bangla and English Spell checking. (Bangla
spelling checker uses Puspa Speller) - 7. Bangla and English Search and replace.
- 8. Printing formatted document.
- 9. Three different skins for the editor.
- 10. Built-in keyboard driver for easy Bangla
typing. (No need to install a keyboard driver). - 11. Customizable Key-Maps for Bangla.
- 12. Easy to use Installer for Windows
7Spelling Checker in BanglaPad
8Rich Text Editing in BanglaPad and exporting to
HTML
9BanglaPad Download and Team Members
- Download http//sourceforge.net/project/showfiles
.php?group_id158301package_id180246 - Developers
- Zahurul Islam August 2005 - December 2005
- Naushad UzZaman August 2005 - December 2005
- Abdur Rahman January 2006 - present
- Maruf Muqtadir January 2006 - present
- Advisors
- Matin Saad Abdullah August 2005 - present
- Mumit Khan August 2005 - present
10English to Bangla Transliteration
- Type phonetically in English, you will get
similar sounding dictionary word. Can be used for
Bangla text input with English keyboard. - Developed by Naushad UzZaman
- Relevant Publication
- 1. Naushad UzZaman, Arnab Zaheen and Mumit Khan,
A Comprehensive Roman (English) to Bangla
Transliteration Scheme, Proc. International
Conference on Computer Processing on Bangla
(ICCPB-2006), Dhaka, Bangladesh, 17 February,
2006. ? - 2. Naushad UzZaman, Phonetic Encoding for Bangla
and its Application to Spelling Checker, Name
Searching, Transliteration and Cross Language
Information Retrieval, Undergraduate Thesis
(Computer Science), BRAC University, May 2005.
11Pata, English to Bangla Transliteration
12Spelling Checker
- Bangla Speller Sandbox Bangla Phonetic Speller
(Puspa). Gives suggestion for misspelling words
based on similarities in pronunciation.
Implemented based on Double Metaphone phonetic
encoding - Developed by Naushad UzZaman
- Download http//sourceforge.net/project/showfiles
.php?group_id158301package_id180247
13Publications on Spelling Checker
- 1. Naushad UzZaman and Mumit Khan, A Bangla
Phonetic Encoding for Better Spelling
Suggestions, Proc. 7th International Conference
on Computer and Information Technology (ICCIT
2004), Dhaka, Bangladesh, December 2004. - 2. Naushad UzZaman and Mumit Khan, A Double
Metaphone Encoding for Bangla and its Application
in Spelling Checker, Proc. 2005 IEEE
International Conference on Natural Language
Processing and Knowledge Engineering, pp.
705-710, Wuhan, China, October 30 - November 1,
2005. - 3. Naushad UzZaman and Mumit Khan, A
Comprehensive Bangla Spelling Checker, Proc.
International Conference on Computer Processing
on Bangla (ICCPB-2006), Dhaka, Bangladesh, 17
February, 2006. - 4. Naushad UzZaman, Phonetic Encoding for Bangla
and its Application to Spelling Checker, Name
Searching, Transliteration and Cross Language
Information Retrieval, Undergraduate Thesis
(Computer Science), BRAC University, May 2005. - 5. Munshi Asadullah, Md. Zahurul Islam, and Mumit
Khan, Error-tolerant Finite-state Recognizer and
String Pattern Similarity Based Spell-Checker for
Bengali, to appear in the Proc. of International
Conference on Natural Language Processing, ICON
2007, January 2007.
14Puspa Spelling Checker
15Search Engine
- Bangla search engine based on open-source search
engine Nutch. - Developed by M. Hammad Ali and Nafid Haque
- Relevant Publications
- 1. Nafid Haque, Hammad Ali, Mumit Khan, and Matin
Saad Abdullah, Infrastructure for Bangla
Information retrieval in the context of ICT for
Development, to appear in the Proc. of
International Conference on Systems, Computing
Sciences and Software Engineering (SCSS 06) of
International Joint Conferences on Computer,
Information, and Systems Sciences, and
Engineering (CISSE 06), December 4 - 14, 2006. - 2. M. Hammad Ali, Nafid Haque, A Decentralised
Approach to Information Retrieval for a
developing country like Bangladesh, Education
Without Borders 2007, Abu Dhabi, February 25 -
27, 2007.
16Search Engine example
17Optical Character Recognition
- BanglaOCR is the Optical Character Recognizer for
Bangla Script. It takes scanned images of a
printed page or document as input and converts
them into editable Unicode text. BanglaOCR allows
users to train the data set from any document and
observe the recognition performance. - BanglaOCR developed by Md. Abul Hasnat and S M
Murtoza Habib. - Download http//sourceforge.net/project/showfiles
.php?group_id158301package_id215908 - Another OCR implemented using Kohonen Network,
developed by Shoeb Shatil. - Download http//sourceforge.net/project/showfiles
.php?group_id158301package_id180249
18OCR Status
- OCR Application
- Status Version 0.1, Release candidate 1
- Status of Different Segments of OCR
- Document skew correction
- Bangla document skew corrector based on Radon
transform. Status Complete. - Segmentation
- Bangla line segmentation. Status Complete
- Bangla word segmentation. Status Complete
- Bangla character segmentation. Status Work in
progress. The large number of combinations
(consonant clusters and the non-spacing marks)
complicates this task. This is omnifont, so must
work with any typeface. - Character/Symbol recognition
- Neural net based recognizer Fairly complete for
the basic alphabet and a subset of the consonant
clusters. The non-spacing marks pose a
significant challenge. - Hidden Markov Model (HMM) based recognizer
Status First demo available. - -Post Processing for OCR
- Post processing spelling checker for OCR
corrects spelling mistakes due to unsuccessful
recognition. Status First demo available.
19BanglaOCR
20OCR Related Publications
- 1. Md. Abul Hasnat, S M Murtoza Habib and Mumit
Khan, Segmentation free Bangla OCR using HMM
Training and Recognition, to appear in the Proc.
of 1st International Conference on Digital
Communications and Computer Applications
(DCCA2007), Irbid, Jordan, 2007. - 2. A. M. Shoeb Shatil and Mumit Khan, Minimally
Segmenting High Performance Bangla OCR using
Kohonen Network, to appear in the Proc. of 9th
International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006. - 3. S. M. Murtoza Habib, Nawsher Ahmed Noor and
Mumit Khan, Skew correction of Bangla script
using Radon Transform, to appear in the Proc. of
9th International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
21Automated Pronunciation Generator
- Pronunciation Generator Input any Bangla word,
this application will give the pronunciation of
that word in IPA (International Phonetic
Alphabet). - Demo available online at http//student.bu.ac.bd/
7Eu02201011/g2pweb/g2p1.htm - Source code available online at
http//student.bu.ac.bd/7Eu02201011/g2pweb/ - Developed by, Ayesha Binte Mosaddeque
- Relevant Publication
- Ayesha Binte Mosaddeque, Naushad UzZaman and
Mumit Khan, Rule based Automated Pronunciation
Generator, to appear in the Proc. of 9th
International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
22Bangla Pronunciation Generator
23Speech Processing
- Text-to-speech
- Voice for Festival.
- Status First demo available, Developed by Firoj
Alam - Automatic Speech Recognition
- Isolated Speech Recognition, Developed by A K M
Mahmudul Hoque - Continuous Speech Recognition. Status First demo
available. Developed by Md. Abul Hasnat
24Speech Related Publications
- 1. Firoj Alam and Promila Kanti Nath, Bangla Text
to Speech using Festival, Undergraduate Thesis
(Computer Science), BRAC University, May 2006.
Supervisor Mumit Khan. - 2. A K M Mahmudul Hoque, Bangla Speech
Recognition, Undergraduate Thesis (Computer
Science), BRAC University, May 2006. Supervisor
Mumit Khan. - 3. Firoj Alam, Promila Kanti Nath and Mumit Khan,
Text To Speech for Bangla Language using
Festival, to appear in the Proc. of 1st
International Conference on Digital
Communications and Computer Applications
(DCCA2007), Irbid, Jordan, 2007.
25Morphology
- Morphology The branch of grammar which studies
the structure or forms of words. - Work done on Bangla Morphology
- Generative verb morphology using two-level rules
- Basic concatanative noun morphology with features
- Software developed Jkimmo, A Multilingual
Computational Morphology Framework for PC-KIMMO.
Developed by Md. Zahurul Islam. - Download Jkimmo http//sourceforge.net/project/sh
owfiles.php?group_id158301package_id180248
26Morphological Analyzer Jkimmo
27Morphology Related Publications
- 1. Sajib Dasgupta and Mumit Khan, Morphological
Parsing of Bangla Words using PC-KIMMO, Proc. 7th
International Conference on Computer and
Information Technology, Dhaka, Bangladesh,
December, 2004. - 2. Sajib Dasgupta and Mumit Khan, Feature
Unification for Morphological Parsing in Bangla,
Proc. 7th International Conference on Computer
and Information Technology, Dhaka, Bangladesh,
December, 2004. - 3. Sajib Dasgupta, Dewan Shahriar Hossain Pavel,
Asif Iqbal Sarkar, Naira Khan and Mumit Khan,
Morphological Analysis of Inflecting Compound
Words in Bangla, Proc. 8th International
Conference on Computer Information Technology
(ICCIT), Islamic University of Technology (IUT),
Dhaka, Bangladesh, 2005. - 4. Md. Zahurul Islam and Mumit Khan, JKimmo A
Multilingual Computational Morphology Framework
for PC-KIMMO, to appear in the Proc. of 9th
International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
28Bangla Parts of Speech (POS) Tagging
- This application tags words in a sentence with
the parts of speech of that word. Implemented and
compared HMM, n-gram and Transformation based
Brills POS Tagging for Bangla, Hindi and Telegu
on different sized corpus. For Bangla it was
compared on different sized tagset too. - Developed by Fahim Muhammad Hasan.
- Relevant Publications
- Fahim Muhammad Hasan, Naushad UzZaman and Mumit
Khan, Comparison of different POS Tagging
Techniques (n-gram, HMM and Brill's tagger) for
Bangla, to appear in the Proc. of International
Conference on Systems, Computing Sciences and
Software Engineering (SCSS 06) of International
Joint Conferences on Computer, Information, and
Systems Sciences, and Engineering (CISSE 06),
December 4 - 14, 2006.
29POS Tagging example
30Syntax
- Syntax the grammatical arrangement of words in
sentences - Bangla syntactic analysis using
- Lexical Functional Grammar (LFG) formalism
- Head-driven Phrase Structure Grammar (HPSG)
formalism - Work done by Naira Khan, Ayesha Binte Mosaddeque,
M Hammad Ali and Nafid Haque. - Relevant Publications
- 1. Md. Nasimul Haque and M. Khan, Parsing Bangla
using LFG An Introduction, BRAC University
Journal, Vol 2, No. 2, 2005. - 2. Naira Khan and Mumit Khan, Developing a
Computational Grammar for Bengali using the HPSG
Formalism, to appear in the Proc. of 9th
International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006. - 3. Ayesha Binte Mosaddeque, M. Hammad Ali and
Nafid Haque, Design of Head-Driven Phrase
Structure Grammer for Bangla, Undergraduate
Thesis (Computer Science), BRAC University,
December 2006. Supervisor Mumit Khan.
31(No Transcript)
32 33Bangla Grammar Checker
- Implemented a statistical Bangla grammar checker
based on n-gram analysis. - Developed by Md. Jahangir Alam.
- Relevant Publications
- Md. Jahangir Alam, Naushad UzZaman and Mumit
Khan, N-gram based Statistical Grammar Checker
for Bangla and English, to appear in the Proc. of
9th International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
34Bangla Text Categorization
- Implemented Bangla Text categorization based on
n-gram analysis. Trained on Prothom Alo newspaper
corpus on 6 different categories. - Developed by Munirul Mansur.
- Relevant Publications
- 1. Munirul Mansur, Naushad UzZaman and Mumit
Khan, Analysis of N-gram based text
categorization for Bangla in a newspaper corpus,
to appear in the Proc. of 9th International
Conference on Computer and Information Technology
(ICCIT 2006), Dhaka, Bangladesh, December 2006. - 2. Munirul Mansur, Analysis of n-gram based text
categorization for Bangla in a newspaper corpus,
Undergraduate Thesis (Computer Science), BRAC
University, August 2006. Supervisor Mumit Khan.
35Analysis of Prothom-Alo newspaper Corpus
- Frequency analysis of 1 year Prothom-Alo
newspaper corpus. - Relevant Publications
- 1. Yeasir Arafat, Analysis and Observations From
a Bangla news corpus, Undergraduate Thesis
(Computer Science), BRAC University, August 2006.
Supervisor Mumit Khan. - 2. Yeasir Arafat, Md. Zahurul Islam and Mumit
Khan, Analysis and Observations From a Bangla
news corpus, to appear in the Proc. of 9th
International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
36Language Modeling, forward and backward n-gram
- Investigating the prospect of backward n-gram
compared to forward n-gram for Bangla. - Relevant Publication
- Naira Khan, Md. Tarek Habib, Md. Jahangir Alam,
Rajib Rahman, Naushad UzZaman and Mumit Khan,
History (forward n-gram) or Future (backward
n-gram)? Which model to consider for n-gram
analysis in Bangla?, to appear in the Proc. of
9th International Conference on Computer and
Information Technology (ICCIT 2006), Dhaka,
Bangladesh, December 2006.
37Font Converter
- Converts different TTF fonts to Unicode encoding.
Status Completed for Ullash, Prothoma, Bangsi
Alpona fonts. - Developed by Md. Zahurul Islam.
- Download http//sourceforge.net/project/showfiles
.php?group_id158301package_id180250
38Stemming
- Stemming Stemming is an algorithm developed to
reduce a search query to its stem or root form,
in other words, variations of particular words
such as past tense and plural and singular usage
are taken into account when performing a search,
For example, applies, applying applied matches
apply. - Relevant Publications
- Md. Zahurul Islam, Md. Nizam Uddin and Mumit
Khan, A Light Weight Stemmer for Bengali and Its
Use in Spelling Checker, to appear in the Proc.
of 1st International Conference on Digital
Communications and Computer Applications
(DCCA2007), Irbid, Jordan, 2007.
39Text Summarization
- Text summarization is the technique which
automatically creates an abstract or summary of a
text. In this study we investigate what works
have been done in this area and implement an
extraction based text summarizer for Bangla
language. - Relevant work
- Md. Nizam Uddin, "A Study on Text Summarization
Techniques and an Approach for Bangla Text
Summarization", Independent Study, Computer
Science, BRAC University, December 2006,
Supervisor Md. Zahurul Islam, Mumit Khan
40Language Resources
- Lexicon
- Wordlist of 160 thousands words with 1st step
parts of speech tags. - Corpus
- 1 year Prothom alo newspaper corpus
- Charjapad and Boru Chandi Dash er kabbo corpus
(Edited by Md. Abdul Hai and Anwar Pasha)
41CRBLP Publications
- 2004
- ICCIT 2004 (Bangladesh) 3 (Morphology, Spelling
Checker) - Total 3
- 2005
- IASTED CI 2005 (Canada) 1 (Name Searching)
- IEEE NLP KE 2005 (China) 1 (Spelling Checker)
- IEE Mobility 2005 (China) 1 (Text Input System
for Mobile) - ICCIT 2005 2 (Morphology, Compiler)
- BU Journal 1 (Morphological Parsing)
- Undergraduate Thesis 1 (Phonetic Encoding)
- Total 7
42CRBLP Publication cont.
- 2006
- ICCPB 2006 (Bangladesh) 4 (Corpus, Lexicon,
Spelling Checker, Transliteration) - ICCIT 2006 (Bangladesh) 11 (HPSG, Corpus
Analysis, Text Categorization, Pronunciation
Generator, Backward n-gram, Grammar Checker, Skew
Correction, Traveler Information System, OCR
using Kohonen Network, Mobile Messaging,
Morphology) - CISSE 2006 (Online) 2 (comparison of POS
tagging, Bangla Information Retrieval) - Undergraduate Thesis 9 (Skew Correction, Mobile
Messaging, Speech Recognition, OCR using Kohonen
network, Text to Speech, Corpus Analysis, Text
Categorization, POS Tagging, HPSG) - Total 24
- 2007
- ICON 2007 (India) 1 (Spelling Checker)
- DCCA 2007 (Jordan) 5 (Stemming, OCR, Text to
Speech, Semantics, wireless LAN) - EWB 2007 (Abu Dhabi) 2 (Information Retrieval,
Localization) - Total 8
- Till January 2007
43CRBLP website
- http//www.bracu.ac.bd/research/crblp/