Title: Status and Challenges of Local Language Computing and BRAC Universitys Initiative
1Status and Challenges of Local Language Computing
and BRAC Universitys Initiative
- Naushad UzZaman
- Research Programmer
- Center for Research on Bangla Language Processing
- BRAC University
D. Nets 5th Anniversary Seminar Series Youth
and ICTs ICT and Localization 29th January, 2006
2Outline
- Statistics of Bangla language speaker
- Localization and local language computing
- BRAC Universitys Initiative
- Local and Regional Initiatives
3Statistics of Bangla language speakers
- Spoken by 245 million people
- 7th most widely spoken language
- Spoken mainly in Bangladesh and Indian state of
West Bengal - More than 144 million people from Bangladesh
4Why localization?
- The masses can harness the power of information
- National Interest digital divide, governance,
language preservation,
5Localization
- Internationalized software in local languages
- Few groups are working actively
- Ankur, Ekushey, D.Net (content development)
- Active projects
- Linux, Mozilla, Open Office
6(No Transcript)
7(No Transcript)
8Larger picture
- Good start, but a long way to!
- Local language computing advanced applications
- Optical character recognition
- Machine translation
- Speech synthesis
- Speech recognition
- Dialog systems
9Challenges
- Language Resources
- Fonts
- Lexicon (word list)
- Corpus (collection of texts)
- Tag the lexicon and corpus
10Challenges for next few years!
- Language processing research
- Document authoring (desktop, web (blog, forums,
emails), etc) - Morphological analyzer
- Speech processing
- Information Retrieval (web searching, name
searching, spelling checker) - OCR (Optical Character Recognition)
- Syntactic analysis (can be used in MT)
- Machine Translation
- And many more
11Status of Bangla Computing
- Scattered work done, very little unification
- Scarcity of free and open-source software
- Little or no attention paid to computational
linguistics - the backbone - Many individuals are working, results few good
publications in ICCIT, IUBs ICCPB and other
conferences
12BRAC Universitys Initiative
- Research Lab (Center for Research on Bangla
Language Processing) - 9 full-time Research staff (6 CS background, 3
linguistics background) - Seed funding from PAN Localization project of
IDRC - Students working part-time, doing internship
- Software/documents all OPEN SOURCE
- Academics
- Course on Natural Language Processing
- Student projects and theses on NLP
13Status of BU Research labs work
- Publications
- ICCIT 2004 3 (Morphology 2, spelling checker)
- BU Journal 1 (Morphological parsing)
- IASTED CI 1 (Name searching)
- IEEE NLP KE 05 1 (Spelling checker)
- ICCIT 2005 1 (Morphology)
- Undergraduate Thesis 3 (Phonetic encoding, OCR,
Bangla text input in mobile) - Total 10
- 4 more research paper submitted
- Ongoing thesis 4
14Status of BU Research labs work
- Invited talks
- University of Toronto CS Seminar
- Stanford University NLP group (May 2005)
- IDRC Partners Conference in Cambodia (June 2005)
- IJCNLP 2005, Jeju Island, Korea (October 2005)
15Language Resources
- Fonts Good open-source fonts available
- Lexicon
- 80 thousand list of words expected to be 110
thousand in the next release - Tagging and annotation is underway. Significant
and large project - Corpus
- Yet to begin
16Language processing research
- Document authoring
- Editor, Banglapad
- open source, platform independent, rich text
editor (supports Bangla spell checking, export to
html) - Status Version 1, Release candidate 1
- http//sourceforge.net/projects/banglapad
- Transliteration, pata
- Type phonetically in English, you will get
similar sounding dictionary word - Desktop application http//sourceforge.net/projec
ts/pata Status Complete - Web based transliteration Status Expected by
June 2006 - Community network tools
- Set of tools to community networking (blogs,
forums, etc) in Bangla. - Not only content authoring but also web services
such as spelling checker. - Status Expected by early 2007
17Language processing research
- Morphology
- verb morphology is reasonably complete
- noun morphology is somewhat usable, but much more
needs to be done - statistical methods for dealing with Bangla
compound words and blends are being worked on - Grapheme To Phoneme (G2P)
- Digital pronunciation dictionary
- Useful step for speech processing
- Status Expected by June 2006
18Language processing research
- Speech Processing
- Text-to-speech
- Voice for Festival.
- Status First demo expected by May 2006.
- Automatic Speech Recognition
- Limited vocabulary segmented speech recognition.
- Status First demo expected by August 2006.
19Language processing research
- Information Retrieval
- Spelling checker
- Gives phonetic suggestion and ranks phonetically
- http//sourceforge.net/projects/puspaspeller/
- Integrated with other text editors, Banglapad
- Status Complete
- Searching
- Phonetic web searching for Bangla
- Input can be English or Bangla
- Status Expected by June 2006
- Name searching
- Can be used in hospital, institutes, census, etc
- Status Expected by October 2006
20Language processing research
- Pattern recognition/image processing/document
processing - Document skew correction Bangla document skew
corrector based on Radon transform. Complete. - Segmentation
- Bangla line segmentation Complete
- Bangla word segmentation Complete
- Bangla character segmentation Work in progress.
The large number of combinations (consonant
clusters and the non-spacing marks) complicates
this task. This is omnifont, so must work with
any typeface.
21Language processing research
- Pattern recognition
- Neural net based recognizer Fairly complete for
the basic alphabet and a subset of the consonant
clusters. The non-spacing marks pose a
significant challenge. - Hidden Markov Model (HMM) based recognizer Just
started, first implementation expected in May,
2006. - Syntax
- Very preliminary work on Bangla syntax using the
Lexical Functional Grammar (LFG) formalism - Also a parallel effort using the Head-driven
Phrase Structure Grammar (HPSG) formalism
22Local and Regional Initiatives
- IDRC Pan Localization Network (PanL10n)
- Phase I 2004-2006 7 country collaboration
- BRAC University, Bangladesh
- Department of IT, Bhutan
- National ICT Development Agency, Cambodia
- Science Tech and Environment Agency, Laos
- Madan Puraskar Pustakalaya, Nepal
- University of Colombo School of Computing, Sri
Lanka - Afghanistan
- Phase II proposed for 2007-2010
23Local and Regional Initiatives
- IDRC Pan Localization Network Phase II
(2007-2010) - Further development of user-end local language
technology - Development of user end training for using the
local language technology - Conduction of this training
- Local language content development
- Measuring effects of using local language
technology
24D.Nets Initiative
25Summary
- Local language computing
- Significant challenges, from language resources
to human resources - 30 years work for English and Western languages
just beginning for Bangla - Include students from CS, linguistics
- OPEN SOURCE a must for knowledge sharing!
- Other universities should also come forward