Status and Challenges of Local Language Computing and BRAC Universitys Initiative - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Status and Challenges of Local Language Computing and BRAC Universitys Initiative

Description:

... work on Bangla syntax using the Lexical Functional Grammar (LFG) formalism. Also a parallel effort using the Head-driven Phrase Structure Grammar (HPSG) formalism ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 26
Provided by: naus5
Category:

less

Transcript and Presenter's Notes

Title: Status and Challenges of Local Language Computing and BRAC Universitys Initiative


1
Status and Challenges of Local Language Computing
and BRAC Universitys Initiative
  • Naushad UzZaman
  • Research Programmer
  • Center for Research on Bangla Language Processing
  • BRAC University

D. Nets 5th Anniversary Seminar Series Youth
and ICTs ICT and Localization 29th January, 2006
2
Outline
  • Statistics of Bangla language speaker
  • Localization and local language computing
  • BRAC Universitys Initiative
  • Local and Regional Initiatives

3
Statistics of Bangla language speakers
  • Spoken by 245 million people
  • 7th most widely spoken language
  • Spoken mainly in Bangladesh and Indian state of
    West Bengal
  • More than 144 million people from Bangladesh

4
Why localization?
  • The masses can harness the power of information
  • National Interest digital divide, governance,
    language preservation,

5
Localization
  • Internationalized software in local languages
  • Few groups are working actively
  • Ankur, Ekushey, D.Net (content development)
  • Active projects
  • Linux, Mozilla, Open Office

6
(No Transcript)
7
(No Transcript)
8
Larger picture
  • Good start, but a long way to!
  • Local language computing advanced applications
  • Optical character recognition
  • Machine translation
  • Speech synthesis
  • Speech recognition
  • Dialog systems

9
Challenges
  • Language Resources
  • Fonts
  • Lexicon (word list)
  • Corpus (collection of texts)
  • Tag the lexicon and corpus

10
Challenges for next few years!
  • Language processing research
  • Document authoring (desktop, web (blog, forums,
    emails), etc)
  • Morphological analyzer
  • Speech processing
  • Information Retrieval (web searching, name
    searching, spelling checker)
  • OCR (Optical Character Recognition)
  • Syntactic analysis (can be used in MT)
  • Machine Translation
  • And many more

11
Status of Bangla Computing
  • Scattered work done, very little unification
  • Scarcity of free and open-source software
  • Little or no attention paid to computational
    linguistics - the backbone
  • Many individuals are working, results few good
    publications in ICCIT, IUBs ICCPB and other
    conferences

12
BRAC Universitys Initiative
  • Research Lab (Center for Research on Bangla
    Language Processing)
  • 9 full-time Research staff (6 CS background, 3
    linguistics background)
  • Seed funding from PAN Localization project of
    IDRC
  • Students working part-time, doing internship
  • Software/documents all OPEN SOURCE
  • Academics
  • Course on Natural Language Processing
  • Student projects and theses on NLP

13
Status of BU Research labs work
  • Publications
  • ICCIT 2004 3 (Morphology 2, spelling checker)
  • BU Journal 1 (Morphological parsing)
  • IASTED CI 1 (Name searching)
  • IEEE NLP KE 05 1 (Spelling checker)
  • ICCIT 2005 1 (Morphology)
  • Undergraduate Thesis 3 (Phonetic encoding, OCR,
    Bangla text input in mobile)
  • Total 10
  • 4 more research paper submitted
  • Ongoing thesis 4

14
Status of BU Research labs work
  • Invited talks
  • University of Toronto CS Seminar
  • Stanford University NLP group (May 2005)
  • IDRC Partners Conference in Cambodia (June 2005)
  • IJCNLP 2005, Jeju Island, Korea (October 2005)

15
Language Resources
  • Fonts Good open-source fonts available
  • Lexicon
  • 80 thousand list of words expected to be 110
    thousand in the next release
  • Tagging and annotation is underway. Significant
    and large project
  • Corpus
  • Yet to begin

16
Language processing research
  • Document authoring
  • Editor, Banglapad
  • open source, platform independent, rich text
    editor (supports Bangla spell checking, export to
    html)
  • Status Version 1, Release candidate 1
  • http//sourceforge.net/projects/banglapad
  • Transliteration, pata
  • Type phonetically in English, you will get
    similar sounding dictionary word
  • Desktop application http//sourceforge.net/projec
    ts/pata Status Complete
  • Web based transliteration Status Expected by
    June 2006
  • Community network tools
  • Set of tools to community networking (blogs,
    forums, etc) in Bangla.
  • Not only content authoring but also web services
    such as spelling checker.
  • Status Expected by early 2007

17
Language processing research
  • Morphology
  • verb morphology is reasonably complete
  • noun morphology is somewhat usable, but much more
    needs to be done
  • statistical methods for dealing with Bangla
    compound words and blends are being worked on
  • Grapheme To Phoneme (G2P)
  • Digital pronunciation dictionary
  • Useful step for speech processing
  • Status Expected by June 2006

18
Language processing research
  • Speech Processing
  • Text-to-speech
  • Voice for Festival.
  • Status First demo expected by May 2006.
  • Automatic Speech Recognition
  • Limited vocabulary segmented speech recognition.
  • Status First demo expected by August 2006.

19
Language processing research
  • Information Retrieval
  • Spelling checker
  • Gives phonetic suggestion and ranks phonetically
  • http//sourceforge.net/projects/puspaspeller/
  • Integrated with other text editors, Banglapad
  • Status Complete
  • Searching
  • Phonetic web searching for Bangla
  • Input can be English or Bangla
  • Status Expected by June 2006
  • Name searching
  • Can be used in hospital, institutes, census, etc
  • Status Expected by October 2006

20
Language processing research
  • Pattern recognition/image processing/document
    processing
  • Document skew correction Bangla document skew
    corrector based on Radon transform. Complete.
  • Segmentation
  • Bangla line segmentation Complete
  • Bangla word segmentation Complete
  • Bangla character segmentation Work in progress.
    The large number of combinations (consonant
    clusters and the non-spacing marks) complicates
    this task. This is omnifont, so must work with
    any typeface.

21
Language processing research
  • Pattern recognition
  • Neural net based recognizer Fairly complete for
    the basic alphabet and a subset of the consonant
    clusters. The non-spacing marks pose a
    significant challenge.
  • Hidden Markov Model (HMM) based recognizer Just
    started, first implementation expected in May,
    2006.
  • Syntax
  • Very preliminary work on Bangla syntax using the
    Lexical Functional Grammar (LFG) formalism
  • Also a parallel effort using the Head-driven
    Phrase Structure Grammar (HPSG) formalism

22
Local and Regional Initiatives
  • IDRC Pan Localization Network (PanL10n)
  • Phase I 2004-2006 7 country collaboration
  • BRAC University, Bangladesh
  • Department of IT, Bhutan
  • National ICT Development Agency, Cambodia
  • Science Tech and Environment Agency, Laos
  • Madan Puraskar Pustakalaya, Nepal
  • University of Colombo School of Computing, Sri
    Lanka
  • Afghanistan
  • Phase II proposed for 2007-2010

23
Local and Regional Initiatives
  • IDRC Pan Localization Network Phase II
    (2007-2010)
  • Further development of user-end local language
    technology
  • Development of user end training for using the
    local language technology
  • Conduction of this training
  • Local language content development
  • Measuring effects of using local language
    technology

24
D.Nets Initiative
25
Summary
  • Local language computing
  • Significant challenges, from language resources
    to human resources
  • 30 years work for English and Western languages
    just beginning for Bangla
  • Include students from CS, linguistics
  • OPEN SOURCE a must for knowledge sharing!
  • Other universities should also come forward
Write a Comment
User Comments (0)
About PowerShow.com