Building digital libraries in Indian languages: case studies with Hindi and Kannada

presentation player overlay
1 / 28
About This Presentation
Transcript and Presenter's Notes

Title: Building digital libraries in Indian languages: case studies with Hindi and Kannada


1
Building digital libraries in Indian languages
case studies with Hindi and Kannada
B.S. Shivaram Trainee (2001-2002) National
Center for Science Information
2
Table of Contents
  • Introduction to Multilingual Digital Libraries
  • Different Character Sets and Encodings
  • Statement of the problem
  • Objectives
  • Need for the project
  • Methodology
  • Implementation
  • System description
  • Observations
  • Limitations
  • Conclusion
  • Future developments

3
Multilingual Digital Library
  • Library
  • Digital library
  • Monolingual digital library
  • Multilingual digital library

4
Definition of MDL
  • According to Ana M. B. Pavani
  • A multilingual digital library is a digital
    library that has all functions implemented
    simultaneously in as many languages as desired
    and whose search retrieve functions are
    language independent.

5
Terms related to multilingualism
  • i18n (internationalization)
  • Localization
  • Multilingual digital library
  • Multilingual documents (?????, ??????, ??????)
  • Cross-language Retrieval

6
Issues of MDL
  • Multiple language recognition, manipulation and
    display.
  • Multilingual or cross-language search and
    retrieval

7
Character set and Encodings
  • Charset- is a bunch of characters, in the way a
    human would understand them.
  • Ex ?, ?,?,?, so on are charset of Kannada
  • ?,?,?,?, so on are charset of Hindi
  • A,B,C,D, so on are charset of Latin English
  • Character Encoding- is a way of storing
    characters on a computer as bits.

8
Different character sets
  • ASCII
  • ISO-8859 series
  • Windows series
  • User defined
  • ISO 10646
  • Utf-8
  • Utf-16
  • Utf-32

9
Unicode
  • Unicode provides a unique number for every
    character, no matter what the platform, no
    matter what the program, no matter what the
    language.
  • Developed by Unicode Consortium
  • There are many versions, 3.2.0 current one
  • Accommodates more than 65,000.
  • Synchronized with the corresponding versions of
    ISO-10646.

10
Unicode
  • Standards incorporated under Unicode
  • ISO 6937, ISO 8859 series
  • ISCII, KS C 5601, JIS X 0209, JIS X 0212, GB
    2312, and CNS 11643 etc.
  • Scripts and Characters
  • European alphabetic scripts
  • Middle Eastern right-to-left scripts
  • Scripts of Asia
  • Indian languages? Devanagari, Bengali, Gurmukhi,
    Oriya, Tamil, Telugu, Kannada, Malayalam.
  • Punctuation marks, diacritics, mathematical
    symbols, technical symbols, arrows, dingbats,
    etc.

11
Assigning Character Codes
  • Unique number is assigned to each code element 
    and is called a code point.
  • These are the hexadecimal numbers with the prefix
    U Ex,. , U0041 is the hexadecimal number "A" .
  • It groups the characters together by scripts in
    code blocks.
  • Code blocks vary in size, depending on the size
    of the script.
  • Code elements are grouped logically throughout
    the range of code points, called the codespace.

12
Text handling
  • Computer text handling involves processing and
    encoding. 
  • The Unicode Standard directly addresses only the
    encoding action, processing will be carried out
    by software.
  • It does not defines glyph images (character set
    images), display software retrieve the glyphs.
  • The Unicode Standard does not specify the size,
    shape, or orientation of on-screen characters.

13
Objectives
  • To assess the suitability of GSDL for developing
    digital library collection in Indian languages
    (Hindi and Kannada)
  • To create search and browse interface for GSDL
    Software in Hindi and Kannada

14
Need
  • Immeasurable amount of literature in many
    languages
  • E-publishing in Indian languages
  • E-governance in India
  • E-learning
  • Digital libraries for Rural population

15
Greenstone Digital Library Software
  • Open source
  • Developed by CS Department, University of
    Waikato, Newzealand
  • http//greenstone.org
  • Can handle different file formats
  • Works on different platforms
  • Supports for many languages through unicode

16
Multilingual support
  • Interface part
  • Content part

17
Methodology
  • Software
  • Windows XP operating system
  • GSDL
  • Macromedia Fireworks
  • Nudi
  • Baraha
  • Internet Explorer 6.0
  • Hardware
  • 128 RAM with Pentium III

18
Hindi and Kannada Interface
  • Separate .dm files were created for both language
  • _textimagehome_ Home Page
  • _textimagehome_ lkn23252367
    236023692330
  • Creating tabs for Hindi Kannada
  • Hindi Tabs
  • Macromedia Fireworks
  • Baraha transliteration software
  • Kannada Tabs
  • Macromedia Fireworks
  • Nudi transliteration software

19
Collection building
  • ?????? ???????? is downloaded from
    http//manaskriti.com/kaavyaalaya/
  • ??????? ???????? is downloaded from
    http//udayavani.com
  • ?????? Unicode collection
  • ????? ???????? ????????

20
System description
  • ?????? ????????/??????? ????????
  • Susha/Shree-Kan-0850 ? Font folder
  • Lang interface ? Hindi/Kannada
  • Preference encoding ? Latin Based
  • Browser encoding ? Latin Based or User defined
  • Hindi/Kannada Unicode collection
  • Mangal/Tunga for Hindi/Kannada ?Font folder
  • Lang interface ? Hindi/Kannada
  • Preference encoding ? utf-8
  • Browser encoding ?utf-8

21
Observations
  • Can have interfaces in many languages .
  • Can build collection in many languages with
    different encodings other than Unicode.
  • Non-Unicode collection has only browse feature.
  • Titles of the Non-Unicode collection were in
    English language .
  • Unicode collections has both search and browse
    features.
  • All collections can be accessed over network.

  • cont

22
Observations
  • Uses MG compression technique.
  • Can browse lists of authors, lists of titles,
    lists of dates, so on.
  • Can handle very large collections.
  • New data can be added to existing collection at
    any point of time.
  • Open-source software anybody can develop and it
    is amendable for local requirements.

23
Limitations
  • Fails to display Unicode html files of Hindi/
    Kannada
  • It doesnt support truncated searching for Indian
    scripts.
  • Case differences option cannot be disabled in the
    preferences page.
  • Presently search feature works only on Windows
    XP.

24
Conclusion
  • Multilingual Digital libraries will be
    ubiquitous in the future and will provide the
    basis for a very broad set of distributed living
    activities including computer-supported
    co-operative work, distance learning etc.
    Developing countries like India, where many
    languages are in practice could utilize
    comprehensive software such as Greenstone. Since
    Greenstone, being open-source software is readily
    extensible to meet the needs of multilingualism.

25
Future developments
  • It can be extended to other Indian languages for
    which Unicode supports.
  • Display problem with html files can be solved for
    Indian languages by creating model mappings in
    utf-8 charset.
  • Collection can be tested for different file
    formats like PDF, RTF, E-mail, etc. for other
    Indian languages.
  • It can be tested with other operating systems
    like UNIX, Linux and browsers like Netscape,
    Opera to assess their compatibility.
  • Can develop stemming algorithms for Indian
    languages, that can be incorporated to GSDL

26
Demo
27
  • Any Qs
  • ?????????????? ?
  • ??? ?????? ?

28
  • Thank you
  • ????????
  • ???????
Write a Comment
User Comments (0)
About PowerShow.com