Unicode for Under Resourced Languages - PowerPoint PPT Presentation

About This Presentation
Title:

Unicode for Under Resourced Languages

Description:

Bitmap (BDF) Faster to create. One size per font, not so scalable ... Bitmap Editors. Each letter is a matrix of pixels, like tiles ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 44
Provided by: progenye
Category:

less

Transcript and Presenter's Notes

Title: Unicode for Under Resourced Languages


1
Unicode for Under Resourced Languages
  • Daniel Yacob
  • The Geez Frontier Foundation

SALTMIL 5 Genoa, Italy 2006
2
Overview
  • What is Unicode?
  • More than Just Encoded Letters!
  • Working with Unicode
  • How Unicode can help you.
  • Resources and how to apply them.
  • Working for Unicode
  • How you can help Unicode.
  • How Unicode can help your U-RL.

3
My Background
  • Started Ethiopic software work in 1993
  • transliterator, keyboard, fonts
  • Amharic Computational Linguistics in 1994
  • Extended Ethiopic Unicode Standardization
    1995-2004
  • Corpus Collection 1997 Present
  • Began Using Unicode in 1995 for Ethiopic
  • but no Unicode standard existed until 2000!

4
My Background
  • Little or no Unicode based resources in 1993-1997
  • Today there is almost always an OpenSource
    project that you can start with and extend.
  • Minimize the time and labour you put into
    developing basic resources.
  • Avoid the maintenance trap.
  • We will assume the worst case scenario
  • You work on a language, using a script, with no
    pre-existing software resources at all.

5
What Unicode is
  • Unicode
  • is a consortium
  • is a process
  • is a community
  • is a conference
  • is a database
  • is a standard
  • is a collection of standards

6
What Unicode is not
  • Unicode
  • is not a font
  • is not a keyboard system
  • is not a transliteration system
  • is not the ISO
  • is not perfect
  • is not complete

7
Over 80 Scripts not Encoded!
Courtesy of Michael Everson http//evertype.com
8
Over 80 Scripts not Encoded!
Courtesy of Michael Everson http//evertype.com
9
Current State of the Unicode Standard New Script
Additions
  • For Unicode 5.0 (2006)
  • NKo (West Africa)
  • Balinese (Indonesia)
  • Phags-pa (historical)
  • Phoenician (historical)
  • Cuneiform (historical)
  • For Unicode 5.1 (2008)
  • Lepcha (India)
  • Ol Chiki (India)
  • Vai (Liberia)
  • Saurashtra (India)
  • Myanmar minorities (Myanmar)
  • Kayah Li (Myanmar)
  • Rejang (Indonesia)
  • Sundanese (Indonesia)
  • Carian, Lycian, Lydian (historical)

Courtesy of Michael Everson http//evertype.com
10
Working with Unicode
  • Unicode is all About Text
  • Most applicable to problems where language is
    represented by text.
  • Unicode addresses some vocabulary but under the
    scope of localization (CLDR).
  • May not be the solution if you are not working
    with text represented in written form
  • Although, Unicode can be used for symbol
    processing

11
Working with Unicode
  • Operating Systems
  • Most anything from this millennia.
  • Apple MacOS Version 9.2
  • Microsoft Windows CE, NT, XP, 2000
  • Solaris 2.8
  • Any GNU/Linux (for console use)
  • GNOME 2.0 or KDE 2.0 and Later

12
Working with Unicode
  • The International Phonetic Alphabet (IPA)

13
(No Transcript)
14
Working with Unicode
  • The International Phonetic Alphabet (IPA)
  • SIL Charis, Doulos, Gentium
  • free and most complete
  • matches New Times Roman style
  • http//scripts.sil.org/IPAhome

15
Working with Unicode
  • If you need more letters
  • Create Your own Fonts!
  • Use the Unicode Private Use Area (PUA)
  • this is Unicodes extension mechanism.
  • does not break compatibility with Unicode
    software.
  • you must send your fonts with your work.
  • encode non-letter symbols (tokens, tags), no need
    for fonts.

16
Working with Unicode
  • The PUA
  • 6,400 code points in the range E000-F8FF
  • 218 additional available in planes 15 16
  • Work in Plane 0 first (0000 FFFF)
  • Intended for company logos, ligatures used by
    typesetting software, etc.

17
Working with Unicode
  • Creating Your Own Fonts
  • Bitmap (BDF)
  • Faster to create
  • One size per font, not so scalable
  • Works best with X-Windows (Unix)
  • Outline (TrueType, PostScipt, OpenType)
  • Takes more time
  • Scalable
  • MS Windows, Mac, Modern Unixes

18
Working with Unicode
  • Bitmap Editors
  • Each letter is a matrix of pixels, like tiles
  • You toggle them on or off to shape your letters
  • GBDFED for recent GNOME/Linux
  • XBDFED for general Unix
  • Or search for BDF Editor

19
Working with Unicode
20
Working with Unicode
  • Bitmap Editors

Zoom View Within Edit Window
21
Working with Unicode
  • Outline Editors
  • Create Bezier curves to outline scalable shapes
  • Here traced around a scanned image
  • FontForge http//fontforge.sf.net

22
Working with Unicode
  • Creating Your Own Keyboards
  • No standard formats
  • Different on every operating system
  • May require some painful programming
  • transliteration may be a better alternative.
  • For small amounts of typing try
    CtrlShiftX1X2X3X4
  • CtrlShift1234

23
Working with Unicode
  • Creating Your Own Keyboards
  • Linux
  • Migration Toward Smart Common Input Method (SCIM)
  • simple table based
  • more complex as needed
  • http//scim.sf.net
  • - or Yudit, Emacs for older Unixes, but you can
    only type in these applications.

24
Working with Unicode
  • Creating Your Own Keyboards
  • Windows
  • Keyman, most mature robust
  • Keyboards created with KeymanDeveloper
  • 59 academic and developing world license
  • worth every cent
  • compiled keyboards also run under Linux with a
    SCIM module
  • http//tavultesoft.com

25
Working with Unicode
  • Text Processing
  • International Components for Unicode (ICU)
  • http//icu.sf.net
  • Java, C/C
  • Bindings in Python, Ruby, C,Perl 6 (some Perl
    5)
  • started by IBM, is OpenSource
  • managed by the Unicode president
  • check with ICU before
  • 700 Encoding Conversions
  • convert legacy systems to and from Unicode
  • migrate corpora to Unicode

26
Working with Unicode
  • Text Processing
  • ICU Normalization
  • Equate letters and diacritical symbols

. 0323
27
Working with Unicode
  • Text Processing
  • ICU Regular Expressions
  • Applies the Unicode Character Database
  • Categorize every character as one of
  • Letter
  • Number
  • Separator
  • Punctuation
  • Marks
  • Symbols
  • Others
  • Subcategories within each. Examples
  • Letter, Uppercase, lowercase, Other,
  • Symbols, Math, Currency, Modifiers,
  • Mark, spacing, non-spacing, enclosing
  • Defines 80 character property types

28
Working with Unicode
  • Text Processing
  • ICU Regular Expressions
  • Set Operations
  • \pLetter Negation
  • \pLetter\pNumber Union
  • \pLetter\pscriptCyrllic Intersection
  • \pLetter-\pLatin Difference
  • Important for a character set the size of Unicode.

29
Working with Unicode
  • Text Processing
  • ICU Regular Expressions
  • Enhanced Word Boundaries
  • Hello There. Gday 123.456 Classic RE
    Hello There. Gday 123.456 Unicode Word
    Boundaries

30
Working with Unicode
  • Text Processing
  • ICU Regular Expressions
  • Equivalence Classes
  • e matches all e eèéêëeeeee
  • not yet implemented
  • use Perl instead

31
Working with Unicode
Overloading Perl Regex with RegexpEthiopic
  • Simple Plurals
  • 7?
  • vs
  • ?????????????????????????????????

32
Working with Unicode
Overloading Perl Regex with RegexpEthiopic
  • /3?/
  • ??????
  • ????
  • ????????
  • /3,6?/
  • ?????? ??????
  • ???? ????
  • ???????? ????????

33
Working with Unicode
  • Text Processing
  • ICU Transliteration
  • Defined by transform rules
  • One to one mappings
  • a ltgt a
  • ß ltgt b
  • Context Rules
  • ß aeiou gt b
  • ß aeiou gt v

34
Working with Unicode
  • Text Processing
  • ICU Transliteration
  • Defined by transform rules
  • Applying UCD Properties
  • T LowercaseLetter ltgt Th
  • T ltgt TH
  • Reverse Transliteration Context Rules
  • s lt Letter s Letter
  • ? lt s Letter
  • s lt s

35
Working with Unicode
  • Text Processing
  • ICU Transliteration
  • Gets much more sophisticated
  • See also Perls TextTransliterate

36
Working for Unicode
  • Taking Your Work a Step Further
  • Youve helped create an orthography now make it
    official.
  • Youve worked with a pre-existing un-encoded
    script using the PUA now formalize it.
  • Youve created a transliteration systemmake it
    an ISO standard.
  • Youve identified a dialect encode it in ISO
    639.
  • Youve developed a keyboardmake it a national
    standard.
  • etc.

37
Working for Unicode
  • Why go the extra mile kilometer?
  • Ethnic pride and identity is promoted.
  • Literacy efforts can be encouraged.
  • The study of historic scripts is kept alive.
  • Communication between and amongst members of the
    community is promoted.
  • Government communication in times of emergency
    (disease, war, natural disaster).
  • Leads to localization, greater access to ICT.
  • and you become the expert!

38
Working for Unicode
  • What to Consider
  • The work will be more social than technical.
  • The work will take years (at least two).
  • Review Encoding History
  • Has this been attempted before and failed? Why?
  • Are there any non-Unicode encodings?
  • Determine the Stakeholders
  • The Government will they support you, oppose
    you, jail you?
  • Political Parties, Religious, Education, Cultural
    Groups
  • does anyone have something to lose by the
    encoding?
  • Communicate, Communicate, Communicate
  • and be transparent.
  • the perception of being closed breeds suspicion
    and opposition.
  • even 11 years after the fact, trust me on this.

39
Working for Unicode
  • New Keyboard?
  • No international standardization working groups
  • Contribute Keyboard back to main project
  • Contact Local ICT Professionals Organization
  • Contact Local University CS Department
  • Contact Local Standards Body

40
Working for Unicode
  • New Language or Dialect?
  • Contact the ICO/DIS 639-3 Registration Authority
  • http//sil.org/iso639-3/
  • iso639-3_at_sil.org
  • Contact Language or Cultural Authority
  • Contact Local University Linguistics Department

41
Working for Unicode
  • New Orthography? Or Un-encoded?
  • Contact the ISO 15924 Registration Authority
  • http//unicode.org/iso15924/
  • Contact Language or Cultural Authority
  • Contact Local ICT Professionals Organization
  • Contact Local University CS Department
  • Contact Local University Linguistics Department
  • Contact Local Standards Body
  • Contact the Script Encoding Initiative

42
Working for Unicode
  • The Script Encoding Initiative
  • http//linguistics.berkeley.edu/sei
  • Works with users on script proposals.
  • Helps raise money for script proposals to be
    written and free fonts to be created.
  • Works collaboratively with other groups (e.g.
    SIL) to avoid duplication of effort.
  • Helps seek experts to review proposals.
  • Participates at standards meetings on behalf of
    minority groups and scholars.

43
fini
  • Conclusion
  • Use Unicode Now!
  • You can do it!
  • Yes you can do it!
  • There are no excuses anymore
  • its 2006 already, Im telling you can do this!
  • and when you do (remember I have faith in you!)
    consider feeding back into the system via
    standardization.
  • Be a good citizen of earth, always ?.
  • Thank You for Listening.
  • Are There Any Questions?
  • This presentation http//yacob.org/papers/
Write a Comment
User Comments (0)
About PowerShow.com