Title: Character Encoding, Fonts
1Character Encoding, Fonts
2Overview
- Why do character encoding and fonts matter to
linguists? - How can you identify problems?
- Why do these problems arise?
- How might they be solved?
3Why does this matter to linguists?
- We work in a global market
- Might receive a file or email which you cannot
read. - Might be asked to translate a file which has been
encoded in a way that is not supported by the
translations tools available to you. - Even if youre not directly responsible for a
particular language, you might be responsible for
a translation project overall.
4You might visit a web page that looks like this
in fact, you may be asked to translate a web
page that looks like this!
5You might open a file to edit it, and find it
looks like this
6Or this
7Why does this happen?
- No standard has been universally implemented to
store and transfer characters. - Typically different standards are used by
- Different applications
- Different types of computer
- Different languages
- Different countries (or locales)
- Although the character has been correctly
interpreted, it is not covered by the font in use
8- Alright, alright
- how does it work?
9Bits and bytes
- Computers deal in binary digits.
- These are known as bits.
- Each bit has a value of 0 or 1.
- These are stored and transferred as groups, known
as bytes. - Bytes have 8 bits.
- The value of a byte is a number between 0
(00000000) and 255 (11111111)
10Characters and numbers
- Computers must be able to associate the
characters used in human writing systems with
binary values. - Every character is associated with a number. The
number is called a code point.
11A coded character set relates each character in a
character repertoire to a code point using a
table.
(Example from Roman Czyborra http//czyborra.com/
charsets/iso646.html)
12The range of characters encoded
- Limited by size of table used in coded character
set. - In turn, this limited by number of bits used to
encode characters - 8 bits offer 256 combinations
- Number of bits available is limited by number of
bytes used to store a character - 1 byte per character 256 possible combinations
- 2 bytes per character 65,536 possible
combinations - Which brings us to
13Character encoding schemes
- A character encoding scheme relates each code
point in a coded character set to a byte or
series of bytes. - This is how character codes are packaged for
storage and transfer. - Problems occur when applications misinterpret
these sequences of bytes. - One character can be mistaken for another.
14Coded character sets for alphabets
Region or language group covered ISO Standard Other Name Other available character encodings
Western Europe ISO 8859-1 Latin 1 CP1252 (WinLatin1), Unicode
Languages which use the Cyrillic alphabet ISO 8859-5 Cyrillic CP1251 (WinCyrillic), KOI8-R, KOI8-U, Unicode
Arabic. Missing the 4 additional letters required for Farsi and 8 required for Urdu. ISO 8859-6 Arabic CP1256 (WinArabic), Unicode
Modern Greek ISO 8859-7 Greek CP1253 (WinGreek), Unicode
15Coded character set families for other writing
systems
- EUC (Extended Unix Code)
- Used for Chinese, Japanese and Korean (CJK)
languages. - Big5 and GB (Guojia Biaozhun - ????)
- Big5 used in Taiwan and Hong Kong - also known as
Chinese Traditional - GB used in PRC and Singapore - also known as
Chinese Simplified - JIS (Japanese Industry Standard)
- Used for Japanese
- Unicode
- All use multibyte character encoding schemes
16Unicode
- 16-bit planes each holds up to 65,536
characters - Characters from nearly all contemporary writing
systems, including both alphabetic and
ideographic scriptsplus additional characters
17Unicode Transformation Formats
- Unicode standard is a coded character set
- Different encoding schemes, or Unicode
Transformation Formats, are used - UTF-8 uses 1, 2, 3, or 4 byte sequence for each
character - Will be most common on WWW
- Less efficient than legacy systems for CJK
characters - UTF-16 uses 2 bytes per character
- Known as Unicode in Microsoft applications
- Double size of ASCII/ISO8859 files
18Fonts
- Character encoding schemes deal with abstract
characters - a is the same character as a
- By contrast, fonts are collections of specific
forms or images. - Different fonts can be applied to a single
encoded character.
19Some solutions
- Problems should be reduced by implementation of
Unicode. - Web pages - experiment with settings in your
browser - In IE use Encoding on the View menu
- Text files
- Open file in Microsoft Word
- Word should identify the character encoding
scheme used - Save as Plain Text
- Select required character encoding
- Check font settings
- Use Unicode fonts for multilingual documents
- Send text as email attachment (rather than in
body)