Character Encoding, Fonts - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Character Encoding, Fonts

Description:

Greek. ISO 8859-7. Modern Greek. CP1256 (WinArabic), Unicode ... Languages which use the Cyrillic alphabet. CP1252 (WinLatin1), Unicode. Latin 1. ISO 8859-1 ... – PowerPoint PPT presentation

Number of Views:919

Avg rating:3.0/5.0

Slides: 20

Provided by: Martin488

Category:

more less

Transcript and Presenter's Notes

Title: Character Encoding, Fonts

1
Character Encoding, Fonts
2
Overview

Why do character encoding and fonts matter to
linguists?
How can you identify problems?
Why do these problems arise?
How might they be solved?

3
Why does this matter to linguists?

We work in a global market
Might receive a file or email which you cannot
read.
Might be asked to translate a file which has been
encoded in a way that is not supported by the
translations tools available to you.
Even if youre not directly responsible for a
particular language, you might be responsible for
a translation project overall.

4
You might visit a web page that looks like this
in fact, you may be asked to translate a web
page that looks like this!
5
You might open a file to edit it, and find it
looks like this
6
Or this
7
Why does this happen?

No standard has been universally implemented to
store and transfer characters.
Typically different standards are used by
Different applications
Different types of computer
Different languages
Different countries (or locales)
Although the character has been correctly
interpreted, it is not covered by the font in use

Alright, alright
how does it work?

9
Bits and bytes

Computers deal in binary digits.
These are known as bits.
Each bit has a value of 0 or 1.
These are stored and transferred as groups, known
as bytes.
Bytes have 8 bits.
The value of a byte is a number between 0
(00000000) and 255 (11111111)

10
Characters and numbers

Computers must be able to associate the
characters used in human writing systems with
binary values.
Every character is associated with a number. The
number is called a code point.

11
A coded character set relates each character in a
character repertoire to a code point using a
table.
(Example from Roman Czyborra http//czyborra.com/
charsets/iso646.html)
12
The range of characters encoded

Limited by size of table used in coded character
set.
In turn, this limited by number of bits used to
encode characters
8 bits offer 256 combinations
Number of bits available is limited by number of
bytes used to store a character
1 byte per character 256 possible combinations
2 bytes per character 65,536 possible
combinations
Which brings us to

13
Character encoding schemes

A character encoding scheme relates each code
point in a coded character set to a byte or
series of bytes.
This is how character codes are packaged for
storage and transfer.
Problems occur when applications misinterpret
these sequences of bytes.
One character can be mistaken for another.

14
Coded character sets for alphabets
Region or language group covered ISO Standard Other Name Other available character encodings
Western Europe ISO 8859-1 Latin 1 CP1252 (WinLatin1), Unicode
Languages which use the Cyrillic alphabet ISO 8859-5 Cyrillic CP1251 (WinCyrillic), KOI8-R, KOI8-U, Unicode
Arabic. Missing the 4 additional letters required for Farsi and 8 required for Urdu. ISO 8859-6 Arabic CP1256 (WinArabic), Unicode
Modern Greek ISO 8859-7 Greek CP1253 (WinGreek), Unicode
15
Coded character set families for other writing
systems

EUC (Extended Unix Code)
Used for Chinese, Japanese and Korean (CJK)
languages.
Big5 and GB (Guojia Biaozhun - ????)
Big5 used in Taiwan and Hong Kong - also known as
Chinese Traditional
GB used in PRC and Singapore - also known as
Chinese Simplified
JIS (Japanese Industry Standard)
Used for Japanese
Unicode
All use multibyte character encoding schemes

16
Unicode

16-bit planes each holds up to 65,536
characters
Characters from nearly all contemporary writing
systems, including both alphabetic and
ideographic scriptsplus additional characters

17
Unicode Transformation Formats