Character Encoding, Fonts - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Character Encoding, Fonts

Description:

Greek. ISO 8859-7. Modern Greek. CP1256 (WinArabic), Unicode ... Languages which use the Cyrillic alphabet. CP1252 (WinLatin1), Unicode. Latin 1. ISO 8859-1 ... – PowerPoint PPT presentation

Number of Views:919
Avg rating:3.0/5.0
Slides: 20
Provided by: Martin488
Category:

less

Transcript and Presenter's Notes

Title: Character Encoding, Fonts


1
Character Encoding, Fonts
2
Overview
  • Why do character encoding and fonts matter to
    linguists?
  • How can you identify problems?
  • Why do these problems arise?
  • How might they be solved?

3
Why does this matter to linguists?
  • We work in a global market
  • Might receive a file or email which you cannot
    read.
  • Might be asked to translate a file which has been
    encoded in a way that is not supported by the
    translations tools available to you.
  • Even if youre not directly responsible for a
    particular language, you might be responsible for
    a translation project overall.

4
You might visit a web page that looks like this
in fact, you may be asked to translate a web
page that looks like this!
5
You might open a file to edit it, and find it
looks like this
6
Or this
7
Why does this happen?
  • No standard has been universally implemented to
    store and transfer characters.
  • Typically different standards are used by
  • Different applications
  • Different types of computer
  • Different languages
  • Different countries (or locales)
  • Although the character has been correctly
    interpreted, it is not covered by the font in use

8
  • Alright, alright
  • how does it work?

9
Bits and bytes
  • Computers deal in binary digits.
  • These are known as bits.
  • Each bit has a value of 0 or 1.
  • These are stored and transferred as groups, known
    as bytes.
  • Bytes have 8 bits.
  • The value of a byte is a number between 0
    (00000000) and 255 (11111111)

10
Characters and numbers
  • Computers must be able to associate the
    characters used in human writing systems with
    binary values.
  • Every character is associated with a number. The
    number is called a code point.

11
A coded character set relates each character in a
character repertoire to a code point using a
table.
(Example from Roman Czyborra http//czyborra.com/
charsets/iso646.html)
12
The range of characters encoded
  • Limited by size of table used in coded character
    set.
  • In turn, this limited by number of bits used to
    encode characters
  • 8 bits offer 256 combinations
  • Number of bits available is limited by number of
    bytes used to store a character
  • 1 byte per character 256 possible combinations
  • 2 bytes per character 65,536 possible
    combinations
  • Which brings us to

13
Character encoding schemes
  • A character encoding scheme relates each code
    point in a coded character set to a byte or
    series of bytes.
  • This is how character codes are packaged for
    storage and transfer.
  • Problems occur when applications misinterpret
    these sequences of bytes.
  • One character can be mistaken for another.

14
Coded character sets for alphabets
Region or language group covered ISO Standard Other Name Other available character encodings
Western Europe ISO 8859-1 Latin 1 CP1252 (WinLatin1), Unicode
Languages which use the Cyrillic alphabet ISO 8859-5 Cyrillic CP1251 (WinCyrillic), KOI8-R, KOI8-U, Unicode
Arabic. Missing the 4 additional letters required for Farsi and 8 required for Urdu. ISO 8859-6 Arabic CP1256 (WinArabic), Unicode
Modern Greek ISO 8859-7 Greek CP1253 (WinGreek), Unicode
15
Coded character set families for other writing
systems
  • EUC (Extended Unix Code)
  • Used for Chinese, Japanese and Korean (CJK)
    languages.
  • Big5 and GB (Guojia Biaozhun - ????)
  • Big5 used in Taiwan and Hong Kong - also known as
    Chinese Traditional
  • GB used in PRC and Singapore - also known as
    Chinese Simplified
  • JIS (Japanese Industry Standard)
  • Used for Japanese
  • Unicode
  • All use multibyte character encoding schemes

16
Unicode
  • 16-bit planes each holds up to 65,536
    characters
  • Characters from nearly all contemporary writing
    systems, including both alphabetic and
    ideographic scriptsplus additional characters

17
Unicode Transformation Formats
  • Unicode standard is a coded character set
  • Different encoding schemes, or Unicode
    Transformation Formats, are used
  • UTF-8 uses 1, 2, 3, or 4 byte sequence for each
    character
  • Will be most common on WWW
  • Less efficient than legacy systems for CJK
    characters
  • UTF-16 uses 2 bytes per character
  • Known as Unicode in Microsoft applications
  • Double size of ASCII/ISO8859 files

18
Fonts
  • Character encoding schemes deal with abstract
    characters
  • a is the same character as a
  • By contrast, fonts are collections of specific
    forms or images.
  • Different fonts can be applied to a single
    encoded character.

19
Some solutions
  • Problems should be reduced by implementation of
    Unicode.
  • Web pages - experiment with settings in your
    browser
  • In IE use Encoding on the View menu
  • Text files
  • Open file in Microsoft Word
  • Word should identify the character encoding
    scheme used
  • Save as Plain Text
  • Select required character encoding
  • Check font settings
  • Use Unicode fonts for multilingual documents
  • Send text as email attachment (rather than in
    body)
Write a Comment
User Comments (0)
About PowerShow.com