Character SetEncoding - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Character SetEncoding

Description:

Fundamentally, computers just deal with numbers. ... such industry leaders as Apple, HP, IBM, Microsoft, Oracle, Sun and many others. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 13
Provided by: vowe5
Category:

less

Transcript and Presenter's Notes

Title: Character SetEncoding


1
Character Set/Encoding Background Tutorial
Lance Vowell
2
History
  • Fundamentally, computers just deal with numbers.
    They store letters and other characters by
    assigning a number for each one. These are called
    bits and bytes

3
History
  • UNIX and the C Programming Language
  • The only characters that mattered were unaccented
    English
  • ASCII
  • 0 31 unprintable
  • 32 127
  • Computers could use 8 bits, ASCII only used 7
    bits.
  • Some people thought We can use 128-255 for
    whatever we want!.
  • IBM-PC
  • OEM Character Set provided accented characters
    for European Languages
  • More and more users were using the top 128
    characters for their own purposes
  • Example
  • On some PCs the character code 130 would display
    é
  • Computers sold in Israel it was the Hebrew
    letter ?
  • So when Americans sending their résumés to Israel
    they would arrive as r?sum?s

4
The Issue
  • Everybody agreed on 0 127 but everybody wanted
    to do something different with 128 255.
  • These different systems for handling these
    characters weere called Code Pages, and the one
    you used depended on where you lived.
  • Very difficult to display two languages on the
    same computer.
  • These code also conflict with one another. That
    is, two encodings can use the same number for two
    different characters, or use different numbers
    for the same character.

5
  • Internet

6
Resolution?
  • The Question
  • How do we begin to solve this issue?
  • The Answer
  • Unicode

7
Unicode
  • Unicode is an industry standard, a single
    character set, designed to allow text and symbols
    from all languages to be consistently represented
    and manipulated by computers.
  • The Unicode Standard has been adopted by such
    industry leaders as Apple, HP, IBM, Microsoft,
    Oracle, Sun and many others. Unicode is required
    by modern standards such as XML, Java etc.
  • Unicode provides a unique number for every
    character, no matter what the platform, no matter
    what the program, no matter what the language.
  • STIP
  • U0053 U0054 U0049 U0050

8
Encoding
  • Encoding is how we store the strings of Unicode
    code points, U numbers, in memory using bits and
    bytes.
  • Hundreds of these encodings which can only store
    some of the code points correctly. They change
    the rest to the infamous ? or ?
  • Examples
  • Windows 1252
  • ISO-8859-1 or Latin-1

9
Encoding
  • There are several different encodings that can
    store any Unicode code point correctly
  • UTF-7
  • UTF-8
  • UTF-16
  • UTF-32

10
UTF-8
  • UTF-8 (8-bit Unicode Transformation Format) is a
    variable-length character encoding for Unicode.
  • It is able to represent any universal character
    in the Unicode standard, yet is backwards
    compatible with ASCII. For this reason, it is
    steadily becoming the preferred encoding for
    email, web pages, and other places where
    characters are stored or streamed.
  • UTF-8 uses up to four bytes per character
    depending on the Unicode symbol.
  • For example
  • 01010011 S
  • 01010100 T
  • 01001001 I
  • 01010000 P

11
Real World Issues
  • The single most important fact about encoding
  • There Is No Such Thing As Plain Text!
  • If you have a string in memory, in a file or in
    an email message the application HAS to know what
    encoding it is in or you cannot interpret it or
    display it to users correctly.
  • Some applications, such as IE, will try to
    guess the encoding.
  • How do we preserve this information?
  • Header Files
  • Examples
  • Content-Type text/plain charsetUTF-8
  • ltmeta http-equivContent-Type
    contenttext/html charsetutf-8gt
  • lt?xml version'1.0' encoding'UTF-8'?gt

12
For a REAL Special Character
Write a Comment
User Comments (0)
About PowerShow.com