ENCODING AND DECODING - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

ENCODING AND DECODING

Description:

ENCODING AND DECODING. Experiencing one (or more) bytes out of your A's. Overview ... Code 130 in US, Gimel ? character in Israel. Difficult to exchange documents ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 17
Provided by: tompe9
Category:
Tags: and | decoding | encoding | gimel

less

Transcript and Presenter's Notes

Title: ENCODING AND DECODING


1
ENCODING AND DECODING
  • Experiencing one (or more) bytes out of your As

2
Overview
  • Its not your fathers character set
  • 8 bit characters
  • ASCII
  • The rest of the world wakes up to computers
  • Unicode
  • Character codes
  • Different flavors
  • Encoding and Decoding classes
  • Example

3
The Good Old Days
  • Focus on unaccented, English letters
  • Every letter, number, capital, etc
  • Represented by codes 0-127
  • Space, 32 A, 65 a, 97
  • Used 7 bits, one bit free on most computers
  • Wordstar and the 8th bit
  • Below 32 control bits ? 7, beep 12, formfeed

4
(No Transcript)
5
8th bit, values 128-255
  • Everybody had their own ideas
  • OEM Character sets
  • IBM-PC -gt graphics (horizontal bars, vertical
    bars, bars with dangles, etc.)
  • Outside U.S. ? different languages
  • Code 130

6
8th bit, values 128-255
  • Everybody had their own ideas
  • OEM Character sets
  • IBM-PC -gt graphics (horizontal bars, vertical
    bars, bars with dangles, etc.)
  • Outside U.S. ? different languages
  • Code 130

7
8th bit, values 128-255
  • Everybody had their own ideas
  • OEM Character sets
  • IBM-PC -gt graphics (horizontal bars, vertical
    bars, bars with dangles, etc.)
  • Outside U.S. ? different languages
  • Code 130 é in US, Gimel ? character in Israel
  • Difficult to exchange documents
  • Code pages regional definition of bit values
    128-255
  • Israel Code page 862
  • Greek Code page 737
  • ISO/ANSI code pages
  • Asia Alphabets had thousands of characters
  • No way to store in one byte (8 bits)

8
Unicode
  • Not a 16-bit code
  • A new way of thinking about characters
  • Old way
  • Character A maps to memory or disk bits
  • A-gt 0100 0001
  • Unicode way
  • Each letter in every alphabet maps to a code
    point
  • Abstract concept
  • A is Platonic form just floats out there
  • A -gt U0639 ? code point

9
Unicode
  • Hello -gt U0048 U0065 U006C U006C U006F
  • Storing in 2 bytes each
  • 0048 0065 006C 006C 006F (big endian)
  • Or 4800 6500 6C00 6C00 6F00 (little endian)
  • Need to have a Byte Order Mark (BOM) at beginning
    of stream
  • UTF8 coding system
  • Stores Unicode points (magic numbers) as 8 bit
    bytes
  • Values 0-127 go into byte 1
  • Values 128 go into bytes 2, 3, etc.
  • For characters up to 127, UTF8 looks just like
    ASCII

10
UNICODE Encodings
  • UTF-8
  • UTF-16 characters stored in 2 byte, 16-bit
    (halfword) sequences also called UTF-2
  • UTF-32 characters stored in 4byte, 32 bit
    sequences
  • UTF-7 forces a zero in high order bit -
    firewalls
  • Ascii Encoding everything above 7 bits is
    dropped

11
Definitions
  • .NET uses UTF-16 encoding internally to store
    text
  • Encoding
  • transfers a set of Unicode characters into a
    sequence of bytes
  • Send a string to a file or a network stream
  • Decoding
  • transfers a sequence of bytes into a set of
    Unicode characters
  • Read a string from a file or a network stream
  • StreamReader, StreamWriter default to UTF-8

12
Encoding/Decoding Classes
  • UTF32Encoding class
  • Convert characters to and from UTF-32 encoding
  • UnicodeEncoding class
  • Convert characters to and from UTF-16 encoding
  • UTF8Encoding class to convert to and from UTF-8
    encoding 1, 2, 3, or 4 bytes per char
  • ASCIIEncoding class to convert to and from ASCII
    Encoding drops all values gt 127
  • System.Text.Encoding supports a wide range of
    ANSI/ISO encodings

13
Convert a string into a stream of encoded bytes
  • Get an encoding object
  • Encoding e Encoding.GetEncoding(Korean)
  • 2. use the encoding objects GetBytes() method
    to convert a string into its byte representation
  • byte encoded
  • encoded e.GetBytes(Im gonna be Korean!)
  • Demo D\_Framework 2.0 Training
    Kits\70-536\Chapter 03\EncodingDemo

14
Write a file in encoded form
FileStream fs new FileStream("text.txt",
FileMode.OpenOrCreate) ... StreamWriter t
new StreamWriter (fs, Encoding.UTF8)
t. Write("This is in UTF8")
Read an encoded file
FileStream fs new FileStream("text.txt",
FileMode.Open) ... StreamReader t new
StreamReader(fs, Encoding.UTF8) String s
t.ReadLine()

15
Summary
  • ASCII is one of oldest encoding standards.
  • UNICODE provides multilingual support
  • System.Text.Encoding has static methods for
    encoding and decoding text.
  • Use an overloaded Stream constructor that accepts
    an encoding object when writing a file.
  • Not necessary to specify Encoding object when
    reading, will default.

16
References
  • www.unicode.org
  • Unicode and .Net what does .NET Provide?
    http//www.developerfusion.co.uk/show/4710/3/
  • Hello Unicode, Goodbye ASCII http//www.nicecleane
    xample.com/ViewArticle.aspx?TIDunicode_encoding
  • The Absolute Minimum Every Software Developer
    Absolutely, Positively Must Know About Unicode
    and Character Sets (No Excuses!)
    http//www.joelonsoftware.com/articles/Unicode.htm
    l
Write a Comment
User Comments (0)
About PowerShow.com