lis508 lecture 1: bits, bytes and characters - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

lis508 lecture 1: bits, bytes and characters

Description:

95 printable symbols. 33 control characters (0-31, 127) ... Letters. Base characters. Ideographic characters. Combining characters. Digits. Extenders ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 22
Provided by: kric2
Learn more at: http://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: lis508 lecture 1: bits, bytes and characters


1
lis508 lecture 1 bits, bytes and characters
  • Thomas Krichel
  • 2002-09-23

2
Structure
  • Bits
  • Bytes
  • Character sets
  • Coded character set
  • Character endcoding

3
Literature
  • Norton new inside the PC chapter 4
  • http//www.danbbs.dk/erikoest/bb_terms.htm
  • http//wwwinfo.cern.ch/asdoc/WWW/publications/ictp
    99/ictp99N2705.html
  • http//www.cl.cam.ac.uk/mgk25/unicode.html

4
Information
  • Information is best understood as what it takes
    to answer a question.
  • The simplest question has a yes or no answer.
    Therefore a bit is the natural measure of
    information.
  • Term first used by John Turkey in 1946.
  • Concatenation of binary digit.

5
Usage of bits
  • Computers are sometimes classified by
  • The number of bits they can process at one time
    i.e. the register size. Larger registers make a
    computer run faster.
  • The number of bits they use to represent
    addresses i.e. address size. A larger address
    size allows to run larger programs.
  • Graphics are also often described by the number
    of bits used to represent each dot.

6
Many bits
  • The first chips used to process 8 bits at a time.
    It become customary to refer to them as a byte.
  • Larger units are
  • Kilo byte is 2 power 10 bytes
  • Mega bytes is 2 power 20 bytes
  • Giga bytes is 2 power 30 bytes
  • Tera byte is 2 power 40 bytes
  • From ancient Greek words for "thousand", "large",
    "giant", and "monster", respectively. Terms date
    back to the French revolution.

7
More than a monster
  • In 1975, the General Conference of Weights and
    Measures (CGPM), based at Sèvres near Paris,
    agreed to add peta- (P) and exa- (E)
  • Petabyte is 2 power 50 bytes
  • Exabyte in 2 power 60
  • Nowadays they are followed by yottabyte (70) and
    zettabyte (80)

8
Hex numbers
  • A byte is often represented by two hex numbers.
  • Each hex number can encode 16 values
  • Written 0 to 9, then A B C D E F. F is 15.
  • Here, prefixed with 0x
  • Use Microsoft calculator with scientific notation
    to convert.

9
decimal/binary numbers
  • 0 0
  • 1 1
  • 2 10
  • 3 11
  • 4 100
  • 5 101
  • 6 110
  • 7 111
  • 8 1000
  • 9 1001
  • 10 1010
  • 11 1011
  • 12 1100
  • 13 1101
  • 14 1110
  • 15 1111

10
Characters
  • Much of the information processed by computers is
    in the form of characters.
  • A character only makes sense for a human user of
    a minimum cultural level.
  • A character is not a glyph.
  • ligatures

11
Representing characters
  • Computers don't understand text, they only
    understand numbers. For computers to be able to
    treat text, there must be a correspondence
    between numbers and text characters. Such a
    correspondence is called a coded character set.
  • Important examples are
  • ASCII
  • ISO 8859--1
  • cp1252

12
ASCII
  • American Standard Code for Information
    Interchange
  • 7-bit character set. There is no such thing as
    8-bit ASCII
  • 95 printable symbols
  • 33 control characters (0-31, 127)
  • http//www.ccmr.cornell.edu/helpful_data/ascii2.ht
    ml has a list.

13
ASCII control codes
  • ACK (6, F) used to acknowledge receipt of
    message, NAK (21, U) used to signal non-receipt
  • CR (13, M) is the carriage return
  • LF (10, J) is the linefeed
  • FF (12, L) is the form feed (new page)
  • BS (8, H) is the backspace
  • DEL (ALT-127) is delete
  • ESC () escape
  • Different programs use them in different ways, a
    big pain in the a

14
ISO-8859-1
  • PCs work with bytes, so manufactures were free to
    fill the other 128 characters.
  • ISO-8859-1, aka ISO-latin-1, it extends ASCII
    with characters that are used by the western
    European languages.
  • It is the default character set of html.
  • Positions 128 to 159 are not used.
  • Cp1252 fills these with graphic chars.

15
Three concepts for characters
  • Abstract Character Repertoire the set of
    characters to be encoded, e.g., some alphabet or
    symbol set
  • Coded Character Set a mapping from an abstract
    character repertoire to a set of non-negative
    integers
  • Character Encoding Scheme a mapping from a
    coded character set to a serialized sequence of
    bytes

16
ISO 10646-1
  • Defines the Universal Character Set (UCS)
  • UCS contains the characters required to represent
    characters used by practically all known
    languages, even the likes of Gurmukhi, Oriya,
    Telugu, Bopomofo, Runic.
  • There are proposals for more, like Hieroglyphs
    and Tengwar.
  • Note that there are about 6800 known languages.

.
17
UCS organization
  • ISO 10646 defines formally a 31-bit character
    set. They are represented as 32 bits, i.e. 4
    bytes, or 8 hex chars.
  • The canonical form of ISO 10646 uses a
    four-dimensional coding space consisting of 256
    groups. Each group consists of 256 planes with
    each plane containing 256 rows, each having 256
    cells.

18
UCS organization
  • The first plane (Plane 0x00) of Group (0x00) is
    called the Basic Multilingual Plane (BMP). It has
    been fixed since first publication.
  • The subsequent 223 planes (0x01 to 0xDF) of Group
    0x00, as well as planes 0x00 to 0xFF in Groups
    0x01 to 0x5F are reserved for further
    standardization.
  • The last 32 planes (0xE0 to 0xFF) of Group 0x00,
    as well as all code positions of 32 groups (0x60
    to 0x7F) are reserved for private use.

19
Relationship with legacy sets
  • Let U(four hex numbers) denote characters in the
    BMP.
  • The UCS characters U0000 to U007F are identical
    to those in ASCII
  • The range U0000 to U00FF is identical to ISO
    8859-1 (Latin-1).

20
Types of characters in UCS
  • Letters
  • Base characters
  • Ideographic characters
  • Combining characters
  • Digits
  • Extenders

21
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com