lis508 lecture 1: bits, bytes and characters

About This Presentation

Title:

lis508 lecture 1: bits, bytes and characters

Description:

95 printable symbols. 33 control characters (0-31, 127) ... Letters. Base characters. Ideographic characters. Combining characters. Digits. Extenders ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 22

Provided by: kric2

Learn more at: http://openlib.org

Category:

more less

Transcript and Presenter's Notes

Title: lis508 lecture 1: bits, bytes and characters

1
lis508 lecture 1 bits, bytes and characters

Thomas Krichel
2002-09-23

2
Structure

Bits
Bytes
Character sets
Coded character set
Character endcoding

3
Literature

Norton new inside the PC chapter 4
http//www.danbbs.dk/erikoest/bb_terms.htm
http//wwwinfo.cern.ch/asdoc/WWW/publications/ictp
99/ictp99N2705.html
http//www.cl.cam.ac.uk/mgk25/unicode.html

4
Information

Information is best understood as what it takes
to answer a question.
The simplest question has a yes or no answer.
Therefore a bit is the natural measure of
information.
Term first used by John Turkey in 1946.
Concatenation of binary digit.

5
Usage of bits

Computers are sometimes classified by
The number of bits they can process at one time
i.e. the register size. Larger registers make a
computer run faster.
The number of bits they use to represent
addresses i.e. address size. A larger address
size allows to run larger programs.
Graphics are also often described by the number
of bits used to represent each dot.

6
Many bits

The first chips used to process 8 bits at a time.
It become customary to refer to them as a byte.
Larger units are
Kilo byte is 2 power 10 bytes
Mega bytes is 2 power 20 bytes
Giga bytes is 2 power 30 bytes
Tera byte is 2 power 40 bytes
From ancient Greek words for "thousand", "large",
"giant", and "monster", respectively. Terms date
back to the French revolution.

7
More than a monster

In 1975, the General Conference of Weights and
Measures (CGPM), based at Sèvres near Paris,
agreed to add peta- (P) and exa- (E)
Petabyte is 2 power 50 bytes
Exabyte in 2 power 60
Nowadays they are followed by yottabyte (70) and
zettabyte (80)

8
Hex numbers

A byte is often represented by two hex numbers.
Each hex number can encode 16 values
Written 0 to 9, then A B C D E F. F is 15.
Here, prefixed with 0x
Use Microsoft calculator with scientific notation
to convert.

9
decimal/binary numbers

0 0
1 1
2 10
3 11
4 100
5 101
6 110
7 111

8 1000
9 1001
10 1010
11 1011
12 1100
13 1101
14 1110
15 1111

10
Characters

Much of the information processed by computers is
in the form of characters.
A character only makes sense for a human user of
a minimum cultural level.
A character is not a glyph.
ligatures

11
Representing characters

Computers don't understand text, they only
understand numbers. For computers to be able to
treat text, there must be a correspondence
between numbers and text characters. Such a
correspondence is called a coded character set.
Important examples are
ASCII
ISO 8859--1
cp1252

12
ASCII

American Standard Code for Information
Interchange
7-bit character set. There is no such thing as
8-bit ASCII
95 printable symbols
33 control characters (0-31, 127)
http//www.ccmr.cornell.edu/helpful_data/ascii2.ht
ml has a list.

13
ASCII control codes

ACK (6, F) used to acknowledge receipt of
message, NAK (21, U) used to signal non-receipt
CR (13, M) is the carriage return
LF (10, J) is the linefeed
FF (12, L) is the form feed (new page)
BS (8, H) is the backspace
DEL (ALT-127) is delete
ESC () escape
Different programs use them in different ways, a
big pain in the a

14
ISO-8859-1

PCs work with bytes, so manufactures were free to
fill the other 128 characters.
ISO-8859-1, aka ISO-latin-1, it extends ASCII
with characters that are used by the western
European languages.
It is the default character set of html.
Positions 128 to 159 are not used.
Cp1252 fills these with graphic chars.

15
Three concepts for characters

Abstract Character Repertoire the set of
characters to be encoded, e.g., some alphabet or
symbol set
Coded Character Set a mapping from an abstract
character repertoire to a set of non-negative
integers
Character Encoding Scheme a mapping from a
coded character set to a serialized sequence of
bytes

16
ISO 10646-1

Defines the Universal Character Set (UCS)
UCS contains the characters required to represent
characters used by practically all known
languages, even the likes of Gurmukhi, Oriya,
Telugu, Bopomofo, Runic.
There are proposals for more, like Hieroglyphs
and Tengwar.
Note that there are about 6800 known languages.

.
17
UCS organization

ISO 10646 defines formally a 31-bit character
set. They are represented as 32 bits, i.e. 4
bytes, or 8 hex chars.
The canonical form of ISO 10646 uses a
four-dimensional coding space consisting of 256
groups. Each group consists of 256 planes with
each plane containing 256 rows, each having 256
cells.

18
UCS organization

The first plane (Plane 0x00) of Group (0x00) is
called the Basic Multilingual Plane (BMP). It has
been fixed since first publication.
The subsequent 223 planes (0x01 to 0xDF) of Group
0x00, as well as planes 0x00 to 0xFF in Groups
0x01 to 0x5F are reserved for further
standardization.
The last 32 planes (0xE0 to 0xFF) of Group 0x00,
as well as all code positions of 32 groups (0x60
to 0x7F) are reserved for private use.

19
Relationship with legacy sets