lis508 lecture 1: bits, bytes and characters - PowerPoint PPT Presentation

About This Presentation
Title:

lis508 lecture 1: bits, bytes and characters

Description:

http://wwwinfo.cern.ch/asdoc/WWW/publications/ictp99 ... hieroglyph. Example: email file. Text files can be read by many computer programs. non-text files ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 28
Provided by: kric2
Learn more at: https://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: lis508 lecture 1: bits, bytes and characters


1
lis508 lecture 1 bits, bytes and characters
  • Thomas Krichel
  • 2003-09-30

2
Structure
  • Numbers
  • Bits
  • Bytes
  • Character sets
  • Coded character set
  • Character endcoding

3
Literature, no need to read
  • Norton new inside the PC chapter 4
  • http//www.danbbs.dk/erikoest/bb_terms.htm
  • http//wwwinfo.cern.ch/asdoc/WWW/publications/ictp
    99/ictp99N2705.html
  • http//www.cl.cam.ac.uk/mgk25/unicode.html

4
Information
  • Information is best understood as what it takes
    to answer a question.
  • The simplest question has a yes or no answer.
    Therefore a bit is the natural measure of
    information.
  • Term first used by John Turkey in 1946.
  • Concatenation of binary digit.

5
Usage of bits
  • Computers are sometimes classified by the number
    of bits they can process at one time. "32 bit
    processor"
  • Graphics are also often described by the number
    of bits used to represent each dot.

6
bits and bytes
  • a bit can take the values 0 or 1, thus it can
    describe 2 possibilities
  • two bits can take the value 00, 01, 10, 11, thus
    it can describe four 22 possibilities
  • n bits can encode 2 power n possibilities.
  • The first chips used to process 8 bits at a time.
    It become customary to refer to them as a byte.
    It can encode 2 power 8 possibilities.
  • We can use binary numbers just as decimal
    numbers.

7
application of bytes
  • IP (Internet Protocol) numbers are used as the
    addresses of computers on the Internet.
  • In IP version 4 (the one that is most commonly
    used), each IP number has 4 bytes.
  • It is represented as x.x.x.x where x is a number
    between 0 and 255 (why?)
  • how many computers can there be on the Internet
    at any one time?

8
decimal/binary numbers
  • 0 0
  • 1 1
  • 2 10
  • 3 11
  • 4 100
  • 5 101
  • 6 110
  • 7 111
  • 8 1000
  • 9 1001
  • 10 1010
  • 11 1011
  • 12 1100
  • 13 1101
  • 14 1110
  • 15 1111

9
Many bytes
  • Larger units are
  • Kilo byte is 2 power 10 bytes (1024 bytes)
  • Mega bytes is 2 power 20 bytes
  • Giga bytes is 2 power 30 bytes
  • Tera byte is 2 power 40 bytes
  • From ancient Greek words for "thousand", "large",
    "giant", and "monster", respectively. Terms date
    back to the French revolution.

10
Hex numbers
  • A byte is often represented by two hex numbers.
  • Each hex number can encode 16 values
  • Written 0 to 9, then A B C D E F. F is 15.
  • Conventionally prefixed with 0x
  • Use Microsoft calculator with scientific notation
    to convert.

11
application of hex numbers
  • Media Access Control (mac) addresses of hardware
    that allows access to computer networks. They are
    6-byte numbers, each byte written as 2 hex
    numbers, e.g. 006008F520A9
  • character numbers that you see when you are
    inserting a special symbol in Microsoft software,
    e.g. powerpoint.

12
Characters
  • Much of the information processed by computers is
    in the form of characters.
  • A character only makes sense for a human user of
    a minimum cultural level.
  • A character is not a glyph.
  • ligatures

13
Information in a computer file
  • A file is a piece of data on a stored on a
    computer.
  • Any file contains a sequence of 0s and 1s, like
    1010100101010011110101010101
  • For a computer to make sense of a file, it has
    to know what type of file it is.

14
executable files
  • Files that are executable are files that make the
    computer do something. For example the file
    starts a program, say powerpoint. An executable
    on one computer may not run on another
  • Non-executable files hold data that is used by an
    executable file. We will call them data files.
    Example powerpoint slides file.

15
text files
  • Many data files contain textual data.
  • Textual data is a sequence of characters.
  • A character is an elementary symbol that has some
    meaning
  • alphabet letter
  • hieroglyph
  • Example email file
  • Text files can be read by many computer programs.

16
non-text files
  • Examples for non-text files are
  • graphics files
  • movie files
  • sound files
  • non-text files are not very important in library
    settings
  • there is not way to organize information
    retrieval for non-text files. They have to be
    retrieved using a textual surrogate.
  • traditional library material are textual
  • will talk about this later.

17
Representing characters
  • Computers don't understand text, they only
    understand numbers. For computers to be able to
    treat text, there must be a correspondence
    between numbers and text characters. Such a
    correspondence is called a character set.
  • Examples for characters are
  • a
  • c
  • ë

18
Legacy character sets
  • In early days, computers were a lot less powerful
    than they are today.
  • Could only deal with the characters that are most
    commonly used.
  • Such sets are
  • ascii
  • ISO-8859-1
  • cp1252

19
ASCII
  • American Standard Code for Information
    Interchange
  • 7-bit character set. There is no such thing as
    8-bit ASCII
  • 95 printable symbols
  • 33 control characters (0-31, 127)
  • http//www.ccmr.cornell.edu/helpful_data/ascii2.ht
    ml has a list up to 127

20
some ASCII control characters
  • CR (13, M) is the carriage return
  • LF (10, J) is the linefeed
  • FF (12, L) is the form feed (new page)
  • BS (8, H) is the backspace
  • DEL (127, ALT-127) is delete
  • ESC (27, ) escape

21
ISO-8859-1
  • ISO-8859-1, aka ISO-latin-1 extends ASCII with
    characters that are commonly used by the western
    European languages.
  • It is the default character set of html.
  • Positions 128 to 159 are not used.
  • Cp1252 fills these with graphic chars. It is as
    Microsoft character set.

22
This is not enough
  • There are around 6800 different languages around.
  • Some of these languages use characters sets that
    are not finite, i.e. folks can make up now
    characters out of existing ones!
  • Setting up a character set for all languages is
    almost impossible.

23
ISO 10646-1
  • Defines the Universal Character Set (UCS)
  • UCS contains the characters required to represent
    characters used by many known languages, even the
    likes of Oriya, Telugu, Bopomofo, Runic.
  • ISO 10646 defines formally a 31-bit character
    set. They are represented as 32 bits, i.e. 4
    bytes, or 8 hex chars.
  • Not finished.

.
24
Unicode
  • ISO is a inter-government agency. Slow and
    bureaucratic.
  • Industry has come together to work on Unicode, a
    2-byte character set.
  • With some minor exceptions, the Unicode
    characters are the some as the first 65536
    characters in UCS.
  • Much better documented standard.

25
Unicode and legacy sets
  • The first 128 characters are identical to those
    in ASCII
  • The next 128 characters are identical to ISO
    8859-1 (Latin-1).
  • Unicode is well documented and the Unicode book
    can be downloaded from the Internet. A must-have
    for the serious digital librarian.

26
Politics
  • Does it make sense to use Unicode rather than,
    say, ISO-latin-1?
  • Many commercial pieces of software have data
    files that contain character data interspersed
    with non-character data. Is that good?

27
http//openlib.org/home/krichel
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com