CSE 599 Lecture 7: Information Theory, Thermodynamics and Reversible Computing

About This Presentation

Title:

CSE 599 Lecture 7: Information Theory, Thermodynamics and Reversible Computing

Description:

Compressing the gas From mechanics work = force distance force ... minimum number of bits you ... codes BCH codes Reed-Solomon codes Turbo codes Hamming ... – PowerPoint PPT presentation

Number of Views:163

Avg rating:3.0/5.0

Slides: 73

Provided by: Raj72

Learn more at: https://courses.cs.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSE 599 Lecture 7: Information Theory, Thermodynamics and Reversible Computing

1
CSE 599 Lecture 7 Information Theory,
Thermodynamics and Reversible Computing

What have we done so far?
Theoretical computer science Abstract models of
computing
Turing machines, computability, time and space
complexity
Physical Instantiations
Digital Computing
Silicon switches manipulate binary variables with
near-zero error
DNA computing
Massive parallelism and biochemical properties of
organic molecules allow fast solutions to hard
search problems
Neural Computing
Distributed networks of neurons compute fast,
parallel, adaptive, and fault-tolerant solutions
to hard pattern recognition and motor control
problems

2
Overview of Todays Lecture

Information theory and Kolmogorov Complexity
What is information?
Definition based on probability theory
Error-correcting codes and compression
An algorithmic definition of information
(Kolmogorov complexity)
Thermodynamics
The physics of computation
Relation to information theory
Energy requirements for computing
Reversible Computing
Computing without energy consumption?
Biological example
Reversibe logic gates ? Quantum computing (next
week!)

3
Information and Algorithmic Complexity

3 principal results
Shannons source-coding theorem
The main theorem of information content
A measure of the number of bits needed to specify
the expected outcome of an experiment
Shannons noisy-channel coding theorem
Describes how much information we can transmit
over a channel
A strict bound on information transfer
Kolmogorov complexity
Measures the algorithmic information content of a
string
An uncomputable function

4
What is information?

First try at a definition
Suppose you have stored n different bookmarks on
your web browser.
What is the minimum number of bits you need to
store these as binary numbers?
Let I be the minimum number of bits needed. Then,
2I ? n ? I ? log2 n
So, the information contained in your
collection of n bookmarks is I0 log2 n

5
Deterministic information I0

Consider a set of alternatives X a1, a2, a3,
aK
When the outcome is a3, we say x a3
I0(X) is the amount of information needed to
specify the outcome of X
I0(X) log2½X½
We will assume base 2 from now on (unless stated
otherwise)
Units are bits (binary digits)
Relationship between bits and binary digits
B 0, 1
X BM set of all binary strings of length M
I0(X) log½BM½ log½2M½ M bits

6
Is this definition satisfactory?

Appeal to your intuition
Which of these two messages contains more
information?
Dog bites man
or
Man bites dog

7
Is this definition satisfactory?

Appeal to your intuition
Which of these two messages contains more
information?
Dog bites man
or
Man bites dog
Same number of bits to represent each message!
But, it seems like the second message contains a
lot more information than the first. Why?

8
Enter probability theory

Surprising events (unexpected messages) contain
more information than ordinary or expected events
Dog bites man occurs much more frequently than
Man bites dog
Messages about less frequent events carry more
information
So, information about an event varies inversely
with the probability of that event
But, we also want information to be additive
If message xy contains sub-parts x and y, we
want
I(xy) I(x) I(y)
Use the logarithm function log(xy) log(x)
log(y)

9
New Definition of Information

Define the information contained in a message x
in terms of log of the inverse probability of
that message
I(x) log(1/P(x)) - log P(x)
First defined rigorously and studied by Shannon
(1948)
A mathematical theory of communication
electronic handout (PDF file) on class website.
Our previous definition is a special case
Suppose you had n equally likely items (e.g.
bookmarks)
For any item x, P(x) 1/n
I(x) log(1/P(x)) log n
Same as before (minimum number of bits needed to
store n items)

10
Review Axioms of probability theory

Kolmogorov, 1933
P(a) gt 0 where a is an event
P(l) 1 where l is the certain event
P(a b) P(a) P(b) where a and b are mutually
exclusive
Kolmogorov (axiomatic) definition is computable
Probability theory forms the basis for
information theory
Classical definition based on event frequencies
(Bernoulli) is uncomputable

11
Review Results from probability theory

Joint probability of two events a and b P(ab)
Independence
Events a and b are independent if P(ab)
P(a)P(b)
Conditional probability P(ab) probability
that event a happens given that b has happened
P(ab) P(ab)/P(b)
P(ba) P(ba)/P(a) P(ab)/P(a)
We just proved Bayes Theorem
P(a) is called the a priori probability of a
P(a½b) is called the a posteriori probability of
a

12
Summary Postulates of information theory

1. Information is defined in the context of a set
of alternatives. The amount of information
quantifies the number of bits needed to specify
an outcome from the alternatives
2. The amount of information is independent of
the semantics (only depends on probability)
3. Information is always positive
4. Information is measured on a logarithmic scale
Probabilities are multiplicative, but information
is additive

13
In-Class Example

Message y contains duplicates y xx
Message x has probability P(x)
What is the information content of y?
Is I(y) 2 I(x)?

14
In-Class Example

Message y contains duplicates y xx
Message x has probability P(x)
What is the information content of y?
Is I(y) 2 I(x)?
I(y) log(1/P(xx)) log1/(P(xx)P(x))
log(1/P(xx)) log(1/P(x))
0 log(1/P(x))
I(x)
Duplicates convey no additional information!

15
Definition Entropy

The average self-information or entropy of an
ensemble X a1, a2, a3, aK
E ? expected (or average) value

16
Properties of Entropy

0 lt H(X) lt I0(X)
Equals I0(X) log½X½ if all the aks are
equally probable
Equals 0 if only one ak is possible
Consider the case where k 2
X a1, a2
P(a1) ? P(a2) 1 ?

17
Examples

Entropy is a measure of randomness of the source
producing the events
Example 1 Coin toss Heads or tails with equal
probability
H -(½ log ½ ½ log ½) -(½ (-1) ½ (-1))
1 bit per coin toss
Example 2 P(heads) ¾ and P(tails) ¼
H -(¾ log ¾ ¼ log ¼) 0.811 bits per coin
toss
As things get less random, entropy decreases
Redundancy and regularity increases

18
Question

If we have N different symbols, we can encode
them in log(N) bits. Example English - 26
letters ? 5 bits
So, over many, many messages, the average
cost/symbol is still 5 bits.
But, letters occur with very different
probabilities! A and E much more common than
X and Q. The log(N) estimate assumes equal
probabilities.
Question Can we encode symbols based on
probabilities so that the average cost/symbol is
minimized?

19
Shannons noiseless source-coding theorem

Also called the fundamental theorem. In words
You can compress N independent, identically
distributed (i.i.d.) random variables, each with
entropy H, down to NH bits with negligible loss
of information (as N??)
If you compress them into fewer than NH bits you
will dramatically lose information
The theorem
Let X be an ensemble with H(X) H bits. Let Hd
(X) be the entropy of an encoding of X with
allowable probability of error d
Given any ? gt 0 and 0 lt ? lt 1, there exists a
positive integer No such that, for N gt No,

20
Comments on the theorem

What do the two inequalities tell us?
The number of bits that we need
to specify outcomes x with vanishingly small
error probability ? does not exceed H ?
If we accept a vanishingly small error, the
number of bits we need to specify x drops to N(H
?)
The number of bits that we need
to specify outcomes x with large allowable error
probability ? is at least H ?

21
Source coding (data compression)

Question How do we compress the outcomes XN?
With vanishingly small probability of error
How do we assign the elements of X such that the
number of bits we need to encode XN drops to N(H
?)
Symbol coding Given x a3 a2 a7 a5
Generate codeword ?(x) 01 1010 00
Want Io(?(x)) H(X)
Well-known coding examples
Zip, gzip, compress, etc.
The performance of these algorithms is, in
general, poor when compared to the Shannon limit

22
Source-coding definitions

A code is a function ? X ? B
B 0, 1
B ? the set of finite strings over B
B 0, 1, 00, 01, 10, 11, 000, 001,
?(x) ?(x1) ?(x2) ?(x3) ?(xN)
A code is uniquely decodable (UD) iff
? X ? B is one-to-one
A code is instantaneous iff
No codeword is the prefix of another
?(x1) is not a prefix of ?(x2)

23
Huffman coding

Given X a1, a2, aK, with associated
probabilities P(ak)
Given a code with codeword lengths n1, n2, nk
The expected code length
No instantaneous, UD code can achieve a smaller
than a Huffman code

24
Constructing a Huffman code

Feynman example Encoding an alphabet
Code is instantaneous and UD 00100001101010
ANOTHER
Code achieves close to Shannon limit
H(X) 2.06 bits 2.13 bits

1
25
Information channels
Input
Output
channel
x
y
I(XY) ? what we know about X given Y
H(X) ? entropy of input ensemble X

I(XY) is the average mutual information between
X and Y
Definition Channel capacity
The information capacity of a channel is C
maxI(XY)
The channel may add noise
Corrupting our symbols

26
Example Channel capacity

Problem A binary source sends ? equiprobable
messages in a time T, using the alphabet 0, 1
with a symbol rate R. As a result of noise, a
0 may be mistaken for a 1, and a 1 for a
0, both with probability q. What is the channel
capacity C?

X
Y
Channel is discrete and memoryless
27
Example Channel capacity (cont)

Assume no noise (no errors)
T is the time to send the string, R is the rate
The number of possible message strings is 2RT
The maximum entropy of the source is Ho log(2RT
) bits
The source rate is (1/T) Ho R bits per second
The entropy of the noise (per transmitted bit) is
Hn qlog1/q (1q)log1/(1q)
The channel capacity C (bits/sec) R RHn R(1
Hn)
C is always less than R (a fixed fraction of R)!
We must add code bits to correct the received
message

28
How many code bits must we add?

We want to send a message string of length M
We add codebits to M, thereby increasing its
length to Mc
How are M, Mc, and q related?
M Mc(1 Hn)
Intuitively, from our example
Also see pgs. 106 110 of Feynman
Note this is an asymptotic limit
May require a huge Mc

29
Shannons Channel-Coding Theorem

The Theorem
There is a nonnegative channel capacity C
associated with each discrete memoryless channel
with the following property For any symbol rate
R lt C, and any error rate ? gt 0, there is a
protocol that achieves a rate gt R and a
probability of error lt ?
In words
If the entropy of our symbol stream is equal to
or less than the channel capacity, then there
exists a coding technique that enables
transmission over the channel with arbitrarily
small error
Can transmit information at a rate H(X) lt C
Shannons theorem tells us the asymptotically
maximum rate
It does not tell us the code that we must use to
obtain this rate
Achieving a high rate may require a prohibitively
long code

30
Error-correction codes

Error-correcting codes allow us to detect and
correct errors in symbol streams
Used in all signal communications (digital
phones, etc)
Used in quantum computing to ameliorate effects
of decoherence
Many techniques and algorithms
Block codes
Hamming codes
BCH codes
Reed-Solomon codes
Turbo codes

31
Hamming codes

An example Construct a code that corrects a
single error
We add m check bits to our message
Can encode at most (2m 1) error positions
Errors can occur in the message bits and/or in
the check bits
If n is the length of the original message then
2m 1 gt (n m)
Examples
If n 11, m 4 24 1 15 gt (n
m) 15
If n 1013, m 10 210 1 1023 gt (n
m) 1023

32
Hamming codes (cont.)

Example An 11/15 SEC Hamming code
Idea Calculate parity over subsets of input bits
Four subsets Four parity bits
Check bit x stores parity of input bit positions
whose binary representation holds a 1 in
position x
Check bit c1 Bits 1,3,5,7,9,11,13,15
Check bit c2 Bits 2,3,6,7,10,11,14,15
Check bit c3 Bits 4,5,6,7,12,13,14,15
Check bit c4 Bits 8,9,10,11,12,13,14,15
The parity-check bits are called a syndrome
The syndrome tells us the location of the error

Position in message binary decimal 0001
1 0010 2 0011 3 0100 4 0101
5 0110 6 0111 7 1000 8 1001
9 1010 10 1011 11 1100 12 1101
13 1110 14 1111 15
33
Hamming codes (cont)

The check bits specify the error location
Suppose check bits turn out to be as follows
Check c1 1 (Bits 1,3,5,7,9,11,13,15)
Error is in one of bits 1,3,5,7,9,11,13,15
Check c2 1 (Bits 2,3,6,7,10,11,14,15)
Error is in one of bits 3,7,11,15
Check c3 0 (Bits 4,5,6,7,12,13,14,15)
Error is in one of bits 3,11
Check c4 0 (Bits 8,9,10,11,12,13,14,15)
So error is in bit 3!!

34
Hamming codes (cont.)

Example Encode 10111011011
Code position 15 14 13 12 11 10 9 8 7
6 5 4 3 2 1
Code symbol 1 0 1 1 1 0 1 c4
1 0 1 c3 1 c2 c1
Codeword 1 0 1 1 1 0 1 1 1
0 1 1 1 0 1
Notice that we can generate the code bits on the
fly!
What if we receive 101100111011101?
c4 1 101100111011101
c3 0 101100111011101
c2 1 101100111011101
c1 1 101100111011101
The error is in location 1011 1110

35
Kolmogorov Complexity (Algorithmic Information)

Computers represent information as stored symbols
Not probabilistic n the Shannon sense)
Can we quantify information from an algorithmic
standpoint?
Kolmogorov complexity K(s) of a finite binary
string s is the single, natural number
representing the minimum length (in bits) of a
program p that generates s when run on a
Universal Turing machine U
K(s) is the algorithmic information content of s
Quantifies the algorithmic randomness of the
string
K(s) is an uncomputable function
Similar argument to the halting problem
How do we know when we have the shortest program?

36
Kolmogorov Complexity Example

Randomness of a string defined by shortest
algorithm that can print it out.
Suppose you were given the binary string x
11111111111111.11111111111111111111111 (1000
1s)
Instead of 1000 bits, you can compress this
string to a few tens of bits, representing the
length P of the program
For I 1 to 1000
Print 1
So, K(x) lt P
Possible project topic Quantum Kolmogorov
complexity?

37
5-minute break

Next Thermodynamics and Reversible Computing

38
Thermodynamics and the Physics of Computation

Physics imposes fundamental limitations on
computing
Computers are physical machines
Computers manipulate physical quantities
Physical quantities represent information
The limitations are both technological and
theoretical
Physical limitations on what we can build
Example Silicon-technology scaling
Major limiting factor in the future Power
Consumption
Theoretical limitations of energy consumed during
computation
Thermodynamics and computation

39
Principal Questions of Interest

How much energy must we use to carry out a
computation?
The theoretical, minimum energy
Is there a minimum energy for a certain rate of
computation?
A relationship between computing speed and energy
consumption
What is the link between energy and information?
Between informationentropy and
thermodynamicentropy
Is there a physical definition for information
content?
The information content of a message in physical
units

40
Main Results

Computation has no inherent thermodynamic cost
A reversible computation, that proceeds at an
infinitesimal rate, consumes no energy
Destroying information requires kTln2 joules per
bit
Information-theoretic bits (not binary digits)
Driving a computation forward requires kTln(r)
joules per step
r is the rate of going forward rather than
backward

41
Basic thermodynamics

First law Conservation of energy
(heat put into system) (work done on system)
increase in energy of a system
DQ DW DU
Total energy of the universe is constant
Second law It is not possible to have heat flow
from a colder region to a hotter region i.e. DQ/T
gt 0
Change in Entropy DS DQ/T
Equality holds only for reversible processes
The entropy of the universe is always increasing

42
Heat engines

A basic heat engine Q2 Q1 W
T1 and T2 are temperatures
T1 gt T2
Reversible heat engines are those that have
No friction
Infinitesimal heat gradients
The Carnot cycle Motivation was steam engine
Reversible
Pumps heat DQ from T1 to T2
Does work W DQ

43
Heat engines (cont.)
44
The Second Law

No engine that takes heat Q1 at T1 and delivers
heat Q2 at T2 can do more work than a reversible
engine
W Q1 Q2 Q1(T1 T2) / T1
Heat will not, by itself, flow from a cold object
to a hot object

45
Thermodynamic entropy

If we add heat ?Q reversibly to a system at fixed
temperature T, the increase in entropy of the
system is ?S ?Q/T

S is a measure of degrees of freedom
The probability of a configuration
The probability of a point in phase space
In a reversible system, the total entropy is
constant
In an irreversible system, the total entropy
always increases

46
Thermodynamic versus Information Entropy

Assume a gas containing N atoms
Occupies a volume V1
Ideal gas No attraction or repulsion between
particles
Now shrink the volume
Isothermally (at constant temperature, immerse in
a bath)
Reversibly, with no friction
How much work does this require?

47
Compressing the gas

From mechanics
work force distance
force pressure (area of piston)
volume change (area of piston) distance
Solving
From gas theory
The idea gas law
N ? number of molecules
k ? Boltzmanns constant (in joules/Kelvin)
Solving

48
A few notes

W is negative because we are doing work on the
gasV2 lt V1
W would be positive if the gas did work for us
Where did the work go?
Isothermal compression
The temperature is constant (same before and
after)
First law The work went into heating the bath
Second law We decreased the entropy of the gas
and increased the entropy
of the bath

49
Free energy and entropy

The total energy of the gas, U, remains unchanged
Same number of particles
Same temperature
The free energy Fe, and the entropy S both
change
Both are related to the number of states (degrees
of freedom)
Fe U TS
For our experiment, change in free energy is
equal to the work done on the gas and U remains
unchanged

DFe is the (negative) heat siphoned off into the
bath
50
Special Case N 1

Imagine that our gas contains only one molecule
Take statistical averages of same molecule over
time rather than over a population of particles
Halve the volume
Fe increases by kTln2
S decreases by kln2
But U is constant

Whats going on?
Our knowledge of the possible locations of the
particle has changed!
Fewer places that themolecule can be in, now that
volume has been halved
The entropy, a measure of the uncertainty of a
configuration, has decreased

51
Thermodynamic entropy revisited

Take the probability of a gas configuration to be
P
Then S klnP
Random configurations (molecules moving
haphazardly) have large P and large S
Ordered configurations (all molecules moving in
one direction) have small P and small S
The less we know about a gas
the more states it could be in
and the greater the entropy
A clear analogy with information theory

52
The fuel value of knowledge

Analysis is from Bennett Tape cells with
particles coding 0 (left side) or 1 (right side)
If we know the message on a tape
Then randomizing the tape can do useful work
Increasing the tapes entropy

What is the fuel value of the tape (i.e. what is
the fuel value of our knowledge)?
53
Bennetts idea

The procedure
Tape cell comes in with known particle location
Orient a piston depending on whether cell is a 0
or a 1
Particle pushes piston outward
Increasing the entropy by kln2
Providing free energy of kTln2 joules per bit
Tape cell goes out with randomized particle
location

54
The energy value of knowledge

Define fuel value of tape (N I)kTln2
N is the number of tape cells
I is information (Shannon)
Examples
Random tape (I N) has no fuel value
Known tape (I 0) has maximum fuel value

55
Feynmans tape-erasing machine

Define the information in the tape to be the
amount of free energy required to reset the tape
The energy required to compress each bit to a
known state
Only the surprise bits cost us energy
Doesnt take any energy to reset known bits
Cost to erase the tape IkTln2 joules

For known bits, just move the partition (without
changing the volume)
56
Reversible Computing

A reversible computation, that proceeds at an
infinitesimal rate, destroying no information,
consumes no energy
Regardless of the complexity of the computation
The only cost is in resetting the machine at the
end
Erasing information costs energy
Reversible computers are like heat engines
If we run a reversible heat engine at an
infinitesimal pace, it consumes no energy other
than the work that it does

57
Energy cost versus speed

We want our computations to run in finite time
We need to drive the computation forward
Dissipates energy (kinetic, thermal, etc.)
Assume we are driving the computation forward at
a rate r
The computation is r times as likely to go
forward as go backward
What is the minimum energy per computational step?

58
Energy-driven computation

Computation is a transition between states
State transitions have an associated energy
diagram
Assume forward state E2 has a lower energy than
backward state E1
A is the activation energy for a state
transition
Thermal fluctuations cause the computer to move
between states
Whenever the energy exceeds A

We also used this model in neural networks (e.g.
Hopfield networks)
59
State transitions

The probability of a transition between states
differing in positive energy DE is proportional
to exp(DE/kT)
Our state transitions have unequal probabilities
The energy required for a forward step is (A
E1)
The energy required for a backward step is (A
E2)

60
Driving computation by energy differences

The (reaction) rate r depends only on the energy
difference between successive states
The bigger (E1 E2), the more likely the state
transitions, and the faster the computation
Energy expended per step E1 E2 kTlnr

61
Driving computation by state availability

We can drive a computation even if the forward
and backward states have the same energy
As long as there are more forward states than
backward states
The computation proceeds by diffusion
More likely to move into a state with greater
availability
Thermodynamic entropy drives the computation

62
Rate-Driven Reversible Computing A Biological
Example

Protein synthesis is an example...
of (nearly) reversible computation
of the copy computation
of a computation driven forward by thermodynamic
entropy
Protein synthesis is a 2-stage process
1. DNA forms mRNA
2. mRNA forms a protein
We will consider step 1

63
DNA

DNA comprises a double-stranded helix
Each strand comprises alternating phosphate and
sugar groups
One of four bases attaches to each sugar
Adenine (A)
Thymine (T)
Cytosine (C)
Guanine (G)
(base sugar phosphate) group is called a
nucleotide
DNA provides a template for protein synthesis
The sequence of nucleotides forms a code

64
RNA polymerase

RNA polymerase attaches itself to a DNA strand
Moves along, building an mRNA strand one base at
a time
RNA polymerase catalyzes the copying reaction
Within the nucleus there is DNA, RNA polymerase,
and triphosphates (nucleotides with 2 extra
phosphates), plus other stuff
The triphosphates are
adenosine triphosphate (ATP)
cytosine triphosphate (CTP)
guanine triphosphate (GTP)
uracil triphosphate (UTP)

65
mRNA

The mRNA strand is complementary to the DNA
The matching pairs are
DNA RNA
A U
T A
C G
G C
As each nucleotide is added, two phosphates are
released
Bound as a pyrophosphate

66
The process
67
RNA polymerase is a catalyst

Catalysts influence the rate of a biochemical
reaction
But not the direction
Chemical reactions are reversible
RNA polymerase can unmake an mRNA strand
Just as easily as it can make one
Grab a pyrophosphate, attach to a base, and
release
The direction of the reaction depends on the
relative concentrations of the pyrophosphates and
triphosphates
More triphosphates than pyrophosphates Make RNA
More pyrophosphates than triphosphates Unmake RNA

68
DNA, entropy, and states

The relative concentrations of pyrophosphate and
triphosphate define the number of states
available
Cells hydrolyze pyrophosphate to keep the
reactions going forward
How much energy does a cell use to drive this
reaction?
Energy kTlnr (S2 S1)T 100kT/bit

69
Efficiency of a representation

Cells create protein engines (mRNA) for 100kT/bit
0.03µm transistors consume 100kT per switching
event
Think of representational efficiency
What does each system get for 100kT?
Digital logic uses an impoverished representation
104 switching events to perform an 8-bit multiply
Semiconductor scaling doesnt improve the
representation
We pay a huge thermodynamic cost to use discrete
math

70
Example 2 Computing using Reversible Logic Gates

Two reversible gates controlled not (CN) and
controlled controlled not (CCN).

A B C A B C0 0 0
0 0 0 0 0 1 0 0 10
1 0 0 1 0 0 1 1
0 1 1 1 0 0 1 0 0 1
0 1 1 0 11 1 0
1 1 1 1 1 1 1 1 0
A B A B0 0 0 00 1 0 11 0 1
1 1 1 1 0
CCN is complete we can form any Boolean
function using only CCN gates e.g. AND if C 0
71
Next Week Quantum Computing

Reversible Logic Gates and Quantum Computing
Quantum versions of CN and CCN gates
Quantum superposition of states allows
exponential speedup
Shors fast algorithm for factoring and breaking
the RSA cryptosystem
Grovers database search algorithm
Physical substrates for quantum computing

72
Next Week

Guest Lecturer Dan Simon, Microsoft Research
Introductory lecture on quantum computing and
Shors algorithm
Discussion and review afterwards
Homework 4 due submit code and results
electronically by Thursday (let us know if you
have problems meeting the deadline)
Sign up for project and presentation times
Feel free to contact instructor and TA if you
want to discuss your project
Have a great weekend!