Title: CSE 599 Lecture 7: Information Theory, Thermodynamics and Reversible Computing
1CSE 599 Lecture 7 Information Theory,
Thermodynamics and Reversible Computing
- What have we done so far?
- Theoretical computer science Abstract models of
computing - Turing machines, computability, time and space
complexity - Physical Instantiations
- Digital Computing
- Silicon switches manipulate binary variables with
near-zero error - DNA computing
- Massive parallelism and biochemical properties of
organic molecules allow fast solutions to hard
search problems - Neural Computing
- Distributed networks of neurons compute fast,
parallel, adaptive, and fault-tolerant solutions
to hard pattern recognition and motor control
problems
2Overview of Todays Lecture
- Information theory and Kolmogorov Complexity
- What is information?
- Definition based on probability theory
- Error-correcting codes and compression
- An algorithmic definition of information
(Kolmogorov complexity) - Thermodynamics
- The physics of computation
- Relation to information theory
- Energy requirements for computing
- Reversible Computing
- Computing without energy consumption?
- Biological example
- Reversibe logic gates ? Quantum computing (next
week!)
3Information and Algorithmic Complexity
- 3 principal results
- Shannons source-coding theorem
- The main theorem of information content
- A measure of the number of bits needed to specify
the expected outcome of an experiment - Shannons noisy-channel coding theorem
- Describes how much information we can transmit
over a channel - A strict bound on information transfer
- Kolmogorov complexity
- Measures the algorithmic information content of a
string - An uncomputable function
4What is information?
- First try at a definition
- Suppose you have stored n different bookmarks on
your web browser. - What is the minimum number of bits you need to
store these as binary numbers? - Let I be the minimum number of bits needed. Then,
- 2I ? n ? I ? log2 n
- So, the information contained in your
collection of n bookmarks is I0 log2 n
5Deterministic information I0
- Consider a set of alternatives X a1, a2, a3,
aK - When the outcome is a3, we say x a3
- I0(X) is the amount of information needed to
specify the outcome of X - I0(X) log2½X½
- We will assume base 2 from now on (unless stated
otherwise) - Units are bits (binary digits)
- Relationship between bits and binary digits
- B 0, 1
- X BM set of all binary strings of length M
- I0(X) log½BM½ log½2M½ M bits
6Is this definition satisfactory?
- Appeal to your intuition
- Which of these two messages contains more
information? - Dog bites man
- or
- Man bites dog
7Is this definition satisfactory?
- Appeal to your intuition
- Which of these two messages contains more
information? - Dog bites man
- or
- Man bites dog
- Same number of bits to represent each message!
- But, it seems like the second message contains a
lot more information than the first. Why?
8Enter probability theory
- Surprising events (unexpected messages) contain
more information than ordinary or expected events - Dog bites man occurs much more frequently than
Man bites dog - Messages about less frequent events carry more
information - So, information about an event varies inversely
with the probability of that event - But, we also want information to be additive
- If message xy contains sub-parts x and y, we
want - I(xy) I(x) I(y)
- Use the logarithm function log(xy) log(x)
log(y)
9New Definition of Information
- Define the information contained in a message x
in terms of log of the inverse probability of
that message - I(x) log(1/P(x)) - log P(x)
- First defined rigorously and studied by Shannon
(1948) - A mathematical theory of communication
electronic handout (PDF file) on class website. - Our previous definition is a special case
- Suppose you had n equally likely items (e.g.
bookmarks) - For any item x, P(x) 1/n
- I(x) log(1/P(x)) log n
- Same as before (minimum number of bits needed to
store n items)
10Review Axioms of probability theory
- Kolmogorov, 1933
- P(a) gt 0 where a is an event
- P(l) 1 where l is the certain event
- P(a b) P(a) P(b) where a and b are mutually
exclusive - Kolmogorov (axiomatic) definition is computable
- Probability theory forms the basis for
information theory - Classical definition based on event frequencies
(Bernoulli) is uncomputable
11Review Results from probability theory
- Joint probability of two events a and b P(ab)
- Independence
- Events a and b are independent if P(ab)
P(a)P(b) - Conditional probability P(ab) probability
that event a happens given that b has happened - P(ab) P(ab)/P(b)
- P(ba) P(ba)/P(a) P(ab)/P(a)
- We just proved Bayes Theorem
- P(a) is called the a priori probability of a
- P(a½b) is called the a posteriori probability of
a
12Summary Postulates of information theory
- 1. Information is defined in the context of a set
of alternatives. The amount of information
quantifies the number of bits needed to specify
an outcome from the alternatives - 2. The amount of information is independent of
the semantics (only depends on probability) - 3. Information is always positive
- 4. Information is measured on a logarithmic scale
- Probabilities are multiplicative, but information
is additive -
13In-Class Example
- Message y contains duplicates y xx
- Message x has probability P(x)
- What is the information content of y?
- Is I(y) 2 I(x)?
14In-Class Example
- Message y contains duplicates y xx
- Message x has probability P(x)
- What is the information content of y?
- Is I(y) 2 I(x)?
- I(y) log(1/P(xx)) log1/(P(xx)P(x))
- log(1/P(xx)) log(1/P(x))
- 0 log(1/P(x))
- I(x)
- Duplicates convey no additional information!
15Definition Entropy
- The average self-information or entropy of an
ensemble X a1, a2, a3, aK - E ? expected (or average) value
16Properties of Entropy
- 0 lt H(X) lt I0(X)
- Equals I0(X) log½X½ if all the aks are
equally probable - Equals 0 if only one ak is possible
- Consider the case where k 2
- X a1, a2
- P(a1) ? P(a2) 1 ?
-
17Examples
- Entropy is a measure of randomness of the source
producing the events - Example 1 Coin toss Heads or tails with equal
probability - H -(½ log ½ ½ log ½) -(½ (-1) ½ (-1))
1 bit per coin toss - Example 2 P(heads) ¾ and P(tails) ¼
- H -(¾ log ¾ ¼ log ¼) 0.811 bits per coin
toss - As things get less random, entropy decreases
- Redundancy and regularity increases
18Question
- If we have N different symbols, we can encode
them in log(N) bits. Example English - 26
letters ? 5 bits - So, over many, many messages, the average
cost/symbol is still 5 bits. - But, letters occur with very different
probabilities! A and E much more common than
X and Q. The log(N) estimate assumes equal
probabilities. - Question Can we encode symbols based on
probabilities so that the average cost/symbol is
minimized?
19Shannons noiseless source-coding theorem
- Also called the fundamental theorem. In words
- You can compress N independent, identically
distributed (i.i.d.) random variables, each with
entropy H, down to NH bits with negligible loss
of information (as N??) - If you compress them into fewer than NH bits you
will dramatically lose information - The theorem
- Let X be an ensemble with H(X) H bits. Let Hd
(X) be the entropy of an encoding of X with
allowable probability of error d - Given any ? gt 0 and 0 lt ? lt 1, there exists a
positive integer No such that, for N gt No,
20Comments on the theorem
- What do the two inequalities tell us?
-
- The number of bits that we need
to specify outcomes x with vanishingly small
error probability ? does not exceed H ? - If we accept a vanishingly small error, the
number of bits we need to specify x drops to N(H
?) -
- The number of bits that we need
to specify outcomes x with large allowable error
probability ? is at least H ?
21Source coding (data compression)
- Question How do we compress the outcomes XN?
- With vanishingly small probability of error
- How do we assign the elements of X such that the
number of bits we need to encode XN drops to N(H
?) - Symbol coding Given x a3 a2 a7 a5
- Generate codeword ?(x) 01 1010 00
- Want Io(?(x)) H(X)
- Well-known coding examples
- Zip, gzip, compress, etc.
- The performance of these algorithms is, in
general, poor when compared to the Shannon limit
22Source-coding definitions
- A code is a function ? X ? B
- B 0, 1
- B ? the set of finite strings over B
- B 0, 1, 00, 01, 10, 11, 000, 001,
- ?(x) ?(x1) ?(x2) ?(x3) ?(xN)
- A code is uniquely decodable (UD) iff
- ? X ? B is one-to-one
- A code is instantaneous iff
- No codeword is the prefix of another
- ?(x1) is not a prefix of ?(x2)
23Huffman coding
- Given X a1, a2, aK, with associated
probabilities P(ak) - Given a code with codeword lengths n1, n2, nk
- The expected code length
- No instantaneous, UD code can achieve a smaller
than a Huffman code -
24Constructing a Huffman code
- Feynman example Encoding an alphabet
- Code is instantaneous and UD 00100001101010
ANOTHER - Code achieves close to Shannon limit
- H(X) 2.06 bits 2.13 bits
1
25Information channels
Input
Output
channel
x
y
I(XY) ? what we know about X given Y
H(X) ? entropy of input ensemble X
- I(XY) is the average mutual information between
X and Y - Definition Channel capacity
- The information capacity of a channel is C
maxI(XY) - The channel may add noise
- Corrupting our symbols
26Example Channel capacity
- Problem A binary source sends ? equiprobable
messages in a time T, using the alphabet 0, 1
with a symbol rate R. As a result of noise, a
0 may be mistaken for a 1, and a 1 for a
0, both with probability q. What is the channel
capacity C? -
X
Y
Channel is discrete and memoryless
27Example Channel capacity (cont)
- Assume no noise (no errors)
- T is the time to send the string, R is the rate
- The number of possible message strings is 2RT
- The maximum entropy of the source is Ho log(2RT
) bits - The source rate is (1/T) Ho R bits per second
- The entropy of the noise (per transmitted bit) is
- Hn qlog1/q (1q)log1/(1q)
- The channel capacity C (bits/sec) R RHn R(1
Hn) - C is always less than R (a fixed fraction of R)!
- We must add code bits to correct the received
message
28How many code bits must we add?
- We want to send a message string of length M
- We add codebits to M, thereby increasing its
length to Mc - How are M, Mc, and q related?
- M Mc(1 Hn)
- Intuitively, from our example
- Also see pgs. 106 110 of Feynman
- Note this is an asymptotic limit
- May require a huge Mc
29Shannons Channel-Coding Theorem
- The Theorem
- There is a nonnegative channel capacity C
associated with each discrete memoryless channel
with the following property For any symbol rate
R lt C, and any error rate ? gt 0, there is a
protocol that achieves a rate gt R and a
probability of error lt ? - In words
- If the entropy of our symbol stream is equal to
or less than the channel capacity, then there
exists a coding technique that enables
transmission over the channel with arbitrarily
small error - Can transmit information at a rate H(X) lt C
- Shannons theorem tells us the asymptotically
maximum rate - It does not tell us the code that we must use to
obtain this rate - Achieving a high rate may require a prohibitively
long code
30Error-correction codes
- Error-correcting codes allow us to detect and
correct errors in symbol streams - Used in all signal communications (digital
phones, etc) - Used in quantum computing to ameliorate effects
of decoherence - Many techniques and algorithms
- Block codes
- Hamming codes
- BCH codes
- Reed-Solomon codes
- Turbo codes
31Hamming codes
- An example Construct a code that corrects a
single error - We add m check bits to our message
- Can encode at most (2m 1) error positions
- Errors can occur in the message bits and/or in
the check bits - If n is the length of the original message then
2m 1 gt (n m) - Examples
- If n 11, m 4 24 1 15 gt (n
m) 15 - If n 1013, m 10 210 1 1023 gt (n
m) 1023
32Hamming codes (cont.)
- Example An 11/15 SEC Hamming code
- Idea Calculate parity over subsets of input bits
- Four subsets Four parity bits
- Check bit x stores parity of input bit positions
whose binary representation holds a 1 in
position x - Check bit c1 Bits 1,3,5,7,9,11,13,15
- Check bit c2 Bits 2,3,6,7,10,11,14,15
- Check bit c3 Bits 4,5,6,7,12,13,14,15
- Check bit c4 Bits 8,9,10,11,12,13,14,15
- The parity-check bits are called a syndrome
- The syndrome tells us the location of the error
Position in message binary decimal 0001
1 0010 2 0011 3 0100 4 0101
5 0110 6 0111 7 1000 8 1001
9 1010 10 1011 11 1100 12 1101
13 1110 14 1111 15
33Hamming codes (cont)
- The check bits specify the error location
- Suppose check bits turn out to be as follows
- Check c1 1 (Bits 1,3,5,7,9,11,13,15)
- Error is in one of bits 1,3,5,7,9,11,13,15
- Check c2 1 (Bits 2,3,6,7,10,11,14,15)
- Error is in one of bits 3,7,11,15
- Check c3 0 (Bits 4,5,6,7,12,13,14,15)
- Error is in one of bits 3,11
- Check c4 0 (Bits 8,9,10,11,12,13,14,15)
- So error is in bit 3!!
34Hamming codes (cont.)
- Example Encode 10111011011
- Code position 15 14 13 12 11 10 9 8 7
6 5 4 3 2 1 - Code symbol 1 0 1 1 1 0 1 c4
1 0 1 c3 1 c2 c1 - Codeword 1 0 1 1 1 0 1 1 1
0 1 1 1 0 1 - Notice that we can generate the code bits on the
fly! - What if we receive 101100111011101?
- c4 1 101100111011101
- c3 0 101100111011101
- c2 1 101100111011101
- c1 1 101100111011101
- The error is in location 1011 1110
35Kolmogorov Complexity (Algorithmic Information)
- Computers represent information as stored symbols
- Not probabilistic n the Shannon sense)
- Can we quantify information from an algorithmic
standpoint? - Kolmogorov complexity K(s) of a finite binary
string s is the single, natural number
representing the minimum length (in bits) of a
program p that generates s when run on a
Universal Turing machine U - K(s) is the algorithmic information content of s
- Quantifies the algorithmic randomness of the
string - K(s) is an uncomputable function
- Similar argument to the halting problem
- How do we know when we have the shortest program?
36Kolmogorov Complexity Example
- Randomness of a string defined by shortest
algorithm that can print it out. - Suppose you were given the binary string x
- 11111111111111.11111111111111111111111 (1000
1s) - Instead of 1000 bits, you can compress this
string to a few tens of bits, representing the
length P of the program - For I 1 to 1000
- Print 1
- So, K(x) lt P
- Possible project topic Quantum Kolmogorov
complexity?
375-minute break
- Next Thermodynamics and Reversible Computing
38Thermodynamics and the Physics of Computation
- Physics imposes fundamental limitations on
computing - Computers are physical machines
- Computers manipulate physical quantities
- Physical quantities represent information
- The limitations are both technological and
theoretical - Physical limitations on what we can build
- Example Silicon-technology scaling
- Major limiting factor in the future Power
Consumption - Theoretical limitations of energy consumed during
computation - Thermodynamics and computation
39Principal Questions of Interest
- How much energy must we use to carry out a
computation? - The theoretical, minimum energy
- Is there a minimum energy for a certain rate of
computation? - A relationship between computing speed and energy
consumption - What is the link between energy and information?
- Between informationentropy and
thermodynamicentropy - Is there a physical definition for information
content? - The information content of a message in physical
units
40Main Results
- Computation has no inherent thermodynamic cost
- A reversible computation, that proceeds at an
infinitesimal rate, consumes no energy - Destroying information requires kTln2 joules per
bit - Information-theoretic bits (not binary digits)
- Driving a computation forward requires kTln(r)
joules per step - r is the rate of going forward rather than
backward
41Basic thermodynamics
- First law Conservation of energy
- (heat put into system) (work done on system)
increase in energy of a system - DQ DW DU
- Total energy of the universe is constant
- Second law It is not possible to have heat flow
from a colder region to a hotter region i.e. DQ/T
gt 0 - Change in Entropy DS DQ/T
- Equality holds only for reversible processes
- The entropy of the universe is always increasing
42Heat engines
- A basic heat engine Q2 Q1 W
- T1 and T2 are temperatures
- T1 gt T2
- Reversible heat engines are those that have
- No friction
- Infinitesimal heat gradients
- The Carnot cycle Motivation was steam engine
- Reversible
- Pumps heat DQ from T1 to T2
- Does work W DQ
43Heat engines (cont.)
44The Second Law
- No engine that takes heat Q1 at T1 and delivers
heat Q2 at T2 can do more work than a reversible
engine - W Q1 Q2 Q1(T1 T2) / T1
- Heat will not, by itself, flow from a cold object
to a hot object
45Thermodynamic entropy
- If we add heat ?Q reversibly to a system at fixed
temperature T, the increase in entropy of the
system is ?S ?Q/T
- S is a measure of degrees of freedom
- The probability of a configuration
- The probability of a point in phase space
- In a reversible system, the total entropy is
constant - In an irreversible system, the total entropy
always increases
46Thermodynamic versus Information Entropy
- Assume a gas containing N atoms
- Occupies a volume V1
- Ideal gas No attraction or repulsion between
particles - Now shrink the volume
- Isothermally (at constant temperature, immerse in
a bath) - Reversibly, with no friction
- How much work does this require?
47Compressing the gas
- From mechanics
- work force distance
- force pressure (area of piston)
- volume change (area of piston) distance
- Solving
- From gas theory
- The idea gas law
- N ? number of molecules
- k ? Boltzmanns constant (in joules/Kelvin)
- Solving
48A few notes
- W is negative because we are doing work on the
gasV2 lt V1 - W would be positive if the gas did work for us
- Where did the work go?
- Isothermal compression
- The temperature is constant (same before and
after) - First law The work went into heating the bath
- Second law We decreased the entropy of the gas
- and increased the entropy
of the bath
49Free energy and entropy
- The total energy of the gas, U, remains unchanged
- Same number of particles
- Same temperature
- The free energy Fe, and the entropy S both
change - Both are related to the number of states (degrees
of freedom) - Fe U TS
- For our experiment, change in free energy is
equal to the work done on the gas and U remains
unchanged
DFe is the (negative) heat siphoned off into the
bath
50Special Case N 1
- Imagine that our gas contains only one molecule
- Take statistical averages of same molecule over
time rather than over a population of particles - Halve the volume
- Fe increases by kTln2
- S decreases by kln2
- But U is constant
- Whats going on?
- Our knowledge of the possible locations of the
particle has changed! - Fewer places that themolecule can be in, now that
volume has been halved - The entropy, a measure of the uncertainty of a
configuration, has decreased
51Thermodynamic entropy revisited
- Take the probability of a gas configuration to be
P - Then S klnP
- Random configurations (molecules moving
haphazardly) have large P and large S - Ordered configurations (all molecules moving in
one direction) have small P and small S - The less we know about a gas
- the more states it could be in
- and the greater the entropy
- A clear analogy with information theory
52The fuel value of knowledge
- Analysis is from Bennett Tape cells with
particles coding 0 (left side) or 1 (right side) - If we know the message on a tape
- Then randomizing the tape can do useful work
- Increasing the tapes entropy
What is the fuel value of the tape (i.e. what is
the fuel value of our knowledge)?
53Bennetts idea
- The procedure
- Tape cell comes in with known particle location
- Orient a piston depending on whether cell is a 0
or a 1 - Particle pushes piston outward
- Increasing the entropy by kln2
- Providing free energy of kTln2 joules per bit
- Tape cell goes out with randomized particle
location
54The energy value of knowledge
- Define fuel value of tape (N I)kTln2
- N is the number of tape cells
- I is information (Shannon)
- Examples
- Random tape (I N) has no fuel value
- Known tape (I 0) has maximum fuel value
55Feynmans tape-erasing machine
- Define the information in the tape to be the
amount of free energy required to reset the tape - The energy required to compress each bit to a
known state - Only the surprise bits cost us energy
- Doesnt take any energy to reset known bits
- Cost to erase the tape IkTln2 joules
For known bits, just move the partition (without
changing the volume)
56Reversible Computing
- A reversible computation, that proceeds at an
infinitesimal rate, destroying no information,
consumes no energy - Regardless of the complexity of the computation
- The only cost is in resetting the machine at the
end - Erasing information costs energy
- Reversible computers are like heat engines
- If we run a reversible heat engine at an
infinitesimal pace, it consumes no energy other
than the work that it does
57Energy cost versus speed
- We want our computations to run in finite time
- We need to drive the computation forward
- Dissipates energy (kinetic, thermal, etc.)
- Assume we are driving the computation forward at
a rate r - The computation is r times as likely to go
forward as go backward - What is the minimum energy per computational step?
58Energy-driven computation
- Computation is a transition between states
- State transitions have an associated energy
diagram - Assume forward state E2 has a lower energy than
backward state E1 - A is the activation energy for a state
transition - Thermal fluctuations cause the computer to move
between states - Whenever the energy exceeds A
We also used this model in neural networks (e.g.
Hopfield networks)
59State transitions
- The probability of a transition between states
differing in positive energy DE is proportional
to exp(DE/kT) - Our state transitions have unequal probabilities
- The energy required for a forward step is (A
E1) - The energy required for a backward step is (A
E2)
60Driving computation by energy differences
- The (reaction) rate r depends only on the energy
difference between successive states - The bigger (E1 E2), the more likely the state
transitions, and the faster the computation - Energy expended per step E1 E2 kTlnr
61Driving computation by state availability
- We can drive a computation even if the forward
and backward states have the same energy - As long as there are more forward states than
backward states - The computation proceeds by diffusion
- More likely to move into a state with greater
availability - Thermodynamic entropy drives the computation
62Rate-Driven Reversible Computing A Biological
Example
- Protein synthesis is an example...
- of (nearly) reversible computation
- of the copy computation
- of a computation driven forward by thermodynamic
entropy - Protein synthesis is a 2-stage process
- 1. DNA forms mRNA
- 2. mRNA forms a protein
- We will consider step 1
63DNA
- DNA comprises a double-stranded helix
- Each strand comprises alternating phosphate and
sugar groups - One of four bases attaches to each sugar
- Adenine (A)
- Thymine (T)
- Cytosine (C)
- Guanine (G)
- (base sugar phosphate) group is called a
nucleotide - DNA provides a template for protein synthesis
- The sequence of nucleotides forms a code
64RNA polymerase
- RNA polymerase attaches itself to a DNA strand
- Moves along, building an mRNA strand one base at
a time - RNA polymerase catalyzes the copying reaction
- Within the nucleus there is DNA, RNA polymerase,
and triphosphates (nucleotides with 2 extra
phosphates), plus other stuff - The triphosphates are
- adenosine triphosphate (ATP)
- cytosine triphosphate (CTP)
- guanine triphosphate (GTP)
- uracil triphosphate (UTP)
65mRNA
- The mRNA strand is complementary to the DNA
- The matching pairs are
- DNA RNA
- A U
- T A
- C G
- G C
- As each nucleotide is added, two phosphates are
released - Bound as a pyrophosphate
66The process
67RNA polymerase is a catalyst
- Catalysts influence the rate of a biochemical
reaction - But not the direction
- Chemical reactions are reversible
- RNA polymerase can unmake an mRNA strand
- Just as easily as it can make one
- Grab a pyrophosphate, attach to a base, and
release - The direction of the reaction depends on the
relative concentrations of the pyrophosphates and
triphosphates - More triphosphates than pyrophosphates Make RNA
- More pyrophosphates than triphosphates Unmake RNA
68DNA, entropy, and states
- The relative concentrations of pyrophosphate and
triphosphate define the number of states
available - Cells hydrolyze pyrophosphate to keep the
reactions going forward - How much energy does a cell use to drive this
reaction? - Energy kTlnr (S2 S1)T 100kT/bit
69Efficiency of a representation
- Cells create protein engines (mRNA) for 100kT/bit
- 0.03µm transistors consume 100kT per switching
event - Think of representational efficiency
- What does each system get for 100kT?
- Digital logic uses an impoverished representation
- 104 switching events to perform an 8-bit multiply
- Semiconductor scaling doesnt improve the
representation - We pay a huge thermodynamic cost to use discrete
math
70Example 2 Computing using Reversible Logic Gates
- Two reversible gates controlled not (CN) and
controlled controlled not (CCN).
A B C A B C0 0 0
0 0 0 0 0 1 0 0 10
1 0 0 1 0 0 1 1
0 1 1 1 0 0 1 0 0 1
0 1 1 0 11 1 0
1 1 1 1 1 1 1 1 0
A B A B0 0 0 00 1 0 11 0 1
1 1 1 1 0
CCN is complete we can form any Boolean
function using only CCN gates e.g. AND if C 0
71Next Week Quantum Computing
- Reversible Logic Gates and Quantum Computing
- Quantum versions of CN and CCN gates
- Quantum superposition of states allows
exponential speedup - Shors fast algorithm for factoring and breaking
the RSA cryptosystem - Grovers database search algorithm
- Physical substrates for quantum computing
72Next Week
- Guest Lecturer Dan Simon, Microsoft Research
- Introductory lecture on quantum computing and
Shors algorithm - Discussion and review afterwards
- Homework 4 due submit code and results
electronically by Thursday (let us know if you
have problems meeting the deadline) - Sign up for project and presentation times
- Feel free to contact instructor and TA if you
want to discuss your project - Have a great weekend!