Title: Coding for DNA Computing: Combinatorial and Biophysical Aspects
1Coding for DNA ComputingCombinatorial and
Biophysical Aspects
- Olgica Milenkovic
- University of Colorado, Boulder
- A Joint Work with Navin Kashyap
- Queens University, Kingston
2LDPC ITERATIVE DECODING
3Outline
- The DNA Computing Paradigm
- Applications
- Error-Control Coding for DNA Computing
- Constrained Coding DNA Secondary and Tertiary
Structure - Statistical Mechanics of DNA/RNA Folding
- Results and Open Problems
4Molecular Biology Terminology
- DNA Double Helix
- Watson-Crick Complements A?T, G ?C, T ?A, C ?G
- RNA Single-Stranded, T Replaced by U
- Helix Denaturation (Ambient Temperature Governed)
- DNA Oligonucleotide Sequences
- DNA Hybridization
- DNA Enzymes Functional Proteins Operating on DNA
5DNA Computing Adlemans Experiment (1994)
The Problem An Unremarkable Instance of the
Directed Traveling Salesmen Problem on a Graph
with Seven Nodes Figures from Adleman, SA 1998
The Method Remarkable Oligonucleotide DNA
Hybridization Technique
Miami (CTACGG) NY (ATGCCG) Route
(Edge) Second Half of Codeword for Miami (CGG)
and First Half of Codeword for NY (ATG) CGGATG
--- Take the Complement of this Word GCCTAC
6DNA Computing The Benefits
- Not a von Neumann Architecture Stochastic
Mechanism with Massive Parallelism 1/50th of
Teaspoon, 1014paths/1s - Extremely Low Power Consumption 1 Joule for 2
1019 Operations - Storage Capacity Vol(1g of DNA)1cm3 ,
Information1 trillion CDs - 18Mb/inch of Length (0.35nm Between Base Pairs)
- Versatility of Applications, Only Plausible
Option in Many Cases - Drawbacks First Implementations not Interactive
- 3-Day Processing Delay
- VERY LOW RELIABILITY OF COMPUTATION
7Applications of DNA Computers
- Combinatorial Problems
- Directed Traveling Salesmen (Adleman 94)
- 3-SAT (Braich et.al., 02)
- Input a 20-Variable, 24-Clause, Boolean
Function - 3-Conjunctive Normal Form (3-CNF)
- For each Variable, two Length15 DNA Sequences
- Assigned, one representing the Variable,
- the other representing its Complement
- Operon Technology, Alameda, CA,
- Integrated DNA Technologies, Skokie, IL
- Non-Attacking Knights (Faulhammer, 00)
- Configurations of Knights that can be Placed on
nn Chess Board so that no Knight is Attacking
any other Knight on the Board -
Figure
8Novel Designs of DNA Computers
- DNA Logic and Automata Interactive Systems
- DNA Transistors (Stojanovic, Stefanovic 03)
- DNA Game-Playing Machines (Stojanovic, Stefanovic
03) - MAYA Consists of Nine Wells (Tubes)
Representing the 3x3 Tic-Tac-Toe Board - Tubes Contain Mixtures of Enzymes Network of 23
Molecular Logic Gates - Human Player has Nine Different DNA Strands
each Specific to one Square on the Board Player
Selects one Square to Play DNA Strand
representing that Square gets Added to all the
Nine Wells - O
- MAYA Analyzes Play Through Biochemical
Reactions Occurring in Wells -
9Applications of DNA Computers
- Meet MAYA(Stojanovic, Stefanovic 2003)
Figure http//www.cs.unm.edu/bandrews/ttt-applet
/
10Applications of DNA Computers
- The Killer Application SMART DRUGS
- E. Shapiro et.al. (Weizmann Institute, Israel),
Nature, Science 2003 - Quintana et.al 2002
- In Vitro DNA-Based Computer Programmed to
Diagnose Cancer and Order Self-Destruction of
Cells - Identifies RNA Cancer Fingerprint Molecules
- Cancer Leaves its own Chemical Fingerprint in
the Body, Including Over-Producing or
Under-Producing Specific RNA Sequences - (Analysis Based on Regulatory Networks of Gene
Interactions, Shmulevich et. al., 2002) - (Milenkovic and Vasic, DIMACS2004, ITW2004)
- Software DNA, Hardware DNA Enzymes
- Responds Appropriately by Releasing Short,
Active DNA Strand -
- Interferes with Tumors by Suppressing Key Cancer
Genes, Making Diseased Cells Self-Destruct - Experiments Prostate and Lung Cancer Cells
11Applications of DNA Computers
- Sensing, Storing, Nano-Scale Mechanics
- Biosensing DNA Fingerprinting of
Bacteria/Viruses, Roco et.al. 2004 - DNA-Based Storage Systems Mansuripur et.al.,
DIMACS2004 - Nucleic Acid Nanostructures and Topology, DNA
Self-Assembly, DNA Nanoscale Mechanical Devices,
Seeman et.al. 1998-2002
RELIABILITY ISSUES FOR ALL DESCRIBED SYSTEMS
UNRESOLVED
Error Control Coding Constrained Coding Graph
Theory/Combinatorics/Pseudo-Knot
Theory Statistical Mechanics
12The Biggest Obstacles
- DNA Oligonucleotide Secondary and Tertiary
Structure Formation - Unwanted Hybridization
DNA Oligonucleotide Sequences are Chemically
Active, Tend to Assume Thermodynamically Most
Stable Form! DNA Sequences can Bind to Partially
Complementary Sequences as Well!
13DNA/RNA Secondary and Tertiary Structure
Secondary Structure
Pseudoknots (Tertiary Structure)
Mneimneh, 2003 (Figures from Web Lecture Notes)
14DNA Hairpins
- DNA/RNA Hairpin Structure Participate in
Important Biological Functions - Regulation of Gene Expression (Zazopoulos, et.
al., 1997) - DNA Recombination (Froelich-Ammon, et. al.,
1994) - Facilitation of Mutagenic Events (Trinh and
Sinden, 1993) in Living Cell, after Breaking of
Intermolecular Pairing in Double Helix DNA, Loose
Strands Form a DNA Hairpin - Potential Antisense Drug (Tang, et. al., 1993)
Injecting into a Living Cell Hairpin with Nucleic
Acid Bases Complementary to an mRNA of a Disease
Gene Blocks its Expression
15DNA/RNA Knots
RNA Secondary Structure Influences Function of
RNA Knots are Special Regulators
Figures Haslinger, 2001 Craven, 2001
16Mathematical Formulation
Definition 1 (Hasliner, 2001) A Secondary
Structure S is a Vertex-Labeled Graph on n
Vertices, for which the Adjacency Matrix A has
the following properties
An Edge (i,j), i-jgt1 is Called a
Base-Pairing. A Secondary Structure Can Consist
of the Following Structural Elements
- A Stack Consists of Subsequent Base Pairs
(p-k,qk), - (p-k1,qk-1),,(p,q) k is the Length of the
Stack - A Loop Consists of all Unpaired Vertices which
are Immediately Interior to some Terminal Base
Pair - An External Vertex is an Unpaired Vertex which
does not Belong to a Loop
17Mathematical Formulation
- If Definition 1, Part 3 is Violated for a Base
Pairing, then the Resulting Formation is Referred
to as a Pseudoknot
- With Information about Energy of Pairings and
Additional Measurements Regarding the DNA
Backbone, Determining Stable Secondary Structures
Becomes a Purely Combinatorial Problem -
- Secondary Structure Prediction Dynamical
Programming Approach, Polynomial Time Nussinovs
and Zuckermann Algorithm - Pseudoknots NP-Complete, Except for Special
Class of H-Knots (Rivas, Eddy 2003)
18Nussinovs Folding Algorithm
Free Energy of Secondary Structure S
Free Energy of Secondary Structure Limited to
positions i, i1,, j
Figure Mneimneh, 2003, Bundschuh, 2004 Feynman
Diagrams for RNA Structure Prediction (Eddy,
Rivas 2001) Free Energy Table Sequence CCCAAATGG
19Statistical Physics DNA Ensemble Analysis
Bundschuh, Hwa 2004 Statistics of Secondary
Structures in Ensemble of Long Random DNA
Sequences Why? Detection of Important Structural
Components in mRNAs, Functional RNAs,
Characterization of the Response of Long
Oligonucleotide DNA Molecule to Puling Forces
Random DNA Problem of Disordered Systems
Bundschuh, Hwa 2004
20Statistical Physics DNA Ensemble Analysis
- Molten Phase Absence of Disorder
Thermodynamic Ensemble Large Number of Different
Secondary Structures with Equal Energy Stability
of Molten Phase Use N-Replica Method
21Stat Physics DNA Ensemble Analysis
- Glassy Phase Few Low Energy Configurations in
Thermodynamic Limit
- Droplet Theory (Huse and Fisher) Large-Scale
Low-Energy Excitations About - Ground State
- Impose deformation over a length scale Lgtgt1,
Monitor Minimal Free Energy Cost of Deformation - Cost Expected to Scale as Lw for large L
Positive w Indicates Deformation Cost Grows with
Increasing Size. Negative w Indicates Deformation
Cost Decays there is a Large Number of
Configurations with Low Overlap with Ground
State, whose Energies are Similar to the Ground
State Energy in the Thermodynamic Limit
(Zero-Temperature Behavior not Stable to Thermal
Fluctuations - No Thermodynamic Glass Phase can
Exist at any Finite Temperature - Related Analysis A. Pagnani, G. Parisi, and F.
Ricci-Tersenghi, 2000/2001
22The Stability of a Particular Secondary Structure
is a Function of Several Constraints 1) Number
of GC versus AT /GT Base Pairs(Larger Number of
Hydrogen Bonds Form more Stable Structures) 2)
Number of Base Pairs Forming a Stem
Region(Presence of Long Subsequence and its
Reverse Complement Lead to Stabilization ) 3)
Number of Base Pairs in a Hairpin (More than 15
or less than 4-7 Bases put Stress on the Loop
) 4) Number of Unpaired Bases (More Unpaired
Bases lead to less Stable Structure )
23Hybridization Constraints
- Individual Sequence Constraints (Wood, Tsaftaaris
etc)
IP1) The consecutive-bases constraint. Long Runs
of the Same Base Forbidden. IP2) The constant
GC-content constraint. Introduced to Achieve
Parallelized Operations on DNA Sequences Assures
Similar Thermodynamic (Melting Temperature)
Characteristics of all Codewords. GC-Content
Usually in the Range of 30-50 of Code Length
- Joint Sequence Constraints
JP1) The Hamming distance constraint. Limits
Unwanted Hybridizations between Codewords.
Requirement is that all Distinct Pairs of
Codewords p,q in C be at Hamming Distance at
Least dmin. To Limit Undesired Hybridization
between a Codeword and the Reverse-Complement of
any other Codeword (including itself) the Reverse
Complement Hamming Distance has to be at Least
dRCmin
JP2) The frame-shift constraint. Applies Only to
Limited Number of Problems. Refers to Requirement
that Concatenation of Two or More Codewords
should not Properly Contain Another
Codeword. JP3) The forbidden subsequence
constraint. Specifies that a Class of Substrings
Must not Occur in any Codeword or Concatenation
of Codewords
24Code Construction
PRIOR WORK Addressed 1/2/3 Requirements No
Families of Codes Given (Length Limited to
20) No Attempt Whatsoever to Consider Secondary
Structure Constraints References Condon et.al.
2000-2004 King 2003 Ryakov 2003 Gaborit and
King 2004 Ghrayeb et.al. 2004
- Approach I Binary Mapping
- Approach II Extended, Cyclic Goppa Codes over
GF(4) - Approach III Hadamard Matrices with Cyclic Core
- WHY Cyclic? Will Show that Computational
Complexity for Nussinovs Algorithm Significantly
Reduced in this Case
25Terminology
DNA Code C Set of Codewords over Alphabet
Q Minimum Hamming, Reverse and
Reverse-Complement Hamming Distance Constant
GC Content Code
26Binary Mapping Approach
Example qACGTCC b(q)001011011010 e(q)011011 o
(q)001100
Code D n,k,d, Contains All-Ones
Word Construction DNA Code Number of
Codewords Length 2n Hamming, Reverse
Complement Hamming Distance at Least d
27Longest Length Codes
Bounds on
(Based on Bounds by Ashikhmin et
al, 2005) Binary Mapping Subcodes of Simplex
Codes (All-Zero Not Allowed) -- EVEN
Special Subset of Codewords from Menas/Zettenberg
Codes --ODD
28Extended Cyclic Goppa Codes
- Approach
- Take a Family of Reversible (
) Cyclic Codes - Eliminate all Self-Reversible Codewords
- From Each Remaining Pair Retain
Exactly One Codeword - Complement Second Half of Each Codeword
Let for q a
Power of a Prime and Let g(z) be a
Polynomial of Degree over
such that g(z) has no Root in . The
Goppa Code, , consists of all words
such that
is a code of length n, dimension
and minimum distance .
Zhang et. al., 1988
29DNA Codes and Goppa Codes
A Reversible Cyclic Code of Dimension k over
GF(q) contains self-reversible
Codewords.
For arbitrary positive integers a,m, there exist
DNA Codes D such that
having the following properties
Choose Constant GC Content Subset of Codewords
Example
CGTTC,CAAAT,CTCCA,GCCTT,GGAGA,ACTAA
30Complex (Generalized) Hadamard Matrices
Matrix of Dimension nn over
Set of m-th Roots of Unity With
property Exponent Matrix over
TheoremHeng et.al, 02 Let Npk-1 for p Prime
and a Positive Integer k. Let g(x)c0c1xc2x2c
N-kxN-k be a Monic Polynomial over Zp, of Degree
N-k, such that g(x)h(x)xN-1 over Zp , for some
monic irreducible polynomial h(x) in Zpx .
Suppose that the vector , (0,c0,c1,c2,,cN-k)
with ci0 for N-kltiltN has the property that it
contains each element of Zp the same number of
times. Then the N cyclic shifts of the vector
(c0,c1,c2,,cN-k) form the code of the exponent
matrix of some Hadamard matrix H(pk,Cp)
Choose p3, and Use only One of G/C
For any , there exists DNA codes D with
codewords of length ,
with constant GC-content equal to and
Each Codeword of such a Code is a Cyclic Shift
of a Fixed Generator Codeword g.
31Hadamard and Vienna
Vienna Package T37?C http//www.tbi.univie.ac.at
/ivo/RNA/ Based on Nussinovs Algorithm Gives
one Minimum Free Energy Secondary
Structure MFOLD (Zuckerman et.al.2000)
32Why Cyclic Codes?
Let a DNA Code Consist of the Cyclic Shifts of a
Codeword . Provided that the free
energy table of is known, the free-energy tables
of all other codewords can be computed with a
total of O(n3) operations only. More precisely,
the free-energy table of the codeword
can be obtained from the table in O(n2) steps.
33C C C A A A T G G
C 0 0 0 0 0 0 -1 -2 -3
C 0 0 0 0 0 0 -1 -2 -2
C 0 0 0 0 0 0 -1 -2 -2
A 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 -1 -1 -1
T 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
G A C A A A G G T
G 0 0 -1 -1 -1 -1 -1 -1 -2
A 0 0 0 0 0 0 -1 -1 -2
C 0 0 0 0 0 0 -1 -1 -1
A 0 0 0 0 0 0 0 0 -1
A 0 0 0 0 0 0 0 0 -1
A 0 0 0 0 0 0 0 0 -1
G 0 0 0 0 0 0 0 0 0
G 0 0 0 0 0 0 0 0 0
T 0 0 0 0 0 0 0 0 0
d WC(CCCAAATGG,GCCCAAATG)7
d WC(GACAAAGGT,TGACAAAGG)9 d
WC(CCCAAATGG,GGCCCAAAT)6
d WC(GACAAAGGT,GTGACAAAG)7 T1 Free
Energy -0.24Kcal/mol
T2 -0.19Kcal/mol Energies
Obtained from Vienna RNA Folding Package (I.
Hofacker)
34Why Binary Mapping?
1 1 1 0 0 0 0 1 1
1 0 -1 -1 -1 -2 -2 -3 -4 -4
1 0 0 -1 -1 -2 -2 -3 -3 -4
1 0 0 0 0 -1 -1 -2 -3 -3
0 0 0 0 0 -1 -1 -2 -2 -3
0 0 0 0 0 0 -1 -1 -1 -2
0 0 0 0 0 0 0 -1 -1 -2
0 0 0 0 0 0 0 0 0 -1
1 0 0 0 0 0 0 0 0 -1
1 0 0 0 0 0 0 0 0 0
C
C
G
G
C
T
A
A
A
351 0 1 0 1 0 1 1 0
1 0 0 -1 -1 -2 -2 -3 -3 -4
0 0 0 0 -1 -1 -2 -2 -3 -3
1 0 0 0 0 -1 -1 -2 -2 -3
0 0 0 0 0 0 -1 -1 -2 -2
1 0 0 0 0 0 0 -1 -1 -1
0 0 0 0 0 0 0 0 -1 -2
1 0 0 0 0 0 0 0 -1 -1
1 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
1 1 0 0 1 0 1 0 0
1 0 -1 -1 -2 -2 -2 -3 -3 -4
1 0 0 0 -1 -2 -2 -2 -3 -3
0 0 0 0 -1 -1 -1 -2 -2 -3
0 0 0 0 0 0 -1 -1 -2 -2
1 0 0 0 0 0 0 -1 -1 -2
0 0 0 0 0 0 0 0 -1 -1
1 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 0 0 -1
0 0 0 0 0 0 0 0 0 0
What Type of Sequences do Minimize the entry
E1,n? Cyclic Shifts with a Minimized Set i
WC(Ci)Cik, k1,2,,m
36The Cyclic Distance (Binary Case)
Sequence Weight w n/2, n even
w (n-1)/2, n odd
Achieved Maximum Length Shift Register (MLSR)
Sequences (Pseudo-Random Sequences in General)
What are the Reversal Distance Properties of MLSR
Sequences?
37The Watson-Crick Distance
- Watson-Crick Distance Plotkin-Type of Bound
38The Free Energy of a DNA Strand (c1,c2,,cn) can
be Approximated According to Breslauers Formula
Much more Accurate
39Other Coding Problems
- Generalized deBruijn Sequences
- Association Schemes for Hamming/RC
Hamming/Constant GC Content - Binary Mapping Approach with Runlength
Constraints - Forbidden Pattern Constraints (Enumeration
Techniques by Goulden and Jackson) - Catalan Numbers
- b1 CN(1)1 ( )b2 CN(2)2 ( ) ( ), ( ( )
)b3 CN(3)5 ( ) ( ) ( ), ( ( ) ( ) ), ( ( )
) ( ), ( ) ( ( ) ), ( ( ( ) ) )