Title: Vladimir V. Ufimtsev
1GENERATION OF DNA CODES
Vladimir V. Ufimtsev
Adviser Dr. V. Rykov
2Historical Background
1948
A Mathematical Theory of Communication
C.E. Shannon
Main result Entropy function - average value of
information obtained from a channel.
1950
Error Detecting and Error Correcting Codes
R.W. Hamming
Main result Matrices that can be used to encode
messages and provide more reliable transmission
across a channel.
1953
A structure for Deoxyribose Nucleic Acid J. D.
WATSON, F. H. C. CRICK, M. H. F. Wilkins, R.
E. Franklin,
Main result Structure found for the building
block of life.
Theres Plenty of Room at the Bottom
R.P. Feynman
1959
Main result Anticipated Science at the nanoscale
( meters).
3Basic Coding Theory
Let denote a set consisting of all vectors
(codewords) of length n built over
i.e.
Let such that
1) 2) 3)
Let be such that
is referred to as a Code of length n, size M, and
minimum distance d.
4Spheres
A sphere in centered at x having radius d
Volume of the sphere around x, of radius d
Spaces
A space is HOMOGENEOUS when the volume of a
sphere does not depend on where it is centered
i.e.
A space is NON - HOMOGENEOUS when the volume of a
sphere does depend on where it is centered.
5The Main Coding Theory Problem
For any code there are 3 conflicting parameters
Length n Size M Minimum distance
d
The aim of coding theory is Given any 2
parameters, find the optimal value for the 3rd.
We need small n for fast transmission, large M
for as much information as possible to be encoded
and large d so that we can detect and correct
many errors.
6Bounds in Coding Theory
Exact formulas for sphere volumes and code sizes
are extremely difficult to obtain sometimes. In
most cases only upper and lower bounds can be
obtained for these parameters. We will be working
in a NON-HOMOGENEOUS space making the obtainment
of exact formulas for sphere volumes and code
sizes VERY HARD.
Hamming Upper Bound on Code Size in
with any metric
Varshamov-Gilbert Lower Bound on Code Size in
with any metric
7Turan's Theorem
- Let G be a simple graph on vertices and e
edges. G contains an M-clique if
CLIQUES
8From Turan to Varshamov-Gilbert
If
Then there exists a code of size M.
9Let
Then
Hence there exists a code of size M and so
10DNA Structure
- The rules of base pairing
- (nucleotide paring)
- A - T adenine (A)
- always pairs with
- thymine (T)
- C - G cytosine (C)
- always pairs with
- guanine (G)
-
11Watson-Crick complements
- Each base has a bonding surface
- Bonding surface of A is complementary to that of
T (2 bonds) - Bonding surface of G is complementary to that of
C (3 bonds) - Hybridization is a process that joins two
complementary opposite polarity single strands
into a double strand through hydrogen bonds.
12Orientation of single DNA strands is important
for hybridization.
13Types of Hybridization
Direct
Shifted
Folded
Loop
14DNA Computing
- Interest into DNA computing was sparked in
1994 by Len Adleman.
Adleman showed how we can use DNA molecules to
solve a mathematical problem. (Hamiltonian path
problem).
DNA computing relies on the fact that DNA strands
can be represented as sequences of bases (4-ary
sequences) and the property of hybridization.
In Hybridization, errors can occur. Thus,
error-correcting codes are required for efficient
synthesis of DNA strands to be used in computing.
15Similarity
is a subsequence of
if and only if there exists a strictly increasing
sequence of indices
Such that
is defined to be the set of longest common
subsequences of
and
is defined to be the length of the longest common
subsequence of
and
16Example of LCS
- X ( A T C T G A T )
- Z ( T C G T ) - subsequence of X
- X ( A T C T G A T )
- Y ( T G C A T A )
- ( T C A T ) L (X,Y)
- LCS(X ,Y) 4
-
17Insertion-Deletion Metric
Original Insertion-Deletion metric (Levenshtein
1966)
This metric results from the number of deletions
and insertions that need to be made to obtain y
from x .
For vectors that have the same length the number
of deletions that will be made is
likewise, the number of insertions that will be
made is
18Longest Common Stacked Pair Subsequence
A common subsequence is called a common
stacked pair subsequence of length
between x and y if two elements
, are consecutive in x and consecutive in y
or if they are non -consecutive in x and or
non-consecutive in y, then and
are consecutive in x and y.
Let
, denote the length of the longest
sequence occurring as a common stacked pair
subsequence subsequence z between sequences x
and y. The number , is called a
similarity of blocks between x and y. The metric
is defined to be
19Stacked Pair Metric Bounds
The upper bound for the average sphere volume in
this metric will be
The Varshamov-Gilbert bound becomes
20Insertion-deletion stacked pair
thermodynamic metric
Thermodynamic weight of virtual stacked pairs.
A C G T
A 1.00 1.44 1.28 0.88
C 1.45 1.84 2.17 1.28
G 1.30 2.24 1.84 1.44
T 0.58 1.30 1.45 1.00
- Can use statistical estimation of sphere volume.
21Sense of Direction
- There are many possibilities for metrics on the
space of DNA sequences. - All discussed metrics are non-homogeneous i.e.
the sizes of the spheres in the metric spaces
depend on the location of their centers. - A universal method that will allow us to
calculate lower bounds for optimal code sizes was
given.
22Bounds for Stacked Pair Metric
Minimum distance (d) 6
Length (n) Min. size
15 8
16 15
17 28
18 53
19 107
20 223
21 479
22 1055
23 2386
24 5524
25 13068
26 31545
27 77600
28 1943016
29 494758
30 1279652
23Bounds for Stacked Pair Metric
Minimum distance (d) 7
Length (n) Min. size
15 2
16 3
17 5
18 8
19 13
20 24
21 46
22 90
23 183
24 381
25 815
26 1783
27 3988
28 9102
29 21174
30 50155
24Bounds for Stacked Pair Metric
Minimum distance (d) 8
Length (n) Min. size
20 4
21 7
22 12
23 21
24 39
25 75
26 149
27 304
28 635
29 1354
30 2946
25Bounds for Stacked Pair Metric
Minimum distance (d) 9
Length (n) Min. size
20 1
21 2
22 2
23 4
24 6
25 10
26 18
27 33
28 62
29 121
30 243
26Bounds for Stacked Pair Metric
Minimum distance (d) 10
Length (n) Min. size
25 2
26 3
27 5
28 8
29 15
30 27
27Levenshtein's Bounds for Insertion-Deletion Metric
Length LCS Min dist. Size V-G bound
10 8 2 4365
14 12 2 580715
12 8 4 482 25
14 10 4 2683 151
16 12 4 1042
18 14 4 7989
20 16 4 66413
22 18 4 588872
24 20 4 5504930
14 8 6 66 1
16 10 6 204 3
18 12 6 767 13
20 14 6 2843 65
22 16 6 364
24 18 6 2279
16 8 8 28 1
18 10 8 50 1
20 12 8 122 1
22 14 8 345 2
24 16 8 1084 7
22 12 10 45 1
24 14 10 86 1