Title: The Information Processing Mechanism of DNA and Efficient DNA Storage
1 The Information Processing Mechanism of DNA
and Efficient DNA Storage
- Olgica Milenkovic
- University of Colorado, Boulder
- Joint work with B. Vasic
2Outline
- PART I HOW DOES DNA ENSURE ITS DATA INTEGRITY?
- Information Theory of Genetics an emerging
discipline - Error-Correction and Proofreading in genetic
processes - What type of codes operate at the level of
bio-chemical processes of the Central Dogma? - Spin Glasses, Kaufmanns NK Model, Regulatory
Network of Gene Interactions and Low-Density
Parity-Check (LDPC) Codes - Cancer, dysfunctional proofreading and chaos
theory - PART II HOW DOES ONE STORE DNA? (DNA
COMPRESSION) - Structure of DNA Statistics and Modeling
- DNA Compression
- Genome Compression
- New Distance Measures and One-Way Communication
- PART III NEW CODING PROBLEMS IN GENETICS
3Information theory of genetics
- 2003 50th Anniversary of discovery DNA has a
double-helix structure! - (Crick, Watson, Franklin, Wilkins 1953)
- 2003 Completion of the Human Genome Project (98
HDNA sequenced) - Every day an average of 15 new sequences added to
SwissProtGeneBank - Vast amount of genetic data just starting to be
analyzed! - DNA is a CODE, but very little is known about its
- exact information content
- nature of redundancy
- statistical properties
- secondary structure
- influence on disease development and control
- underlying error-correcting mechanism
4Information Theory of DNA
Helps in understanding the EVOLUTION of DNA
FUNCTIONALITY of DNA DISEASE DEVELOPMENT IT
community still not involved in this
area! Signal Processing Community is just
getting involved Special Issue of Signal
Processing Journal devoted to Genetics, 2003.
5The League of Extraordinary Gentlemen
6IHow is information stored in a genetic
sequence? What are the atoms of information?
7The DNA Polymer
O
5
S B U A G C A K R
B - O P N H E O S P H A T E
PO4
OH
CH2OH
1
4
H
H
H
H
Sugar
2
OH
H
3
Deoxiribose (Sugar)
PO4
Sugar
PO4
8The Bases
D O U B L E - H E L I X
Purine Bases Adenine (A) Guanine (G)
Pyramidine BasesThymine (T) Cytosine (C)
9The Pairing Rule
A and T paired through TWO hydrogen bonds
G and C paired through THREE hydrogen bonds
10instead of DNA's thymine, i.e. U replaces T. It
is the RNA sequence of codes which biologists
usually refer to as the genetic code (see Table.4
below).
instead of DNA's thymine, i.e. U replaces T. It
is the RNA sequence of codes which biologists
usually refer to as the genetic code (see Table.4
below).
The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code The Genetic Code
Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter Second Letter
U U C C A A G G
First Letter U UUU leu UCU ser UAU tyr UGU cys UCAG Third Letter
First Letter U UUC leu UCC ser UAC tyr UGC cys UCAG Third Letter
First Letter U UUA leu UCA ser UAA stop UGA stop UCAG Third Letter
First Letter U UUG leu UCG ser UAG stop UGG trp UCAG Third Letter
First Letter C CUA leu CCU pro CAU his CGU arg UCAG Third Letter
First Letter C CUC leu CCC pro CAC his CGC arg UCAG Third Letter
First Letter C CUA leu CCA pro CAA gin CGA arg UCAG Third Letter
First Letter C CUG leu CCG pro CAG gin CGG arg UCAG Third Letter
First Letter A AUU ile ACU thr AAU asn AGU ser UCAG Third Letter
First Letter A AUC ile ACC thr AAC asn AGC ser UCAG Third Letter
First Letter A AUA ile ACA thr AAA lys AGA arg UCAG Third Letter
First Letter A AUG met ACG thr AAG lys AGG arg UCAG Third Letter
First Letter G GUU val GCU ala GAU asp GGU gly UCAG Third Letter
First Letter G GUC val GCC ala GAC asp GGC gly UCAG Third Letter
First Letter G GUA val GCA ala GAA glu GGA gly UCAG Third Letter
First Letter G GUG val GCG ala GAG glu GGG gly UCAG Third Letter
Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations Abbreviations
ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine ala alaninearg arginineasn asparagineasp aspartic acidcys cysteine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine gln glutamineglu glutamic acidgly glycinehis histidineile isoleucine leu leucinelys lysinemet methioninephy phenylalaninepro proline leu leucinelys lysinemet methioninephy phenylalaninepro proline leu leucinelys lysinemet methioninephy phenylalaninepro proline ser serinethr threoninetrp tryptophantyr tyrosineval valine ser serinethr threoninetrp tryptophantyr tyrosineval valine ser serinethr threoninetrp tryptophantyr tyrosineval valine
In summary all life as we know it contains DNA
and its close relative RNA. These polymers
11Genes, Exons, Introns (Junk DNA)
- Genes Sequence of base pairs coding for chains
of amino-acids - Consist of exons (coding) and introns
(non-coding) regions - Length- anything between several tenths up to
several millions - EXAMPLE Among most complex identified genes is
- DYSTROPHINE
- (2 million bps, more than 60 exons, codes for
4000 amino acids) -
- Escherichia Coli around 4000 genes Humans
35000-40000 genes - Junk DNA Disrespectful name for introns
- Significant fraction of DNA
- Shown (last year) to be somewhat responsible
for RNA coding - (Far from being junk, but function still
not well understood)
12The Central Dogma
DNA
mRNA
Proteins
Replication Transcription
Translation
A Communication Theory Perspective
Genetic Channel
DNA sequence mRNA Proteins
DNA sequence
What kind of errors are introduced by the Genetic
Channel?
13Processing in the Genetic Channel DNA REPLICATION
- DNA within Chromosomes (tight packing)
- DNA wrapped around HISTONES (proteins)
- HISTONES are organized in NUCLEOSOMES
- NUCLEOSOMES CHROMATINE folded in
CHROMOSOMES
Untying the knots Topoisomerases Unwinding the
helix Helicases Getting it all started
Primers Doing the hard work Polymerases Sealing
the segments Ligases Helping to keep two sides
apart SSB
14Replication more details
Timing for replication E. Coli 40 min Humans
(parallel) lt 2 hours Can be prolonged for
proofreading purposes
Rules Replication always proceeds in 5 to 3
direction Replication is
semi-conservative Replication is a parallel
process for eukaryotes Facts Polymerases
can stitch together any combination
of bases (Ps are a little bit sloppy)
15Errors
- Combination of substitution, deletion, insertion
(replication fork), shift, reversal, etc errors - (Complete exon or intron deleted, or simple base
pair deletions) - 1. Tautomeric shifts (transition/transvertion)
T-G, G-T, C-A, A-C - 2. Recombination between non-identical molecules
(HETERODUPLEX mismatches) - 3. Spontaneous DEAMINATION (C to U, C to T, C-G
to T-A), METHYLATION (CpG), rare - 4. APURINIC/APYRAMIDINIC SITES (due to
HYDROLISIS) - 5. CROSS-LINKS
- 6. STRAND-BREAKAGE, OXIDATIVE DAMAGE ERRORS
- 7. LOSS OF 5000-10000 PURINE and 200-500
PYRIMIDINE bases (20 hours) due to radiation - Replication Errors Polymerases miss-insertion
probability between 10e-3/10e-5
Miscoding A-G-A-T-G C-T-G-C-T-A-C
Slippage A-A-T-G
C-G-T-T A-C T
Slippage-Dislocation G-A-A-T-G
C-G -T -T-T-A-C
Miscoding - Realignment
A-G-A-T-G C-T C-T-A-C
G
16Bio-chemical mechanism responsible for error
correction?
- Proofreading (Maroni, Molecular and Genetic
Analysis of Human Traits) - Replication polymerases error rate
human DNA with bps, total of 106
errors - Example
- C to U conversion causes presence of
deoxyuridine, detected by uracil-DNA GLYCOSYLASE - Glycosylase process acts like erasure channel
- 1. Proofreading based on semi-conservative nature
of replication - 2. Excision Repair Mechanisms Arrays of
Exonucleases - Show large degree of pre-correction binding
activity correction performed by EXCISION - Jumping occurs between different genes !!!
(Lin, Lloyd, Roberts, Nucleases) - Reduce error levels by an additional several
orders of magnitude - Mismatch-specific post-replication enzymes
- Total number of errors per human DNA replication
on average JUST ONE - Replication and Repair have been optimized for
balancing spontaneous mutational load - Permitting evolution without threatening fitness
or survival
17- Characteristics of DNA ECC
- Error-correction performed on different levels
- Error correction performed in very short time
- Extremely large number of very diverse errors
corrected - Error correction tied to global structure of DNA
- (not to consecutive base pairs)
- Error correction also depends on DNA topology
18Identify ECCs of DNA
- Error-Correcting Codes in DNA Forsdyke (1981),
Wolny (1983), Eigen (1993), Liebovitch et al
(1996), Battail (1997), Rosen and Moore (2003),
McDonaill (2003) - Theories
- Non-coding regions are in-series error detecting
sequences! - Ordering of coding/non-coding regions responsible
for error-correction! - Complementary base pairing corresponds to
error-detecting code! - Acceptor/Donor hydrogen atom/lone electrons
1 represents donor, 0 acceptor Additionally, add
0 or 1 for purine and pyramidine Code
A 1010 G 0110 T 0101 C 1001
19- BEST ERROR CORRECTING MECHANISM Deinococcus
radiodurans - Microbe with extreme radiation resistance
- Enabled to survive radiation doses thousands of
times higher than would kill most organisms,
including humans. - Surpasses the cockroach by orders of magnitude!
- Why? Because of its remarkable DNA-repair
mechanism!!! - D. radiodurans flawlessly regenerates its
radiation-shattered genome in about 24 hours. -
Conan The Bacterium (to conquer the
Red Planet !)
20Something seemingly unrelated
Spin Glasses, the Ising Model, Hopfield Networks
or Boltzmann Machines State x of a spin glass
with N spins that may take values in
-1,1 Energy of the state x E, external field h
The Hamiltonian
Hamiltonian for Ising model
Example Water exists as a gas, liquid or solid,
but all microscopic elements are H2O
molecules This is due to intermolecular
interactions depending on temperature, pressure
etc.
-
frustration
21Something seemingly unrelated
Codes on graphs the most powerful class of error
correcting codes in information theory, including
Turbo, Low-Density Parity-Check (LDPC),
Repeat-Accumulate (RA) Codes
Most important consequence of graphical
description efficient iterative decoding
Variable nodes communicate to check nodes their
reliability Check nodes decide which variables
are unreliable and suppress their inputs Number
of edges in graph density of H Sparse
small complexity
Variables Checks
Detrimental for convergence of decoder presence
of short cycle in code graph
Applications of LDPC codes for cryptography,
compression, distributed source coding for sensor
networks, error control coding in optical,
wireless comm and magnetic and optical storage
22Gallagers Decoding Algorithm A
Works for (Binary Symmetric Channel) BSC Each
variable sends its channel reliability unless all
incoming messages from checks say change Each
check sends estimate of the bit based on modulo
two sum of other bits participating in the check
Alternative view VariablesAtoms Binary
ValuesSpins Variables align or misalign
according to interaction patterns
LDPC equivalent to diluted spin glasses
Ground state search for above Hamiltonian
maximum aposteriori decoding of codeword Average
magnetization at a site MAP decision for
individual variable
23Something seemingly unrelated
- The regulatory Network of
- Gene Interactions (RNGI)
- Kaufmann (1960s) NK Evolution
- through Changing Interactions
- between Genes
-
- Life exists at the Edge of Chaos!
- BASED ON SPIN GLASSES!
- RANDOM BOOLEAN FUNCTION MODEL
- Evolution carried by genes, not base pairs, and
the way genes interact!
T T1
G1 G2 G3 G1 G2 G3
0 0 0 0 0 1
0 0 1 0 0 1
0 1 0 1 0 1
0 1 1 0 0 0
1 0 0 1 0 1
1 0 1 0 1 0
1 1 0 0 0 1
1 1 1 0 1 1
G1G2 G1
0 0 0
0 1 1
1 0 0
1 1 0
G1G3 G2
0 0 0
0 1 0
1 0 0
1 1 1
G1G2 G3
0 0 1
0 1 1
1 0 0
1 1 1
G1
G3
G2
24Chaos, Attractors, Connectivity
- Boolean networks dynamical systems
Attractors point and periodic - Characterized by network topology
Number and period lengths - choices of Boolean node functions
MOST IMPORTANT topological factor CONNECTIVITY KE
Y Sparse connectivity allows enough variability
for evolutionary processes, produces
self-organizing structures, but doesnt allow the
system to get trapped in chaotic behavior MOST
IMPORTANT Boolean function factors BIAS (number
of 1 outputs) CANNALIZATION (depends on number of
inputs determining output)
- 111
- 000 011
- 001 101
- 110 010
- Kimatograph of the network
25The NK model and RNGI
N number of genes Knumber of genes
co-interacting with one given gene K2 critical
value (mainly frozen states with islands of
changing interaction) Interaction between genes
in regulatory network very limited in scope K
ranges everywhere between 2-3 to 10-15 If we
check carefully, logarithmic in N, i.e. number of
genes Between 2 and 3 for Escherichia Coli
(around a thousand genes) 4 and
8 for higher metazoea (several thousand genes)
Can explain the process of cell differentiation
genetic material of each cell the same, yet
cells functionally and morphologically very
different Each cell type CORRESPONDS TO ONE GIVEN
ATTRACTOR of the RNGI Counting attractors for
networks with N40000 genes, K2 gives Cell
types (correct number 258).
26KEY IDEA LDPC Code with Given Decoding Algorithm
is a BOOLEAN NETWORK, SPIN GLASS,
- Example LDPC Code under Gallagers A Algorithm
In the Control Graph, edge (i,j) exists if i-th
bit controls j-th bit (i.e. if i and j are at
distance exactly two)
G1 G2 G3 G4
Boolean function determined by decoding
algorithm For Gallagers A algorithm, takes form
of truncated/periodically repeated MORSE-THUE
sequence
LDPC Code Variables and Checks
LDPC Code The Control Graph
Morse-Thue 0 1 2 3 4 5 6 7
0 1 10 11
100 101 110 111
0 1 1 0 1 0 0 1
Properties Self-Similar (fractal) Results in
unbiased Boolean functions
27Use Boolean Network Analysis for LDPC Codes
- No cycles of length four, code regular uniform
choice for Boolean function - Cycles of length four Boolean functions vary,
many more attractors - In no case are the functions canalizing
modulo two sums of variable nodes connected to
controls
Can use mean-field theorems to see when initial
perturbations in the codewords disappear in the
limit use the Boolean derivative, sensitivity
analysis, iterative Jacobian and Lyapunov
exponent (as in Schmulevich et.al)
matrix with
in entry (i,j).
Jacobian F is a
28Use Boolean Network Analysis for LDPC Codes
Iterated Jacobian
Lyapunov exponent
The influence of variable on the Boolean
function f is defined as the
expectation of the partial derivative with
respect to the distribution of the variables
,
.
Influence carries important information about
frozen states, error susceptibility etc.
iterative change of size of stable core
Control of the chaotic phase in the a Boolean
network by means of periodic pulses (with period
T) that freeze a fraction of nodes
29LDPC Codes and Gallagers A Decoding Algorithm
A (B)C1 (C)C2 (D)C3 F3(A) A (B C) D C1 C2 F1(A) A (B C D) C1 F2(A)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 1 1
0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1
0 0 1 1 0 0 0 1 1 1 1 1 0 0 1 1 0 0
0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1
0 1 0 1 0 0 1 0 1 1 1 1 0 1 0 1 0 0
0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0
0 1 1 1 1 0 1 1 1 0 1 0 0 1 1 1 1 1
1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0
1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1 1
1 0 1 0 1 1 0 1 0 1 0 1 1 0 1 0 1 1
1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0
1 1 0 0 1 1 1 0 0 1 0 1 1 1 0 0 1 1
1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 0
1 1 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0
1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1
30New decoding methods for LDPC and other Block
Codes
- Work in Progress
- Decoders that dont operate on the frozen core
- Decoders that periodically freeze some variables
to avoid chaotic behavior - Iterative decoders that work for asymmetric
channels and channels with insertion/deletion
errors
31(No Transcript)
32Bold Conjecture The ECC of DNA Replication
operates on multiple levelsCarrier of
information is gene, not base pairThe Global
level involves Genes Local levels may involve
exons or base pairs in generalThe Global Code
is an LDPC Code!Wigner observed that the same
mathematical concepts turn up in entirely
unexpected connections in whole of science (no
explanation as of yet)LDPC related to
statistical physics (spin glasses) to neural
networks to self-organizing systems to R.
Sole and B. Goodwin, Signs of Life How
Complexity Pervades Biology
33 The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code The Corresponding LDPC Code
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
3 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0
4 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0
5 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0
6 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
7 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0
8 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
9 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1
10 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0
11 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
12 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1
Table 1 Example of 15-node regulatory network in terms of gene controls Table 1 Example of 15-node regulatory network in terms of gene controls Table 1 Example of 15-node regulatory network in terms of gene controls
Gene Controls Control (after addition)
1 2,3,13 2,313
2 1,3 1,3
3 4,5,6 4,5,6,1,2
4 5,6 5,6,3
5 4,9,6 4,9,6
6 5,9,3,4,7,8 3,4,5,7,8,9
7 15,8,6 15,8,6
8 7,9 7,9
9 4,5,6 4,5,6
10 9,13,15 9,13,15
11 8,12,13 8,12,13
12 11,13,15 8,11,13,15
13 8,14,15 8,14,15,11,12
14 X X
15 11,12 11,12,13,14,10,7
15-gene interaction example by Hashimoto
(Shmulevich, Anderson Cancer Center) Need q-ary
LDPC code corresponding to different levels of
interaction
34- Cancer genetic disorder of somatic cells
- Human cancer INDUCED and SPONTANEOUS
- Accumulation of mutant (erroneous) genes that
control cell cycle, maintain genomic stability,
and mediate apoptosis - Causes of mutation depurination and
depyrimidation of DNA proofreading and mismatch
errors during DNA replication - Deamination of 5-methylcytosine to produce C to T
base pair substitutions and damage to DNA and
its replication imposed by products of metabolism
(notably oxidative damage caused by oxygen free
radicals) - Defective DNA excision-repair low levels of
antioxidants, antioxidant enzymes, and
nucleophiles that trap DNA-reactive
electrophiles and enzymes that conjugate
nucleophiles with DNA-damaging electrophiles
35Cancer Research
- To summarize Various forms of cancer tightly
linked to malfunctioning of proofreading (ECC)
mechanism - Cancer cells correspond to a special type of
attractor of the RNGI - (A cancer cell is just another
configuration of RNGI) - (Schmulevich et.al., Anderson Cancer
Research Center) - This attractor has genes interacting in a
way that results in uncontrolled cell
division - Key observation C-Change in RNGI results in
further weakening of the proofreading system, and
VV
36Example 1 Cancer cells cheat the proofreading
mechanism regulating reduction in length of
telomeres
Aging during each cell division, telomeres get
shorter and shorter When they become too short,
errors in replication happen, leading to cancer
(a time bomb in our body) Cancer cells
cheat proofreading mechanism and allow
telomeres to maintain constant length
Finding the error-control mechanism classifying
diseases accurately, curing diseases (including
cancer) by gene therapy, making telomer lengths
constant over long time
37Example 2 Breast Cancer Oncogene BRCA1 tightly
linked to error-control of DNA and cell division
regulation
38How to obtain results practically? DNA
Microarrays!
Figure taken from Schmulevich et.al.
39- II
- How can one efficiently store
- DNA sequences?
40DNA Storage Compression
- GenBank/Swiss-Prot storage of large number of
DNA and protein sequences (17471 million
sequences in GenBank, 2002) - Every day, an average of 15 new sequences added
to database - DNA compression absolutely necessary to maintain
banks - Fractal DNA structure to be exploited
- Possible use of Tsallis entropy
- Need novel compression algorithms
- DNA sequences of related species differ in very
small percent of base pairs need cross-reference
compression - Need meaningful definition of DNA distance
- -- major paradigm shift from base-pair
- distance to chromosomal distance --
41Statistical properties of DNA sequences
Bases within the human mitochondrion (length
approximately 17000) appear with the following
frequencies
A T G C
0.31 0.13 0.25 0.31
while within different regions of human fetal
globin gene
Introns A T G C
0.27 0.29 0.27 0.17
Exons A T G C
0.24 0.22 0.28 0.25
Parts of genetic sequences can be modeled by
Markov chains of given order and transition
probabilities order 2-7
Regions of uniform distribution isochors can
stretch in length up to hundreds Kbps
Repetitive patterns tandem repeats (TR), random
repeats (RR), short interspersed repeat
sequences (SINEs, 9 of DNA), long interspersed
repeat sequences (LINEs).
BPs, like CG, have very small probability most
notorious triplet repeats, related to
Huntingtons disease and Fragile-X mental
retardation, consist of these very unlikely CG
pairs (CGG)m ,(CCG)m, m number of
repetitions
Junk-DNA seems to have long-range (fractal)
characteristics.
42A fractal patterns arises from the so-called DNA
walk a graphical representation of the DNA
sequence in which one moves up for C or T and
down for A or G. Can have two,
three-dimensional random walk further
differentiation A,G,C,T
C A T G
Fractal dimension of the DNA molecule 0.85 for
higher species, 1 for lower Use lingual analysis
of human languages for exploring DNA "language"
(Zipf method)
http//library.thinkquest.org/26242/full/ap/ap13.h
tml DNAWalker http//athena.bioc.uvic.ca/pbr/walk/
43DNA and Cantor Sets
Provata and Almirantis, 2003 Fractal Cantor
pattern in DNA Exons - filled regions Introns -
empty regions Random, fractal, Cantor-like
set Implication atom (carrier of information)
exon/intron pairs History-based random walk and
DNA description in terms of urn models Only
introns in higher species have higher complexity
than in lower species Both coding and non-coding
regions exhibit long range correlation, with
spectral density of introns
44Known algorithms
GenCompress (Chen, 97) Biocompress
(Grumbach/Tachi, 94) Fact (Rivals,
00) GenomeSequenceCompress (Sato et.al 00)
Use characteristics of DNA like repeats, reverse
complements Compression rate is about 1.74 bits
per base (78 in compression ratio)
Two classes statistical and grammar based
compression algorithms
Huffman, Lempel-Ziv, Arithmetic Coding,
Burrows-Wheeler, Kieffers Grammar Based
Schemes (with DNA specific modifications)
No known algorithm specially suited for fractal
nature of DNA, although 90 fractal!
FILE COMPRESSION RATE (ACHIEVABLE) GZIP ARITHM. VPS2A UNIX COMPRESS BIO- COMPRESS BWT GTAC
Human Growth Hormone (HUMGHCSA) 2.00 2.065 2.052 1.607 2.19 1.31 1.608 1.1
45Different Entropy Measures
- TE non-additive in the way that for two
independent PS A,B -
46Approach Use Fractal Grammars
Inference of context-free grammars from fractal
data sets Syntactic generation of fractals Theory
of formal languages can be used to state the
problem of "syntactic fractal pattern
recognition" Explore Connections with
Wavelets (ideas by Jacques Blanc-Talon)
Example Heighway dragon and Koch curve
- Barthel, Brandau, Hermesmeier, Heising Fractal
Prediction, 1997 - Zerotree wavelet coding using fractal prediction
47How does one compress sets of related DNA
Sequences?
- Distributed Source Coding Problem Peculiar
Correlation Patterns - Could explore Wavelet Based Compression
- Distributed Source Coding with LDPC Codes
48Genomic Distance and One-Way Communication
- Major paradigm shift in genetic distance measure
- From base-pair distance (involving deletion,
insertion and substitution) Sankoff,
Kruskal,Time Warps, String Edits, and
Macromolecules) to Chromosomal Distance based on
global arrangements of genes - Inversions are primary mechanism of genome
rearrangement! - REVERSAL DISTANCE
- The smallest number of inversions necessary to
transform one genome into another - Finding the minimum number of reversals needed to
sort a permutation - Permutations are signed, indicating direction of
transcription - Example (1 3 2) (1 -2 -3)
(1 2 -3) (1 2 3) - How does one perform one-way communication
(SENDING INFORMATION TO A RECEIVER WHO POSESESS
CORRELATED INFORMATION) under the reversal
distance measure?
49The other way around DNA compression methods
increase network efficiency by up to 10
times Peribit's SR-50 compressor
- Uses molecular sequence reduction (MSR)
algorithms similar to those used to match
patterns in the study of DNA. - The algorithms identify and eliminate repetitions
previously undetected in network traffic in wide
area networks (Wans) to give compression ratios
of between 1.21 for voice and video and 51 for
SQL traffic.
50IIIAdditional Coding Problems in Genetics
51DNA ComputingCodes with Constant GC Content and
invariant under Watson-Crick InversionMicroarr
ay Error Control CodingUsing design theory to
reduce error rate of DNA array dataUse novel
clustering algorithms for DNA Array Data
52Conclusion
- Genetics is the most exciting source of new ideas
for coding theory - The atom of information is a gene, not a base
pair or a triple of base pairs - The error control code of the genome is to be
found operating on the level of genes - Compression, phylogenic tree construction
comparison of species has to be performed on the
level of genes first - Once the genes are compared, can move to local
base pair comparisons