Title: Application of Algorithm Research to Molecular Biology
1Application of Algorithm Research to Molecular
Biology
- R. C. T. Lee
- Dept. Of Computer Science
- National Chinan University
2- There is one peculiar characteristics of all
living organisms We can reproduce ourselves. - Yet, it is important that what we reproduce have
to be the same as we are. - That is, wild flowers produce the same kind of
wild flowers and birds reproduce the same kind of
birds.
3- Information about ourselves must be passed to our
descendants. - Question How is this done?
- Answer Through DNA.
4- DNA(Deoxyribonucleic Acid) can be viewed as two
strands of nucleic acids formed as a double helix.
5(No Transcript)
6- There are only four types of nucleic acids in
every DNA - A Adenine
- G Guanine
- C Cytosine
- T Thymine
7- Each strand of a DNA is a sequence of A, G, C and
T. - Yet, in each strand, A is paired with T in the
other strand. - Similarly, G is paired with C.
8Human Mitochondrial DNA Control Region
- TTCTTTCATGGGGAAGCAAA
- AAGAAAGTACCCCTTCGTTT
9- DNA exists in cells.
- For each living organism, there are a lot of
different kinds of cells. For instance, in human
beings, we have muscle cells, blood cells, neural
cells etc. - How can different cells perform different
functions?
10Genes
- In each DNA sequence, there are subsequences
which are called genes. - Each gene corresponds to a distinct protein and
it is the protein which determines the function
of the cell. - For instance, in red blood cells, there must be
oxygen carrying protein haemoglobin and the
production of this protein is controlled by a
certain gene.
11Proteins
- Each protein consists of amino acids.
- There are 20 different amino acids
12(No Transcript)
13The Relationship between a Gene and its
Corresponding Protein
14- As shown above, each amino acid is coded by a
triplet. For instance, TTC denotes
PHE(Phenylalanine). - Each triplet is called a codon.
- There are three codons, namely TAA, TGA and TAG
which represent end of gene.
15- Protein Rnase AKETAAAKFER
- Its corresponding DNA sequence isAAA GAA ACT
GCT GCT GCT AAA TTT GAA CGT
16How Is a Protein Produced?
- RNA (Ribonucleic Acid)
- Each cell is able to recognize all of the
starting points of genes relevant to the proteins
important to the functions of the cell.
17- The RNA system scans a gene. For each codon being
scanned, it produces a corresponding amino acid. - After all codons have been scanned, the
corresponding protein is produced.
18(No Transcript)
19- AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT
- KETAAAKFER
- Note that codon AAA corresponds to amino acid K
and CGT corresponds to R. - Remember TAA, TGA and TAG signify end of gene.
20Problems
- 1. String Matching Problem
- 2. Sequence Alignment Problem
- 3. Evolution Tree Problem
- 4. RNA Secondary Structure Prediction Problem
- 5. Protein Structure Problem
- 6. Physical Mapping Problem
21Exact String Matching Problems
- Exact String Matching Problems
- Instance A text T of length n and a pattern P
of length m, where n gt m. - Question Find all occurrences of P in T.
- Example If T ttaptaap and P ap, then P
occurs in T starting at 3 and 7. - Linear time (O(nm) time) Algorithms
- Knuth-Morris-Pratt (KMP) algorithm
- Boyer-Moore algorithm
22Approximate String Matching Problems
- Approximate String Matching Problems
- Instance A text T of length n, a pattern P of
length m and a maximal number of errors allowed k - Question Find all text positions where the
pattern matches the text up to k errors, where
errors can be substituting, deleting, or
inserting a character. - Example
- Let T pttapa, P patt and k 2.
- The substrings T1..2, T1..3, T1..4 and
T5..6 are up to 2 errors with P. - Algorithms
- Dynamic Programming approach
- NFA approach
23Sequence Alignment Problem
- ATTCATTACAACCGCTATGACCCATCAACAACCGCTATG
- It appears that these two sequences are quite
different. - An alignment will produce the followingATTCATTA-
CAACCGCTATGACCCATCAACAACCGCTATG
24- Given two sequences, any alignment will have a
corresponding score. - For each exact match, the score is equal to 2.
- For each mismatch, the score is equal to -1.
- AGC- AG-CAAAC AAAC2-3-1
2x2-2x(-1)2
25- The sequence alignment problem Given two
sequences, find an alignment which produces the
highest score. - Approach Dynamic Programming
- The multiple sequence alignment problem is NP-hard
26The Evolution Tree Problem
27(No Transcript)
28- The evolution tree problem Given a distance
matrix of n species, find an evolution tree under
some criterion. - Usually, the criteria are such that all of the
tree distances reflect the original distances. - That is, when two species are close to each other
in the distance matrix, they should be close in
the evolution tree.
29- Each criterion corresponds to a distinct
evolution tree problem. - Most of them are NP-complete.
- Algorithms which produce optimal evolution trees
in polynomial time are mostly based upon the
minimal spanning tree approach.
30A Partial Evolution Tree of the Homo Sapien
(Intelligent Human Beings, also Modern Men) Our
ancestors are from Africa.
31Secondary Structure of RNA
- Due to hydrogen bonds, the primary structure of a
RNA can fold back on itself to form its secondary
structure. - Base pairs (formed by hydrogen bonds) 1. A?U
(Watson-Crick base pair) 2. C?G (Watson-Crick
base pair)3. G?U (Wobble base pair)
32AGGCCUUCCU
332D 3D Structures of Yeast Phenylalanyl-Transfer
RNA
3D Structure
2D Structure
34Secondary Structure Prediction Problem
- Given an RNA sequence, determine the secondary
structure of the minimum free energy from this
sequence. - Approach Dynamic Programming
35Protein Structure Problem
- Each amino acid of a protein can be classified
into either of the following two types - H (hydrophobic, non-polar) (hating water)
- P (hydrophilic, polar) (loving water)
- Then the amino acid sequence of a protein can be
viewed as a binary sequence of Hs (1s) and Ps
(0s).
36Example
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
1
0
0
0
0
Score 5
Score 3
37H-P Model
- Instance A sequence of 1s (Hs) and 0s (Ps).
- Question To find a self-avoiding paths embedded
in either a 2D or 3D lattice which maximizes
score, where the score is the number of pairs of
1s that are adjacent in the lattice without
being adjacent in the sequence. - NP-complete even for 2D lattice.
38 Physical Mapping Problem
39Shortest Common Superstring
- Input A collection F of strings.
- Output A shortest possible string S such that
for every f ? F, S is a superstring of f. - For example
- NP-complete
ACT CTA AGTACTAGT
F
S
40- Suppose the target is too long and its contents
are unknown. - What can we do?
- Enzyme A ? 6, 8, 3, 10Enzyme B ? 7, 11, 4,
5Enzymes A and B ? 1, 5, 2, 6, 7, 3, 3
41This problem is called the two digest problem
which is NP-complete.
42- TAA, TGA, or TAG.
- Do you know what they mean?
- End of Gene.
- Thank you for your patience. Have a good
conference.