Application of Algorithm Research to Molecular Biology - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Application of Algorithm Research to Molecular Biology

Description:

Cut the DNA in each YAC clone and clone into overlapping cosmid clones. 104 bp. Duplicate the cosmid and then cut the copies randomly. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 55
Provided by: algCsie
Category:

less

Transcript and Presenter's Notes

Title: Application of Algorithm Research to Molecular Biology


1
Application of Algorithm Research to Molecular
Biology
  • R. C. T. Lee
  • Dept. Of Computer Science
  • National Chinan University

2
  • There is one peculiar characteristics of all
    living organisms We can reproduce ourselves.
  • Yet, it is important that what we reproduce have
    to be the same as we are.
  • That is, wild flowers produce the same kind of
    wild flowers and birds reproduce the same kind of
    birds.

3
  • Information about ourselves must be passed to our
    descendants.
  • Question How is this done?
  • Answer Through DNA.

4
First of all, we need a language to pass the
information about heredity. This language has
existed for 3 billion years, the oldest language
in the world. This language consists of 4
alphabets A, G, C and T.
5
We need a mechanism to represent the alphabets.
This is done by using chemical compounds. A
adenine G guanine C cytosine T thymine
6
Nature has used DNA to pass the heredity
information to our descendants. A DNA strand is
a sequence of chemical compounds. From our
point of view, a DNA strand is a sequence of A,
G, C and T.
7
  • DNA(Deoxyribonucleic Acid) can be viewed as two
    strands of nucleic acids formed as a double helix.

8
(No Transcript)
9
  • Each strand of a DNA is a sequence of A, G, C and
    T.
  • Yet, in each strand, A is paired with T in the
    other strand.
  • Similarly, G is paired with C.

10
Human Mitochondrial DNA Control Region
  • TTCTTTCATGGGGAAGCAAA
  • AAGAAAGTACCCCTTCGTTT

11
  • DNA exists in cells.
  • For each living organism, there are a lot of
    different kinds of cells. For instance, in human
    beings, we have muscle cells, blood cells, neural
    cells etc.
  • How can different cells perform different
    functions?

12
Genes
  • In each DNA sequence, there are subsequences
    which are called genes.
  • Each gene corresponds to a distinct protein and
    it is the protein which determines the function
    of the cell.
  • For instance, in red blood cells, there must be
    oxygen carrying protein haemoglobin and the
    production of this protein is controlled by a
    certain gene.

13
Proteins
  • Each protein consists of amino acids.
  • There are 20 different amino acids

14
(No Transcript)
15
The Relationship between a Gene and its
Corresponding Protein
16
  • As shown above, each amino acid is coded by a
    triplet. For instance, TTC denotes
    PHE(Phenylalanine).
  • Each triplet is called a codon.
  • There are three codons, namely TAA, TGA and TAG
    which represent end of gene.

17
  • Protein Rnase AKETAAAKFER
  • Its corresponding DNA sequence isAAA GAA ACT
    GCT GCT GCT AAA TTT GAA CGT

18
How Is a Protein Produced?
  • RNA (Ribonucleic Acid)
  • Each cell is able to recognize all of the
    starting points of genes relevant to the proteins
    important to the functions of the cell.

19
  • The RNA system scans a gene. For each codon being
    scanned, it produces a corresponding amino acid.
  • After all codons have been scanned, the
    corresponding protein is produced.

20
(No Transcript)
21
  • AAA GAA ACT GCT GCT GCT AAA TTT GAA CGT
  • KETAAAKFER
  • Note that codon AAA corresponds to amino acid K
    and CGT corresponds to R.
  • Remember TAA, TGA and TAG signify end of gene.

22
Problems
  • 1. String Matching Problem
  • 2. Sequence Alignment Problem
  • 3. Evolution Tree Problem
  • 4. RNA Secondary Structure Prediction Problem
  • 5. Protein Structure Problem
  • Physical Mapping Problem
  • Genome Rearrangement Problem

23
Exact String Matching Problems
  • Exact String Matching Problems
  • Instance A text T of length n and a pattern P
    of length m, where n gt m.
  • Question Find all occurrences of P in T.
  • Example If T ttaptaap and P ap, then P
    occurs in T starting at 3 and 7.
  • Linear time (O(nm) time) Algorithms
  • Knuth-Morris-Pratt (KMP) algorithm
  • Boyer-Moore algorithm

24
Approximate String Matching Problems
  • Approximate String Matching Problems
  • Instance A text T of length n, a pattern P of
    length m and a maximal number of errors allowed k
  • Question Find all text positions where the
    pattern matches the text up to k errors, where
    errors can be substituting, deleting, or
    inserting a character.
  • Example
  • Let T pttapa, P patt and k 2.
  • The substrings T1..2, T1..3, T1..4 and
    T5..6 are up to 2 errors with P.
  • Algorithms
  • Dynamic Programming approach
  • NFA approach

25
Sequence Alignment Problem
  • ATTCATTACAACCGCTATGACCCATCAACAACCGCTATG
  • It appears that these two sequences are quite
    different.
  • An alignment will produce the followingATTCATTA-
    CAACCGCTATGACCCATCAACAACCGCTATG

26
  • Given two sequences, any alignment will have a
    corresponding score.
  • For each exact match, the score is equal to 2.
  • For each mismatch, the score is equal to -1.
  • AGC- AG-CAAAC AAAC2-3-1
    2x2-2x(-1)2

27
  • The sequence alignment problem Given two
    sequences, find an alignment which produces the
    highest score.
  • Approach Dynamic Programming
  • The multiple sequence alignment problem is NP-hard

28
Before alignment TTAAAAATAA GAAATTTTTT
TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA
ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT
TTTTTAAAAA ATTTCTATAA ATTTTATATA TATTTTATAT
TTAAAAATAA GAAATTTTTT TTTTTAAAAA ATTTCTATAA
ATTTTATATA TATTTTATAT TTAAAAATAA GAAATTTTTT
TTTTTAAAAA ATTTCTATAA ATTTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA
ATGTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT
TTTAAAAATA ATTTCTATAA ATGTTATATA TATATTTTAT
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA
ATGTTATATA TATATTTTAT TTAAAAATAA GAAATTATTT
TTTAAAATAA TTTCTATAAA TTTTATATAT ATATTTTATA
TTAAAAATAA GAAATTATTT TTTAAAAATA ATTTCTATAA
ATTTTATATA TATATTTTAT TTAAAAATAA GAAATTTTTT
TTTTTAAATT AAATTTCTAT CAATTTTATA TATTTTTTAT
TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA
TTTATATATA TATTTTATAA TTAAAAATTA GAAATTTTAT
TTTTAAAATT TCTATTAAAA TTTATATATA TATATTATAA
TTAAAAATTA GAAATTTTAT TTTTAAAATT TCTATTAAAA
TTTATATATA TTTTTTATAA TTAAAAATTA GAAATTTTAT
TTTTTAAAAT TTCTATTAAA ATTTATATAT ATATTTTTTT
TTAAAAATGA GAAATTTTTA TAAAAAAATT TCTTTAAATT
TTATATATTT TATAAATATA TTAATAATAA GAAATTTTTT
TATTTTTTAA ATAAAAAATT CTTTAAATTT TATATATATA
29
After alignment TTAAAAATAA GAAATTATTT
TTTAAA AATAATT TCTATAAAT GTTATATATA
TTAAAAATAA GAAATTATTT TTTAAA AATAATT
TCTATAAAT GTTATATATA TTAAAAATAA GAAATTATTT
TTTAAA AATAATT TCTATAAAT GTTATATATA
TTAAAAATAA GAAATTTTTT TTTTTAA AAAATT
TCTATAAAT TTTATATATA TTAAAAATAA GAAATTTTTT
TTTTTAA AAAATT TCTATAAAT TTTATATATA
TTAAAAATAA GAAATTTTTT TTTTTAA AAAATT
TCTATAAAT TTTATATATA TTAAAAATAA GAAATTTTTT
TTTTTAA AAAATT TCTATAAAT TTTATATATA
TTAAAAATAA GAAATTATTT TTTAAA ATAATT
TCTATAAAT TTTATATATA TTAAAAATAA GAAATTATTT
TTTAAA AATAATT TCTATAAAT TTTATATATA
TTAAAAATTA GAAATTTTAT TTTTAA AATT
TCTATTAAAA TTTATATATA TTAAAAATTA GAAATTTTAT
TTTTAA AATT TCTATTAAAA TTTATATATA
TTAAAAATAA GAAATTTTTT TTTTTAA AAAATT
TCTATAAAT TTTATATATA TTAAAAATTA GAAATTTTAT
TTTTTAA AATT TCTATTAAAA TTTATATATA
TTAAAAATTA GAAATTTTAT TTTTAA AATT
TCTATTAAAA TTTATATATA TTAAAAATAA GAAATTTTTT
TTTTTAA ATTAAATT TCTATCAAT TTTATATATT
TTAAAAATGA GAAATTTTTA TAA AAAAATT
TCTTTAAAT TTTATATATT TTAATAATAA GAAATTTTTT
TATTTTTTAA ATAAAAAAT TCTTTAAAT TTTATATATA
30
The Evolution Tree Problem
31
(No Transcript)
32
  • The evolution tree problem Given a distance
    matrix of n species, find an evolution tree under
    some criterion.
  • Usually, the criteria are such that all of the
    tree distances reflect the original distances.
  • That is, when two species are close to each other
    in the distance matrix, they should be close in
    the evolution tree.

33
  • Each criterion corresponds to a distinct
    evolution tree problem.
  • Most of them are NP-complete.
  • Algorithms which produce optimal evolution trees
    in polynomial time are mostly based upon the
    minimal spanning tree approach.

34
A Partial Evolution Tree of the Homo Sapien
(Intelligent Human Beings, also Modern Men) Our
ancestors are from Africa.
35
Secondary Structure of RNA
  • Due to hydrogen bonds, the primary structure of a
    RNA can fold back on itself to form its secondary
    structure.
  • Base pairs (formed by hydrogen bonds) 1. A?U
    (Watson-Crick base pair) 2. C?G (Watson-Crick
    base pair)3. G?U (Wobble base pair)

36
RNA Secondary Structure without Pseudoknots
37
Given an RNA sequence, there may be several
secondary structures without pseudoknots, as
shown below
38
An optimal RNA secondary structure is one with
the maximum number of base pairs.
39
RNA Secondary Structure with Simple Pseudoknots
40
2D 3D Structures of Yeast Phenylalanyl-Transfer
RNA
3D Structure
2D Structure
41
Secondary Structure Prediction Problem
  • Given an RNA sequence, determine the secondary
    structure of the minimum free energy from this
    sequence.
  • Approach Dynamic Programming

42
Protein Structure Problem
  • Each amino acid of a protein can be classified
    into either of the following two types
  • H (hydrophobic, non-polar) (hating water)
  • P (hydrophilic, polar) (loving water)
  • Then the amino acid sequence of a protein can be
    viewed as a binary sequence of Hs (1s) and Ps
    (0s).

43
Example
  • Instance 011001001110010

0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0
1
1
0
1
1
1
0
0
0
1
0
0
0
0
Score 5
Score 3
44
H-P Model
  • Instance A sequence of 1s (Hs) and 0s (Ps).
  • Question To find a self-avoiding paths embedded
    in either a 2D or 3D lattice which maximizes
    score, where the score is the number of pairs of
    1s that are adjacent in the lattice without
    being adjacent in the sequence.
  • NP-complete even for 2D lattice.

45
Physical Mapping Problem
46
Shortest Common Superstring
  • Input A collection F of strings.
  • Output A shortest possible string S such that
    for every f ? F, S is a superstring of f.
  • For example
  • NP-complete

ACT CTA AGTACTAGT
F
S
47
  • Suppose the target is too long and its contents
    are unknown.
  • What can we do?
  • Enzyme A ? 6, 8, 3, 10Enzyme B ? 7, 11, 4,
    5Enzymes A and B ? 1, 5, 2, 6, 7, 3, 3

48
This problem is called the two digest problem
which is NP-complete.
49
A genome is a sequence of genes. Chloroplast
genome of Alfafa -8, -7, -6, -5, -4, -3, -2,
-1, -11, -10, -9 Chloroplast genome of garden
pea -4, 3, -2, 8, 7, -1, -5, -6, -11, 10,
9
50
Suppose that we can only reverse a substring of
genes. -4, 5, -8, -9 After reversal, we
have 9, 8, -5, 4.
51
The sorting by reversal problem The problem of
transforming one sequence to another only by
reversals in the minimum number of steps.
52
The transformation of worm Ascaris Suum
mitochondrial DNA into human mitochondrial DNA
12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 32
33 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2
7 12 31 34 28 26 17 29 4 9 36 18 35 19 1 16 14 33
32 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2
7 12 31 32 33 14 16 1 19 35 18 36 9 4 29 17 26 28
34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2
7 12 33 32 31 14 16 1 19 35 18 36 9 4 29 17 26 28
34 22 15 11 27 5 20 13 30 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 13 20 5 27 11 15 22 34 28 26 17
29 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 29 17 26 28 34 22 15 11 27 5 20
13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 29 28 26 17 34 22 15 11 27 5 20
13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 29 28 27 11 15 22 34 17 26 5 20
13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 29 28 27 26 17 34 22 15 11 5 20
13 4 9 36 18 35 19 1 16 14 23 10 6 3 24 21 8 25 2
7 12 33 32 31 30 29 28 27 26 25 8 21 24 3 6 10 23
14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 21 8 3 6 10 23
14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 10 6 3 8 21
14 16 1 19 35 18 36 9 4 13 20 5 11 15 22 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 15 11 5
20 13 4 9 36 18 35 19 1 16 14 21 8 3 6 10 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 14 16
1 19 35 18 36 9 4 13 20 5 11 15 8 3 6 10 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 13
4 9 36 18 35 19 1 16 14 5 11 15 8 3 6 10 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
35 18 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2
7
53
12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 35 36 9 4 13 1 16 14 5 11 15 8 3 6 10 34 17 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 17 34 10 6 3 8 15 11 5 14 16 1 13 4 9 36 35 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 17 16 14 5 11 15 8 3 6 10 34 1 13 4 9 36 35 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 17 16 15 11 5 14 8 3 6 10 34 1 13 4 9 36 35 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 17 16 15 14 5 11 8 3 6 10 34 1 13 4 9 36 35 2
7 12 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19
18 17 16 15 14 13 1 34 10 6 3 8 11 5 4 9 36 35 2
7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30 31 32 33 1 34 10 6 3 8 11 5 4 9 36 35 2
7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30 31 32 33 1 34 10 6 3 4 5 11 8 9 36 35 2
7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30 31 32 33 1 34 35 36 9 8 11 5 4 3 6 10 2
7 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
28 29 30 31 32 33 1 34 35 36 9 8 7 2 10 6 3 4 5
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 10 2 3 4
5 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
27 28 29 30 31 32 33 1 34 35 36 9 8 7 6 5 4 3 2
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
26 27 28 29 30 31 32 33 1 2 3 4 5 6 7 8 9 36 35
34 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 31 32 33 34 35 36 9 8 7 6 5 4 3
2 1 10 11 1 2 3 4 5 6 7 8 9 36 35 34 33 32 31 30
29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14
13 12 10 11 1 2 3 4 5 6 7 8 9 36 35 34 33 32 31
30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15
14 13 12 11 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
31 32 33 34 35 36  
54
  • TAA, TGA, or TAG.
  • Do you know what they mean?
  • End of Gene.
  • Thank you for your patience. Have a good
    conference.
Write a Comment
User Comments (0)
About PowerShow.com