What does mathematics contribute to bioinformatics? - PowerPoint PPT Presentation

About This Presentation
Title:

What does mathematics contribute to bioinformatics?

Description:

Mathematics Is Biology's Next Microscope, Only Better; ... Honest Craig's Casino. This is a casino in Nevada where one plays 64-number. roulette. ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 21
Provided by: winfri
Learn more at: https://people.ohio.edu
Category:

less

Transcript and Presenter's Notes

Title: What does mathematics contribute to bioinformatics?


1
What does mathematics contribute to
bioinformatics?
  • Winfried Just
  • Department of Mathematics
  • Ohio University

2
A new microscope and a new physics
  • In 2004 PLoS Biology published a paper by Joel E.
    Cohen
  • Mathematics Is Biology's Next Microscope, Only
    Better
  • Biology Is Mathematics' Next Physics, Only
    Better.
  • Really?
  • How does this new microscope differ from the
    traditional ones?
  • How to use it?
  • Why did mathematicians become seriously
    interested in
  • biology?
  • And how is all this related to bioinformatics?

3
More empirical observations
  • NSF and NIH recently started to invest heavily in
    biomathematics.
  • In 2002 the Mathematical Biosciences Institute
    (MBI, located at OSU) was founded this is the
    first and so far only NSF institute dedicated
    exclusively to applications of mathematics in one
    other area.
  • Several other new research institutes in
    biomathematics are supported from public or
    private sources.
  • A number of new journals specializing in
    biomathematics got started.
  • The job market for biomathematicians is currently
    rather favorable, both in academia and industry,
    especially in the pharmaceutical industry.

4
What is behind this trend?
  • And why do we observe this trend now, instead of
    30
  • years ago or 30 years from now? There are two
    main
  • reasons
  • Contemporary biology generate a huge mountains of
    data. Drawing biologically meaningful inferences
    from these data requires analysis in the
    framework of good mathematical models. Hence
    mathematics has become a necessary tool for
    biology.
  • Currently available computer power allows us to
    investigate sufficiently detailed mathematical
    models to draw biologically realistic inferences.
    Thus mathematics has become a useful tool for
    biology.

5
Biomathematics vs. bioinformatics
  • Everything that has been said so far about
  • biomathematics could also be said about
  • bioinformatics.
  • What is the difference between the two areas?
  • Biomathematics Applications of mathematics to
    biology.
  • Bioinformatics The design, implementation, and
    use of
  • computer algorithms to draw inferences from
    massive sets of
  • biomolecular data. It is an interdisciplinary
    field that draws on
  • knowledge from biology, biochemistry, statistics,
    mathematics,
  • and computer science.

6
Example of a huge data set Genbank
  • The first viral genome was published in the
    1980s, the first
  • bacterial genome, H. influenzae, 1.83 106 bp,
    in 1995,
  • The first genome of a multicellular organism, C.
    elegans,
  • 108 bp, w 1998. The sketch of our own genome,
  • H. sapiens, p 109 bp, was announced in June
    2000.
  • As of February 2008, Genbank contained 85 759 586
    764 bp
  • of information.
  • How to draw concrete inferences from such a huge
  • mountains of information?

7
Where are the genes?
  • Let us look, for example, at our own genome. The
    information
  • about it is written in Genbank as a sequence p
    109 liter that
  • would fill a million of tightly typed pages, the
    equivalent of
  • several thousand novels
  • ...actggtacctgtatatggacgctccatatttaatgcgcgatgcagga
    tctaaa...
  • Less than 1.5 of this sequence codes proteins.
    How to find
  • these genes?
  • No human can read the whole sequence. A computer
    can read
  • it easily, in a few seconds. So, maybe the
    computer will tell us
  • where the genes are, where they start, and where
    they end.
  • But what is the computer supposed to compute???

8
Honest Craigs Casino
  • This is a casino in Nevada where one plays
    64-number
  • roulette. In each round, a player bets chips on
    three
  • among those 64 numbers. If one of these three
    chosen
  • numbers comes up, honest Craig will pay a
    suitable
  • premium. If not, the player loses the chips.
  • QUESTION How long does it take, on average, for
    a
  • winning number to come up?

9
Honest Craigs Casino
  • This is a casino in Nevada where one plays
    64-number
  • roulette. In each round, a player bets chips on
    three
  • among those 64 numbers. If one of these three
    chosen
  • numbers comes up, honest Craig will pay a
    suitable
  • premium. If not, the player loses the chips.
  • QUESTION How long does it take, on average, for
    a
  • winning number to come up?
  • ANSWER 64/3 21.33 rounds.

10
Probability of long waiting times
  • Let us assume that Craig is as honest as he
    claims.
  • Then the probability P(k) that our player will
    keep losing
  • throughout the first k rounds is (61/64)k. In
    particular,
  • starting from k 50 we obtain the following
    probabilities
  • P(50) 0.0907 P(51) 0.0864 P(52) 0.0824
    P(53) 0.0785 P(54) 0.0748
  • P(55) 0.0713 P(56) 0.0680 P(57) 0.0648
    P(58) 0.0618 P(59) 0.0589
  • P(60) 0.0561 P(61) 0.0535 P(62) 0.0510
    P(63) 0.0486 P(64) 0.0463
  • P(65) 0.0441 P(66) 0.0421 P(67) 0.0401
    P(68) 0.0382 P(69) 0.0364
  • P(100) 0.0082 P(200) 0.000064 P(300)
    0.00000055

11
Some statistical terminology
  • The assumption that Craig is as honest as he
    claims will
  • be our null hypothesis. The suspicion that he is
    cheating
  • after all is our alternative hypothesis. The
    number of
  • losses that precede the first winning round will
    be our
  • test statistics. The p-value is the probability
    that the
  • test statistics takes the observed or a more
    extreme
  • value under the assumption of the null
    hypothesis. If
  • the p-value falls below our agreed upon
    significance
  • level, we are justified in rejecting the null
    hypothesis. In
  • science, the most commonly used significance
    level is
  • 0.05. Falsely accusing honest Craiga about
    cheating
  • would be a Type I error trusting him when he is
    in fact
  • cheating would be a Type II error.

12
Craiga Venters Lab
  • In 1995 Craig Venters team sequenced the genome
    of the
  • bacterium H. influenzae. If we want to detect
    the positions of
  • its 1740 genes that code proteins in its sequence
    of 1 830 140
  • base pairs, we can reason as follows In bacteria
    almost all the
  • genome codes proteins. Let us start from
    position n and read
  • triplets (n, n1, n2), (n3, n4, n5),

13
Craiga Venters Lab
  • In 1995 Craig Venters team sequenced the genome
    of the
  • bacterium H. influenzae. If we want to detect
    the positions of
  • its 1740 genes that code proteins in its sequence
    of 1 830 140
  • base pairs, we can reason as follows In bacteria
    almost all the
  • genome codes proteins. Let us start from
    position n and read
  • triplets (n, n1, n2), (n3, n4, n5), If
    we read in the
  • correct reading frame, we will read a sequence of
    codons that
  • ends with a STOP codon, that is, TAA, TGA, TAG.

14
Craiga Venters Lab
  • In 1995 Craig Venters team sequenced the genome
    of the
  • bacterium H. influenzae. If we want to detect
    the positions of
  • its 1740 genes that code proteins in its sequence
    of 1 830 140
  • base pairs, we can reason as follows In bacteria
    almost all the
  • genome codes proteins. Let us start from
    position n and read
  • triplets (n, n1, n2), (n3, n4, n5), If
    we read in the
  • correct reading frame, we will read a sequence of
    codons that
  • ends with a STOP codon, that is, TAA, TGA, TAG.
    Such a
  • STOP codon will appear on average once in about
    300 triplets.
  • If we read in one of the other five reading
    frames, we will read
  • garbage, that is, a more or less random sequence
    of triplets
  • and one of the triplets TAA, TGA, TAG will be
    encountered on
  • average once every 64/3 21.33 positions.
  • Rings a bell?

15
This is the same problem!
  • With minor modifications Now our null hypothesis
    will be that
  • we read in the wrong reading frame, the
    alternative hypothesis
  • will be that we read a coding sequence in the
    correct reading
  • frame. If we dont encounter a STOP codon while
    reading 63
  • successive triplets, we can reject the null
    hypothesis at
  • significance level 0.05 and conclude that we
    found a sequence
  • that codes a protein whose end is easy to find.
  • So we can design an easy gene-finding algorithm
    based on
  • finding these so-called ORFs (open reading
    frames).

16
Some caveats
  • The beginning of the gene is somewhat more
    difficult to determine, since ATG is both the
    START codon and the codon for methionine, and the
    promoter is also part of the gene.
  • The garbage in the other five reading frames is
    not completely random.
  • This approach will miss all genes that code
    proteins shorter than 63 amino acids (type ?
    error) and will sometimes discover spurious genes
    (type ? error).
  • This approach is unsuitable for discovering
    RNA-coding genes.
  • However, the above problems can be solved, and
    there
  • exist good gene-finding algorithms based on this
    idea.

17
Craiga Venters lab in 2000
  • But now let us look at the genome of H. sapiens
  • Protein-coding regions constitute only a small
    fraction of our genome.
  • All by itself, this would lead to a lot more Type
    I errors
  • than in prokaryotes.

18
Craiga Venters lab in 2000
  • But now let us look at the genome of H. sapiens
  • Protein-coding regions constitute only a small
    fraction of our genome.
  • The coding sequences, exons, are interspersed
    with introns.
  • A given codon may be split by an intron.
  • Consecutive exons dont have to sit in the same
    reading frame.
  • Introns look similar to random sequences.
  • So we are faced with a much more difficult
    problem.
  • Nowadays there exist pretty good algorithms for
    finding genes
  • in eukaryotes. But
  • No algorithm for finding genes in prokaryotes
    will work here.

19
Mathematics and mathematicians
  • Mathematics is a great language for elucidating
    the common structure in apparently unrelated
    problems.
  • Mathematicians have a tendency to talk about
    complicated theories in their jargon instead of
    giving simple and concrete answers.
  • Mathematical microscopes often dont come with
    a simple users manual. In order to successfully
    use them, one needs to understand to some extent
    how they work. The choice of the most
    appropriate mathematical microscope for a given
    biological problem often requires active
    cooperation between mathematicians and
    biologists.
  • The key to success in this type of cooperation is
    finding a common language and mutual
    understanding of and respect for the two
    different intellectual approaches.
  • Mathematical models form the basis for
    formulating hypotheses, often in the form of
    probabilities.
  • The final interpretation of these hypotheses and
    their experimental verification belongs to the
    biologists. Thus mathematical microscopes will
    not make the more traditional ones redundant.
  • In points 3-6, feel free to substitute
    bioinformatics for mathematics.

20
Biomathematics vs. bioinformatics
  • Biomathematics Applications of mathematics to
  • biology.
  • Bioinformatics The design, implementation, and
    use of
  • computer algorithms to draw inferences from
    massive
  • sets of biomolecular data. It is an
    interdisciplinary field
  • that draws on knowledge from biology,
    biochemistry,
  • statistics, mathematics, and computer science.
  • The design of all bioinformatics tools is based
    on
  • mathematical models. In order to choose the most
  • appropropriate among the available tools and draw
  • proper inferences, one needs to understand these
    models.
Write a Comment
User Comments (0)
About PowerShow.com