Title: What does mathematics contribute to bioinformatics?
1What does mathematics contribute to
bioinformatics?
- Winfried Just
- Department of Mathematics
- Ohio University
2A new microscope and a new physics
- In 2004 PLoS Biology published a paper by Joel E.
Cohen - Mathematics Is Biology's Next Microscope, Only
Better - Biology Is Mathematics' Next Physics, Only
Better. - Really?
- How does this new microscope differ from the
traditional ones? - How to use it?
- Why did mathematicians become seriously
interested in - biology?
- And how is all this related to bioinformatics?
3More empirical observations
- NSF and NIH recently started to invest heavily in
biomathematics. - In 2002 the Mathematical Biosciences Institute
(MBI, located at OSU) was founded this is the
first and so far only NSF institute dedicated
exclusively to applications of mathematics in one
other area. - Several other new research institutes in
biomathematics are supported from public or
private sources. - A number of new journals specializing in
biomathematics got started. - The job market for biomathematicians is currently
rather favorable, both in academia and industry,
especially in the pharmaceutical industry.
4What is behind this trend?
- And why do we observe this trend now, instead of
30 - years ago or 30 years from now? There are two
main - reasons
- Contemporary biology generate a huge mountains of
data. Drawing biologically meaningful inferences
from these data requires analysis in the
framework of good mathematical models. Hence
mathematics has become a necessary tool for
biology. - Currently available computer power allows us to
investigate sufficiently detailed mathematical
models to draw biologically realistic inferences.
Thus mathematics has become a useful tool for
biology.
5Biomathematics vs. bioinformatics
- Everything that has been said so far about
- biomathematics could also be said about
- bioinformatics.
- What is the difference between the two areas?
- Biomathematics Applications of mathematics to
biology. - Bioinformatics The design, implementation, and
use of - computer algorithms to draw inferences from
massive sets of - biomolecular data. It is an interdisciplinary
field that draws on - knowledge from biology, biochemistry, statistics,
mathematics, - and computer science.
-
6Example of a huge data set Genbank
- The first viral genome was published in the
1980s, the first - bacterial genome, H. influenzae, 1.83 106 bp,
in 1995, - The first genome of a multicellular organism, C.
elegans, - 108 bp, w 1998. The sketch of our own genome,
- H. sapiens, p 109 bp, was announced in June
2000. - As of February 2008, Genbank contained 85 759 586
764 bp - of information.
- How to draw concrete inferences from such a huge
- mountains of information?
7Where are the genes?
- Let us look, for example, at our own genome. The
information - about it is written in Genbank as a sequence p
109 liter that - would fill a million of tightly typed pages, the
equivalent of - several thousand novels
- ...actggtacctgtatatggacgctccatatttaatgcgcgatgcagga
tctaaa... - Less than 1.5 of this sequence codes proteins.
How to find - these genes?
- No human can read the whole sequence. A computer
can read - it easily, in a few seconds. So, maybe the
computer will tell us - where the genes are, where they start, and where
they end. - But what is the computer supposed to compute???
8Honest Craigs Casino
- This is a casino in Nevada where one plays
64-number - roulette. In each round, a player bets chips on
three - among those 64 numbers. If one of these three
chosen - numbers comes up, honest Craig will pay a
suitable - premium. If not, the player loses the chips.
- QUESTION How long does it take, on average, for
a - winning number to come up?
9Honest Craigs Casino
- This is a casino in Nevada where one plays
64-number - roulette. In each round, a player bets chips on
three - among those 64 numbers. If one of these three
chosen - numbers comes up, honest Craig will pay a
suitable - premium. If not, the player loses the chips.
- QUESTION How long does it take, on average, for
a - winning number to come up?
- ANSWER 64/3 21.33 rounds.
10Probability of long waiting times
- Let us assume that Craig is as honest as he
claims. - Then the probability P(k) that our player will
keep losing - throughout the first k rounds is (61/64)k. In
particular, - starting from k 50 we obtain the following
probabilities - P(50) 0.0907 P(51) 0.0864 P(52) 0.0824
P(53) 0.0785 P(54) 0.0748 - P(55) 0.0713 P(56) 0.0680 P(57) 0.0648
P(58) 0.0618 P(59) 0.0589 - P(60) 0.0561 P(61) 0.0535 P(62) 0.0510
P(63) 0.0486 P(64) 0.0463 - P(65) 0.0441 P(66) 0.0421 P(67) 0.0401
P(68) 0.0382 P(69) 0.0364 - P(100) 0.0082 P(200) 0.000064 P(300)
0.00000055
11Some statistical terminology
- The assumption that Craig is as honest as he
claims will - be our null hypothesis. The suspicion that he is
cheating - after all is our alternative hypothesis. The
number of - losses that precede the first winning round will
be our - test statistics. The p-value is the probability
that the - test statistics takes the observed or a more
extreme - value under the assumption of the null
hypothesis. If - the p-value falls below our agreed upon
significance - level, we are justified in rejecting the null
hypothesis. In - science, the most commonly used significance
level is - 0.05. Falsely accusing honest Craiga about
cheating - would be a Type I error trusting him when he is
in fact - cheating would be a Type II error.
12Craiga Venters Lab
- In 1995 Craig Venters team sequenced the genome
of the - bacterium H. influenzae. If we want to detect
the positions of - its 1740 genes that code proteins in its sequence
of 1 830 140 - base pairs, we can reason as follows In bacteria
almost all the - genome codes proteins. Let us start from
position n and read - triplets (n, n1, n2), (n3, n4, n5),
13Craiga Venters Lab
- In 1995 Craig Venters team sequenced the genome
of the - bacterium H. influenzae. If we want to detect
the positions of - its 1740 genes that code proteins in its sequence
of 1 830 140 - base pairs, we can reason as follows In bacteria
almost all the - genome codes proteins. Let us start from
position n and read - triplets (n, n1, n2), (n3, n4, n5), If
we read in the - correct reading frame, we will read a sequence of
codons that - ends with a STOP codon, that is, TAA, TGA, TAG.
14Craiga Venters Lab
- In 1995 Craig Venters team sequenced the genome
of the - bacterium H. influenzae. If we want to detect
the positions of - its 1740 genes that code proteins in its sequence
of 1 830 140 - base pairs, we can reason as follows In bacteria
almost all the - genome codes proteins. Let us start from
position n and read - triplets (n, n1, n2), (n3, n4, n5), If
we read in the - correct reading frame, we will read a sequence of
codons that - ends with a STOP codon, that is, TAA, TGA, TAG.
Such a - STOP codon will appear on average once in about
300 triplets. - If we read in one of the other five reading
frames, we will read - garbage, that is, a more or less random sequence
of triplets - and one of the triplets TAA, TGA, TAG will be
encountered on - average once every 64/3 21.33 positions.
-
- Rings a bell?
15This is the same problem!
- With minor modifications Now our null hypothesis
will be that - we read in the wrong reading frame, the
alternative hypothesis - will be that we read a coding sequence in the
correct reading - frame. If we dont encounter a STOP codon while
reading 63 - successive triplets, we can reject the null
hypothesis at - significance level 0.05 and conclude that we
found a sequence - that codes a protein whose end is easy to find.
- So we can design an easy gene-finding algorithm
based on - finding these so-called ORFs (open reading
frames).
16Some caveats
- The beginning of the gene is somewhat more
difficult to determine, since ATG is both the
START codon and the codon for methionine, and the
promoter is also part of the gene. - The garbage in the other five reading frames is
not completely random. - This approach will miss all genes that code
proteins shorter than 63 amino acids (type ?
error) and will sometimes discover spurious genes
(type ? error). - This approach is unsuitable for discovering
RNA-coding genes. - However, the above problems can be solved, and
there - exist good gene-finding algorithms based on this
idea.
17Craiga Venters lab in 2000
- But now let us look at the genome of H. sapiens
- Protein-coding regions constitute only a small
fraction of our genome. - All by itself, this would lead to a lot more Type
I errors - than in prokaryotes.
18Craiga Venters lab in 2000
- But now let us look at the genome of H. sapiens
- Protein-coding regions constitute only a small
fraction of our genome. - The coding sequences, exons, are interspersed
with introns. - A given codon may be split by an intron.
- Consecutive exons dont have to sit in the same
reading frame. - Introns look similar to random sequences.
- So we are faced with a much more difficult
problem. - Nowadays there exist pretty good algorithms for
finding genes - in eukaryotes. But
- No algorithm for finding genes in prokaryotes
will work here.
19Mathematics and mathematicians
- Mathematics is a great language for elucidating
the common structure in apparently unrelated
problems. - Mathematicians have a tendency to talk about
complicated theories in their jargon instead of
giving simple and concrete answers. - Mathematical microscopes often dont come with
a simple users manual. In order to successfully
use them, one needs to understand to some extent
how they work. The choice of the most
appropriate mathematical microscope for a given
biological problem often requires active
cooperation between mathematicians and
biologists. - The key to success in this type of cooperation is
finding a common language and mutual
understanding of and respect for the two
different intellectual approaches. - Mathematical models form the basis for
formulating hypotheses, often in the form of
probabilities. - The final interpretation of these hypotheses and
their experimental verification belongs to the
biologists. Thus mathematical microscopes will
not make the more traditional ones redundant. - In points 3-6, feel free to substitute
bioinformatics for mathematics.
20Biomathematics vs. bioinformatics
- Biomathematics Applications of mathematics to
- biology.
- Bioinformatics The design, implementation, and
use of - computer algorithms to draw inferences from
massive - sets of biomolecular data. It is an
interdisciplinary field - that draws on knowledge from biology,
biochemistry, - statistics, mathematics, and computer science.
- The design of all bioinformatics tools is based
on - mathematical models. In order to choose the most
- appropropriate among the available tools and draw
- proper inferences, one needs to understand these
models.