What does mathematics contribute to bioinformatics? - PowerPoint PPT Presentation

About This Presentation

Title:

What does mathematics contribute to bioinformatics?

Description:

Mathematics Is Biology's Next Microscope, Only Better; ... Honest Craig's Casino. This is a casino in Nevada where one plays 64-number. roulette. ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 21

Provided by: winfri

Learn more at: https://people.ohio.edu

Category:

more less

Transcript and Presenter's Notes

Title: What does mathematics contribute to bioinformatics?

1
What does mathematics contribute to
bioinformatics?

Winfried Just
Department of Mathematics
Ohio University

2
A new microscope and a new physics

In 2004 PLoS Biology published a paper by Joel E.
Cohen
Mathematics Is Biology's Next Microscope, Only
Better
Biology Is Mathematics' Next Physics, Only
Better.
Really?
How does this new microscope differ from the
traditional ones?
How to use it?
Why did mathematicians become seriously
interested in
biology?
And how is all this related to bioinformatics?

3
More empirical observations

NSF and NIH recently started to invest heavily in
biomathematics.
In 2002 the Mathematical Biosciences Institute
(MBI, located at OSU) was founded this is the
first and so far only NSF institute dedicated
exclusively to applications of mathematics in one
other area.
Several other new research institutes in
biomathematics are supported from public or
private sources.
A number of new journals specializing in
biomathematics got started.
The job market for biomathematicians is currently
rather favorable, both in academia and industry,
especially in the pharmaceutical industry.

4
What is behind this trend?

And why do we observe this trend now, instead of
30
years ago or 30 years from now? There are two
main
reasons
Contemporary biology generate a huge mountains of
data. Drawing biologically meaningful inferences
from these data requires analysis in the
framework of good mathematical models. Hence
mathematics has become a necessary tool for
biology.
Currently available computer power allows us to
investigate sufficiently detailed mathematical
models to draw biologically realistic inferences.
Thus mathematics has become a useful tool for
biology.

5
Biomathematics vs. bioinformatics

Everything that has been said so far about
biomathematics could also be said about
bioinformatics.
What is the difference between the two areas?
Biomathematics Applications of mathematics to
biology.
Bioinformatics The design, implementation, and
use of
computer algorithms to draw inferences from
massive sets of
biomolecular data. It is an interdisciplinary
field that draws on
knowledge from biology, biochemistry, statistics,
mathematics,
and computer science.

6
Example of a huge data set Genbank

The first viral genome was published in the
1980s, the first
bacterial genome, H. influenzae, 1.83 106 bp,
in 1995,
The first genome of a multicellular organism, C.
elegans,
108 bp, w 1998. The sketch of our own genome,
H. sapiens, p 109 bp, was announced in June
2000.
As of February 2008, Genbank contained 85 759 586
764 bp
of information.
How to draw concrete inferences from such a huge
mountains of information?

7
Where are the genes?

Let us look, for example, at our own genome. The
information
about it is written in Genbank as a sequence p
109 liter that
would fill a million of tightly typed pages, the
equivalent of
several thousand novels
...actggtacctgtatatggacgctccatatttaatgcgcgatgcagga
tctaaa...
Less than 1.5 of this sequence codes proteins.
How to find
these genes?
No human can read the whole sequence. A computer
can read
it easily, in a few seconds. So, maybe the
computer will tell us
where the genes are, where they start, and where
they end.
But what is the computer supposed to compute???

8
Honest Craigs Casino

This is a casino in Nevada where one plays
64-number
roulette. In each round, a player bets chips on
three
among those 64 numbers. If one of these three
chosen
numbers comes up, honest Craig will pay a
suitable
premium. If not, the player loses the chips.
QUESTION How long does it take, on average, for
a
winning number to come up?

9
Honest Craigs Casino

This is a casino in Nevada where one plays
64-number
roulette. In each round, a player bets chips on
three
among those 64 numbers. If one of these three
chosen
numbers comes up, honest Craig will pay a
suitable
premium. If not, the player loses the chips.
QUESTION How long does it take, on average, for
a
winning number to come up?
ANSWER 64/3 21.33 rounds.

10
Probability of long waiting times

Let us assume that Craig is as honest as he
claims.
Then the probability P(k) that our player will
keep losing
throughout the first k rounds is (61/64)k. In
particular,
starting from k 50 we obtain the following
probabilities
P(50) 0.0907 P(51) 0.0864 P(52) 0.0824
P(53) 0.0785 P(54) 0.0748
P(55) 0.0713 P(56) 0.0680 P(57) 0.0648
P(58) 0.0618 P(59) 0.0589
P(60) 0.0561 P(61) 0.0535 P(62) 0.0510
P(63) 0.0486 P(64) 0.0463
P(65) 0.0441 P(66) 0.0421 P(67) 0.0401
P(68) 0.0382 P(69) 0.0364
P(100) 0.0082 P(200) 0.000064 P(300)
0.00000055

11
Some statistical terminology

The assumption that Craig is as honest as he
claims will
be our null hypothesis. The suspicion that he is
cheating
after all is our alternative hypothesis. The
number of
losses that precede the first winning round will
be our
test statistics. The p-value is the probability
that the
test statistics takes the observed or a more
extreme
value under the assumption of the null
hypothesis. If
the p-value falls below our agreed upon
significance
level, we are justified in rejecting the null
hypothesis. In
science, the most commonly used significance
level is
0.05. Falsely accusing honest Craiga about
cheating
would be a Type I error trusting him when he is
in fact
cheating would be a Type II error.

12
Craiga Venters Lab

In 1995 Craig Venters team sequenced the genome
of the
bacterium H. influenzae. If we want to detect
the positions of
its 1740 genes that code proteins in its sequence
of 1 830 140
base pairs, we can reason as follows In bacteria
almost all the
genome codes proteins. Let us start from
position n and read
triplets (n, n1, n2), (n3, n4, n5),

13
Craiga Venters Lab

In 1995 Craig Venters team sequenced the genome
of the
bacterium H. influenzae. If we want to detect
the positions of
its 1740 genes that code proteins in its sequence
of 1 830 140
base pairs, we can reason as follows In bacteria
almost all the
genome codes proteins. Let us start from
position n and read
triplets (n, n1, n2), (n3, n4, n5), If
we read in the
correct reading frame, we will read a sequence of
codons that
ends with a STOP codon, that is, TAA, TGA, TAG.

14
Craiga Venters Lab

In 1995 Craig Venters team sequenced the genome
of the
bacterium H. influenzae. If we want to detect
the positions of
its 1740 genes that code proteins in its sequence
of 1 830 140
base pairs, we can reason as follows In bacteria
almost all the
genome codes proteins. Let us start from
position n and read
triplets (n, n1, n2), (n3, n4, n5), If
we read in the
correct reading frame, we will read a sequence of
codons that
ends with a STOP codon, that is, TAA, TGA, TAG.
Such a
STOP codon will appear on average once in about
300 triplets.
If we read in one of the other five reading
frames, we will read
garbage, that is, a more or less random sequence
of triplets
and one of the triplets TAA, TGA, TAG will be
encountered on
average once every 64/3 21.33 positions.
Rings a bell?

15
This is the same problem!

With minor modifications Now our null hypothesis
will be that
we read in the wrong reading frame, the
alternative hypothesis
will be that we read a coding sequence in the
correct reading
frame. If we dont encounter a STOP codon while
reading 63
successive triplets, we can reject the null
hypothesis at
significance level 0.05 and conclude that we
found a sequence
that codes a protein whose end is easy to find.
So we can design an easy gene-finding algorithm
based on
finding these so-called ORFs (open reading
frames).

16
Some caveats

The beginning of the gene is somewhat more
difficult to determine, since ATG is both the
START codon and the codon for methionine, and the
promoter is also part of the gene.
The garbage in the other five reading frames is
not completely random.
This approach will miss all genes that code
proteins shorter than 63 amino acids (type ?
error) and will sometimes discover spurious genes
(type ? error).
This approach is unsuitable for discovering
RNA-coding genes.
However, the above problems can be solved, and
there
exist good gene-finding algorithms based on this
idea.

17
Craiga Venters lab in 2000

But now let us look at the genome of H. sapiens
Protein-coding regions constitute only a small
fraction of our genome.
All by itself, this would lead to a lot more Type
I errors
than in prokaryotes.

18
Craiga Venters lab in 2000

But now let us look at the genome of H. sapiens
Protein-coding regions constitute only a small
fraction of our genome.
The coding sequences, exons, are interspersed
with introns.
A given codon may be split by an intron.
Consecutive exons dont have to sit in the same
reading frame.
Introns look similar to random sequences.
So we are faced with a much more difficult
problem.
Nowadays there exist pretty good algorithms for
finding genes
in eukaryotes. But
No algorithm for finding genes in prokaryotes
will work here.

19
Mathematics and mathematicians

Mathematics is a great language for elucidating
the common structure in apparently unrelated
problems.
Mathematicians have a tendency to talk about
complicated theories in their jargon instead of
giving simple and concrete answers.
Mathematical microscopes often dont come with
a simple users manual. In order to successfully
use them, one needs to understand to some extent
how they work. The choice of the most
appropriate mathematical microscope for a given
biological problem often requires active
cooperation between mathematicians and
biologists.
The key to success in this type of cooperation is
finding a common language and mutual
understanding of and respect for the two
different intellectual approaches.
Mathematical models form the basis for
formulating hypotheses, often in the form of
probabilities.
The final interpretation of these hypotheses and
their experimental verification belongs to the
biologists. Thus mathematical microscopes will
not make the more traditional ones redundant.
In points 3-6, feel free to substitute
bioinformatics for mathematics.