Title: Tentative definition of bioinformatics
1Tentative definition of bioinformatics
- Bioinformatics, often also called genomics,
computational genomics, or computational biology,
is a new interdisciplinary field at the
intersection of biology, computer science,
statistics, and mathematics. Its subject matter
is the extraction of biologically useful
information from large sets of molecular data,
such as DNA or protein sequence data or gene
expression data. The term bioinformatics is
currently used mainly to refer to the extraction
of information from sequence data, while the
creation and analysis of gene expression data is
called functional genomics.
2Biologys dilemma There is too much to know
about living things
- Roughly 1.5 million species of organisms have
been - described and given scientific names to date.
Some - biologists estimate that the total number of all
living - species may be several times higher. It is
impossible to - learn everything about all these organisms.
Biologists - solve the dilemma by focusing on some species,
so-called - model organisms, and trying to find out as much
as they - can about these model organisms.
3Some important model organisms
- Mammals Human, chimpanzee, mouse, rat
- Fish Zebrafish, Pufferfish
- Insects Fruitfly (Drosophila melanogaster)
- Roundworms Ceanorhabditis elegans
- Protista Malaria parasite (Plasmodium
falciparum) - Fungi Bakers yeast (Saccharomyces cerevisiae)
- Plants Thale cress (Arabidopsis thaliana),
corn, rice - Bacteria Escherichia coli, Mycoplasma genitalis
- Archea Methanococcus janaschii
4Lets find out everything about some species
- What would it mean to learn everything about a
given - species? All available evidence indicates that
the complete - blueprint for making an organism is encoded in
the - organisms genome. Chemically, the genome
consists of - one or several DNA molecules. These are long
strings - composed of pairs of nucleotides. There are only
four - different nucleotides, denoted by A, C, G, T.
The - information about how to make the organism is
encoded - by the order in which the nucleotides appear.
-
5Some genome sizes
- HIV2 virus
9671 bp - Mycoplasma genitalis 5.8 105
bp - Haemophilus influenzae 1.83 106 bp
- Saccharomyces cerevisiae 1.21 107 bp
- Caenorhabditis elegans 108
bp - Drosophila melanogaster 1.65 108 bp
- Homo sapiens 3.14 109
bp - Some amphibians 8 1010
bp - Amoeba dubia 6.7
1011 bp
6Sequencing Genomes
- Contemporary technology makes it possible to
completely - sequence entire genomes, that is, determine the
sequence - of As, Cs, Gs, and Ts in the organisms
genome. The - first virus was sequenced in the 1980s, the
first - bacterium (Haemophilus influenzae) in 1995, the
first - multicellular organism (Caenorhabditis elegans)
in 1998. - A draft of the human genome was announced in
2000.
7Where to store all these data?
- In databases of course. Some of the sequence
data are - stored in proprietary data bases, but most of
them are - stored in the public data base Genbank and an be
- accessed via the World Wide Web. In fact, most
relevant - journals require proof of submission to Genbank
before an - article discussing sequence data will be
published. - The URL for Genbank is
- http//www.ncbi.nlm.nih.gov/Genbank/
8Whats in the databases?
- In 1981, Genbank contained less than 500,000 bp
of info. - In 1986, Genbank contained 9,615,371 bp of info.
- In 1991, Genbank contained 71,947,426 bp of info.
- In 1996, Genbank contained 651,972,984 bp of
info. - In 2001, Genbank contained 15,849,921,438 bp of
info. - In 2004, Genbank contained 37,893,844,733 bp of
info. - In 2009, Genbank contained 106,533,156,756 bp of
info.
9Whats in the databases?
- On March 18, 2005 there were 1791 completely
sequenced - viruses, 204 completely sequenced bacteria,
- 21 completely sequenced archaea, and 9 complete
- genomes of Eukaryotes, among them two yeasts, the
- roundworm C. elegans, the fruitfly Drosophila
- melanogaster, the mosquito A. gambiae, the
malaria - parasite P. falciparum, and the plant Arabidopsis
thaliana - (thale cress). There are also drafts of 11 other
genomes - of eukaryotes, most notably of the human genome.
10Whats in the databases?
- On December 17, 2010 there were
- 3518 completely sequenced viruses,
- 952 completely sequenced bacteria,
- 68 completely sequenced archaea,
- and 73 complete genomes of Eukaryotes,
- among them cow, wolf, horse, human, a
- monkey, pig, chimpanzee.
11First challengeSequencing large genomes
- Currently, much of the sequencing process is
automated. - However, contemporary sequencing machines can
only - sequence stretches of DNA that are a few hundred
base - pairs long at a time. The process of assembling
these - stretches of sequence into a whole genome poses
some - interesting mathematical problems.
12First challengeSequencing large genomes
- For example, the publicly financed Human Genome
Project - uses an approach called genome mapping to
facilitate - sequence assembly. Celera Genomics, a private
- enterprise, announced that they will be able to
complete - the sequencing of the entire human genome much
faster - by using an approach called shotgun sequencing.
There - was much debate over the feasibility of the
latter - approach, but it apparently worked. At its core,
this was a - debate over the mathematics of sequence assembly.
13You have sequenced your genome - what do you do
with it?
- This is known as genome analysis or sequence
analysis. - At present, most of bioinformatics is concerned
with - sequence analysis. Here are some of the
questions - studied in sequence analysis
- gene finding
- protein 3D structure prediction
- gene function prediction
- prediction of important sites in proteins
- reconstruction of phylogenies
14Genes and proteins
- The genome controls the making and workings of an
- organism by telling the cell which proteins to
manufacture - under which conditions. Proteins are the
workhorses of - biochemistry and play a variety of roles.
- A gene is a stretch of DNA that codes a given
protein.
15Where are the genes?
- The objective of gene finding is to identify the
regions of - DNA that are genes. Ideally, we want to make
statements - like Positions 28,354 through 29,536 of this
genome code - a protein.
- The mathematical challenge here is to identify
patterns in - DNA that reliably indicate where a gene starts
and ends, - especially in eukaryotes.
16Protein structure prediction
- When a protein is manufactured in the cell, it
assumes a - characteristic 3D structure or fold. It is very
costly to - determine the 3D structure of a protein
experimentally (by - NMR or X-ray crystallography). It would be much
cheaper - if we could predict the 3D structure of a protein
directly - from its primary structure, i.e., from the
sequence of its - amino acids. This is known as the protein
folding problem. - Many approaches have been proposed to develop
- algorithms for solving this problem so far
results are - mixed.
17Prediction of protein function
- Suppose you have identified a gene. What is its
role in the - biochemistry of its organism? Sequence databases
can - help us in formulating reasonable hypotheses.
- Search the database for proteins with similar
amino acid sequences in other organisms. - If the functions of the most similar proteins are
known and if they tend to be the same function
(e.g., enzyme involved in glucose metabolism),
then it is reasonable to conjecture that your
gene also codes an enzyme involved in glucose
metabolism.
18Prediction of protein function homology searches
- Given a nucleotide or DNA sequence, searching the
data - base(s) for similar sequences is known as
homology - searches. The most popular software tool for
performing - these searches is called BLAST therefore
biologists often - speak of BLAST searches. There are two
interesting - problems here
- How to measure similarity of two sequences.
- How much similarity constitutes evidence of
biologically meaningful homology as opposed to
random chance?
19Prediction of important sites in proteins
- Not all parts of a protein are equally important
the - function of most of its amino acids is often just
to maintain - an appropriate 3D structure, and mutations of
those less - crucial amino acids often don't have much effect.
- However, most proteins have crucial parts such as
- binding sites. Mutations occurring at binding
sites tend to - be lethal and will be weeded out by evolution.
20How to predict binding sites from sequence data
- Get a collection of proteins of similar amino
acid sequences and analogous biochemical function
from your database. - Align these sequences amino acid by amino acid.
- Check which regions of the protein are highly
conserved in the course of evolution. - The binding site should be in one of the highly
conserved regions.
21The importance of being aligned
- DNA and protein molecules evolve mostly by three
- processes point mutations (exchange of a single
letter for - another), insertions, and deletions. If a group
of - homologuous proteins from different organisms has
been - identified, it is assumed that these proteins
have evolved - from a common ancestor. The process of multiple
- sequence alignment aims at identifying loci in
the - individual molecules that are derived from a
common - ancestral locus. These form the columns of the
alignment.
22Example of a multiple alignment
- A T G - - T T C G G A C T
-
- A C G A A T C C A G - C T
-
- - C G A A T C C T A A C C
-
- - T G A G C A C T A A C C
23Reconstruction of phylogenetic trees
- A phylogenetic tree depicts the evolutionary
history of a - group of species. By observing similarities and
differences - between species, we may be able to reconstruct
their - phylogeny. Classically, the degree of similarity
between - two species has been assessed from morphological
- characters. By comparing genomic sequence data,
we - actually can quantify the degree of similarity
between any - two species, and use these degrees of similarity
as a basis - for reconstructing phylogenetic trees.
24Reconstruction of phylogenetic trees
- The most common approach to using genomic data
for - reconstruction of phylogenetic trees is to look
at genes - with analogous function and thus supposedly
common - ancestry and see how far the genes taken from the
extant - organisms have diverged.
- The observed differences in the amino acid
composition - are then used to reconstruct the phylogeny. The
current - partition of organisms into eubacteria, archaea
and - eukaria was discovered in this way by analyzing
rRNA.
25The new frontier Functional genomics
- It is fashionable nowadays to talk about
functional - genomics. Many people use this term as if it
were a new - discipline separate from bioinformatics, but I
think it is - more appropriate to consider it a new subfield of
- bioinformatics.
- The ultimate aim of functional genomics is to
understand - what genes do, when they do it, and how they do
it. - Ideally, we would like to understand the cell, or
organism, - as a giant network of chemical pathways that
regulate - each other.
26Microarrays (gene chips)
- Microarrays or Gene Chips allow to monitor the
level of - activity of all the gene represented on the chip
- simultaneously under a variety of environmental
- conditions, in various organs, and at various
stages of - development.
- There are two types of challenges here To
determine - when a change in activity level detected by the
chip is - statistically significant, and to use the data so
obtained to - make inferences about gene regulation.
27What do we do with all these data?
- The bread and butter method of microarray data
- analysis is clustering. This allows to identify,
for - a sequence of experiments on the same set of
genes - under various conditions, groups of genes that
are - up- or down-regulated simultaneously. It is
believed - that genes acting in the same chemical pathway
- would normally belong to the same cluster. Some
- algorithms for clustering will be discussed in
this course. -