Tentative definition of bioinformatics - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Tentative definition of bioinformatics

Description:

Tentative definition of bioinformatics Bioinformatics, often also called genomics, computational genomics, or computational biology, is a new interdisciplinary field ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 28

Provided by: Winf2

Category:

more less

Transcript and Presenter's Notes

Title: Tentative definition of bioinformatics

1
Tentative definition of bioinformatics

Bioinformatics, often also called genomics,
computational genomics, or computational biology,
is a new interdisciplinary field at the
intersection of biology, computer science,
statistics, and mathematics. Its subject matter
is the extraction of biologically useful
information from large sets of molecular data,
such as DNA or protein sequence data or gene
expression data. The term bioinformatics is
currently used mainly to refer to the extraction
of information from sequence data, while the
creation and analysis of gene expression data is
called functional genomics.

2
Biologys dilemma There is too much to know
about living things

Roughly 1.5 million species of organisms have
been
described and given scientific names to date.
Some
biologists estimate that the total number of all
living
species may be several times higher. It is
impossible to
learn everything about all these organisms.
Biologists
solve the dilemma by focusing on some species,
so-called
model organisms, and trying to find out as much
as they
can about these model organisms.

3
Some important model organisms

Mammals Human, chimpanzee, mouse, rat
Fish Zebrafish, Pufferfish
Insects Fruitfly (Drosophila melanogaster)
Roundworms Ceanorhabditis elegans
Protista Malaria parasite (Plasmodium
falciparum)
Fungi Bakers yeast (Saccharomyces cerevisiae)
Plants Thale cress (Arabidopsis thaliana),
corn, rice
Bacteria Escherichia coli, Mycoplasma genitalis
Archea Methanococcus janaschii

4
Lets find out everything about some species

What would it mean to learn everything about a
given
species? All available evidence indicates that
the complete
blueprint for making an organism is encoded in
the
organisms genome. Chemically, the genome
consists of
one or several DNA molecules. These are long
strings
composed of pairs of nucleotides. There are only
four
different nucleotides, denoted by A, C, G, T.
The
information about how to make the organism is
encoded
by the order in which the nucleotides appear.

5
Some genome sizes

HIV2 virus
9671 bp
Mycoplasma genitalis 5.8 105
bp
Haemophilus influenzae 1.83 106 bp
Saccharomyces cerevisiae 1.21 107 bp
Caenorhabditis elegans 108
bp
Drosophila melanogaster 1.65 108 bp
Homo sapiens 3.14 109
bp
Some amphibians 8 1010
bp
Amoeba dubia 6.7
1011 bp

6
Sequencing Genomes

Contemporary technology makes it possible to
completely
sequence entire genomes, that is, determine the
sequence
of As, Cs, Gs, and Ts in the organisms
genome. The
first virus was sequenced in the 1980s, the
first
bacterium (Haemophilus influenzae) in 1995, the
first
multicellular organism (Caenorhabditis elegans)
in 1998.
A draft of the human genome was announced in
2000.

7
Where to store all these data?

In databases of course. Some of the sequence
data are
stored in proprietary data bases, but most of
them are
stored in the public data base Genbank and an be
accessed via the World Wide Web. In fact, most
relevant
journals require proof of submission to Genbank
before an
article discussing sequence data will be
published.
The URL for Genbank is
http//www.ncbi.nlm.nih.gov/Genbank/

8
Whats in the databases?

In 1981, Genbank contained less than 500,000 bp
of info.
In 1986, Genbank contained 9,615,371 bp of info.
In 1991, Genbank contained 71,947,426 bp of info.
In 1996, Genbank contained 651,972,984 bp of
info.
In 2001, Genbank contained 15,849,921,438 bp of
info.
In 2004, Genbank contained 37,893,844,733 bp of
info.
In 2009, Genbank contained 106,533,156,756 bp of
info.

9
Whats in the databases?

On March 18, 2005 there were 1791 completely
sequenced
viruses, 204 completely sequenced bacteria,
21 completely sequenced archaea, and 9 complete
genomes of Eukaryotes, among them two yeasts, the
roundworm C. elegans, the fruitfly Drosophila
melanogaster, the mosquito A. gambiae, the
malaria
parasite P. falciparum, and the plant Arabidopsis
thaliana
(thale cress). There are also drafts of 11 other
genomes
of eukaryotes, most notably of the human genome.

10
Whats in the databases?

On December 17, 2010 there were
3518 completely sequenced viruses,
952 completely sequenced bacteria,
68 completely sequenced archaea,
and 73 complete genomes of Eukaryotes,
among them cow, wolf, horse, human, a
monkey, pig, chimpanzee.

11
First challengeSequencing large genomes

Currently, much of the sequencing process is
automated.
However, contemporary sequencing machines can
only
sequence stretches of DNA that are a few hundred
base
pairs long at a time. The process of assembling
these
stretches of sequence into a whole genome poses
some
interesting mathematical problems.

12
First challengeSequencing large genomes

For example, the publicly financed Human Genome
Project
uses an approach called genome mapping to
facilitate
sequence assembly. Celera Genomics, a private
enterprise, announced that they will be able to
complete
the sequencing of the entire human genome much
faster
by using an approach called shotgun sequencing.
There
was much debate over the feasibility of the
latter
approach, but it apparently worked. At its core,
this was a
debate over the mathematics of sequence assembly.

13
You have sequenced your genome - what do you do
with it?

This is known as genome analysis or sequence
analysis.
At present, most of bioinformatics is concerned
with
sequence analysis. Here are some of the
questions
studied in sequence analysis
gene finding
protein 3D structure prediction
gene function prediction
prediction of important sites in proteins
reconstruction of phylogenies

14
Genes and proteins

The genome controls the making and workings of an
organism by telling the cell which proteins to
manufacture
under which conditions. Proteins are the
workhorses of
biochemistry and play a variety of roles.
A gene is a stretch of DNA that codes a given
protein.

15
Where are the genes?

The objective of gene finding is to identify the
regions of
DNA that are genes. Ideally, we want to make
statements
like Positions 28,354 through 29,536 of this
genome code
a protein.
The mathematical challenge here is to identify
patterns in
DNA that reliably indicate where a gene starts
and ends,
especially in eukaryotes.

16
Protein structure prediction

When a protein is manufactured in the cell, it
assumes a
characteristic 3D structure or fold. It is very
costly to
determine the 3D structure of a protein
experimentally (by
NMR or X-ray crystallography). It would be much
cheaper
if we could predict the 3D structure of a protein
directly
from its primary structure, i.e., from the
sequence of its
amino acids. This is known as the protein
folding problem.
Many approaches have been proposed to develop
algorithms for solving this problem so far
results are
mixed.

17
Prediction of protein function

Suppose you have identified a gene. What is its
role in the
biochemistry of its organism? Sequence databases
can
help us in formulating reasonable hypotheses.
Search the database for proteins with similar
amino acid sequences in other organisms.
If the functions of the most similar proteins are
known and if they tend to be the same function
(e.g., enzyme involved in glucose metabolism),
then it is reasonable to conjecture that your
gene also codes an enzyme involved in glucose
metabolism.

18
Prediction of protein function homology searches

Given a nucleotide or DNA sequence, searching the
data
base(s) for similar sequences is known as
homology
searches. The most popular software tool for
performing
these searches is called BLAST therefore
biologists often
speak of BLAST searches. There are two
interesting
problems here
How to measure similarity of two sequences.
How much similarity constitutes evidence of
biologically meaningful homology as opposed to
random chance?

19
Prediction of important sites in proteins

Not all parts of a protein are equally important
the
function of most of its amino acids is often just
to maintain
an appropriate 3D structure, and mutations of
those less
crucial amino acids often don't have much effect.
However, most proteins have crucial parts such as
binding sites. Mutations occurring at binding
sites tend to
be lethal and will be weeded out by evolution.

20
How to predict binding sites from sequence data

Get a collection of proteins of similar amino
acid sequences and analogous biochemical function
from your database.
Align these sequences amino acid by amino acid.
Check which regions of the protein are highly
conserved in the course of evolution.
The binding site should be in one of the highly
conserved regions.

21
The importance of being aligned