Title: Basic Bioinformatics
1Basic Bioinformatics
- As it is applied to the Bacillus megaterium
genome
2What we are going to talk about
- Why we are doing all this DNA sequencing
- What genes look like and where they are found
- How we can compare sequences between different
species - How genes move between species
3DNA Sequencing
- Bioinformatics is based on the fact that DNA
sequencing is cheap, and becoming easier and
cheaper very quickly. - the Human Genome Project cost roughly 3 billion
and took 12 years (1991-2003). - Sequencing James Watsons genome in 2007 cost 2
million and took 2 months - Today, you could get your genome sequenced for
about 100,000 and it would take a month. - The Archon X prize you win 10 million if you
can sequence 100 human genomes in 10 days, at a
cost of 10,000 per genome. - It is realistic to envision 100 per genome
within 10 years everyones genome could be
sequenced if they wanted or needed it.
4Why its useful
- All of the information needed to build an
organism is contained in its DNA. If we could
understand it, we would know how life works. - Preventing and curing diseases like cancer (which
is caused by mutations in DNA) and inherited
diseases. - Curing infectious diseases (everything from AIDS
and malaria to the common cold). If we
understand how a microorganism works, we can
figure out how to block it. - Understanding genetic and evolutionary
relationships between species - Understanding genetic relationships between
humans. Projects exist to understand human
genetic diversity. Also, sequencing the
Neanderthal genome. - Ancient DNA currently it is thought that under
ideal conditions (continuously kept frozen),
there is a limit of about 1 million years for DNA
survival. So, Jurassic Park will probably remain
fiction.
5From DNA to Gene
- But extracting that information is difficult.
How to convert a string of ACGTs into knowledge
of how the organism works is hard. - Most of the work is on the computer, with key
confirming experiments done in the wet lab. - The sequence below contains a gene critical for
life the gene that initiates replication of the
DNA. Can you spot it? - We are now going to spend some time on what genes
look like and how we can find them.
TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAA
AAAATTAAGCAAACCTAGTTTTGAAACCTG GCTCAAATCGACAAAAGCT
CATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGC
ACGGGACTGGT TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTT
TATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT CCTCCTA
ACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAAT
AAAGACGAAGGAGCAGAATTTCC TCAAAGCATGCTAAATTCGAAGTATA
CCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCT
TCTT TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTA
TTTACGGGGGAGTAGGATTAGGCAAAACACACTTA ATGCACGCCATAGG
CCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCAT
CTGAAAAATTCACAAA CGAGTTTATTAACTCTATTCGTGACAATAAAGC
AGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG AT
GATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCA
TACGTTTAATACGCTTCACGAAGAAAGC AAGCAGATTGTCATCTCAAGT
GATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCG
CTTTGAATG GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACA
CGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT TAGTTATTC
CAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGA
GAATTAGAAGGCGCACTTATT
6DNA
- DNA is just a long string of 4 letters
(nucleotides, or bases) Adenine, Guanine,
Cytosine, and Thymine. - Which we will just refer to as A, C, G, and T
- and we are skipping lots of details
- Each DNA molecule has 2 strands, with the bases
paired in the center - A on one strand always pairs with T on the other
strand - G pairs with C.
- the strands run in opposite directions (like
roads) - Since the two DNA strands are complementary,
there is no need to write down both strands
7Chromosomes and Genes
- each chromosome is a long piece of DNA
- B. megaterium genome is a circle (like most
bacteria) of about 5 million bases. - Human chromosomes are 100-200 million bases long.
We have 46 chromosomes (2 sets of 23, one set
from each parent). - genes are just regions on that DNA. It is not
obvious where genes are if you look at a DNA
sequence. - there is a lot of DNA that is not part of genes
in humans only 2 at most of the DNA is part of
any gene. - Bacteria use more of their DNA 80 of the B. meg
chromosome is genes. - B. meg has about 1 gene per 1000 base pairs (bp)
of DNA. About 5000 genes - Humans have about 25,000 genes.
- We are far more complicated than bacteria
regulation of the genes is very complicated in
humans - We use the same gene in different ways in
different tissues
8Genes and Proteins
- Most genes code for proteins each gene contains
the information necessary to make one protein. - Proteins are the most important type of
macromolecule. - Structure collagen in skin, keratin in hair,
crystallin in eye. - Enzymes all metabolic transformations, building
up, rearranging, and breaking down of organic
compounds, are done by enzymes, which are
proteins. - Transport oxygen in the blood is carried by
hemoglobin, everything that goes in or out of a
cell (except water and a few gasses) is carried
by proteins. - Also nutrition (egg yolk), hormones, defense,
movement
9The Genetic Code
- Proteins are long chains of amino acids.
- There are 20 different amino acids coded in DNA
- There are only 4 DNA bases, so you need 3 DNA
bases to code for the 20 amino acids - 4 x 4 x 4 64 possible 3 base combinations
(codons) - Each codon codes for one amino acid
- Most amino acids have more than one possible
codon - Genes start at a start codon and end at a stop
codon. - 3 codons are stop codons all genes end at a stop
codon. - Start codons are a bit trickier, since they are
used in the middle of genes as well as at the
beginning - in eukaryotes, ATG is always the start codon,
making Methionine (Met) the first amino acid in
all proteins (but in many proteins it is
immediately removed). - In prokaryotes, ATG, GTG, or TTG can be used as a
start codon. B. meg prefers ATG, but about 30
of the genes start with GTG or TTG.
In bioinformatics, we generally ignore the fact
that RNA uses the base uracil (U) in place of T.
10Gene Expression
- How do you get a protein from a gene?
- A two-step process (called the Central Dogma of
Molecular Biology). - First, the gene has to be copied (transcribed)
into an RNA form. - The RNA copy (messenger RNA) is exactly like the
gene itself, except RNA replaces T with U. - Most gene regulation whether the gene is on or
off happens here - Second, the RNA is translated into protein by
ribosomes, which are complex RNA/protein hybrid
machines. - With the help of transfer RNA molecules, which
have one end that matches the 3 base codon and
the other end that is attached to the proper
amino acid. - The ribosome starts at the start codon and moves
down the messenger RNA, adding one amino acid at
a time to the growing chain. When the ribosome
reaches a stop codon, it falls off, releasing the
new protein.
11Reading Frames
- Here we get a bit subtle.
- Since codons consist of 3 bases, there are 3
reading frames possible on an RNA (or DNA),
depending on whether you start reading from the
first base, the second base, or the third base. - The different reading frames give entirely
different proteins. - Consider ATGCCATC, and refer to the genetic code.
(X is junk) - Reading frame 1 divides this into ATG-CCA-TC,
which translates to Met-Pro-X - Reading frame 2 divides this into A-TGC-CAT-C,
which translates to X-Cys-His-X - Reading frame 3 divides this into AT-GCC-ATC,
which translates to X-Ala-Ile - Each gene uses a single reading frame, so once
the ribosome gets started, it just has to count
off groups of 3 bases to produce the proper
protein.
12Open Reading Frames
- Ribosomes are very obedient to stop codons when
a stop codon is reached, the protein is finished.
Thus, all genes end at the first stop codon in
their reading frame. - Since 3 out of the 64 codons are stop codons,
random DNA has stop codons very frequently. - However, genes do something necessary for
survival, so natural selection keeps stop codons
out of the middle of genes. - That is, if a mutation arises that creates a stop
codon in the middle of a gene, the organism dies
and leaves no descendants. - Open reading frames (ORFs) are regions with no
stop codons. All genes reside in long open
reading frames - Note that stop codons in other reading frames
have no effect on the gene. - The start codon must occur upstream in the same
reading frame as the stop codon. It is usually
near the beginning of the ORF, but not
necessarily the first possible start codon. - Determining the exact start codon is not easy or
obvious. - But, the first stop codon in an open reading
frame is always a reasonable guess
This is a map of the stop codons in all 3
reading frames in a stretch of DNA. The long ORF
in reading frame 1 is highlighted in black.
13Gene Placement
- Genes can occur on either DNA strand.
- If they are on the reverse strand, the DNA
sequence needs to be reversed and complemented - In bacteria, most of the DNA is part of a gene.
Most long open reading frames (say 100 bp or
longer) that dont overlap other long ORFs
contain genes - Most genes do not overlap each other.
- Sometimes there are very short overlaps (50 bp or
less), especially if the two genes are
functionally related. - In bacteria, genes that affect the same
biochemical pathway or function are sometimes
adjacent to each other on the same DNA strand
(not necessarily the same reading frame),
allowing them to be co-regulated - This group of genes is called an operon
- Operons only exist in bacteria they are not
present in eukaryotes at all.
14Finding Genes
- First job is to find long ORFs, examining the
longest ORFs first and putting together a set
with minimal overlaps. - It is also necessary to identify potential start
codons, with the furthest upstream start codon as
the easiest choice. - Then, how do we know that the ORF contains a real
gene? The most definitive way is to match it
with a gene known from other species - conservation of a sequence between species
strongly suggests that the sequence has a
function that is being conserved by natural
selection - We compare protein sequences, not DNA, because
protein is more conserved in evolution than DNA - The organisms survival depends on the protein
being functional, which means having the proper
amino acids sequence - Since the genetic code is degenerate, many
different DNA sequences will give identical
proteins. - The protein 3-dimensional structure is even more
conserved, because it is more closely related to
enzyme activity than the amino acid sequence is. - However, we dont have good ways of determining
3-D structure from a DNA sequence
15Sequence Comparison
- So, we compare our ORF sequence to a database of
known protein sequences from many species. - BLAST is the standard sequence alignment tool
(BLAST Basic Local Alignment Search Tool) - BLAST is based on the concept that if you compare
the same (that is, homologous) protein from many
different species, you can see that some amino
acids readily substitute for each other and
others almost never do. - A substitution matrix, giving a score for each
amino acid position in the proteins being
compared.
16Practical BLAST
- BLAST itself is a bit of software that can be run
on almost any computer, but the database needed
for a good cross-species comparison is quite
large - the database is called nr for non-redundant,
and it contains at least 20 Gb of sequence data - We are going to use the BLAST service at UniProt,
a European consortium that contains a
comprehensive collection of protein sequences - http//www.uniprot.org/
- Nearly all derived from DNA sequences direct
sequencing of proteins is difficult - Terminology your sequence, which you paste into
the box on the web site, is the query sequence.
Sequences in the database that match yours are
called subject sequences.
17A Sequence to BLAST
- This is a more-or-less randomly chosen gene from
B. meg. - It is 174 amino acids long
- It is written in fasta format the first line
starts with gt and is immediately followed by an
identifier (ORF00135), and then some
miscellaneous comments. - After that the sequence is written without spaces
or other marks.
- gtORF00135 chromosome 538197-538721 revcomp
MKAKLIQYVYDAECRLFKSVNQHFDRKHLNRFLRLLTHAGGATFTIVIAC
LLLFLYPSSVAYACAFSLAVSHIPVAIAKKLYPRKRPYIQLKHTKVLENP
LKDHSFPSGHTTAIFSLVTPLMIVYPAFAAVLLPLAVMVGISRIYLGLHY
PTDVMVGLILGIFSGAVALNIFLT
18Results
19BLAST Scores
- Results are arranged with the best ones on top
- The most important score is the Expect value, or
E-value, which can be defined the number of hits
any random sequence (with the same length as
yours) would have in the database. - E-values for good hits are usually written
something like 3e-42, which is the same as 3 x
10-42 , a very small number - Bad hits are very common, and they have e-values
in a more familiar form for example, 0.004 or
1.2 - A really good e-values is less than 1e-180, which
underflows the computers processing
capabilities, so it written as 0.0 - E-values are affected by the length of the query
sequence as well as the size of the database, so
even perfect matches with short sequences give
poor e-values - In this case we see many hits with good e-values,
and the top e-values all are quite similar. - Before we can conclude that our protein is a
homologue of the proteins BLAST matches it with,
we would like them to have roughly the same
length and have a high percentage of identical
amino acids. - the lengths of the query and subject sequences
should be within 20 of each other - There should be at least 30 identical amino
acids - In this case we can be quite sure we have a good
match - BLAST also returns a fourth value, the bit score,
which we are going to ignore.
20Gene Names
- Mostly genes are named with the function of their
protein. - at some point, some related genes had their
function determined through lab work by
examining the effects of mutations in the gene,
by isolating and studying the protein produced by
the gene, etc. - Enzymes (end in ase), transport across the cell
membrane, genetic information processing
(DNA-gtRNA-gtprotein), structural proteins,
sporulation and germination, and more! - Many genes (maybe 1/4 of them in a typical
genome) have no known function, although they are
found in several different species conserved
hypothetical genes - Every new genome has some genes that are unique
no matching BLAST hits in the database. - Are they real genes? Sometimes there is evidence
in the form of messenger RNA, but usually we
dont know - call them hypothetical genes
- putative means that we think we know the genes
function but we arent sure. Putative should be
followed by the function name.
21More Gene Names
- One question of interest do the names of the top
BLAST hits agree with each other? They should,
but there are always annotation errors, and our
knowledge of gene function increases over time. - With some sloppiness due to different naming
conventions practiced by different scientists - Here we have a classic case of mis-naming. Why
is the top hit ribosomal protein S2, with no
other hit having this name? - Ribosomal proteins are highly conserved in
evolution - Some checking on my part showed that no homology
exists between this gene and the ribosomal
protein S2 found in any other Bacillus species - The other names are similar, although not
identical. - What is PAP2? A quick Google search shows that
it stands for phosphatidic acid phosphatase,
which fits the other names well. - There is probably some uncertainty about its
exact function, given the variety of names and
the family protein designation in several of
them.
22Horizontal and Vertical Gene Transfer
- We are accustomed to thinking of genes being
passed from parent to offspring, always staying
within the species, with very occasional
splitting of one species into two. - This is called vertical gene transfer.
- But, we know that some genes are transferred
across species lines, not by the standard genetic
mechanisms. - This is called horizontal gene transfer
- It is rare in humans and other higher organisms
- In bacteria 10 or more of genes have been
transferred in horizontally. - B meg genes that come from vertical descent have
other Bacillus species (or another closely
related species) as the closest BLAST hit - Horizontally transferred genes can come from
almost anywhere other bacteria, Archaea,
eukaryotes plants, animals, fungi - The general mechanisms are well known, including
conjugation (direct transfer of DNA between two
bacteria), transduction (transfer of DNA using a
virus as a carrier), and transformation (the
bacteria pick up DNA molecules from their
environment.
23Bacillus Phylogeny
- Kings Play Chess On Fine Ground Sand
- Bacteria is the domain
- Firmicutes is the phylum
- Bacilli is the class
- Bacillales is the order
- Bacillaceae is the family
- Bacillus is the genus.
24Our Example
- Most of the top hits are from various Bacillus
species there is little doubt that this gene is
the results of normal, vertical gene flow. - What about Anoxybacillus flavithermus?
- Click on the accession number to get more
information, including its phylogeny. - Taxonomic lineage Bacteria gt Firmicutes gt
Bacillales gt Bacillaceae gt Anoxybacillus. - Same family as B meg.
25Aligned Sequences
- You can see the aligned sequences by clicking on
the Local alignment diagrams - Query sequence on top, subject below
- Identical amino acids are in the middle of the
alignment, and similar ones have a sign. - Gaps regions where one sequence has amino acids
not found in the other sequence, are indicated
with ---. - This protein is very typical in that the best
matches are in the middle of the protein, with
fewer identical amino acids near the ends. - Also, the match doesnt quite make it to the very
beginning of the proteins, although they are
almost identical in length. - The active site of most enzymes is in the middle
- The ends of proteins are often not well conserved
26Local Alignment Result
27Graphical Overview
- Click on Graphical Overview (just under the BLAST
box on the left) to get an overview of all the
aligned sequences - The extent of the matching region is shown with
the colored boxes, with non-matching regions
drawn as a line. - Color indicates percent of identical amino acids
- You can see that mostly our query and the various
subjects (matches) line up along almost all of
their lengths. - This is a good way to check whether our start
site is reasonable. - A few odd ones lower down.
- Genes, and pieces of genes, can move to new
locations in the genome, fuse with other genes,
break apart, etc. Always subject to natural
selection if the altered gene doesnt work, the
organism will die and we wont see it. - And of course, sequencing and annotation errors
occur.
28The Basic Points
- DNA can be read in 3 different reading frames, a
consequence of the genetic code (3 bases 1
amino acid) - Genes are found in long open reading frames,
areas where there are no stop codons. - BLAST is the tool we use to compare sequences
between species - BLAST scores (e-values) describe the probability
of finding a random sequence in the database - Gene sequences are conserved between species by
natural selection - DNA sequences outside of genes are much less
conserved - Most genes are transferred vertically, from
parent to offspring, but a significant number are
transferred horizontally, from unrelated species).
29End
30Other Stuff
- Within-species BLAST--are there duplicate genes?
Do their names match? What is most closely
related species? Present in both strains? - Are nearby genes related by subsystem?