Basic Bioinformatics - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Basic Bioinformatics

Description:

So, Jurassic Park will probably remain fiction. From DNA to Gene ... There are only 4 DNA bases, so you need 3 DNA bases to code for the 20 amino acids ... – PowerPoint PPT presentation

Number of Views:424
Avg rating:3.0/5.0
Slides: 31
Provided by: rickj4
Category:

less

Transcript and Presenter's Notes

Title: Basic Bioinformatics


1
Basic Bioinformatics
  • As it is applied to the Bacillus megaterium
    genome

2
What we are going to talk about
  • Why we are doing all this DNA sequencing
  • What genes look like and where they are found
  • How we can compare sequences between different
    species
  • How genes move between species

3
DNA Sequencing
  • Bioinformatics is based on the fact that DNA
    sequencing is cheap, and becoming easier and
    cheaper very quickly.
  • the Human Genome Project cost roughly 3 billion
    and took 12 years (1991-2003).
  • Sequencing James Watsons genome in 2007 cost 2
    million and took 2 months
  • Today, you could get your genome sequenced for
    about 100,000 and it would take a month.
  • The Archon X prize you win 10 million if you
    can sequence 100 human genomes in 10 days, at a
    cost of 10,000 per genome.
  • It is realistic to envision 100 per genome
    within 10 years everyones genome could be
    sequenced if they wanted or needed it.

4
Why its useful
  • All of the information needed to build an
    organism is contained in its DNA. If we could
    understand it, we would know how life works.
  • Preventing and curing diseases like cancer (which
    is caused by mutations in DNA) and inherited
    diseases.
  • Curing infectious diseases (everything from AIDS
    and malaria to the common cold). If we
    understand how a microorganism works, we can
    figure out how to block it.
  • Understanding genetic and evolutionary
    relationships between species
  • Understanding genetic relationships between
    humans. Projects exist to understand human
    genetic diversity. Also, sequencing the
    Neanderthal genome.
  • Ancient DNA currently it is thought that under
    ideal conditions (continuously kept frozen),
    there is a limit of about 1 million years for DNA
    survival. So, Jurassic Park will probably remain
    fiction.

5
From DNA to Gene
  • But extracting that information is difficult.
    How to convert a string of ACGTs into knowledge
    of how the organism works is hard.
  • Most of the work is on the computer, with key
    confirming experiments done in the wet lab.
  • The sequence below contains a gene critical for
    life the gene that initiates replication of the
    DNA. Can you spot it?
  • We are now going to spend some time on what genes
    look like and how we can find them.

TTGGAAAACATTCATGATTTATGGGATAGAGCTTTAGATCAAATTGAAAA
AAAATTAAGCAAACCTAGTTTTGAAACCTG GCTCAAATCGACAAAAGCT
CATGCTTTACAAGGAGACACGCTCATTATTACTGCACCTAATGATTTTGC
ACGGGACTGGT TAGAATCTAGGTATTCTAATTTAATTGCTGAAACACTT
TATGATCTTACGGGGGAAGAGTTAGATGTAAAATTTATTATT CCTCCTA
ACCAGGCCGAGGAAGAATTCGATATTCAAACTCCTAAAAAGAAAGTCAAT
AAAGACGAAGGAGCAGAATTTCC TCAAAGCATGCTAAATTCGAAGTATA
CCTTTGATACATTTGTTATCGGATCTGGAAATCGGTTTGCGCATGCAGCT
TCTT TAGCAGTAGCAGAAGCGCCGGCTAAAGCGTATAATCCGCTTTTTA
TTTACGGGGGAGTAGGATTAGGCAAAACACACTTA ATGCACGCCATAGG
CCACTATGTGTTAGATCATAATCCTGCCGCGAAAGTCGTGTACTTATCAT
CTGAAAAATTCACAAA CGAGTTTATTAACTCTATTCGTGACAATAAAGC
AGTAGAATTCCGCAACAAATACCGTAATGTAGATGTTTTACTGATTG AT
GATATTCAATTCTTAGCAGGTAAAGAGCAGACACAAGAAGAATTTTTCCA
TACGTTTAATACGCTTCACGAAGAAAGC AAGCAGATTGTCATCTCAAGT
GATCGACCGCCGAAAGAAATTCCTACACTTGAAGATCGACTTCGCTCTCG
CTTTGAATG GGGCCTTATTACAGACATCACACCACCAGATTTGGAAACA
CGAATTGCTATTTTGCGTAAAAAAGCCAAAGCGGACGGCT TAGTTATTC
CAAATGAAGTTATGCTTTATATCGCCAATCAGATTGATTCAAATATTAGA
GAATTAGAAGGCGCACTTATT
6
DNA
  • DNA is just a long string of 4 letters
    (nucleotides, or bases) Adenine, Guanine,
    Cytosine, and Thymine.
  • Which we will just refer to as A, C, G, and T
  • and we are skipping lots of details
  • Each DNA molecule has 2 strands, with the bases
    paired in the center
  • A on one strand always pairs with T on the other
    strand
  • G pairs with C.
  • the strands run in opposite directions (like
    roads)
  • Since the two DNA strands are complementary,
    there is no need to write down both strands

7
Chromosomes and Genes
  • each chromosome is a long piece of DNA
  • B. megaterium genome is a circle (like most
    bacteria) of about 5 million bases.
  • Human chromosomes are 100-200 million bases long.
    We have 46 chromosomes (2 sets of 23, one set
    from each parent).
  • genes are just regions on that DNA. It is not
    obvious where genes are if you look at a DNA
    sequence.
  • there is a lot of DNA that is not part of genes
    in humans only 2 at most of the DNA is part of
    any gene.
  • Bacteria use more of their DNA 80 of the B. meg
    chromosome is genes.
  • B. meg has about 1 gene per 1000 base pairs (bp)
    of DNA. About 5000 genes
  • Humans have about 25,000 genes.
  • We are far more complicated than bacteria
    regulation of the genes is very complicated in
    humans
  • We use the same gene in different ways in
    different tissues

8
Genes and Proteins
  • Most genes code for proteins each gene contains
    the information necessary to make one protein.
  • Proteins are the most important type of
    macromolecule.
  • Structure collagen in skin, keratin in hair,
    crystallin in eye.
  • Enzymes all metabolic transformations, building
    up, rearranging, and breaking down of organic
    compounds, are done by enzymes, which are
    proteins.
  • Transport oxygen in the blood is carried by
    hemoglobin, everything that goes in or out of a
    cell (except water and a few gasses) is carried
    by proteins.
  • Also nutrition (egg yolk), hormones, defense,
    movement

9
The Genetic Code
  • Proteins are long chains of amino acids.
  • There are 20 different amino acids coded in DNA
  • There are only 4 DNA bases, so you need 3 DNA
    bases to code for the 20 amino acids
  • 4 x 4 x 4 64 possible 3 base combinations
    (codons)
  • Each codon codes for one amino acid
  • Most amino acids have more than one possible
    codon
  • Genes start at a start codon and end at a stop
    codon.
  • 3 codons are stop codons all genes end at a stop
    codon.
  • Start codons are a bit trickier, since they are
    used in the middle of genes as well as at the
    beginning
  • in eukaryotes, ATG is always the start codon,
    making Methionine (Met) the first amino acid in
    all proteins (but in many proteins it is
    immediately removed).
  • In prokaryotes, ATG, GTG, or TTG can be used as a
    start codon. B. meg prefers ATG, but about 30
    of the genes start with GTG or TTG.

In bioinformatics, we generally ignore the fact
that RNA uses the base uracil (U) in place of T.
10
Gene Expression
  • How do you get a protein from a gene?
  • A two-step process (called the Central Dogma of
    Molecular Biology).
  • First, the gene has to be copied (transcribed)
    into an RNA form.
  • The RNA copy (messenger RNA) is exactly like the
    gene itself, except RNA replaces T with U.
  • Most gene regulation whether the gene is on or
    off happens here
  • Second, the RNA is translated into protein by
    ribosomes, which are complex RNA/protein hybrid
    machines.
  • With the help of transfer RNA molecules, which
    have one end that matches the 3 base codon and
    the other end that is attached to the proper
    amino acid.
  • The ribosome starts at the start codon and moves
    down the messenger RNA, adding one amino acid at
    a time to the growing chain. When the ribosome
    reaches a stop codon, it falls off, releasing the
    new protein.

11
Reading Frames
  • Here we get a bit subtle.
  • Since codons consist of 3 bases, there are 3
    reading frames possible on an RNA (or DNA),
    depending on whether you start reading from the
    first base, the second base, or the third base.
  • The different reading frames give entirely
    different proteins.
  • Consider ATGCCATC, and refer to the genetic code.
    (X is junk)
  • Reading frame 1 divides this into ATG-CCA-TC,
    which translates to Met-Pro-X
  • Reading frame 2 divides this into A-TGC-CAT-C,
    which translates to X-Cys-His-X
  • Reading frame 3 divides this into AT-GCC-ATC,
    which translates to X-Ala-Ile
  • Each gene uses a single reading frame, so once
    the ribosome gets started, it just has to count
    off groups of 3 bases to produce the proper
    protein.

12
Open Reading Frames
  • Ribosomes are very obedient to stop codons when
    a stop codon is reached, the protein is finished.
    Thus, all genes end at the first stop codon in
    their reading frame.
  • Since 3 out of the 64 codons are stop codons,
    random DNA has stop codons very frequently.
  • However, genes do something necessary for
    survival, so natural selection keeps stop codons
    out of the middle of genes.
  • That is, if a mutation arises that creates a stop
    codon in the middle of a gene, the organism dies
    and leaves no descendants.
  • Open reading frames (ORFs) are regions with no
    stop codons. All genes reside in long open
    reading frames
  • Note that stop codons in other reading frames
    have no effect on the gene.
  • The start codon must occur upstream in the same
    reading frame as the stop codon. It is usually
    near the beginning of the ORF, but not
    necessarily the first possible start codon.
  • Determining the exact start codon is not easy or
    obvious.
  • But, the first stop codon in an open reading
    frame is always a reasonable guess

This is a map of the stop codons in all 3
reading frames in a stretch of DNA. The long ORF
in reading frame 1 is highlighted in black.
13
Gene Placement
  • Genes can occur on either DNA strand.
  • If they are on the reverse strand, the DNA
    sequence needs to be reversed and complemented
  • In bacteria, most of the DNA is part of a gene.
    Most long open reading frames (say 100 bp or
    longer) that dont overlap other long ORFs
    contain genes
  • Most genes do not overlap each other.
  • Sometimes there are very short overlaps (50 bp or
    less), especially if the two genes are
    functionally related.
  • In bacteria, genes that affect the same
    biochemical pathway or function are sometimes
    adjacent to each other on the same DNA strand
    (not necessarily the same reading frame),
    allowing them to be co-regulated
  • This group of genes is called an operon
  • Operons only exist in bacteria they are not
    present in eukaryotes at all.

14
Finding Genes
  • First job is to find long ORFs, examining the
    longest ORFs first and putting together a set
    with minimal overlaps.
  • It is also necessary to identify potential start
    codons, with the furthest upstream start codon as
    the easiest choice.
  • Then, how do we know that the ORF contains a real
    gene? The most definitive way is to match it
    with a gene known from other species
  • conservation of a sequence between species
    strongly suggests that the sequence has a
    function that is being conserved by natural
    selection
  • We compare protein sequences, not DNA, because
    protein is more conserved in evolution than DNA
  • The organisms survival depends on the protein
    being functional, which means having the proper
    amino acids sequence
  • Since the genetic code is degenerate, many
    different DNA sequences will give identical
    proteins.
  • The protein 3-dimensional structure is even more
    conserved, because it is more closely related to
    enzyme activity than the amino acid sequence is.
  • However, we dont have good ways of determining
    3-D structure from a DNA sequence

15
Sequence Comparison
  • So, we compare our ORF sequence to a database of
    known protein sequences from many species.
  • BLAST is the standard sequence alignment tool
    (BLAST Basic Local Alignment Search Tool)
  • BLAST is based on the concept that if you compare
    the same (that is, homologous) protein from many
    different species, you can see that some amino
    acids readily substitute for each other and
    others almost never do.
  • A substitution matrix, giving a score for each
    amino acid position in the proteins being
    compared.

16
Practical BLAST
  • BLAST itself is a bit of software that can be run
    on almost any computer, but the database needed
    for a good cross-species comparison is quite
    large
  • the database is called nr for non-redundant,
    and it contains at least 20 Gb of sequence data
  • We are going to use the BLAST service at UniProt,
    a European consortium that contains a
    comprehensive collection of protein sequences
  • http//www.uniprot.org/
  • Nearly all derived from DNA sequences direct
    sequencing of proteins is difficult
  • Terminology your sequence, which you paste into
    the box on the web site, is the query sequence.
    Sequences in the database that match yours are
    called subject sequences.

17
A Sequence to BLAST
  • This is a more-or-less randomly chosen gene from
    B. meg.
  • It is 174 amino acids long
  • It is written in fasta format the first line
    starts with gt and is immediately followed by an
    identifier (ORF00135), and then some
    miscellaneous comments.
  • After that the sequence is written without spaces
    or other marks.
  • gtORF00135 chromosome 538197-538721 revcomp
    MKAKLIQYVYDAECRLFKSVNQHFDRKHLNRFLRLLTHAGGATFTIVIAC
    LLLFLYPSSVAYACAFSLAVSHIPVAIAKKLYPRKRPYIQLKHTKVLENP
    LKDHSFPSGHTTAIFSLVTPLMIVYPAFAAVLLPLAVMVGISRIYLGLHY
    PTDVMVGLILGIFSGAVALNIFLT

18
Results
19
BLAST Scores
  • Results are arranged with the best ones on top
  • The most important score is the Expect value, or
    E-value, which can be defined the number of hits
    any random sequence (with the same length as
    yours) would have in the database.
  • E-values for good hits are usually written
    something like 3e-42, which is the same as 3 x
    10-42 , a very small number
  • Bad hits are very common, and they have e-values
    in a more familiar form for example, 0.004 or
    1.2
  • A really good e-values is less than 1e-180, which
    underflows the computers processing
    capabilities, so it written as 0.0
  • E-values are affected by the length of the query
    sequence as well as the size of the database, so
    even perfect matches with short sequences give
    poor e-values
  • In this case we see many hits with good e-values,
    and the top e-values all are quite similar.
  • Before we can conclude that our protein is a
    homologue of the proteins BLAST matches it with,
    we would like them to have roughly the same
    length and have a high percentage of identical
    amino acids.
  • the lengths of the query and subject sequences
    should be within 20 of each other
  • There should be at least 30 identical amino
    acids
  • In this case we can be quite sure we have a good
    match
  • BLAST also returns a fourth value, the bit score,
    which we are going to ignore.

20
Gene Names
  • Mostly genes are named with the function of their
    protein.
  • at some point, some related genes had their
    function determined through lab work by
    examining the effects of mutations in the gene,
    by isolating and studying the protein produced by
    the gene, etc.
  • Enzymes (end in ase), transport across the cell
    membrane, genetic information processing
    (DNA-gtRNA-gtprotein), structural proteins,
    sporulation and germination, and more!
  • Many genes (maybe 1/4 of them in a typical
    genome) have no known function, although they are
    found in several different species conserved
    hypothetical genes
  • Every new genome has some genes that are unique
    no matching BLAST hits in the database.
  • Are they real genes? Sometimes there is evidence
    in the form of messenger RNA, but usually we
    dont know
  • call them hypothetical genes
  • putative means that we think we know the genes
    function but we arent sure. Putative should be
    followed by the function name.

21
More Gene Names
  • One question of interest do the names of the top
    BLAST hits agree with each other? They should,
    but there are always annotation errors, and our
    knowledge of gene function increases over time.
  • With some sloppiness due to different naming
    conventions practiced by different scientists
  • Here we have a classic case of mis-naming. Why
    is the top hit ribosomal protein S2, with no
    other hit having this name?
  • Ribosomal proteins are highly conserved in
    evolution
  • Some checking on my part showed that no homology
    exists between this gene and the ribosomal
    protein S2 found in any other Bacillus species
  • The other names are similar, although not
    identical.
  • What is PAP2? A quick Google search shows that
    it stands for phosphatidic acid phosphatase,
    which fits the other names well.
  • There is probably some uncertainty about its
    exact function, given the variety of names and
    the family protein designation in several of
    them.

22
Horizontal and Vertical Gene Transfer
  • We are accustomed to thinking of genes being
    passed from parent to offspring, always staying
    within the species, with very occasional
    splitting of one species into two.
  • This is called vertical gene transfer.
  • But, we know that some genes are transferred
    across species lines, not by the standard genetic
    mechanisms.
  • This is called horizontal gene transfer
  • It is rare in humans and other higher organisms
  • In bacteria 10 or more of genes have been
    transferred in horizontally.
  • B meg genes that come from vertical descent have
    other Bacillus species (or another closely
    related species) as the closest BLAST hit
  • Horizontally transferred genes can come from
    almost anywhere other bacteria, Archaea,
    eukaryotes plants, animals, fungi
  • The general mechanisms are well known, including
    conjugation (direct transfer of DNA between two
    bacteria), transduction (transfer of DNA using a
    virus as a carrier), and transformation (the
    bacteria pick up DNA molecules from their
    environment.

23
Bacillus Phylogeny
  • Kings Play Chess On Fine Ground Sand
  • Bacteria is the domain
  • Firmicutes is the phylum
  • Bacilli is the class
  • Bacillales is the order
  • Bacillaceae is the family
  • Bacillus is the genus.

24
Our Example
  • Most of the top hits are from various Bacillus
    species there is little doubt that this gene is
    the results of normal, vertical gene flow.
  • What about Anoxybacillus flavithermus?
  • Click on the accession number to get more
    information, including its phylogeny.
  • Taxonomic lineage Bacteria gt Firmicutes gt
    Bacillales gt Bacillaceae gt Anoxybacillus.
  • Same family as B meg.

25
Aligned Sequences
  • You can see the aligned sequences by clicking on
    the Local alignment diagrams
  • Query sequence on top, subject below
  • Identical amino acids are in the middle of the
    alignment, and similar ones have a sign.
  • Gaps regions where one sequence has amino acids
    not found in the other sequence, are indicated
    with ---.
  • This protein is very typical in that the best
    matches are in the middle of the protein, with
    fewer identical amino acids near the ends.
  • Also, the match doesnt quite make it to the very
    beginning of the proteins, although they are
    almost identical in length.
  • The active site of most enzymes is in the middle
  • The ends of proteins are often not well conserved

26
Local Alignment Result
27
Graphical Overview
  • Click on Graphical Overview (just under the BLAST
    box on the left) to get an overview of all the
    aligned sequences
  • The extent of the matching region is shown with
    the colored boxes, with non-matching regions
    drawn as a line.
  • Color indicates percent of identical amino acids
  • You can see that mostly our query and the various
    subjects (matches) line up along almost all of
    their lengths.
  • This is a good way to check whether our start
    site is reasonable.
  • A few odd ones lower down.
  • Genes, and pieces of genes, can move to new
    locations in the genome, fuse with other genes,
    break apart, etc. Always subject to natural
    selection if the altered gene doesnt work, the
    organism will die and we wont see it.
  • And of course, sequencing and annotation errors
    occur.

28
The Basic Points
  • DNA can be read in 3 different reading frames, a
    consequence of the genetic code (3 bases 1
    amino acid)
  • Genes are found in long open reading frames,
    areas where there are no stop codons.
  • BLAST is the tool we use to compare sequences
    between species
  • BLAST scores (e-values) describe the probability
    of finding a random sequence in the database
  • Gene sequences are conserved between species by
    natural selection
  • DNA sequences outside of genes are much less
    conserved
  • Most genes are transferred vertically, from
    parent to offspring, but a significant number are
    transferred horizontally, from unrelated species).

29
End
30
Other Stuff
  • Within-species BLAST--are there duplicate genes?
    Do their names match? What is most closely
    related species? Present in both strains?
  • Are nearby genes related by subsystem?
Write a Comment
User Comments (0)
About PowerShow.com