Computational Analysis of Genome Sequences - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Computational Analysis of Genome Sequences

Description:

yeast, nematode, fruit fly. 2 major projects in 2000: Human (3.3 billion bp) ... Haemophilus influenzae (1.83 Mb) Fleischmann et al., Science 269, 496-512 (1995) ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 47
Provided by: stevens9
Category:

less

Transcript and Presenter's Notes

Title: Computational Analysis of Genome Sequences


1
Computational Analysis of Genome Sequences
Steven Salzberg The Institute for Genomic
Research (TIGR) and The Johns Hopkins University
2
The Genomics Revolution
  • 1995 1st genome (H. influenzae, TIGR)
  • 1996 1st eukaryote (S. cerevisiae)
  • 2000 29 complete microbial genomes
  • 22 in progress at TIGR
  • 50 in progress worldwide
  • 3 complete eukaryotes
  • yeast, nematode, fruit fly
  • 2 major projects in 2000
  • Human (3.3 billion bp)
  • Arabidopsis thaliana (125 million bp)

3
Genomes Completed at TIGR
Organism (genome size) Reference Haemophilus
influenzae (1.83 Mb) Fleischmann et al., Science
269, 496-512 (1995). Mycoplasma genitalium (0.58
Mb) Fraser et al., Science 270, 397-403 (1995).
Methanococcus jannaschii(1.7 Mb) Bult et al.,
Science 273, 1058-73 (1996). Helicobacter
pylori(1.6 Mb) Tomb et al., Nature 388, 539-47
(1997). Archeoglobus fulgidus (2.1 Mb) Klenk et
al., Nature 390, 364-70 (1997). Borrelia
burgdorferi(1.5 Mb) Fraser et al., Nature 390,
580-6 (1997). Treponema pallidum(1.1 Mb) Fraser
et al., Science 281, 375-88 (1998). Plasmodium
falciparum chr2 (1 Mb) Gardner et al., Science
282, 1126-32 (1998). Thermotoga maritima (1.8
Mb) Nelson et al., Nature 399, 323-9 (1999).
Deinococcus radiodurans(3.3 Mb) White et al.,
Science 286, 1571-7 (1999). Arabidopsis thaliana
chr2 (19 Mb) Lin et al., Nature 402, 761-8
(1999). Neisseria meningitidis (2.3 Mb) Tettelin
et al., Science 287, 1809-15 (2000). Chlamydia
pneumoniae (1.2 Mb) Read et al., Nucleic Acids
Res 28, 1397-406 (2000). Chlamydia trachomatis
(1.0 Mb) Read et al., Nucleic Acids Res 28,
1397-406 (2000). Vibrio cholerae (4.0
Mb) Heidelberg et al., Nature, in press.
Mycobacterium tuberculosis(4.4 Mb) Fleischmann
et al., manuscript in preparation Streptococcus
pneumoniae(2.2 Mb) Tettelin et al., manuscript in
preparation Caulobacter crescentus (4.0
Mb) Nierman et al., manuscript in
preparation Chlorobium tepidum (2.1 Mb) Eisen et
al., manuscript in preparation Porphyromonas
gingivalis (2.2 Mb) Fleishmann et al., manuscript
in preparation
4
Genomes in progress at TIGR
Organism (genome size) Funding source
Plasmodium falciparum chr 14 (3.4 Mb) BWF/DoD
Plasmodium falciparum chr 10,11 (4 Mb) NIAID/DoD
Trypanosoma brucei chr 2 (1 Mb) NIAID
Enterococcus faecalis (3.0 Mb) NIAID
Mycobacterium avium (4.4 Mb) NIAID Pseudomonas
putida (6.2 Mb) DOE Schewanella putrefaciens
(4.5 Mb) DOE Staphylococcus aureus (2.8
Mb) NIAID, MGRI Dehalococcoides ethenogenes
(1.5Mb) DOE Desulfovibrio vulgaris (3.2Mb) DOE
Thiobacillus ferrooxidans (2.9 Mb) DOE Chlamydia
psittaci GPIC (1.2Mb) NIAID Bacillus anthracis
(5.0Mb) ONR/DOE/NIAID Treponema denticola (3.0
Mb) NIDR C. hydrogenoformans (2.0
Mb) DOE Methylococcus capsulatus (4.6
Mb) DOE Geobacter sulfurreducens (4.0
Mb) DOE Wolbachia sp (Drosophila) (1.4
Mb) NIH Colwellia sp (1.0 Mb) DOE Mycobacterium
smegmatis (4.0Mb) NIAID Staphylococcus
epidermidis (2.5 Mb) NIAID Theileria parva
(10Mb) ILRI/TIGR
5
A Microbial Genome Sequencing Project
Random sequencing
Genome Assembly
Annotation
Data Release
Publication www.tigr.org
Sample tracking
6
Gene Finding
  • Gene finding plays an ever-larger role in
    high-speed DNA sequencing projects
  • Theres no time for much else!
  • 1000s of genes generated each month at a
    high-throughput sequencing facility
  • Separate gene finders are needed for every
    organism
  • Training on organism X, finding genes on Y,
    generates inferior results
  • Bootstrapping problem training data is hard to
    find

7
Open Reading Frames 6 possibilities
TCG TAC GTA GCT AGC TAG CTA AGC ATG CAT CGA TCG
ATC GAT
T CGT ACG TAG CTA GCT AGC TA A GCA TGC ATC GAT
CGA TCG AT
identical sequence
TC GTA CGT AGC TAG CTA GCT A AG CAT GCA TCG ATC
GAT CGA T
8
GLIMMER A Microbial Gene Finder
  • GLIMMER 2.0 released late 1999
  • gt 200 site licenses worldwide
  • Works on bacteria, archaea, viruses too
  • Malaria (eukaryotic) version GLIMMERM
  • Refs Salzberg et al., NAR, 1998, Genomics 1999
    Delcher et al., NAR, 1999
  • Web site and code
  • http//www.tigr.org/

9
Uniform Markov Models
  • Use conditional probability of a sequence
    position given previous k positions in the
    sequence.
  • Fixed, kth-order model bigger k s yield better
    models (as long as data is sufficient).
  • Probability (score) of sequence s1 s2 s3 sn is

10
Uniform Markov Models
  • Advantages
  • Easy to train. Count frequencies of (k1)mers in
    training data.
  • Easy to assign a score to a sequence.
  • Disadvantages
  • (k1)mers can be undersampled i.e., occur too
    infrequently in training data.
  • Models sequence as fixed-length chunks, which may
    not be the best model of biology.

11
Interpolated Markov Models
  • Use a linear combination of 8 different Markov
    chains for example
  • c8 P (gatcagtta) c7 P (gtcagtta)
  • c1 P (ga) c0 P (g)
  • where c0 c1 c2 c3 c4 1
  • Equivalent to interpolating the results of
    multiple Markov chains
  • Score of a sequence is the product of
    interpolated probabilities of bases in the
    sequence

12
IMMs vs. Fixed-Order Models
  • Performance
  • IMM should always do at least as well as
    fixed-order.
  • E.g., even if kth-order model is correct, it can
    be simulated by (k1)st-order
  • Our results support this.
  • IMM result can be used as fixed-order model.
  • IMM slightly harder to train and uses more memory.

13
IMM Training
  • Problem How to determine the weights of all the
    thousands of k-mers?
  • Traditionally done with E-M algorithm using
    cross-validation (deleted estimation).
  • Slow.
  • Overtraining can be a problem.

14
GLIMMER IMM Training
  • Our approach assumes
  • Longer context is always better
  • Only reason not to use it is undersampling in
    training data.
  • If sequence occurs frequently enough in training
    data, use it, i.e., l 1
  • Otherwise, use frequency and c2 significance to
    set l.

15
How GLIMMER Works
  • Three separate programs
  • long-orfs automatically extract long open
    reading frames that do not overlap other long
    orfs.
  • IMM model builder. Takes any kind of sequence
    data.
  • Gene predictor. Takes genome sequence and finds
    all the genes.

16
Gene Predictor
  • Finds scores entire ORFs.
  • Uses 7 competing models 6 reading frames plus
    random model.
  • Score for an ORF is the probability that the
    right model generated it.
  • 3-periodic Markov model
  • High-scoring ORFs are then checked for overlaps.

17
Glimmer 2.0 IMM design
Pos -1
Context
a
t
c
g
ATGCATGATCGAG
Pos -3
Pos -3
Pos -3
Pos -2
12bp
Pos -3
Pos -3
Pos -3
Pos -4
8 levels deep
18
Better Overlap Resolution
19
Better Overlap Resolution
20
GLIMMER 2.0s Performance
Organism Genes Genes
Additional Annotated
Found Genes H. influenzae 1738 172
0 (99.0) 250 (14) M. genitalium 483 480 (99.4)
81 (17) M. jannaschii 1727 1721 (99.7) 221 (13)
H. pylori 1590 1550 (97.5) 293 (18) E.
coli 4269 4158 (97.4) 824 (19) B.
subtilis 4100 4030 (98.3) 586 (14) A.
fulgidis 2437 2404 (98.6) 274 (11) B.
burgdorferi 853 843 (99.3) 62 (7) T.
pallidum 1039 1014 (97.6) 180 (17) T.
maritima 1877 1854 (98.8) 190 (10)
21
GLIMMER 2.0 on known genes
Organism Genes Known
Correct Annotated Genes
Predictions H. influenzae 1738 1501 1496 (99
.7) M. genitalium 483 478 476 (99.6) M.
jannaschii 1727 1259 1256 (99.8) H.
pylori 1590 1092 1084 (99.3) E.
coli 4269 2656 2632 (99.1) B. subtilis 4100 1249
1231 (98.6) A. fulgidis 2437 1799 1786 (99.3) B.
burgdorferi 853 601 600 (99.8) T.
pallidum 1039 755 747 (98.9) T.
maritima 1877 1504 1493 (99.3) Average (99.3)
22
  • Speed
  • Training for 2 Megabase genome lt 1 minute
    (on a Pentium-450)
  • Find all genes in 2Mb genome lt 1 minute
  • Impact GLIMMER was used for
  • B. burgdorferi (Lyme disease) , T. pallidum
    (syphilis) (TIGR)
  • C. trachomatis (blindness,std) (Berkeley/Stanford)
  • C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF)
  • T. maritima, D. radiodurans, M. tuberculosis, V.
    cholerae, S. pneumoniae, C. trachomatis, C.
    pneumoniae, N. meningitidis (TIGR)
  • X. fastidiosa (Brazilian consortium)
  • Plasmodium falciparum (malaria) GlimmerM
  • Arabidopsis thaliana (model plant) GlimmerM
  • Others viruses, simple eukaryotes, more bacteria

23
Self-Similarity Scans
  • Idea analyze a whole genome by counting 3-mers
    in all 6 frames
  • Analyze small windows (2000 bp, 10000bp) using
    the same statistic
  • Algorithm
  • Build model of entire sequence
  • Apply the ?2 statistic to compare windows to the
    genome itself

24
Haemophilus influenzae (meningitis)
?2
GC
25
Thermotoga maritima (hyperthermophile)
26
Vibrio cholerae (cholera)
27
On the other side of CTXf prophage is a region
encoding an RTX toxin (rtxA) and its activator
(rtxC) and transporters (rtxBD). A third
transporter gene has been identified that is a
paralog of rtxB, and is transcribed in the same
direction as rtxBD. Downstream of this gene are
two genes encoding a sensor histidine kinase and
response regulator. Trinucleotide composition
analysis suggests that the RTX region was
horizontally acquired along with the sensor
histidine kinase/response regulator, suggesting
these regulators effect expression of the closely
linked RTX transcriptional units. --Heidelberg et
al., Nature, in press.
28
MUMmer
  • Aligns 2 complete genomes
  • Maximal Unique Matches
  • Suffix trees
  • Very fast alignment of very long DNA sequences
  • Ref Delcher et al., Nucl. Acids Res., 1999
  • Software at
  • http//www.tigr.org/softlab

29
The Problem
  • Efficiently compute alignments between long
    sequences to identify biologically interesting
    features.
  • E.g., two strains of M. tuberculosis,each
    4.4MB
  • E.g., two versions of a genome at different
    stages of closure
  • Compute alignment in less than 2 minutes

30
Maximal Unique Sequences
Sequences in genomes A and B that Occur exactly
once in A and in B Are not contained in any
larger such sequence
31
Select the longest consistent set of MUMs Occur
in the same order in A and B
32
Suffix Trees
  • A tree with edges labelled by strings
  • Labels of child edges of a node begin with
    distinct letters
  • Each leaf L represents a sequencethe labels on
    the path to L from the root
  • Holds all suffixes of a set of sequences
  • A suffix is a subsequence that extends to the
    end of its sequence
  • The suffix tree for sequences A and B
  • Contains less than 2(A B ) nodes.
  • Can be constructed in O (A B ) time!
  • Still need lots of RAM
  • All the analyses here were run on a desktop PC

33
  • Analyze the gaps between adjacent MUMs
  • Small gaps can be aligned with Smith-Waterman
    algorithm
  • Large gaps can be aligned recursively
  • Large inserts can be searched for separately.
    Many will be inconsistent MUMs
  • Overlapping MUMs indicate variation in copy
    number of small repeats

34
M. tuberculosis CSU93 vs. H37Rv
A C G TA 66 164 9C 48 81 169G 164 89 44T 1
1 159 61
35
M genitalium vs. M. pneumoniae
36
H. pylori 26695 vs. J99
37
V. cholera (forward) vs. E. coli
Origin
38
V. cholera (reverse) vs. E. coli
39
V. cholera (both strands) vs. E. coli a puzzle?
40
V. cholera vs. itself
41
S. pyogenes vs. S. pneumoniae
42
S. pyogenes vs. itself
43
M. leprae vs M. tuberculosis
M. tuberculosis
M. leprae
44
X-alignments how?
4
3
3
4
5
2
2
5
1
1
6
6
Ori
3
4
2
5
1
6
4
3
3
4
2
5
5
2
1
6
1
6
45
Chr 2 vs. Chr 4 of Arabidopsis thaliana
discovery of a 4 Mb duplication
1100 genes 430 (39) duplicated
46
Acknowledgements
  • GLIMMER, GLIMMERM
  • Arthur Delcher, Simon Kasif, Owen White, Mihaela
    Pertea
  • MUMmer
  • Arthur Delcher, Simon Kasif, Jeremy Peterson, Rob
    Fleischmann, Owen White
  • Analyses
  • Numerous TIGR faculty and staff, including
    Jonathan Eisen, Owen White, Rob Fleischmann,
    Hervé Tettelin, Tim Read, Maria Ermolaeva, John
    Heidelberg, Ian Paulsen, Malcolm Gardner, Claire
    Fraser, Clyde Hutchison, ...
  • Supported by
  • National Institutes of Health (NHGRI, NLM)
  • National Science Foundation (CISE, BIO)
Write a Comment
User Comments (0)
About PowerShow.com