Title: Part 12 Genome Analysis
1Part 12 Genome Analysis
2Outline
- Overview
- Why do comparative genomic analysis?
- Assumptions/Limitations
- Genome Analysis and Annotation Standard Procedure
- General Purposes Databases for Comparative
Genomics - Organism Specific Databases
- Genome Analysis Environments
- Genome Sequence Alignment Programs
- Genomic Comparison Visualization Tools
3Some of the prokaryotic genomes
4Some of the eukaryotic genomes
Aspergillus fumigatus
Farmers lung
In progress
Dictyostelium discoideum
Soil amoeba
In progress
Amoebic dysentry
In progress
Entamoeba histolitica
Leishmania major
Leishmaniasis
In progress
Plasmodium falciparum
Malaria
In progress
Bilharzia
In progress
Schistosoma mansoni
Schizosaccharomyces pombe
Fission yeast
Complete
Theileria annulata
Veterinary
In progress
Toxoplasma gondii
Toxoplasmosis
In progress
Trypanosoma brucei
Sleeping sickness
In progress
5Bioinformatics Flow Chart
1a. Sequencing
6. Gene Protein expression data
1b. Analysis of nucleic acid seq.
7. Drug screening
2. Analysis of protein seq.
Ab initio drug design OR Drug compound screening
in database of molecules
3. Molecular structure prediction
4. molecular interaction
8. Genetic variability
5. Metabolic and regulatory networks
6Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
7Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
8Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
9Why do comparative genomics?
- Many of the genes encoded in each genome from the
genome projects had no known or predictable
function - Analysis of protein set from completely sequenced
genomes - Uniform evolutionary conservation of proteins in
microbial genomes, 70 of gene products from
sequenced genomes have homologs in distant
genomes (Koonin et al., 1997) - Function of many of these genes can be predicted
by comparing different genomes of known
functional annotation and transferring functional
annotation of proteins from better studied
organisms to their orthologs in lesser studied
organisms. - Cross species comparison to help reveal conserved
coding regions - No prior knowledge of the sequence motif is
necessary - Complement to algorithmic analysis
10Assumptions/Limitation
- Homologous genes are relatively well preserved
while noncoding regions tend to show varying
degrees of conservation. Conserved noncoding
regions are believed to be important in
regulating gene expression, maintaiing structural
organization of the genome and most likely other
possible functions. - Cross species comparative genomics is influenced
by the evolutionary distance of the compared
species.
11Genome Analysis and Annotation General Procedure
- Basic procedure to determine the functional and
structural annotation of uncharacterized
proteins - Use a sequence similarity search programs such as
BLAST or FASTA to identify all the functional
regions in the sequence. If greater sensitivity
is required then the Smith-Waterman algorithm
based programs are preferred with the trade-off
greater analysis time. - Identify functional motifs and structural domains
by comparing the protein sequence against
PROSITE, BLOCKS, SMART, CDD, or Pfam. - Predict structural features of the protein such
as signal peptides, transmembrane segments,
coiled-coil regions, and other regions of low
sequence complexity - Generate a secondary and tertiary (if possible)
structure prediction - Annotation
- Transfer of function information from a
well-characterized organism to a lesser studied
organism and/or - Use phylogenetic patterns (or profiles) and/or
- Use the phylogenetic pattern search tools (e.g.
through COGs) to perform a systematic formal
logical operations (AND, OR, NOT) on gene sets --
differential genome display (Huynen et al., 1997).
12Genome Analysis and AnnotationOne Possible
Procedure
- Basic procedure to determine the functional and
structural annotation of uncharacterized
proteins - Use a sequence similarity search programs such as
BLAST or FASTA to identify all the functional
regions in the sequence. If greater sensitivity
is required then the Smith-Waterman algorithm
based programs are preferred with the trade-off
greater analysis time. - Identify functional motifs and structural domains
by comparing the protein sequence against
PROSITE, BLOCKS, SMART, CDD, or Pfam. - Predict structural features of the protein such
as signal peptides, transmembrane segments,
coiled-coil regions, and other regions of low
sequence complexity - Generate a secondary and tertiary (if possible)
structure prediction - Transfer of function information from a
well-characterized organism to a lesser studied
organism and/or use phylogenetic patterns (or
profiles) and/or use the phylogenetic pattern
search tools (e.g. through COGs) to perform a
systematic formal logical operations (AND, OR,
NOT) on gene sets -- differential genome display
(Huynen et al., 1997)..
13Automated Genome Annotation
- GeneQuiz limited number of searches/day
- MAGPIE outside users cannot submit own seq
- PEDANT commercial version allow for full
capacity - SEALS semi automated
14General Databases Useful for Comparative Genomics
- Locus Link/RefSeq http//www.ncbi.nih.gov/LocusLi
nk/ - PEDANT -Protein Extraction Description ANalysis
Tool http//pedant.gsf.de/ - MIPS http//mips.gsf.de/
- COGs - Cluster of Orthologous Groups (of
proteins) http//www.ncbi.nih.gov/COG/ - KEGG - Kyoto Encyclopedia of Genes and Genomes
http//www.genome.ad.jp/kegg/ - MBGD - Microbial Genome Database
http//mbgd.genome.ad.jp/ - GOLD - Genome OnLine Database http//wit.integrate
dgenomics.com/GOLD/ - TOGA http//www.tigr.org/xxxxx
15Problems with existing sequence alignments
algorithms for genomic analysis
- Most algorithms were developed for comparing
single protein sequences or DNA sequences
containing a single gene - Most algorithms were based on assigning a score
to all the possible alignments (usually by the
sum of the similarity/identity values for each
aligned residue minus a penalty for the
introduction of gaps) and then finding the
optimal or near-optimal alignment based on the
chosen scoring scheme. - Unfortunately, most of these programs cannot
accurately handle long alignments. - Linear-space type of Smith-Waterman variants are
too computationally intensive requiring
specialized hardware (memory-limited) or very
time-consuming. Higher speed vs increased
sensitivity.
16Genome-size comparative alignment tools
- ASSIRC - Accelerated Search for SImilarity
Regions in Chromosomes - ftp//ftp.biologie.ens.fr/pub/molbio/ (Vincens et
al. 1998) - BLAT
- http//genome.ucsc.edu/cgi-bin/hgBlat?commandstar
t (Kent xxx) - DIALIGN - DIagonal ALIGNment
- http//www.gsf.de/biodv/dialign.html (Morgenstern
et al. 1998 Morgenstern 1999( - DBA - DNA Block Aligner
- http//www.sanger.ac.uk/Software/Wise2/dba.shtml
(Jareborg et al. 1999( - GLASS - GLobal Alignment SyStem
- http//plover.lcs.mit.edu/ (Batzoglou et al.
2000) - LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL
PAIRS - Email jbuhler_at_cs.washington.edu (Buhler 2001)
- MegaBlast
- http//www.ncbi.nih.gov/blast/ (Zhang 2000)
- MUMmer - Maximal Unique Match (mer)
- http//www.tigr.org/softlab/ (Delcher et al.
1999) - PIPMaker - Percent Identity Plot MAKER
- http//biocse.psu.edu/pipmaker/ (Schwartz et al.
2000) - SSAHA Sequence Search and Alignment by Hashing
Algorithm
17SSAHA
- Sequence Search and Alignment by Hashing
Algorithm - Software tool for very fast matching and
alignment of DNA sequences. - Achieves fast search speed by converting sequence
information into a hash table data structure
which can then be searched very rapidly for
matches - http//www.sanger.ac.uk/Software/analysis/SSAHA/
- Run from the Unix command line
- Need gt 1GB RAM (needs a lot of memory)
- SSAHA algorithm best for application requiring
exact or almost exact matches between two
sequences e.g. SNP detection, fast sequence
assembly, ordering and orientation of contigs
18Genome Analysis Environment
- MAGPIE - Automated Genome Project Investigation
Environment - PEDANT
- SEALS
19Problems with Visualizing Genomes
- Alignment programs output often were visualized
by text file, which can be intuitively difficult
to interpret when comparing genomes. - Visualization tools needed to handle the
complexity and volume of data and present the
information in a comprehensive and comprehensible
manner to a biologist for interpretation. - Genome Alignment Visualization tools need to
provide - interpretable alignments,
- gene prediction and database homologies from
different sources - Interactive features real time capabilities,
zooming, searching specific regions of homologies - Represent breaks in synteny
- Multiple alignments display
- Displaying contigs of unfinished genomes with
finished genomes - Handle various data formats
- Software availabilty (no black box)
20Genome Comparison Visualization Tool
- ACT - Artemis Comparison Tool (displays parsed
BLAST alignments based on Artemis an
annotation tool) - http//www.sanger.ac.uk/Software/ACT/
- Alfresco (displays DBA alignments and ...)
- http//www.sanger.ac.uk/Software/Alfresco/
(Jareborg Durbin 2000) - PipMaker (displays BlastZ alignments)
- http//bio.cse.psu.edu/pipmaker/ (Schwartz et al.
2000) - Enteric/Menteric/Maj (displays Blastz alignments)
- http//glovin.cse.psu.edu/enterix/ (Florea et al.
2000 McClelland et al. 2000) - Intronerator (displays WABA alignments and ...)
- http//www.cse.ucsc.edu/kent/intronerator/ (Kent
Zahler 2000b) - VISTA (Visualization Tool for Alignment)
(displays GLASS alignments) - http//www-gsd.lbl.gov/vista/
- SynPlot (displays DIALIGN and GLASS alignments)
- http//www.sanger.ac.uk/Users/igrg/SynPlot/
21Artemis Comparison Tool (ACT)
- ACT is a DNA sequence comparison viewer based on
Artemis - Can read complete EMBL and GenBank entries or
sequence in FASTA or raw format - Additional sequence feature can be in EMBL,
GenBank, GFF format - ACT is free software and is distributed under the
GNU Public License - Java based software
- Latest release 2.0 better support Eukaryotic
Genome Comparison - http//www.sanger.ac.uk/Software/ACT/
22Salmonella typhi vs. E. coli SPI-2
GC tRNA phage/IS genes Pseudogenes
S.typhi
Blast hits
E.coli
23Salmonella typhi and Yersinia pestis type III
secretion systems
24Salmonella typhi vs. E. coli - ACT
SPI-10
SPI-1
SPI-2
SPI-9
SPI-7 Vi
S. typhi
DNA matches
E. coli
25Neisseria meningitidis - A vs. B comparison - ACT
26Extra Slides 1
27ASSIRC
- Accelerated Search for SImilarity Regions in
Chromosome - ASSIRC finds regions of similarity in pair-wise
genomic sequence alignments. - The method involves three steps
- (i) identification of short exact chains of fixed
size, called 'seeds', common to both sequences,
using hashing functions - (ii) extension of these seeds into putative
regions of similarity by a 'random walk'
procedure (i.e. the four bases are associated - (iii) final selection of regions of similarity by
assessing alignments of the putative sequences. - We used simulations to estimate the proportion of
regions of similarity not detected for particular
region sizes, base identity proportions and seed
sizes. - This approach can be tailored to the user's
specifications. - They looked for regions of similarity between two
yeast chromosomes (V and IX). The efficiency of
the approach was compared to those of
conventional programs BLAST and FASTA, by
assessing CPU time required and the regions of
similarity found for the same data set. - http//www.biologie.ens.fr/perso/vincens/assirc.ht
ml - ftp//ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz
28BLAT
- Only DNA sequences of 25,000 or less bases and
protein or translated sequence of 5000 or less
letters will be processed. If multiple sequences
are submitted at the same time, the total limit
is 50,000 bases or 12,500 letters. - BLAT on DNA is designed to quickly find sequences
of 95 and greater similarity of length 40 bases
or more. It may miss more divergent or shorter
sequence alignments. It will find perfect
sequence matches of 33 bases, and sometimes find
them down to 22 bases. BLAT on proteins finds
sequences of 80 and greater similarity of length
20 amino acids or more. In practice DNA BLAT
works well on primates, and protein blat on land
vertebrates - BLAT is not BLAST. DNA BLAT works by keeping an
index of the entire genome in memory. The index
consists of all non- overlapping 11-mers except
for those heavily involved in repeats. The index
takes up a bit less than a gigabyte of RAM. The
genome itself is not kept in memory, allowing
BLAT to deliver high performance on a reasonably
priced Linux box. The index is used to find areas
of probable homology, which are then loaded into
memory for a detailed alignment. Protein BLAT
works in a similar manner, except with 4-mers
rather than 11-mers. The protein index takes a
little more than 2 gigabytes - BLAT was written by Jim Kent. Like most of Jim's
software interactive use on this web server is
free to all. Sources and executables to run batch
jobs on your own server are available free for
academic, personal, and non-profit purposes. Non-
exclusive commercial licenses are also available.
Contact Jim for details.