Part 12 Genome Analysis - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Part 12 Genome Analysis

Description:

Part 12 Genome Analysis Outline Overview Why do comparative genomic analysis? Assumptions/Limitations Genome Analysis and Annotation Standard Procedure General ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 29
Provided by: chu8163
Category:

less

Transcript and Presenter's Notes

Title: Part 12 Genome Analysis


1
Part 12 Genome Analysis
2
Outline
  • Overview
  • Why do comparative genomic analysis?
  • Assumptions/Limitations
  • Genome Analysis and Annotation Standard Procedure
  • General Purposes Databases for Comparative
    Genomics
  • Organism Specific Databases
  • Genome Analysis Environments
  • Genome Sequence Alignment Programs
  • Genomic Comparison Visualization Tools

3
Some of the prokaryotic genomes
4
Some of the eukaryotic genomes


Aspergillus fumigatus

Farmers lung

In progress

Dictyostelium discoideum

Soil amoeba

In progress

Amoebic dysentry

In progress

Entamoeba histolitica

Leishmania major
Leishmaniasis

In progress

Plasmodium falciparum

Malaria

In progress


Bilharzia

In progress

Schistosoma mansoni
Schizosaccharomyces pombe
Fission yeast

Complete

Theileria annulata
Veterinary

In progress

Toxoplasma gondii

Toxoplasmosis

In progress

Trypanosoma brucei

Sleeping sickness

In progress



5
Bioinformatics Flow Chart
1a. Sequencing
6. Gene Protein expression data
1b. Analysis of nucleic acid seq.
7. Drug screening
2. Analysis of protein seq.
Ab initio drug design OR Drug compound screening
in database of molecules
3. Molecular structure prediction
4. molecular interaction
8. Genetic variability
5. Metabolic and regulatory networks
6
Genomic DNA
Shearing/Sonication
Subclone and Sequence
Shotgun reads
Assembly
Contigs
Finishing read
Finishing
Complete sequence
7
Genome Sequencing - Review
Strategy
Strategy
Libraries
Libraries
Sequencing
Sequencing
Assembly
Assembly
Closure
Closure
Annotation
Annotation
Release
Release
8
Annotation of eukaryotic genomes
Genomic DNA
ab initio gene prediction
transcription
Unprocessed RNA
RNA processing
Mature mRNA
AAAAAAA
Gm3
Comparative gene prediction
translation
Nascent polypeptide
folding
Active enzyme
Functional identification
Reactant A
Product B
Function
9
Why do comparative genomics?
  • Many of the genes encoded in each genome from the
    genome projects had no known or predictable
    function
  • Analysis of protein set from completely sequenced
    genomes
  • Uniform evolutionary conservation of proteins in
    microbial genomes, 70 of gene products from
    sequenced genomes have homologs in distant
    genomes (Koonin et al., 1997)
  • Function of many of these genes can be predicted
    by comparing different genomes of known
    functional annotation and transferring functional
    annotation of proteins from better studied
    organisms to their orthologs in lesser studied
    organisms.
  • Cross species comparison to help reveal conserved
    coding regions
  • No prior knowledge of the sequence motif is
    necessary
  • Complement to algorithmic analysis

10
Assumptions/Limitation
  • Homologous genes are relatively well preserved
    while noncoding regions tend to show varying
    degrees of conservation. Conserved noncoding
    regions are believed to be important in
    regulating gene expression, maintaiing structural
    organization of the genome and most likely other
    possible functions.
  • Cross species comparative genomics is influenced
    by the evolutionary distance of the compared
    species.

11
Genome Analysis and Annotation General Procedure
  • Basic procedure to determine the functional and
    structural annotation of uncharacterized
    proteins
  • Use a sequence similarity search programs such as
    BLAST or FASTA to identify all the functional
    regions in the sequence. If greater sensitivity
    is required then the Smith-Waterman algorithm
    based programs are preferred with the trade-off
    greater analysis time.
  • Identify functional motifs and structural domains
    by comparing the protein sequence against
    PROSITE, BLOCKS, SMART, CDD, or Pfam.
  • Predict structural features of the protein such
    as signal peptides, transmembrane segments,
    coiled-coil regions, and other regions of low
    sequence complexity
  • Generate a secondary and tertiary (if possible)
    structure prediction
  • Annotation
  • Transfer of function information from a
    well-characterized organism to a lesser studied
    organism and/or
  • Use phylogenetic patterns (or profiles) and/or
  • Use the phylogenetic pattern search tools (e.g.
    through COGs) to perform a systematic formal
    logical operations (AND, OR, NOT) on gene sets --
    differential genome display (Huynen et al., 1997).

12
Genome Analysis and AnnotationOne Possible
Procedure
  • Basic procedure to determine the functional and
    structural annotation of uncharacterized
    proteins
  • Use a sequence similarity search programs such as
    BLAST or FASTA to identify all the functional
    regions in the sequence. If greater sensitivity
    is required then the Smith-Waterman algorithm
    based programs are preferred with the trade-off
    greater analysis time.
  • Identify functional motifs and structural domains
    by comparing the protein sequence against
    PROSITE, BLOCKS, SMART, CDD, or Pfam.
  • Predict structural features of the protein such
    as signal peptides, transmembrane segments,
    coiled-coil regions, and other regions of low
    sequence complexity
  • Generate a secondary and tertiary (if possible)
    structure prediction
  • Transfer of function information from a
    well-characterized organism to a lesser studied
    organism and/or use phylogenetic patterns (or
    profiles) and/or use the phylogenetic pattern
    search tools (e.g. through COGs) to perform a
    systematic formal logical operations (AND, OR,
    NOT) on gene sets -- differential genome display
    (Huynen et al., 1997)..

13
Automated Genome Annotation
  • GeneQuiz limited number of searches/day
  • MAGPIE outside users cannot submit own seq
  • PEDANT commercial version allow for full
    capacity
  • SEALS semi automated

14
General Databases Useful for Comparative Genomics
  • Locus Link/RefSeq http//www.ncbi.nih.gov/LocusLi
    nk/
  • PEDANT -Protein Extraction Description ANalysis
    Tool http//pedant.gsf.de/
  • MIPS http//mips.gsf.de/
  • COGs - Cluster of Orthologous Groups (of
    proteins) http//www.ncbi.nih.gov/COG/
  • KEGG - Kyoto Encyclopedia of Genes and Genomes
    http//www.genome.ad.jp/kegg/
  • MBGD - Microbial Genome Database
    http//mbgd.genome.ad.jp/
  • GOLD - Genome OnLine Database http//wit.integrate
    dgenomics.com/GOLD/
  • TOGA http//www.tigr.org/xxxxx

15
Problems with existing sequence alignments
algorithms for genomic analysis
  • Most algorithms were developed for comparing
    single protein sequences or DNA sequences
    containing a single gene
  • Most algorithms were based on assigning a score
    to all the possible alignments (usually by the
    sum of the similarity/identity values for each
    aligned residue minus a penalty for the
    introduction of gaps) and then finding the
    optimal or near-optimal alignment based on the
    chosen scoring scheme.
  • Unfortunately, most of these programs cannot
    accurately handle long alignments.
  • Linear-space type of Smith-Waterman variants are
    too computationally intensive requiring
    specialized hardware (memory-limited) or very
    time-consuming. Higher speed vs increased
    sensitivity.

16
Genome-size comparative alignment tools
  • ASSIRC - Accelerated Search for SImilarity
    Regions in Chromosomes
  • ftp//ftp.biologie.ens.fr/pub/molbio/ (Vincens et
    al. 1998)
  • BLAT
  • http//genome.ucsc.edu/cgi-bin/hgBlat?commandstar
    t (Kent xxx)
  • DIALIGN - DIagonal ALIGNment
  • http//www.gsf.de/biodv/dialign.html (Morgenstern
    et al. 1998 Morgenstern 1999(
  • DBA - DNA Block Aligner
  • http//www.sanger.ac.uk/Software/Wise2/dba.shtml
    (Jareborg et al. 1999(
  • GLASS - GLobal Alignment SyStem
  • http//plover.lcs.mit.edu/ (Batzoglou et al.
    2000)
  • LSH-ALL-PAIRS - Locality -Sensitve Hashing in ALL
    PAIRS
  • Email jbuhler_at_cs.washington.edu (Buhler 2001)
  • MegaBlast
  • http//www.ncbi.nih.gov/blast/ (Zhang 2000)
  • MUMmer - Maximal Unique Match (mer)
  • http//www.tigr.org/softlab/ (Delcher et al.
    1999)
  • PIPMaker - Percent Identity Plot MAKER
  • http//biocse.psu.edu/pipmaker/ (Schwartz et al.
    2000)
  • SSAHA Sequence Search and Alignment by Hashing
    Algorithm

17
SSAHA
  • Sequence Search and Alignment by Hashing
    Algorithm
  • Software tool for very fast matching and
    alignment of DNA sequences.
  • Achieves fast search speed by converting sequence
    information into a hash table data structure
    which can then be searched very rapidly for
    matches
  • http//www.sanger.ac.uk/Software/analysis/SSAHA/
  • Run from the Unix command line
  • Need gt 1GB RAM (needs a lot of memory)
  • SSAHA algorithm best for application requiring
    exact or almost exact matches between two
    sequences e.g. SNP detection, fast sequence
    assembly, ordering and orientation of contigs

18
Genome Analysis Environment
  • MAGPIE - Automated Genome Project Investigation
    Environment
  • PEDANT
  • SEALS

19
Problems with Visualizing Genomes
  • Alignment programs output often were visualized
    by text file, which can be intuitively difficult
    to interpret when comparing genomes.
  • Visualization tools needed to handle the
    complexity and volume of data and present the
    information in a comprehensive and comprehensible
    manner to a biologist for interpretation.
  • Genome Alignment Visualization tools need to
    provide
  • interpretable alignments,
  • gene prediction and database homologies from
    different sources
  • Interactive features real time capabilities,
    zooming, searching specific regions of homologies
  • Represent breaks in synteny
  • Multiple alignments display
  • Displaying contigs of unfinished genomes with
    finished genomes
  • Handle various data formats
  • Software availabilty (no black box)

20
Genome Comparison Visualization Tool
  • ACT - Artemis Comparison Tool (displays parsed
    BLAST alignments based on Artemis an
    annotation tool)
  • http//www.sanger.ac.uk/Software/ACT/
  • Alfresco (displays DBA alignments and ...)
  • http//www.sanger.ac.uk/Software/Alfresco/
    (Jareborg Durbin 2000)
  • PipMaker (displays BlastZ alignments)
  • http//bio.cse.psu.edu/pipmaker/ (Schwartz et al.
    2000)
  • Enteric/Menteric/Maj (displays Blastz alignments)
  • http//glovin.cse.psu.edu/enterix/ (Florea et al.
    2000 McClelland et al. 2000)
  • Intronerator (displays WABA alignments and ...)
  • http//www.cse.ucsc.edu/kent/intronerator/ (Kent
    Zahler 2000b)
  • VISTA (Visualization Tool for Alignment)
    (displays GLASS alignments)
  • http//www-gsd.lbl.gov/vista/
  • SynPlot (displays DIALIGN and GLASS alignments)
  • http//www.sanger.ac.uk/Users/igrg/SynPlot/

21
Artemis Comparison Tool (ACT)
  • ACT is a DNA sequence comparison viewer based on
    Artemis
  • Can read complete EMBL and GenBank entries or
    sequence in FASTA or raw format
  • Additional sequence feature can be in EMBL,
    GenBank, GFF format
  • ACT is free software and is distributed under the
    GNU Public License
  • Java based software
  • Latest release 2.0 better support Eukaryotic
    Genome Comparison
  • http//www.sanger.ac.uk/Software/ACT/

22
Salmonella typhi vs. E. coli SPI-2
GC tRNA phage/IS genes Pseudogenes
S.typhi
Blast hits
E.coli
23
Salmonella typhi and Yersinia pestis type III
secretion systems
24
Salmonella typhi vs. E. coli - ACT
SPI-10
SPI-1
SPI-2
SPI-9
SPI-7 Vi
S. typhi
DNA matches
E. coli
25
Neisseria meningitidis - A vs. B comparison - ACT
26
Extra Slides 1
27
ASSIRC
  • Accelerated Search for SImilarity Regions in
    Chromosome
  • ASSIRC finds regions of similarity in pair-wise
    genomic sequence alignments.
  • The method involves three steps
  • (i) identification of short exact chains of fixed
    size, called 'seeds', common to both sequences,
    using hashing functions
  • (ii) extension of these seeds into putative
    regions of similarity by a 'random walk'
    procedure (i.e. the four bases are associated
  • (iii) final selection of regions of similarity by
    assessing alignments of the putative sequences.
  • We used simulations to estimate the proportion of
    regions of similarity not detected for particular
    region sizes, base identity proportions and seed
    sizes.
  • This approach can be tailored to the user's
    specifications.
  • They looked for regions of similarity between two
    yeast chromosomes (V and IX). The efficiency of
    the approach was compared to those of
    conventional programs BLAST and FASTA, by
    assessing CPU time required and the regions of
    similarity found for the same data set.
  • http//www.biologie.ens.fr/perso/vincens/assirc.ht
    ml
  • ftp//ftp.biologie.ens.fr/pub/molbio/assirc.tar.gz

28
BLAT
  • Only DNA sequences of 25,000 or less bases and
    protein or translated sequence of 5000 or less
    letters will be processed. If multiple sequences
    are submitted at the same time, the total limit
    is 50,000 bases or 12,500 letters.
  • BLAT on DNA is designed to quickly find sequences
    of 95 and greater similarity of length 40 bases
    or more. It may miss more divergent or shorter
    sequence alignments. It will find perfect
    sequence matches of 33 bases, and sometimes find
    them down to 22 bases. BLAT on proteins finds
    sequences of 80 and greater similarity of length
    20 amino acids or more. In practice DNA BLAT
    works well on primates, and protein blat on land
    vertebrates
  • BLAT is not BLAST. DNA BLAT works by keeping an
    index of the entire genome in memory. The index
    consists of all non- overlapping 11-mers except
    for those heavily involved in repeats. The index
    takes up a bit less than a gigabyte of RAM. The
    genome itself is not kept in memory, allowing
    BLAT to deliver high performance on a reasonably
    priced Linux box. The index is used to find areas
    of probable homology, which are then loaded into
    memory for a detailed alignment. Protein BLAT
    works in a similar manner, except with 4-mers
    rather than 11-mers. The protein index takes a
    little more than 2 gigabytes
  • BLAT was written by Jim Kent. Like most of Jim's
    software interactive use on this web server is
    free to all. Sources and executables to run batch
    jobs on your own server are available free for
    academic, personal, and non-profit purposes. Non-
    exclusive commercial licenses are also available.
    Contact Jim for details.
Write a Comment
User Comments (0)
About PowerShow.com