Title: Year One Milestones
1Year One Milestones For the VIMSS Comparative
Genomics Pipeline Eric Alm EJAlm_at_lbl.gov
2Outline
- Comparative Genomics Database
- Over 125 genomes
- Perl-based API
- Over 600 genomes expected soon!
- Protein Annotation Web Tools
- Comparative Genomics Browser
- Genome Annotation
- Operon prediction
- Regulon prediction
- Motif detection
- Data Analysis
- Prokaryotic gene expression database
- Integration with FGC
3Protein Annotation Pages
Release Date 8/15/2003
4Comparative Genomics Browser
5Genome Annotation
- Operon prediction
- Parameter-free prediction tool for
uncharacterized genomes - Survey of Operon structure in prokaryotes
- Evolution of operons
- Operons Early scenario
- Regulon prediction
- DNA motif detection
- Structure based Protein-DNA interface
modeling/design
6Operon Prediction
- Why predict operon structure?
- Functional annotation
- Evolutionary questions
- Few genomes with experimental data
- Computational Approaches
- Microarray coexpression (Liao et al.)
- Metabolic pathways (Kasif et al.)
- Intergenic length (Collado-Vides et al.)
- Gene clusters (Salzberg et al.)
7Outline of Approach
- Calculate number of operons (Nop) in genome
- Based on the number of direction changes
- Verify with intergenic length distributions
- For each pair of genes calculate log-likelihood
that the genes are on the same operon - Find the optimal map (based on pairwise score)
that partitions the genome into Nop operons
8Scoring Function
- Intergenic length
- The most common separation is 1bp or 4bp overlap
- TGATG
- ATGA
- Large intergenic lengths rarely observed
- mRNA instability
9Scoring Function
- Gene Neighbors
- Gene pairs that tend to occur physically nearby
on the chromosome in many divergent genomes
10Gene Neighbors
- Calculating GNM score
- Orthologs are defined as Bidirectional Best BLAST
hits - The probability of two genes being neighbors by
chance is estimated from the fraction of
adjacent genes on convergent transcripts that are
neighbors - Related genomes are clustered such that (Pchance
lt eps) - Only one genome per cluster can contribute to the
total score to avoid overestimating significance
Intergenic length distributions for true
positives (highly conserved gene neighbors) and
combined (same direction) sets
Log-likelihood scores depend on true positive
distribution, true negative distribution,
combined distribution, and prior knowledge of Nop
11Scoring Function
- Phylogenetic profiles
- Genes that co-occur in the same genomes tend to
be functionally related
Taken from Marcotte et al., 1999
12Scoring Function
- Intrinsic termination signals
- Rho-independent terminators can be detected using
RNA folding algorithms (RNAfold) in some organisms
13Operon Prediction
- Limitations
- Alternative transcripts
- Limited by accurate prediction of start codon
- Parameter-free version requires accurate Codon
Adaptation Index - Preliminary Benchmarks in E. coli
Scoring function
Accuracy
Based on experimentally verified E. coli
operons Accuracy TPTN/TPTNFPFN
14Regulon (?) Prediction
Phylogenetic Profiles
Gene Neighbors
15cis-element Prediction
- Whole genome methods
- Moby Dick
- DimerFinder (Results in Biofiles for Dv, Gm, So)
- Predicted regulon methods
- AlignACE
- GIBBS-sampler
- MEME
- Phylogenetic footprinting
- What genomes to use?
16Choosing Genomes for Phylogenetic Footprinting
Fully sequenced gamma-proteobacteria
17Integration with the Functional Genomics Core
- Prokaryotic Gene Expression DB
- Currently no central repository
- Growing amounts of data in many species
- Comparative analysis of gene expression
- Evaluating predicted Regulons
- Correlating gene expression with genome structure
18Prokaryotic Gene Expression DB
gt20 organisms gt35 different treatments Expecting
90 publications this year Currently gt820
experiments in our DB
19Comparative Analysis of Gene Expression
Heat shock response in two species
20Comparative Analysis of Heat Shock Response
5 most up-regulated genes in both species
Name Description b1664 possible enzyme clpB heat
shock protein dnaJ chaperone with DnaK heat
shock protein dnaK chaperone Hsp70 DNA
biosynthesis autoregulated heat shock
proteins ftsJ cell division protein fucP fucose
permease grpE phage lambda replication host DNA
synthesis heat shock protein protein
repair hflB degrades sigma32, integral membrane
peptidase, cell division protein hslV heat shock
protein hslVU, proteasome-related peptidase
subunit htpG chaperone Hsp90, heat shock protein
C 62.5 hybC probable large subunit,
hydrogenase-2 ibpA heat shock protein lon DNA-bind
ing, ATP-dependent protease La heat shock
K-protein miaA delta(2)-isopentenylpyrophosphate
tRNA-adenosine transferase mopA GroEL, chaperone
Hsp60, peptide-dependent ATPase, heat shock
protein mopB GroES, 10 Kd chaperone binds to
Hsp60 in pres. Mg-ATP, suppressing its ATPase
activity rpoD RNA polymerase, sigma(70) factor
regulation of proteins induced at high
temperatures rpoH RNA polymerase, sigma(32)
factor regulation of proteins induced at high
temperatures rseA sigma-E factor, negative
regulatory protein yaiU putative flagellin
structural protein ybbN putative thioredoxin-like
protein ybeD orf, hypothetical protein ycdQ orf,
hypothetical protein yfjI orf, hypothetical
protein
21Detecting cis-regulatory motifs
22Evaluating Predicted Regulons
Correlation
Correlation
- E. coli data for 14 conditions
- Blattner Lab
- Regulons are nearly as tightly correlated as
operons - Operon structure in distantly related species can
be used to infer coregulation
- Shewanella data for 4 conditions
- ORNL
- Regulons are significantly more correlated than
random, but not as tightly correlated as for E.
coli - Fewer conditions
- No mutant regulator data
23Gene Expression and Genome Structure
- Test Case - Cyanobacteria
- Circadian Rhythms
- Expression of nearly all genes is tied to 24-hour
cycle - Heterologous genes/promoters adapt to host cycle
- Clock gene has homology to Helicase/Recombinase
- Genome Structure
- Little known about DNA replication
- Very little GC-skew - no obvious peak
- Little conservation of gene order/operons
24Gene Expression and Genome Structure
Cyanobacterial Clocks
25Summary
- Web based comparative genomics tools
- Operon predictions
- Gene interaction predictions
- Gene neighbors (Regulons?)
- Phylogenetic profiles
- Correlated microarray expression
- Release Date 8/15/2003
- Operon prediction tool
- Insights into the evolution of operons
- Regulon predictions
- Gene neighbors
- Phylogenetic profiles
- Coexpressed genes
- cis-regulatory motif detection
- Dimer-based method
- Comparative Gene Expression DB
- Preliminary comparative analysis of microarray
data
26Future Directions
- Automated input pipeline for new genomes
- Complete parameter-free operon prediction tool
- cis-regulatory motif predictions
- Regulon-based methods
- Phylogenetic profiling
- Integrate experimental data into protein
annotation pages - Microarray
- Proteomics
- Gene deletion
- Construct DB queries over experimental results
via website - Other ideas?
- Release Comparative Gene Expression DB
- Work with Functional Genomics Core to add genomic
context to high-throughput data
27Acknowledgements
- Director
- Adam Arkin
- VIMSS - Arkin Lab
- Katherine Huang
- Richard Koche
- Sarah Wang
- Dubchak Lab
- Simon Minovitsky
- Volunteer
- Vladmir Ulyashin
- ORNL
- Jizhong Zhou
- Matthew Fields
- Dorothea Thompson
- Yongqing Liu
- Adam Leaphart
- Haichun Gao
- and others