Title: The Genome Properties System
1The Genome Properties System
- Applications in genome annotation and
comparative genomics
Jeremy D. Selengut, Daniel H. Haft, Owen
White The Institute for Genomic Research,
Rockville, MD
2(No Transcript)
3What is a Genome Property?
- An ATTRIBUTE of biological organisms that is
rigorously defined such that assertion of its
absence, presence, or quantitative extent can be
made (either automatically or manually) in a
self-consistent manner. - Biological processes including metabolic
pathways, observable phenotypes, and quantitative
measures of genomic content.
www.tigr.org/Genome_Properties
4What is a Genome Property?
- A standardized bioinformatic analysis applied
over all sequenced genomes yielding discrete
assertions with controlled vocabularies and
linked to traceable evidence.
www.tigr.org/Genome_Properties
5(No Transcript)
6Quantitative measures
Quantitative data is calculated directly from the
genome sequence or the set of predicted genomic
features. Other examples of predicted
proteins avg. intergenic length amino acid
abundances
www.tigr.org/Genome_Properties
7Phenotypic Data
WHITE Natural transformation YES
Phenotypic data must be manually curated from
literature citations and expert
curation Phenotypic data is therefore SPARSE
compared to data calculated from genomic content
www.tigr.org/Genome_Properties
8Taxonomic data
Taxonomic information is included in the system
and is queryable to separate data by phylum,
class, order, etc. Actinobacteria (High-GC
Gram positive bacteria) Gamma proteobacteria
www.tigr.org/Genome_Properties
9Component-based Properties
- Metabolic pathways and other biological processes
are carried out by combinations of genetically
encoded COMPONENTS (ususally proteins). - Many components are both biologically required
and broadly (and accurately) detectable by
homology methods. - Detection of the complete set of these components
implies the presence of the process.
10Example Fucose catabolism
Chen YM, Zhu Y, Lin ECÂ Â The organization of
the fuc regulon specifying L-fucose dissimilation
in Escherichia coli K12 as determined by gene
cloning.  Mol Gen Genet 1987 Dec210(2)331-7. Â
PMID3325779
11Example Fucose catabolism
Transporter to import L-fucose
12Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25)
13Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51)
14Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17)
15Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17) Plus alcohol and/or
aldehyde dehydrogenases
16Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17) Plus alcohol and/or
aldehyde dehydrogenases Plus transcription
factors
17Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17) Plus alcohol and/or
aldehyde dehydrogenases Plus transcription
factors
18Detection of components via Equivalog HMMs
- Hidden Markov Models (HMMs) allow automated
assignment of sequences to homology families.
(TIGRFAMs, Pfam) - Equivalog-type HMMs are designed to detect only
members of families having the same function.
(1219 equivalogs in TIGRFAMs, 439 built into
Genome Properties thus far)
19Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17) Specific equivalog
TIGR01086 L-fuculose phosphate aldolase Plus
alcohol and/or aldehyde dehydrogenases Plus
transcription factors
20Detection of components via less specific
family-level HMMs
- Homology families which include gene products
with a range of substrate specificities. -
- These can detect proteins which may be involved
in a particular process. - We can require proximity to known genes
21Example Fucose catabolism
Transporter to import L-fucose L-fucose to
fuculose by L-fucose ketol-isomerase
(5.3.1.25) then to L-fuculose-1P by
L-fuculokinase (2.7.1.51) then to
L-lactaldehyde and glycerone-P by L-fuculose-P
aldolase (4.1.2.17) Specific equivalog
TIGR01086 L-fuculose phosphate aldolase Generic
family PF00596 Class II Aldolase Plus alcohol
and/or aldehyde dehydrogenases Plus
transcription factors
22User Interface Query Page
www.tigr.org/Genome_Properties
23(No Transcript)
24(28)
(1)
(1)
(1)
25(No Transcript)
26Detection of components via regular expression
text searchescombined with operon prediction
- Primary search terms (and exclusionary terms)
-
27Detection of components via regular expression
text searchescombined with operon prediction
- Primary search terms (and exclusionary terms)
- YES osidase, anase, arabinase, dextrinase,
osaminidase - NO capsul, export, cyanase, biosynthe,
nucleosidase, tryptophanase
28Detection of components via regular expression
text searchescombined with operon prediction
- Primary search terms (and exclusionary terms)
- Secondary (optional) search terms
29Detection of components via regular expression
text searchescombined with operon prediction
- Primary search terms (and exclusionary terms)
- Secondary (optional) search terms
- Selection rules
- 2 or more primary matches
- 5 or more primary secondary matches
- Threshold percentage of genes in operon matching
search terms
30Genome Properties - Content
- biological niche
- animal pathogen
- human pathogen
- optimum salinity
- optimal growth temperature
- optimal pH
- oxygen requirement
- plant pathogen
- temperature environment
- cell surface component
- capsule
- flagella
- outer membrane
- peptidoglycan(murein) biosynthesis
- S-layer
- type IV pilus
- cellular growth, organization and division cell
shape - minCDE system
- mreBCD system
protein transport Sec-system protein
translocase Tat (Sec-independent) protein
export type I secretion type II secretion
type III secretion type IV
secretion quantitative content amino acid
abundance count of DNA molecules count of
predicted proteins count of tRNAs DNA
dinucleotide thermophily metric RRYY-RY-YR DNA
GC content DNA size (megabases) functional
gene clustering - property level protein
average length selfish genetic elements
CRISPR region group I intron group II
intron inteins metabolism (subcategories) bio
synthesis catabolism central intermediary
metabolism energy metabolism nucleic acid
metabolism protein modification and
cofactors storage polymer systems
31(No Transcript)
32(No Transcript)
33(No Transcript)
34Data mining -- Clustering
35Applications
- Catalyst for the generation of new TIGRFAMs HMMs
- Impovements in HMM quality
- Greater richness of annotation
- Quality control on gene calling
- Gene-name independent annotations
- Genome process-level summarization
- Phylogenetic profiling
36Acknowledgements
Lauren Brinkac
Tanja Davidsen
Dan Haft
Owen White
Nikhat Zafar
Funding U.S. Dept. of Energy
U.S. National Science Foundation
37www.tigr.org/Genome_Properties
Species Helicobacter pylori (2 strains
sequenced) Phylum Proteobacteria Cell shape
Rod GC content (average) 39.0 Oxygen
requirement Microaerophilic Human pathogen
YES TCA cycle PARTIAL Flagella YES
Species Porphyromonas gingivalis (1 strain
sequenced) Phylum Bacteroidetes Cell shape
Rod GC content 48.3 Oxygen requirement
Anaerobic Human pathogen YES Selenocysteine NO
Outer membrane YES Flagella NO
Species Yersinia pestis (2 strains
sequenced) Phylum Proteobacteria Cell shape
Coccus GC content 47.6 Oxygen requirement
Aerobic Human pathogen YES Capsule
proteinaceous Polyketide Natural Products YES
(1) Glycine betaine biosynthesis YES Flagella
CRYPTIC
Species Shigella flexneri (2 strains sequenced)
Phylum Proteobacteria Cell shape Rod GC
content 50.9 Oxygen requirement Facultative
anaerobic Human pathogen YES Selenocysteine YES
Polyketide or NRPS Natural Products NO