Title: The EcoCyc and MetaCyc Pathway/Genome Databases
1The EcoCyc and MetaCyc Pathway/Genome Databases
- Peter D. Karp, Ph.D.
- Bioinformatics Research Group
- SRI International
- pkarp_at_ai.sri.com
- http//www.ai.sri.com/pkarp/
- http//EcoCyc.org/
2Overview
- Motivations and terminology
- Pathway/genome databases
- BioCyc collection
- EcoCyc, MetaCyc
- Pathway Tools software
- Bioinformatics Database Warehouse project
3A
E
4(No Transcript)
5What to do When Theories BecomeLarger than Minds
can Grasp?
- Example E. coli metabolic network
- 160 pathways involving 744 reactions and 791
substrates - Example E. coli genetic network
- Control by 97 transcription factors of 1174 genes
in 630 transcription units - Past solutions
- Partition theories across multiple minds
- Encode theories in natural-language text
- We cannot compute with theories in those forms
- Evaluate theories for consistency with new data
microarrays - Refine theories with respect to new data
- Compare theories describing different organisms
6Solution Biological Knowledge Bases
- Store biological knowledge and theories in
computers in a declarative form - Amenable to computational analysis and generative
user interfaces - Establish ongoing efforts to curate (maintain,
refine, embellish) these knowledge bases - Accepted to store data in computers, but not
knowledge - Such knowledge bases are an integral part of the
scientific enterprise
7Pathway Definition
- Chemical reactions interconvert chemical
compounds - An enzyme is a protein that accelerates chemical
reactions - A pathway is a linked set of reactions
- Often regulated as a unit
- A conceptual unit of cells biochemical machine
A B C D
A C E
8Terminology
- Model Organism Database (MOD) DB describing
genome and other information about an organism - Pathway/Genome Database (PGDB) MOD that
combines information about - Pathways, reactions, substrates
- Enzymes, transporters
- Genes, replicons
- Transcription factors, promoters, operons, DNA
binding sites - BioCyc Collection of 15 PGDBs at BioCyc.org
- EcoCyc, AgroCyc, YeastCyc
9BioCyc Collection ofPathway/Genome DBs
- Computationally Derived Datasets
- Agrobacterium tumefaciens
- Caulobacter crescentus
- Chlamydia trachomatis
- Bacillus subtilis
- Helicobacter pylori
- Haemophilus influenzae
- Mycobacterium tuberculosis RvH37
- Mycobacterium tuberculosis CDC1551
- Mycoplasma pneumonia
- Pseudomonas aeruginosa
- Saccharomyces cerevisiae
- Treponema pallidum
- Vibrio cholerae
- Yellow Open Database
- Literature-based Datasets
- MetaCyc
- Escherichia coli (EcoCyc)
http//BioCyc.org/
10Terminology Pathway Tools Software
- PathoLogic
- Prediction of metabolic network from genome
- Computational creation of new Pathway/Genome
Databases - Pathway/Genome Editors
- Distributed curation of PGDBs
- Distributed object database system, interactive
editing tools - Pathway/Genome Navigator
- WWW publishing of PGDBs
- Querying, visualization of pathways, chromosomes,
operons - Analysis operations
- Pathway visualization of gene-expression data
- Global comparisons of metabolic networks
- Bioinformatics 18S225 2002
11Pathway Tools Algorithms
- Query, visualization and editing tools for these
datatypes - Full Metabolic Map
- Paint gene expression data on metabolic network
compare metabolic networks - Pathways
- Pathway prediction
- Reactions
- Balance checker
- Compounds
- Chemical substructure comparison
- Enzymes, Transporters, Transcription Factors
- Genes Blast search
- Chromosomes
- Operons
- Operon prediction
12Model Organism Databases
- DBs that describe the genome and other
information about an organism - Every sequenced organism with an active
experimental community requires a MOD - Integrate genome data with information about the
biochemical and genetic network of the organism - MODs are platforms for global analyses of an
organism - Interpret gene expression data in a pathway
context - Characterize systems properties of metabolic and
genetic networks - Determine consistency of metabolic and transport
networks - In silico prediction of essential genes
13EcoCyc Project EcoCyc.org
- E. coli Encyclopedia
- Model-Organism Database for E. coli
- Computational symbolic theory of E. coli
- Electronic review article for E. coli over 3500
literature citations - Tracks the evolving annotation of the E. coli
genome - Collaborative development via Internet
- Karp (SRI) -- Bioinformatics architect
- John Ingraham -- Advisor
- (SRI) Metabolic pathways
- Saier (UCSD) and Paulsen (TIGR)-- Transport
- Collado (UNAM)-- Regulation of gene expression
- Database content 18,000 objects
14 EcoCyc E.coli Dataset
Pathway/Genome Navigator
Pathways 165
Reactions 2,760
Compounds 774
Enzymes 914 Transporters 162 Promoters
812 TransFac Sites 956 Citations 3,508
Proteins 4,273
Transcription Units 724 Factors 110
Genes 4,393
http//EcoCyc.org/
15EcoCyc Procedures
- All DB updates by 5 staff curators
- Information gathered from biomedical literature
- Corrections solicited from E. coli researchers
- Review-level database
- Four releases per year
- Available through WWW site, as data files, as
downloadable application - Quality assurance of data and software
- Evaluate database consistency constraints
- Perform element balancing of reactions
- Run other checking programs
- Display every DB object
16MetaCyc Metabolic Encyclopedia
- Nonredundant metabolic pathway database
- Describe a representative sample of every
experimentally determined metabolic pathway - Literature-based DB with extensive references and
commentary - Pathways, reactions, enzymes, substrates
- 460 pathways, 1267 enzymes, 4294 reactions
- 172 E. coli pathways, 2735 citations
- Nucleic Acids Research 3059-61 2002.
- Jointly developed by SRI and Carnegie Institution
- New focus on plant pathways
17Family of Pathway/GenomeDatabases
18Pathway Tools Implementation Details
- Allegro Common Lisp
- Sun and PC platforms
- Ocelot object database
- 250,000 lines of code
- Lisp-based WWW server at BioCyc.org
- Manages 15 PGDBs
19Pathway Tools Architecture
Pathway Genome Navigator
Object DBMS
20Ocelot Knowledge Server Architecture
- Frame data model
- Classes, instances, inheritance
- Frames have slots that define their properties,
attributes, relationships - A slot has one or more values
- Each value can be any Lisp datatype
- Slotunits define metadata about slots
- Domain, range, inverse
- Collection type, number of values, value
constraints - Transaction logging facility
- Schema evolution
21Ocelot Storage System Architecture
- Persistent storage via disk files, Oracle DBMS
- Concurrent development Oracle
- Single-user development disk files
- Read-only delivery bundle data into binary
program - Oracle storage
- DBMS is submerged within Ocelot, invisible to
users - Relational schema is domain independent, supports
multiple KBs simultaneously - Frames transferred from DBMS to Ocelot
- On demand
- By background prefetcher
- Memory cache
- Persistent disk cache to speed performance via
Internet
22The Common Lisp ProgrammingEnvironment
- Gatt studied Lisp and Java implementation of 16
programs by 14 programmers (Intelligence 1121
2000)
23EcoCyc WWW Server
24Pathway/Genome DBs Created byExternal Users
- Plasmodium falciparum, Stanford University
- plasmocyc.stanford.edu
- Mycobacterium tuberculosis, Stanford University
- BioCyc.org
- Arabidopsis thaliana and Synechosistis, Carnegie
Institution of Washington - Arabidopsis.org1555
- Methanococcus janaschii, EBI
- Maine.ebi.ac.uk1555
- Other PGDBs in progress by 24 other users
- Software freely available
- Each PGDB owned by its creator
25Global Consistency Checking of Biochemical
Network
- Given
- A PGDB for an organism
- A set of initial metabolites
- Infer
- What set of products can be synthesized by the
small-molecule metabolism of the organism - Can known growth medium yield known essential
compounds? - Pacific Symposium on Biocomputing p471 2001
26AlgorithmForward Propagation
Nutrient set
Products
PGDB reaction pool
Transport
Fire reactions
Metabolite set
Reactants
27Results
- Phase I Forward propagation
- 21 initial compounds yielded only half of 38
essential compounds for E. coli - Phase II Manually identify
- Bugs in EcoCyc (e.g., two objects for tryptophan)
- Missing initial protein substrates (e.g., ACP)
- Missing pathways in EcoCyc
- Phase III Forward propagation with 11 more
initial metabolites - Yielded all 38 essential compounds
28Nutrient-Related AnalysisValidation of the
EcoCyc Database
- Phase I
- Essential compounds
- produced 19
- not produced 19
- Total compounds
- produced (28)
- Reactions
- Fired (31)
29Missing Essential Compounds Due To
- Bugs in EcoCyc
- Narrow conceptualization of the problem
- Protein substrates
- Incomplete biochemical knowledge
30Nutrient-Related AnalysisValidation of the
EcoCyc Database
- Phase II (After adding 11 extra metabolites)
- Essential compounds
- produced 38
- not produced 0
- Total compounds
- produced (49)
- not produced (51)
- Reactions
- Fired (58)
- Not fired (42)
31Pathway Tools Misconceptions
- PathoLogic
- Does not re-annotate genomes
- Pathway Tools does not handle quantitative
information - Pathway/Genome Editors do not work through the
web
32HumanCyc Human Metabolic PathwayDatabase
Consortium
- Construct DB of human metabolic pathways using
PathoLogic - Link to human genome web sites
- Hire one curator to refine and curate with
respect to literature over a 2 year period - Remove false-positive predictions
- Insert known pathways missed by PathoLogic
- Add comments and citations from pathways and
enzymes to the literature - Add enzyme activators, inhibitors, cofactors,
tissue information - Available as flatfiles and with Pathway/Genome
Navigator - New versions to be released every 6 months
33Summary
- Pathway/Genome Databases
- MetaCyc non-redundant DB of literature-derived
pathways - 14 organism-specific PGDBs available through SRI
at BioCyc.org - Computational theories of biochemical machinery
- Pathway Tools software
- Extract pathways from genomes
- Morph annotated genome into structured ontology
- Distributed curation tools for MODs
- Query, visualization, WWW publishing
34BioCyc and Pathway Tools Availability
- WWW BioCyc freely available to all
- BioCyc.org
- Six BioCyc DBs openly available to all
- BioCyc DBs freely available to non-profits
- Flatfiles downloadable from BioCyc.org
- Binary executable
- Sun UltraSparc-170 w/ 64MB memory
- PC, 400MHz CPU, 64MB memory, Windows-98 or newer
- PerlCyc API
- Pathway Tools freely available to non-profits
35Acknowledgements
- SRI
- Suzanne Paley, Pedro Romero, John Pick, Cindy
Krieger, Martha Arnaud - EcoCyc Project
- Julio Collado-Vides, Ian Paulsen, Monica Riley,
Milton Saier - MetaCyc Project
- Sue Rhee, Lukas Mueller, Peifen Zhang, Chris
Somerville - Stanford
- Gary Schoolnik, Harley McAdams, Lucy Shapiro,
Russ Altman, Iwei Yeh
- Funding sources
- NIH National Center for Research Resources
- NIH National Institute of General Medical
Sciences - NIH National Human Genome Research Institute
- Department of Energy Microbial Cell Project
- DARPA BioSpice, UPC
BioCyc.org