Title: Pathways Databases: A Case Study in Computational Symbolic Theories
1Pathways Databases A Case Study in Computational
Symbolic Theories
Presented by Gunjan Gupta April 13, 2004
2Overview
- What is Gene Co-expression and why discover gene
co-expression ? - Using DNA microarray data for discovering
functional gene co-expression. - Discussion on how to use phylogenetic information
to filter out false positives in the micro-array
data. - Detailed discussion of the proposed solution.
3A Complete Scientific Theory of a simple
organism is very Complex
- A simple organism such as E. Coli is very
complex
Example, for E. Coli - 791 chemical compounds
744 enzyme-catalyzed biochemical reactions for
the Metabolic pathways. - 4290 protein coding
genes - A perhaps larger number of mostly unknown
RNA-encoding genes
Explains all the observed processes in an
organism
4Too complex(large) for a single Scientist to grasp
- More complex than a factory producing cars
Imagine a car factory that not only produces
cars, but produces other factories of its own
kind, and everything else needed to produce,
reproduce, repair, fend off competition, fight
foreign attacks, adapt to a changing environment,
evolve !
5Traditional methods of representing theoretical
knowledge
- Technical papers in journals
- Diagrams, flowcharts tables
- Mostly natural language descriptions more
suitable for humans. - Raw and partially processed data in large
biological databases example, indexed by search
criteria, annotated in English etc.
6Traditional methods not enough, why ?
- Cannot represent scientific theories for
biological systems too complex. - Automatic deductions not possible as the theory
is not stored in a structured, computer readable
presentation. - Natural Language Processing not developed to the
level yet to allow a computer to search human
readable data. - Only bits and pieces understandable by one person
at any given time. - Large-scale patterns and deductions might never
be found in the absence of any one individual who
can see the big picture. - Addition of details from experimentation does not
necessarily improve our understanding of the
whole system.
7Pathway Databases for Biology
- A Database of Scientific theory storing
qualitative information semantics of the
theory in a well-defined form. - Useful for representing theories that are mostly
non-numerical in nature a biological system is
well suited. - Definition A Pathway is a linked set of
biochemical reactions, linked as follows
Theory of relationships between 1,2,3,4,5,6 7
defined)
Reaction 6
Reaction 1
Reaction 2
Chemical 8
Raw Material 5
8Different Kinds of Biological Pathways Database
(PDB)
- Only Metabolic Pathways (majority) (time
generally fast) - e.g. Production of ATP in the Mitochondria
- Signaling Pathways (time generally medium)
- Intra-cellular e.g. cell-membrane to nucleus
- Extra-cellular e.g. growth-factors for nerve,
skin, say for injury.. - Genetic-regulatory pathways (time generally
medium/slow) - gene expression control, make more as needed,
example synthesize amino-acids deficient in food
by increasing expression of related genes. - A combination of one or more of the above (e.g.
PAD, EcoCyc).
9Pathways Databases an intersection of 4 fields
Genomics
Biochemistry
Databases
Artificial Intelligence
10Examples of current PDBs
- Metabolic Pathway Database
- EcoCyc (author) http//www.ecocyc.org/
- Signaling Pathways
- SPAD http//www.grt.kyushu-u.ac.jp/spad/
- Genetic-regulatory pathways
- BIOBASE http//www.gene-regulation.com/
11EcoCyc Project
- EcoCyc Encyclopedia of Escherichia coli K12
Genes and Metabolism - At http//www.ecocyc.org/
- Links to a bigger site http//biocyc.org/server.
html containing PDB for other organisms including
human.
12What Metabolic Genetic Pathway Info does EcoCyc
Contain ?
- For each Enzyme in the PDB
- Detailed description of reaction catalyzed by
each enzyme. - Genes to which the enzymes map to, if available.
- The range of substrate the enzyme would accept.
- Chemicals that inhibit or activate the Enzyme.
- Its subunit structure.
- Each small molecule enzyme substrate
- Pathways types for E-Coli included
- Biosynthesis of cellular bulding blocks.
- Extraction of Carbon from food.
- Extraction of chemical energy from food.
13Tools Visualization in EcoCyc
- SRI International (http//www.sri.com/) developed
the visualization and search tool called Pathway
Tools as an intelligent user interface. - Allows the user to exploits the semantic
information in the PDB and write complex queries. - Visualize results as a Pathway graph called
Overview Diagram. - A variety of criteria for the queries
- Name matching
- Classification hierarchy (taxonomy, metabolic
pathways) - Example
- Find all reactions that are activated or
inhibited by a given metabolite. - Superimposing genetics data on the visualization
(see demo) - User can create a PGDB for a new organism and
share it.
14Visualization
Metabolite categories
Individual Reactions
Glycolysis region
15An application automated inter-species
comparison of reactions
- Yellow shows reactions in E. Coli that match with
another species S. Cerevisiae, found as a
result of database lookup. - Mostly automated layout using Pathway Tools (some
manual fitting was done by author in this paper
for this diagram).
16Exploiting Knowledge Representation AI tools in a
PDB
- Building an Ontology DB schema defining the
precise relationships between entity use UML ?
? - Encode a theory using the Ontology
- EcoCyc ontology consists of 1000 object-oriented
classes encoding key concepts of biology and
biochemistry. - Extend Ontology when new concepts are found that
cannot be derived using existing Ontology. - Use KR techniques that exploit Symbolic AI
reasoning to say - build an inference engine (see Bruce Porters
Knowledge Machine for example) on the PDB or - Perform specific global inferences using specific
relationship searches.
17Example of Symbolic Inferences on EcoCyc
- Results of a search changed the simplistic notion
of what gene is - In E. Coli found 1 out of 7 enzymes catalyze more
than one reaction, and almost 1 out of 7 cases
where an reaction is used in more than one
Pathway (overlapping sub-graphs or clusters)
this can be treated as a discovered theory. - Characterized the transcription factors
relationship (see next slide). - Other interesting theories found using PDB
- Scale free network topology that follows a power
law for both Metabolic and Genetic networks. - Deletion of proteins with high connectivity more
likely to kill the organism.
18Example 2 Characterizing Transcription Factors
inter-relationship in a genetic pathway
Just two dominate the relationships making the
tree very shallow
Most do not regulate themselves or other
transcription factors
19EcoCyc Demo
- Demo 1 Search demo
- Go to site http//www.ecocyc.org/
- Click on DB search
- Search for Glycolysis and show to class if time
permits .. - Demo 2 Combining pathways with Gene expression
data - Go to site http//www.ecocyc.org1555/expression.h
tml - Specify data file as http//biocyc.org/coli.dat
(might have to save it locally to work). - Select absolute display level.
- Enter ratio for numerator as 1.
20Issues/Comments
- The idea itself is quite powerful combining
sequence data with DNA array, and results seem to
be quite good, but .. - Too many steps hard to quantify the results
except empirically, because of the complexity. At
least 5 levels of successive transformations
(1.genes to meta-genes via blast 2. Pearson
correlation 3. order based probability 4. network
thresholding 5. Clustering). - Some heuristics are not explained much for
clustering, transformation into 2-d space from
probability for example, where the original
problem was a graph- why not directly partition
the graph ? - Limited quality of clusters because of 2-d
translation. Not clear why and how the data fits
into a 2-d space. Not clear if the translation
using P-value is a metric. - The paper was obviously not written by plain
computer scientists - lot of interesting
discoveries and analysis after the method was
used.
21One line summary of the paper ..
Using phylogenetics to filter out gene
co-expressions in micro-array data that are not
functionally relevant.