Pathways Databases: A Case Study in Computational Symbolic Theories PowerPoint PPT Presentation

presentation player overlay
1 / 21
About This Presentation
Transcript and Presenter's Notes

Title: Pathways Databases: A Case Study in Computational Symbolic Theories


1
Pathways Databases A Case Study in Computational
Symbolic Theories
  • Author Peter D. Karp

Presented by Gunjan Gupta April 13, 2004
2
Overview
  • What is Gene Co-expression and why discover gene
    co-expression ?
  • Using DNA microarray data for discovering
    functional gene co-expression.
  • Discussion on how to use phylogenetic information
    to filter out false positives in the micro-array
    data.
  • Detailed discussion of the proposed solution.

3
A Complete Scientific Theory of a simple
organism is very Complex
  • A simple organism such as E. Coli is very
    complex

Example, for E. Coli - 791 chemical compounds
744 enzyme-catalyzed biochemical reactions for
the Metabolic pathways. - 4290 protein coding
genes - A perhaps larger number of mostly unknown
RNA-encoding genes
Explains all the observed processes in an
organism
4
Too complex(large) for a single Scientist to grasp
  • More complex than a factory producing cars

Imagine a car factory that not only produces
cars, but produces other factories of its own
kind, and everything else needed to produce,
reproduce, repair, fend off competition, fight
foreign attacks, adapt to a changing environment,
evolve !
5
Traditional methods of representing theoretical
knowledge
  • Technical papers in journals
  • Diagrams, flowcharts tables
  • Mostly natural language descriptions more
    suitable for humans.
  • Raw and partially processed data in large
    biological databases example, indexed by search
    criteria, annotated in English etc.

6
Traditional methods not enough, why ?
  • Cannot represent scientific theories for
    biological systems too complex.
  • Automatic deductions not possible as the theory
    is not stored in a structured, computer readable
    presentation.
  • Natural Language Processing not developed to the
    level yet to allow a computer to search human
    readable data.
  • Only bits and pieces understandable by one person
    at any given time.
  • Large-scale patterns and deductions might never
    be found in the absence of any one individual who
    can see the big picture.
  • Addition of details from experimentation does not
    necessarily improve our understanding of the
    whole system.

7
Pathway Databases for Biology
  • A Database of Scientific theory storing
    qualitative information semantics of the
    theory in a well-defined form.
  • Useful for representing theories that are mostly
    non-numerical in nature a biological system is
    well suited.
  • Definition A Pathway is a linked set of
    biochemical reactions, linked as follows

Theory of relationships between 1,2,3,4,5,6 7
defined)
Reaction 6
Reaction 1
Reaction 2
Chemical 8
Raw Material 5
8
Different Kinds of Biological Pathways Database
(PDB)
  • Only Metabolic Pathways (majority) (time
    generally fast)
  • e.g. Production of ATP in the Mitochondria
  • Signaling Pathways (time generally medium)
  • Intra-cellular e.g. cell-membrane to nucleus
  • Extra-cellular e.g. growth-factors for nerve,
    skin, say for injury..
  • Genetic-regulatory pathways (time generally
    medium/slow)
  • gene expression control, make more as needed,
    example synthesize amino-acids deficient in food
    by increasing expression of related genes.
  • A combination of one or more of the above (e.g.
    PAD, EcoCyc).

9
Pathways Databases an intersection of 4 fields
Genomics
Biochemistry
Databases
Artificial Intelligence
10
Examples of current PDBs
  • Metabolic Pathway Database
  • EcoCyc (author) http//www.ecocyc.org/
  • Signaling Pathways
  • SPAD http//www.grt.kyushu-u.ac.jp/spad/
  • Genetic-regulatory pathways
  • BIOBASE http//www.gene-regulation.com/

11
EcoCyc Project
  • EcoCyc Encyclopedia of Escherichia coli K12
    Genes and Metabolism
  • At http//www.ecocyc.org/
  • Links to a bigger site http//biocyc.org/server.
    html containing PDB for other organisms including
    human.

12
What Metabolic Genetic Pathway Info does EcoCyc
Contain ?
  • For each Enzyme in the PDB
  • Detailed description of reaction catalyzed by
    each enzyme.
  • Genes to which the enzymes map to, if available.
  • The range of substrate the enzyme would accept.
  • Chemicals that inhibit or activate the Enzyme.
  • Its subunit structure.
  • Each small molecule enzyme substrate
  • Pathways types for E-Coli included
  • Biosynthesis of cellular bulding blocks.
  • Extraction of Carbon from food.
  • Extraction of chemical energy from food.

13
Tools Visualization in EcoCyc
  • SRI International (http//www.sri.com/) developed
    the visualization and search tool called Pathway
    Tools as an intelligent user interface.
  • Allows the user to exploits the semantic
    information in the PDB and write complex queries.
  • Visualize results as a Pathway graph called
    Overview Diagram.
  • A variety of criteria for the queries
  • Name matching
  • Classification hierarchy (taxonomy, metabolic
    pathways)
  • Example
  • Find all reactions that are activated or
    inhibited by a given metabolite.
  • Superimposing genetics data on the visualization
    (see demo)
  • User can create a PGDB for a new organism and
    share it.

14
Visualization
Metabolite categories
Individual Reactions
Glycolysis region
15
An application automated inter-species
comparison of reactions
  • Yellow shows reactions in E. Coli that match with
    another species S. Cerevisiae, found as a
    result of database lookup.
  • Mostly automated layout using Pathway Tools (some
    manual fitting was done by author in this paper
    for this diagram).

16
Exploiting Knowledge Representation AI tools in a
PDB
  • Building an Ontology DB schema defining the
    precise relationships between entity use UML ?
    ?
  • Encode a theory using the Ontology
  • EcoCyc ontology consists of 1000 object-oriented
    classes encoding key concepts of biology and
    biochemistry.
  • Extend Ontology when new concepts are found that
    cannot be derived using existing Ontology.
  • Use KR techniques that exploit Symbolic AI
    reasoning to say
  • build an inference engine (see Bruce Porters
    Knowledge Machine for example) on the PDB or
  • Perform specific global inferences using specific
    relationship searches.

17
Example of Symbolic Inferences on EcoCyc
  • Results of a search changed the simplistic notion
    of what gene is
  • In E. Coli found 1 out of 7 enzymes catalyze more
    than one reaction, and almost 1 out of 7 cases
    where an reaction is used in more than one
    Pathway (overlapping sub-graphs or clusters)
    this can be treated as a discovered theory.
  • Characterized the transcription factors
    relationship (see next slide).
  • Other interesting theories found using PDB
  • Scale free network topology that follows a power
    law for both Metabolic and Genetic networks.
  • Deletion of proteins with high connectivity more
    likely to kill the organism.

18
Example 2 Characterizing Transcription Factors
inter-relationship in a genetic pathway
Just two dominate the relationships making the
tree very shallow
Most do not regulate themselves or other
transcription factors
19
EcoCyc Demo
  • Demo 1 Search demo
  • Go to site http//www.ecocyc.org/
  • Click on DB search
  • Search for Glycolysis and show to class if time
    permits ..
  • Demo 2 Combining pathways with Gene expression
    data
  • Go to site http//www.ecocyc.org1555/expression.h
    tml
  • Specify data file as http//biocyc.org/coli.dat
    (might have to save it locally to work).
  • Select absolute display level.
  • Enter ratio for numerator as 1.

20
Issues/Comments
  • The idea itself is quite powerful combining
    sequence data with DNA array, and results seem to
    be quite good, but ..
  • Too many steps hard to quantify the results
    except empirically, because of the complexity. At
    least 5 levels of successive transformations
    (1.genes to meta-genes via blast 2. Pearson
    correlation 3. order based probability 4. network
    thresholding 5. Clustering).
  • Some heuristics are not explained much for
    clustering, transformation into 2-d space from
    probability for example, where the original
    problem was a graph- why not directly partition
    the graph ?
  • Limited quality of clusters because of 2-d
    translation. Not clear why and how the data fits
    into a 2-d space. Not clear if the translation
    using P-value is a metric.
  • The paper was obviously not written by plain
    computer scientists - lot of interesting
    discoveries and analysis after the method was
    used.

21
One line summary of the paper ..
Using phylogenetics to filter out gene
co-expressions in micro-array data that are not
functionally relevant.
Write a Comment
User Comments (0)
About PowerShow.com