PathoLogic Pathway Predictor - PowerPoint PPT Presentation

About This Presentation
Title:

PathoLogic Pathway Predictor

Description:

Transform existing genome to PGDB form. Infer metabolic pathways ... Coloring scheme aids in assessing pathway evidence. Phase I: Prune extra variant pathways ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 45
Provided by: pangeasy
Category:

less

Transcript and Presenter's Notes

Title: PathoLogic Pathway Predictor


1
PathoLogic Pathway Predictor
2
Inference of Metabolic Pathways
Annotated Genomic Sequence
Pathway/Genome Database
Pathways
Reactions
PathoLogic Software Integrates genome and pathway
data to identify putative metabolic networks
Compounds
Multi-organism Pathway Database (MetaCyc)
Gene Products
Genes
Genomic Map
3
PathoLogic Functionality
  • Initialize schema for new PGDB
  • Transform existing genome to PGDB form
  • Infer metabolic pathways and store in PGDB
  • Infer operons and store in PGDB
  • Assemble Overview diagram
  • Assist user with manual tasks
  • Assign enzymes to reactions they catalyze
  • Identify false-positive pathway predictions
  • Build protein complexes from monomers
  • Infer transport reactions

4
PathoLogic Input/Output
  • Inputs
  • File listing genetic elements
  • http//bioinformatics.ai.sri.com/ptools/genetic-el
    ements.dat
  • Files containing DNA sequence for each genetic
    element
  • Files containing annotation for each genetic
    element
  • MetaCyc database
  • Output
  • Pathway/genome database for the subject organism
  • Reports that summarize
  • Evidence contained in the input genome for the
    presence of reference pathways
  • Reactions missing from inferred pathways

5
PathoLogic Analysis Phases
  • Trial parsing of input data files few days
  • Initialize schema of new PGDB 3 min
  • Create DB objects for replicons, genes, proteins
    5 min
  • Assign enzymes to reactions they catalyze
  • ferrochelatase
    10 min / 1 week
  • glutamate 1-semialdehyde 2,1-aminomutase
  • porphobilinogen deaminase

E1
E2
B
D
E
F
6
PathoLogic Analysis Phases
  • From assigned reactions, infer what pathways are
    present
    5 min / few days
  • Define metabolic overview diagram 30
    min
  • Define protein complexes
    few days

7
genetic-elements.dat
  • ID TEST-CHROM-1
  • NAME Chromosome 1
  • TYPE CHRSM
  • CIRCULAR? N
  • ANNOT-FILE chrom1.pf
  • SEQ-FILE chrom1.fsa
  • //
  • ID TEST-CHROM-2
  • NAME Chromosome 2
  • CIRCULAR? N
  • ANNOT-FILE /mydata/chrom2.gbk
  • SEQ-FILE /mydata/chrom2.fna
  • //

8
File Naming Conventions
  • One pair of sequence and annotation files for
    each genetic element
  • Sequence files FASTA format
  • suffix fsa or fna
  • Annotation file
  • Genbank format suffix .gbk
  • PathoLogic format suffix .pf

9
Typical Problems Using Genbank Files With
PathoLogic
  • Wrong qualifier names used read PathoLogic
    documentation!
  • Extraneous information in a given qualifier
  • Check results of trial parse carefully

10
GenBank File Format
  • Accepted feature types
  • CDS, tRNA, rRNA, misc_RNA
  • Accepted qualifiers
  • /locus_tag Unique ID recm
  • /gene Gene name req
  • /product req
  • /EC_number recm
  • /product_comment opt
  • /gene_comment opt
  • /alt_name Synonyms opt
  • /pseudo Gene is a pseudogene opt
  • For multifunctional proteins, put each function
    in a separate /product line

11
PathoLogic File Format
  • Each record starts with line containing an ID
    attribute
  • Tab delimited
  • Each record ends with a line containing //
  • One attribute-value pair is allowed per line
  • Use multiple FUNCTION lines for multifunctional
    proteins
  • Lines starting with are comment lines
  • Valid attributes are
  • ID, NAME, SYNONYM
  • STARTBASE, ENDBASE, GENE-COMMENT
  • FUNCTION, PRODUCT-TYPE, EC, FUNCTION-COMMENT
  • DBLINK
  • INTRON

12
PathoLogic File Format
  • ID TP0734
  • NAME deoD
  • STARTBASE 799084
  • ENDBASE 799785
  • FUNCTION purine nucleoside phosphorylase
  • DBLINK PIDg3323039
  • PRODUCT-TYPE P
  • GENE-COMMENT similar to GP1638807 percent
    identity 57.51 identified by sequence
    similarity putative
  • //
  • ID TP0735
  • NAME gltA
  • STARTBASE 799867
  • ENDBASE 801423
  • FUNCTION glutamate synthase
  • DBLINK PIDg3323040
  • PRODUCT-TYPE P

13
Before you start What to do when an error occurs
  • Most Navigator errors are automatically trapped
    debugging information is saved to error.tmp file.
  • All other errors (including most PathoLogic
    errors) will cause software to drop into the Lisp
    debugger
  • Unix error message will show up in the original
    terminal window from which you started Pathway
    Tools.
  • Windows Error message will show up in the Lisp
    console. The Lisp console usually starts out
    iconified its icon is a blue bust of Franz
    Liszt
  • 2 goals when an error occurs
  • Try to continue working
  • Obtain enough information for a bug report to
    send to pathway-tools support team.

14
The Lisp Debugger
  • Sample error (details and number of restart
    actions differ for each case)
  • Error Received signal number 2 (Keyboard
    interrupt)
  • Restart actions (select using continue)
  • 0 continue computation
  • 1 Return to command level
  • 2 Pathway Tools version 10.0 top level
  • 3 Exit Pathway Tools version 10.0
  • 1c EC(2)
  • To generate debugging information (stack
    backtrace)
  • zoom count all
  • To continue from error, find a restart that takes
    you to the top level in this case, number 2
  • cont 2
  • To exit Pathway Tools
  • exit

15
How to report an error
  • Determine if problem is reproducible, and how to
    reproduce it (make sure you have all the latest
    patches installed)
  • Send email to ptools-support_at_ai.sri.com
    containing
  • Pathway Tools version number and platform
  • Description of exactly what you were doing (which
    command you invoked, what you typed, etc.) or
    instructions for how to reproduce the problem
  • error.tmp file, if one was generated
  • If software breaks into the lisp debugger, the
    complete error message and stack backtrace
    (obtained using the command zoom count all, as
    described on previous slide)

16
Using the PPP GUI to Create a Pathway/Genome
Database
  • Input Project Information
  • Organism -gt Create New

17
Input Project Information
18
PathoLogic Command Menus
  • Organism
  • Select
  • Create New
  • Save KB
  • Revert KB
  • Reinitialize KB
  • Specify Reference PGDB(s)
  • Exit
  • Build
  • Trial Parse
  • Automated Build
  • Refine
  • Assign Probable Enzymes
  • Assign Modified Proteins
  • Create Protein Complexes
  • Re-run Name Matcher
  • Rescore Pathways
  • Predict transcription units
  • Transport Identification Parser
  • Update Overview
  • Pathway Hole Filler

19
Next Steps
  • Trial Parse
  • Build -gt Trial Parse
  • Fix any errors in input files
  • Build pathway/genome database
  • Build -gt Automated Build

20
PathoLogic Parser Output
21
Assign Enzymes to Reactions
5.1.3.2
Gene product
MetaCyc
UDP-glucose-4-epimerase
Match
yes
no
Assign
Probable enzyme -ase
UDP-D-glucose ? UDP-galactose
no
yes
Manually search
Not a metabolic enzyme
yes
no
Assign
Cant Assign
22
Enzyme Name Matcher
  • Matches on full enzyme name
  • Match is case-insensitive and removes the
    punctuation characters -_()',
  • Also matches after removal of prefixes and
    suffixes such as
  • Putative, Hypothetical, etc
  • alphabetacatalyticinducible
    chainsubunitcomponent
  • Parenthetical gene name

23
Enzyme Name Matcher
  • For names that do not match, software identifies
    probable metabolic enzymes as those
  • Containing ase
  • Not containing keywords such as
  • sensor kinase
  • topoisomerase
  • protein kinase
  • peptidase
  • Etc
  • Research unknown enzymes
  • MetaCyc, Swiss-Prot, PubMed

24
Enzyme Name to Reaction Mapping
See also file PTools Tutorial/PathoLogic
Reports/name-matching-report.txt
25
Manual Polishing
  • Refine -gt Assign Probable Enzymes ? Do this
    first
  • Refine -gt Rescore Pathways ?
    Redo after assigning enzymes
  • Refine -gt Create Protein Complexes ? Can be
    done at any time
  • Refine -gt Assign Modified Proteins ? Can
    be done at any time
  • Refine -gt Transport Identification Parser ? Can
    be done at any time
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Update Overview ? Do this last, and
    repeat after any material changes to PGDB

26
Assign Probable Enzymes
27
How to find reactions for probable enzymes
  • First, verify that enzyme name describes a
    specific, metabolic function
  • Search for fragment of name in MetaCyc you may
    be able to find a match that PathoLogic missed
  • Look up protein in SwissProt or other DBs
  • Search for gene name in PGDB for related organism
    (bear in mind that gene names are not reliable
    indicators of function, so check carefully)
  • Search for function name in PubMed
  • Other

28
Manual Polishing
  • Refine -gt Assign Probable Enzymes
  • Refine -gt Rescore Pathways
  • Refine -gt Create Protein Complexes
  • Refine -gt Assign Modified Proteins
  • Refine -gt Transport Identification Parser
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Run Consistency Checker
  • Refine -gt Update Overview

29
Automated Pathway Inference
  • All pathways in MetaCyc for which there is at
    least one enzyme identified in the target
    organism are considered for possible inclusion.
  • Algorithm errs on side of inclusivity easier to
    manually delete a pathway from an organism than
    to find a pathway that should have been predicted
    but wasnt.

30
Considerations taken into account when deciding
whether or not a pathway should be inferred
  • Is there a unique enzyme an enzyme not involved
    in any other pathway?
  • Does the organism fall in the expected taxonomic
    domain of the pathway?
  • Is this pathway part of a variant set, and, if
    so, is there more evidence for some other
    variant?
  • If there is no unique enzyme
  • Is there evidence for more than one enzyme?
  • If a biosynthetic pathway, is there evidence for
    final reaction(s)?
  • If a degradation pathway, is there evidence for
    initial reaction(s)?
  • If an energy metabolism pathway, is there
    evidence for more than half the reactions?

31
Assigning Evidence Scores to Predicted Pathways
  • XYZ denotes score for P in O
  • where
  • X total number of reactions in P
  • Y enzymes catalyzing number of reactions for
    which there is evidence in O
  • Z number of Y reactions that are used in other
    pathways in O

32
Manual Pruning of Pathways
  • Use pathway evidence report
  • Coloring scheme aids in assessing pathway
    evidence
  • Phase I Prune extra variant pathways
  • Rescore pathways, re-generate pathway evidence
    report
  • Phase II Prune pathways unlikely to be present
  • No/few unique enzymes
  • Most pathway steps present because they are used
    in another pathway
  • Pathway very unlikely to be present in this
    organism
  • Nonspecific enzyme name assigned to a pathway
    step

33
Caveats
  • Cannot predict pathways not present in MetaCyc
  • Evidence for short pathways is hard to interpret
  • Since many reactions occur in multiple pathways,
    some false positives

34
Output from PPP
  • Pathway/genome database
  • Summary pages
  • Pathway evidence page
  • Click Summary of Organisms, then click organism
    name, then click Pathway Evidence, then click
    Save Pathway Report
  • Missing enzymes report
  • Directory tree containing sequence files,
    reports, etc.

35
Resulting Directory Structure
  • ROOT/ptools-local/pgdbs/user/ORGIDcyc/VERSION/
  • input
  • organism.dat
  • organism-init.dat
  • genetic-elements.dat
  • annotation files
  • sequence files
  • reports
  • name-matching-report.txt
  • trial-parse-report.txt
  • kb
  • ORGIDbase.ocelot
  • data
  • overview.graph
  • released -gt VERSION

36
Manual Polishing
  • Refine -gt Assign Probable Enzymes
  • Refine -gt Rescore Pathways
  • Refine -gt Create Protein Complexes
  • Refine -gt Assign Modified Proteins
  • Refine -gt Transport Identification Parser
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Run Consistency Checker
  • Refine -gt Update Overview

37
Creating Protein Complexes
38
Complex Subunits Stoichiometries
39
Manual Polishing
  • Refine -gt Assign Probable Enzymes
  • Refine -gt Re-run Name Matcher
  • Refine -gt Create Protein Complexes
  • Refine -gt Assign Modified Proteins
  • Refine -gt Transport Identification Parser
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Run Consistency Checker
  • Refine -gt Update Overview

40
Proteins as Reaction Substrates
41
Manual polishing
  • Refine -gt Assign Probable Enzymes
  • Refine -gt Re-run Name Matcher
  • Refine -gt Create Protein Complexes
  • Refine -gt Assign Modified Proteins
  • Refine -gt Transport Identification Parser
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Run Consistency Checker
  • Refine -gt Update Overview

42
What are pathway holes?
At least one reaction in the pathway has an
enzyme assigned. The reactions in the pathway
without enzymes assigned are holes.
1.4.3.-
iminoaspartate
No EC
L-aspartate
quinolinate
holes
n.n. pyrophosphorylase nadC, RV1596
6.3.1.5
deamido-NAD
deamido-NAD
nicotinate nucleotide
2.7.7.18
6.3.5.1
NAD
43
Algorithm for identifying candidates and
consolidating data
Step III IV Consolidate hits and evaluate
evidence using a Bayes classifier
Step II BLAST against target genome
Step I collect query isozymes of function A
3 queries have low-scoring hits to sequence
X Resulting P(has-function) is low
gene X
organism 1 enzyme A
organism 2 enzyme A
organism 3 enzyme A
8 queries have high-scoring hits to sequence
Y Resulting P(has-function) is high
organism 4 enzyme A
organism 5 enzyme A
gene Y
organism 6 enzyme A
organism 7 enzyme A
organism 8 enzyme A
5 queries have low-scoring hits to sequence
Z Resulting P(has-function) is low
gene Z
target genome
44
Reference for the Pathway Hole Filler
  • Green, ML and Karp, PD. A Bayesian method for
    identifying missing enzymes in predicted
    metabolic pathway databases. BMC Bioinformatics
    2004, 576.

45
Features used to calculate the probability that a
protein has the desired function
Candidate is in a contiguous set of genes
transcribed in one direction with another gene in
the pathway
  • Best E-value
  • Avg. rank
  • Avg aligned
  • Number of query sequences aligned
  • Potential operon?
  • Adjacent reactions?

Candidate is adjacent to the gene assigned to an
adjacent reaction in the pathway
46
Navigating to the Pathway Hole Filler
47
Steps that must be completed before running the
Pathway Hole Filler
  • Install BLAST executable (should already be
    installed on training room machines)
  • Prepare BLAST protein db
  • Need FASTA format genome nucleotide sequence (see
    instructor if you have something different, like
    ESTs, or have no sequence data file)
  • In general, the more pathways in your PGDB, the
    more the pathway hole filler will have to search
    for

48
  • Steps for operating the pathway hole filler
  • Prepare training data for Bayes classifier
  • Collect feature data for known rxns in PGDB
  • Calculate probability distributions for classifier
  • Identify and evaluate candidates
  • Collect feature data for each candidate
  • Use classifier to determine P(has-function)
  • Choose holes to fill in KB
  • Either select all above a cutoff or manually
    review candidates

49
Step 1 Prepare Training Data
  • Calculate training data from your organism or use
    existing training data
  • Once Step 1 has been completed, the training data
    are saved and can be reused (even in another
    Pathway Tools session).
  • If using existing data from E. coli the training
    data are based on data from the literature.

50
Step 2 Identify Evaluate Candidates
51
Step 2 Identify Evaluate Candidates
A list of all pathway holes in the PGDB
A list of all pathways in the PGDB with holes
52
Modes of operation
  • Fully automatic
  • No interaction required from user
  • All default values used
  • Prepare training data all known rxns in KB
  • Identify and evaluate candidates all pathways
    with pathway holes
  • Choose holes to fill in KB all holes with Pgt0.9
    filled

53
Modes of operation
  • Wizard
  • Wizard prompts user for training data source and
    for which holes to make predictions. Wizard runs
    Steps 1 2, then prompts user to complete Step 3.

Power-user mode User must proceed through each
step in order. Program still prompts user for
required parameters, but each step must be
completed before advancing to next step.
54
Step 3 Choose Holes to Fill in KB
55
Step 3 Choose Holes to Fill in KB
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Output from Pathway Hole Filler- from Prepare
Training Data step
  • ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/data/
  • (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/data/)
  • rxn-list data retrieved from ORGID for
    calculating training data
  • priors/ directory containing training data that
    is loaded when using existing data from ORGID
  • These files contain the training data computed in
    Step 1. If either file is available, the user may
    use existing training data in Step 1.
  • Each file is overwritten each time you run this
    step.

60
Output from Pathway Hole Filler- from Identify
and Evaluate Candidates step
  • ROOT/aic-export/ecocyc/ORGIDcyc/VERSION/reports/
  • (e.g., ROOT/aic-export/ecocyc/caulocyc/1.0/reports
    /)
  • ORGIDholesX-Y.html (e.g., CAULOholes0-10.html)
  • ORGID_filled-holes.html the list of holes that
    user selected to fill in the KB in Step 3.
  • blasterrors.log log of each rxn describing
    whether or not any candidates were found
  • hole-data file containing data (in a Lisp
    structure) found for each rxn, used to generate
    list in Choose holes to fill in KB dialogue. If
    this file is available, step 3 can be initiated
    without repeating Step 2.
  • Each file is overwritten each time you run this
    step.

61
Manual polishing
  • Refine -gt Assign Probable Enzymes
  • Refine -gt Rescore Pathways
  • Refine -gt Create Protein Complexes
  • Refine -gt Assign Modified Proteins
  • Refine -gt Transport Identification Parser
  • Refine -gt Pathway Hole Filler
  • Refine -gt Predict Transcription Units
  • Refine -gt Run Consistency Checker
  • Refine -gt Update Overview

62
Nomenclature
  • WO pair pair of genes within an operon
  • TUB pair pair of genes at a transcription unit
    boundary (delineate operons)

63
Operation of the operon predictor
  • For each contiguous gene pair, predict whether
    gene pairs are within the same operon or at a
    transcription unit boundary
  • Use pairwise predictions to identify potential
    operons
  • AB TUB pair
  • BC WO pair operon BCD
  • CD WO pair
  • DE TUB pair

A
B
C
D
E
64
Operon predictor
  • Predicts operon gene pairs based on
  • intergenic distance between genes
  • genes in the same functional class
  • Typically used for operon prediction
  • We use method from Salgado et al, PNAS (2000) as
    a starting point.
  • Uses E. coli experimentally verified data as a
    training set.
  • Compute log likelihood of two genes being WO or
    TUB pair based on intergenic distance.

65
Operon predictor
  • Additional features easily computed from a PGDB
  • both genes products enzymes in the same metabolic
    pathway
  • both gene products monomers in the same protein
    complex
  • one gene product transports a substrate for a
    metabolic pathway in which the other gene product
    is involved as an enzyme
  • a gene upstream or downstream from the gene pair
    (and within the same directon) is related to
    either one of the genes in the pair as per
    features 1, 2 and 3 above.
Write a Comment
User Comments (0)
About PowerShow.com