Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc - PowerPoint PPT Presentation

About This Presentation
Title:

Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc

Description:

Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc Peifen Zhang Carnegie Institution For Science Department of Plant Biology – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 26
Provided by: Peif65
Category:

less

Transcript and Presenter's Notes

Title: Challenges in Creating and Curating Plant PGDBs: Lessons Learned from AraCyc and PoplarCyc


1
Challenges in Creating and Curating Plant PGDBs
Lessons Learned from AraCyc and PoplarCyc
  • Peifen Zhang
  • Carnegie Institution For Science
  • Department of Plant Biology
  • Stanford, CA

2
Who We Are
PMN - Sue Rhee (PI) - Kate Dreher (curator) -
A. Karthik (curator, previous) - Lee Chae
(Postdoc) - Anjo Chi (programmer) - Cynthia Lee
(TAIR tech team) - Larry Ploetz (TAIR tech
team) - Shanker Singh (TAIR tech team) - Bob
Muller (TAIR tech team) Key Collaborators -
Peter Karp (MetaCyc, SRI) - Ron Caspi (MetaCYc,
SRI) - Lukas Mueller (SGN) - Anuradha Pujar (SGN)
http//plantcyc.org
3
Introduction
  • Background and rationale
  • Plants (food, feed, forest, medicine, biofuel)
  • An ocean of sequences
  • More than 60 species in genome sequencing
    projects, hundreds in EST projects
  • Putting individual genes onto a network of
    metabolic reactions and pathways
  • Annotating, visualizing and analyzing at system
    level
  • AraCyc (Arabidopsis thaliana, TAIR/PMN)
  • predicted by using the Pathway Tools software,
    followed by manual curation

4
Introduction (cont)
  • Background and rationale
  • Plants (food, feed, forest, medicine, biofuel)
  • An ocean of sequences
  • More than 60 species in genome sequencing
    projects, hundreds in EST projects
  • Putting individual genes onto a network of
    metabolic reactions and pathways
  • Annotating, visualizing and analyzing at system
    level
  • AraCyc (Arabidopsis thaliana, TAIR/PMN)
  • predicted by using the Pathway Tools software,
    followed by manual curation
  • Other plant pathway databases predicted by using
    the Pathway Tools
  • RiceCyc (Oryza sativa, Gramene)
  • MedicCyc (Medicago truncatula, Noble Foundation)
  • LycoCyc (Solanum lycopersicum, SGN),

5
Limitations
  • Creating pathway databases includes three major
    components, and is resource-intensive
  • Sequence annotation
  • Reference pathway database
  • Pathway prediction, validation, refinement
  • Heterogeneous sequence annotation protocols and
    varying levels of pathway validation impact
    quality and hinder meaningful cross-species
    comparison
  • Using a non-plant reference database causes many
    false-positive and false-negative pathway
    predictions

6
Introducing the PMN
  • Scope
  • A platform for plant metabolic pathway database
    creation
  • A community for data curation
  • Curators, editorial board, ally in other
    databases, researchers
  • Major goals
  • Create a plant-specific reference pathway
    database (PlantCyc)
  • Create an enzyme sequence annotation pipeline
  • Enhance pathway prediction by using PlantCyc,
    and including an automated initial validation
    step
  • Create metabolic pathway databases for plant
    species
  • e.g. PoplarCyc (Populus trichocarpa), SoyCyc
    (soybean)

7
PlantCyc Creation
  • Nature of PlantCyc
  • Multiple-species, plants-only
  • curator-reviewed pathways, predicted,
    hypothetical, empirical
  • primary and secondary metabolism
  • Major Source
  • All AraCyc pathways and enzymes
  • Plant pathways and enzymes from MetaCyc
  • Additional pathways and enzymes manually curated
    and added
  • Enzymes from RiceCyc, LycoCyc and MedicCyc

8
PMN Database Content Statistics
PlantCyc 4.0
AraCyc 7.0
PoplarCyc 2.0
Pathways
685
369
288
Enzymes
11058
5506
3420
Reactions
2929
2418
1707
Compounds
2966
2719
1397
Organisms
343
1
1
  • Valuable plant natural products, many are
    specialized metabolites that are limited to a few
    species or genus.
  • medicinal e.g. artemisinin and quinine
    (treatment of malaria),
  • codeine and morphine (pain-killer),
  • ginsenosides (cardio-protectant),
  • lupenol (antiinflammatory),
  • taxol and vinblastine (anti-cancer)
  • industrial materials e.g. resin and rubber
  • food flavor and scents e.g. capsaicin and
    piperine (chili and pepper flavor), geranyl
    acetate (aroma of rose) and menthol (mint).

9
(No Transcript)
10
Enzyme Sequence Annotation (version 1.0)
  • Reference sequences, enzymes with known functions
  • 14,187 enzyme sequences compiled from
    GOA-UniProt, Brenda, MetaCyc, and TAIR
  • 3805 functional identifiers (full EC number,
    MetaCyc reaction id, GO id)
  • Annotation methods
  • BLASTP
  • Cut-off
  • unique e-value threshold for each functional
    identifier

11
Number of enzrxn TAIR Annotation Accuracy PMN Annotation Accuracy
Genes common to both 3900
enzrxn common to both 2493 n/a n/a

TAIR-only enzrxn (EXP) 567 80 (12/15) n/a
TAIR-only enzrxn (IEA) 171 48 (11/23) n/a
TAIR-only enzrxn (ISS) 671 45 (10/22) n/a

PMN-only enzrxn (IEA) 3421 n/a 69 (11/16)

Genes unique to TAIR 2225
EXP 397 77 (10/13) n/a
IEA 420 12 (2/17) n/a
ISS 378 45 (5/11) n/a
Genes unique to PMN 1681 1503 n/a 35 (12/34)


Accurate the annotation came from a top hit
that has good homology to a known enzyme
12
Conclusion
  • Increased performance with potentially true
    enzymes
  • Over-prediction for non-enzyme proteins

13
Enzyme Sequence Annotation (version 2.0, in
progress)
  • Reference sequences, proteins with known
    functions (ERL)
  • SwissProt
  • 117,000 proteins, 26,000 enzymes, 2,400 full EC
    numbers
  • Additional enzymes from Brenda, MetaCyc, and TAIR
  • Functional identifiers full EC number, MetaCyc
    reaction id, GO id,
  • Annotation methods
  • BLASTP
  • Priam (enzyme-specific, motif-based)
  • CatFam (enzyme-specific, motif-based)
  • Function calling
  • Ensemble and voting

14
Enzyme Sequence Annotation (version 2.0, in
progress)
Lee Chae (unpublished)
15
Application to the Poplar Genome
  • Sequence annotation version 1.0
  • Pathway Tools version 12.5
  • PGDB creation using PlantCyc vs MetaCyc

16
Comparison of the PoplarCyc Initial Builds with
Either PlantCyc or MetaCyc as the Reference
Database.
Reference database used PlantCyc (2.0) MetaCyc (12.5)
Total number of pathways in the Reference database (version) 646 1395
Total number of predicted pathways 285 346
Number of false-positive predictions (false positive rate, FP/FPTN) 25 (7.5) 92 (8.5)
Database-specific false positive predictions 2 69
Number of false-negative predictions (false negative rate, FN/TPFN) 51 (16.4) 56 (18.1)
Database-specific false negative predictions 9 13
17
Conclusion
  • The absolute number of false-positive pathways
    was reduced significantly by using PlantCyc as
    the reference
  • The number of false-negative pathways was
    comparable using either PlantCyc or MetaCyc as
    the reference, indicating the usefulness of both
    databases as references

18
Automated Initial Pathway Validation
  • Remove non-plant pathways, identified from manual
    validation of AraCyc and PoplarCyc
  • A list of 132 MetaCyc pathways (an up-to-date
    file is posted online)
  • Add universal plant pathways
  • A list of 115 pathways (an up-to-date file is
    posted online)

19
A Recap of the PMN Workflow
Pathway prediction (PlantCyc)
Enzyme sequence annotation
Automated pathway validation
Pathway prediction (MetaCyc)
Manual validation
20
An Example of Practical Issues
21
Updating AraCyc with TAIR Functional Annotations
  • Source and quality
  • Literature-based GO annotations
  • Catalytic activities
  • Experimental evidence (IDA, IMP, IGI, IPI, IEP)

22
Problem
  • TAIR AT5G13700 (polyamine oxidase, IDA, PubMed
    16778015)
  • Polyamine oxidase reactions in MetaCyc/PlantCyc
  • Which one of the reaction catalyzed by AT5G13700
    was supported in the paper?

23
Conclusion
  • Not to automatically propagate GO-exp annotations
    to enzrxns
  • Manually enter along with appropriate evidence

24
Future Work
  • Enhance pathway prediction and validation
  • Using additional evidence, such as presence of
    compounds, weighted confidence of enzyme
    annotations
  • Refine pathways, hole-filling
  • Including non-sequence homology based information
    in enzyme function prediction, such as
    phylogenetic profiles, co-expression, protein
    structure
  • Create new pathway databases
  • moss (P. patens), Selaginella, maize, cassava,
    wine grape
  • Add new data types, critical for strategic
    planning of metabolic engineering
  • Rate-limiting step
  • Transcriptional regulator

25
Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com