Use Cases for a Proteomics Data Repository - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Use Cases for a Proteomics Data Repository

Description:

An XML schema for transfer of proteomics protein identification data. A relational database implementation for the data repository using OJB, allowing ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: capco
Category:

less

Transcript and Presenter's Notes

Title: Use Cases for a Proteomics Data Repository


1
Use Cases for a Proteomics Data Repository
Our Experiences with PRIDE The PRoteomics
IDEntifications database
PRIDE - A Data Repository and Data Transfer
Format for Protein Peptide Identifications and
Supporting Evidence Phil Jones, EBI, Hinxton,
Cambridgeshire, UK. pjones_at_ebi.ac.uk
2
Requirements Overview What needs to be
considered?
  • Nature of likely queries how will the
    repository be interrogated?
  • What is the nature of the response that the user
    querying the repository will require?
  • Which kinds of proteomics data should be
    included in the repository?
  • How will submission of data to the repository be
    promoted / encouraged?
  • How will the repository meet common standards
    for the exchange of data and what are the
    advantages of doing so?
  • What level of detail should the repository
    include? Major data storage and efficiency
    concerns connected with this.

3
What kinds of Questions will be Asked? - Search
Criteria
  • Likely Queries may be a combination of any of
    the following
  • Literature Reference (by author / title /
    keywords etc.)
  • Protein ID
  • Protein family / Domain / other classification
  • Peptide sequence
  • Species
  • Developmental stage / age
  • Tissue / Organ / Cell type
  • Sub-cellular Component
  • Disease / Pathological State
  • Genotype / Phenotype
  • Environmental conditions (of organism under
    analysis)
  • Sample processing method
  • Instrument Type / Parameters
  • Search Engine / Parameters

4
What kinds of Questions will be Asked? - Search
Space
  • Require common controlled vocabularies /
    ontologies to define search space
  • Species NCBI Tax id, ITIS.
  • Tissue / organ / cell type MeSH, Plant
    Ontology, cell.obo
  • Sub-cellular component GO
  • Disease MeSH
  • Genotype GO
  • Phenotype MGI's Mammalian Phenotype Ontology
  • Sample Processing PSI Ontology
  • MS Instrument PSI Ontology

5
What kinds of response will the typical user
expect?
  • Need to define what will be returned to the user
    querying the database and the format of such a
    return
  • Machine readable data formats (e.g. XML)
  • Human readable data formats
  • Graphical display e.g. visualisation of spectra,
    gel images.
  • Display of statistics or compact summary of
    data.
  • Details of tissue, sample prep, other
    experimental parameters
  • Predicted protein identifications and
    appropriate scores
  • Predicted peptide identifications and
    appropriate scores
  • Predicted post translational modifications
  • Links to references in the literature

Data format
Data content
6
Controlling Data Volume How detailed do you
want to go?
  • Raw MS data would quickly swell to TB in
    magnitude.
  • Peak lists will certainly involve GB of data
    initially. Can be expected to swell to TB but
    perhaps at a more controllable and sustainable
    rate.
  • Massive data sets create problems for both
    storage and efficiency of data retrieval.
  • Raw data optionally stored by submitter, e.g. in
    FTP server and linked to from the repository?

7
Data formats for submission and inter-repository
exchange
  • As well as allowing submission of data, the
    flexibility to exchange data with external
    proteomics repositories would also be desirable.
  • Successful model of collaborative effort to
    achieve this is the PSI initiative for the
    exchange of protein interaction data using the
    PSI MI XML format, with major protein-protein
    interaction databases being involved
  • BIND
  • DIP
  • Hybrigenics
  • IntAct
  • MINT
  • MIPS interaction tables.
  • The ability to exchange MI data is now being
    extended by the IMEx consortium with the
    following aims
  • creation of a consistent body of public data
  • avoidance of redundant curation.

8
Data formats for submission and inter-repository
exchange
  • PSI General Proteomics Standards Workgroup
    developing
  • MIAPE Minimum Information about a Proteomics
    Experiment
  • PSI Object Model
  • The PSI / GPS Ontology (working name PSI-ont)
    based upon the MGED ontology
  • Data exchange formats
  • mzData (MS Instrument output / peak lists)
  • mzIdent (Peptide and Protein Identifications).

9
How has PRIDE tackled these problems ?
PRIDE is a multi-faceted project offering
  • An XML schema for transfer of proteomics protein
    identification data.
  • A relational database implementation for the data
    repository using OJB, allowing the use of most
    currently available RDBMSs. A central data
    repository is being set up at the EBI. However
    the intention is to implement a network of
    federated databases across the community that can
    exchange data and not necessarily PRIDE.
  • Secure upload of proteomics data in the PRIDE XML
    schema format. (Future developments upload and
    download using the mzData XML schema and the
    mzIdent XML schema.)
  • The ability to search the repository and download
    results in PRIDE XML or HTML format. (Future
    developments download in alternative XML
    formats.)
  • Following release, will become open source and
    freely available.

10
The PRIDE Data Model
11
PRIDE Security and Data Availability
Security measures for data privacy, traceability
and group access
  • Before data can be uploaded, the user needs to
    register. All personal data is encrypted on the
    database.
  • All experimental data is linked to the person who
    has uploaded it.
  • Data can be marked as private. The person
    uploading the data may give a date upon which the
    data becomes public.
  • Group access to private data can be granted by
    creating a 'collaboration'. Other users can then
    apply to join the collaboration. This is
    validated by the person who created it. The
    collaboration concept will allow geographically
    separated laboratories to share data via PRIDE
    before it is publicly available.

12
PRIDE Data Curation and Support
  • Consistent use of protein identifiers
    encouraging (but not mandating) use of IPI.
  • Removing data uploaded in error.
  • Ensuring consistent use of ontology / controlled
    vocabularies for annotation of data
  • Species (NCBI Taxonomy IDs)
  • Tissue (MeSH IDs)
  • Sub-cellular location (GO)
  • MS related ontologies (currently being developed
    as part of PSI).
  • Providing point of contact between investigators
    (personal information is not given out without
    permission)

13
PRIDE Post Translational Modifications
  • Provides capacity to store details of
    post-translational modifications
  • Will use RESID PTM IDs for naturally occurring
    PTMs and the UNIMOD database for PTMs arising
    as artefacts of MS.
  • (Negotiations have taken place to allow the
    RESID curators to annotate UNIMOD, possibly
    allowing a single comprehensive PTM database ID
    to be used).
  • Data that can be stored includes
  • a link to the peptide that the PTM was found in.
  • reference database name and PTM ID
  • mono-isotopic mass delta value
  • mean mass delta value

14
PRIDE Web Application Data Upload
15
PRIDE Web Application Data Search
16
PRIDE Web Application Search Results (HTML)
17
PRIDE Future Directions
  • mzData compatibility for both import and export
    early 2005.
  • mzIdent compatibility for import and export
    shortly after first release version of mzIdent
    becomes available.
  • Improved search facilities including boolean
    search on multiple fields.
  • Development of a PRIDE curation tool.
  • Set up peak list re-analysis pipeline to provide
    up-to-date protein identifications using the
    latest version of IPI for all data sets.
  • Negotiations under way to include protein
    identifications from HBPP, HPPP and HLPP projects
    in PRIDE.

18
PRIDE Acknowledgements
  • Lennart Martens
  • Samuel Kerrien
  • Antony Quinn
  • Mark Rijnbeek
  • Kai Runte
  • Chris Taylor
  • Henning Hermjakob
  • Weimin Zhu
  • Rolf Apweiler
Write a Comment
User Comments (0)
About PowerShow.com