Title: RProteomics Interoperability Review
1RProteomicsInteroperability Review
- Patrick McConnell1,Salvatore Mungal1, Richard
Haney1, Mark Peedin1 - 1Duke Comprehensive Cancer Center
2RProteomics Team
- Duke, ICR Developer
- Patrick McConnell, Project lead
- Richard Haney, Architect and developer of
statistical systems - Salvatore Mungal, Middle-tier Java developer
- Mark Peedin, Database developer
- Northwestern University, Collaborator
- Simon Lin, Proteomics domain expert
- Oregon Health Sciences University, ICR Adopter
- Shannon McWeeney
- Veena Rajaraman
- University of Pennsylvania, ICR Adopter
- David Fenstermacher
- Craig Street
- University of North Carolina, Collaborator
- Cristoph Borchers
3RProteomics Overview
- Statistical routines to analyze proteomics data
- MS and LC-MS data
- Integrative Cancer Research Workspace
- Proteomics Special Interest Group
- Adopters
- University of Pennsylvania
- Oregon Health Sciences University
- Architecture Workspace
- Reference implementation for analysis services
- Project ends October 1, 2005
4RProteomics Focus
- We are NOT concerned with
- LIMS and database management
- Identification of proteins
- Database searching and pattern matching
- Statistical Modeling of Spectra (SMOS)
- Data standards
- Generic statistical data and spectral data
- Analysis
- Spectra processing and analysis audit trail
5RProteomics Details
- Statistical routines to analyze MS and LC-MS data
- Background removal, denoising, alignment/calibrati
on, normalization, peak finding, isotope
deconvolution, peptide quantitation, high-level
modeling - Open Statistical Services (OSS)
- Bridging Java and R with web services and XML
- Proteomics database
- Data model (XML Schema), object model (Java), XML
database (Mako) - Grid services
- Data access, data transformation, and analysis
- Graphical user interface
- Load and query data, run analytics, view plots
- Future plans
- Multilevel and hierarchical statistics
- Dynamic GUI
- Peptide/protein identification services
- Integrate with PIR, Q5, caMassClass, proteomics
repository
6Statistics Denoising
7Modeling Process
XMI Class Attribute Association Definition As
sociation multiplicity
XML Schema Complex types Elements with in-lined
complex types Elements of complex
types Elements of simple types Attributes Anno
tations minOccurs/maxOccurs
- Mapping Rules
- Remove Type from end
- Map schema types
- Add id attitribute
- Move single-attribute classes to an attribute
8Data Model Overview
- mzXML
- De facto standard for encoding raw proteomics
data - ScanFeatures
- Generically encodes proteomics data and
analytical results - AML-routine
- Describes analysis routines in great detail
- Metadata to help researchers understand grid
services - AML-run
- Keeps track analysis routines (provenance)
- Hooks input data and output data together
- StatML
- Generic encoding for statistical data (lists,
arrays, and scalars) - Service parameters
- Parameters to the operations of the grid service
9mzXML
- Encodes raw spectra data (mz-intensity pairs)
- De facto standard by Sashimi
- Instrumentation
- Data processing
- Separation technique
- Spot description
- m/z scan values
- MALDI acquisition
- Data integrity
- Other candidates mzData and mqData
http//sashimi.sourceforge.net/software_glossolali
a.html
10ScanFeatures
- Support for statistical data
- Metadata
- project, patient, fraction, replicate, date,
outcome, scanNumber, scanStartPos, scanStepSize - Features
- Name
- m/z, intensity, peakWidth, peakHeight, etc.
- Controlled vocabulary
- Value
- Scalar or array
- Support for a hierarchy of features
11AML-routine
- Authors
- Writers of the routine and contact information
- Routine name
- One-word description
- Title
- Formal one-sentence description
- Aliases
- Alternate names
- Ontological description
- Controlled vocabulary (CDEs)
- Textual description
- Human-readable description (1-3 sentences)
- External references
- Journals, websites
- Pseudo-code
- Approximation of actual code
- Source code
- Actual source code
- Routine Signature
- Input/output parameters
- Contract
- Pre and post conditions
- Usage
- Textual description and examples of how the
routine is to be used - Implementation
- OS/hardware/compiler on which the routine is
implemented - Caveats
- Any user comments not previously covered
- Benchmarks
- Theoretical performance and links to performance
runs
12AML-run
AML-run
lsid
AML-routine
inputs
outputs
lsid
lsid
data
data
lsid
lsid
data
data
lsid
lsid
data
data
user submit time complete time
13AML-run Cont.
data
data
data
data
aml-run
aml-run
Input can be used for more than one run
Different levels combined together
data
data
aml-run
aml-run
Output can be input for another run
Track back the final output to find what analysis
was performed and what data was used
data
14StatML
- Scalars
- String, boolean, integer, long, float, double
- Arrays
- Base64 encoded (turn 3 bytes into 4)
- Square arrays (multi-dimensional)
- Integer, long, float, double
- Lists
- Can contain lists, arrays, scalars
- Null
- Lack of value
Efficient, textual encoding
15(No Transcript)