Title: caGrid Version 0.5 Reference Implementation RProteomics caBIG Architecture Workspace Face to Face Georgetown University August 16th -18th, 2005
1caGrid Version 0.5 Reference ImplementationRProte
omicscaBIG Architecture Workspace Face to
FaceGeorgetown UniversityAugust 16th -18th,
2005
- Patrick McConnell
- Duke Comprehensive Cancer Center
- patrick.mcconnell_at_duke.edu
- Shannon Hastings
- Ohio State University
- hastings_at_bmi.osu.edu
2Outline
- High Level Overview of Proteomics
- Data Model
- Project Architecture
- Process of getting to Silver level compliance
- Functionality Exposed to Grid
- Process of Grid Enablement
- Demo/Screenshots
- Lessons Learned / Technical Difficulties / Wish
List - Acknowledgements
3Proteomics Overview
- Goal
- Find biomarker
- Build predictive model
- Proteins are split into peptide fragments
- Mass is measured by time-of-flight (TOF)
- Mass of peptides can be used to identify
proteins - Peptides can undergo a second MS to help
identification
http//www.appliedbiosystems.com/catalog/myab/Stor
eCatalog/products/CategoryDetails.jsp?hierarchyID
101category3rd112051trailno
4Proteomics Data
- A modest study can be on the order of 10 GB of
data
5Project Overview
- RProteomics is a development project in the
Proteomics SIG of the ICR Workspace - Developing analytical routines for proteomics
data - Denoising, background removal, peak
identification, spectral alignment,
normalization, peptide quantitation - Focus is on analytics
- NOT databases, LIMS, protein identification
- RProteomics is a critical step in the proteomics
pipeline - LIMS -gt repository -gt RProteomics -gt
classification -gt protein identification - RProteomics provides integration
- Q5 classification has been integrated
6Statistics Background Removal
7Statistics Denoising
8Statistics Spectral Alignment
9Statistics Protein Quantitation
10Data Model
- mzXML
- Encodes raw spectra data (mz-intensity pairs)
- Some metadata about instrumentation
- Utilizes base64 encoding for binary data
- scanFeatures
- Encodes analysis results as a set of features
- Some metadata about the experiment
- Utilizes base64 encoding for binary data
- Service parameters
- JpegImage
- Lsid
- WindowSize
- ThreshholdMultiplier
11Project Architecture
12Project Architecture
13Process of getting to Silver level compliance
- Programming and messaging interfaces
- Apache Axis for web services
- Wrapped functionality with Java interfaces that
made sense - Vocabularies, terminologies, and ontologies
- Data elements
- Wrote tool for XML Schema to XMI conversion
- Manually curated UML
- Went through semantic connecting process
- Information models
- XML Schema to begin with, so information models
were easy
14Functionality Exposed to the Grid
- Analytical service no security requirements
- Discuss its input and output and what it does
scientifically - Functionality to be exposed
- 20 more statistical methods
- Data access methods, translation methods
15Process of Grid Enablement
- Process
- Creation/extraction of data types using XML
Schema - Upload data types into caGrid GME
- Use the Analytical Toolkit Portal to create and
modify grid service interface. - Implement the server stub that is generated by
making the appropriate calls into the original
non-grid-enabled RProteomics application. - Compile, and deploy.
16Demo and/or Screenshots
- Demonstration of RProteomics GUI with grid
functionality
17Lessons Learned / Technical Difficulties / Wish
List
- Think grid from the beginning
- Have an idea what the service interface will be
ahead of time - Wrap parameters with objects
- Technology is complex
- XML, Schema, CDEs, Globus, Web Services, etc.
- Installation is complex
- Have to have working knowledge of Tomcat, Axis,
Ant, environment variables, etc. - Need to have compatible versions of each
component, esp. Java 1.4.2_04 - Wish list
- Wizard for grid-enabling existing code
- Documentation of every aspect of installation and
functionality - Clone Shannon for each development project
18Lessons Learned / Technical Difficulties / Wish
List
- Starting with a non-grid-enabled application
which has been tested and is stable made wrapping
it to a grid service easier to debug. - Need a standard mechanism for dealing with large
data objects. - Some sort of lazy loaded object/pointer would be
sufficient. - Integration of toolkit portal into some standard
IDEs might make development even easier.
19Acknowledgements
- Duke, ICR Developer
- Patrick McConnell, Project lead
- Richard Haney, Architect and developer of
statistical systems - Salvatore Mungal, Middle-tier Java developer
- Mark Peedin, Database developer
- Northwestern University, Collaborator
- Simon Lin, Proteomics domain expert
- Oregon Health Sciences University, ICR Adopter
- Shannon McWeeney
- Veena Rajaraman
- University of Pennsylvania, ICR Adopter
- David Fenstermacher
- Craig Street
- University of North Carolina, Collaborator
- Cristoph Borchers, Proteomics scientist
- OSU, caGRID Team
- Shannon Hastings
- Scott Oster
- Stephen Langella
- Tahsin Kurc
- Joel Saltz
- Architecture
- Arumani Manisundaram
- Avinash Krishnakant
- VCDE
- Brian Davis, Workspace Lead
- George Komatsoulis, VCDE lead
- Claire Wolfe, VCDE curator
- Salvatore Mungal, VCDE mentor