Analytical and Data Services Guidelines


Analytical and Data Services Guidelines
  • Architecture/VCDE WorkspacesJoint Face to
    FaceFebruary 1st-2nd, 2006

Scott Oster Ohio State University oster_at_bmi.osu.e
Patrick McConnell Duke Comprehensive Cancer
  • Overview of Data and Analytical Services
  • Distinction between Analytical Tool and
    Analytical Service
  • Metadata definition and usage
  • Current UML model for service metadata
  • Need for harmonization
  • Plan to consensus
  • Leveraging existing data standards in caBIG
  • Defacto standards into UML
  • Bridging caDSR and GME
  • Namespace issues (existing standards)
  • Connecting CDEs and Schema types

caBIG Services
Analytical Service
Grid-Enabled Client
Tool 1
Tool 2
Research Center
Grid Data Service
Tool 3
Tool 4
Grid Portal
Research Center
caBIG Services
  • Data Services
  • Data services present an object view of data
  • Objects exposed as data services comply with
    common data elements registered in the caDSR/EVS,
    and transported as XML using schema types
    registered in GME
  • Currently Query only (no update, insert, or
  • Analytical Services
  • Analytical Services are base Globus services
  • Required to be strongly-typed with respect to
    input and output
  • Analytical services input and output objects
    conforming to registered classes in caDSR, and
    schema types registered in GME
  • Graphical tool to automatically create source
    code, configuration files, and build process for
    new analytical services
  • Input and output parameters can be discovered
    from GME

Analytical Tool vs. Analytical Service
  • Analytical services provide data back to the grid
  • Analytical tools only consume data from the grid
  • Examples
  • caWorkbench
  • RProteomics

Analytical Service Guidelines
  • Inputs and outputs (parameters) defined by
  • Objects with metadata registered in caDSR and
  • Objects with XML Schema defined
  • Parameters defined as objects, not simple data
  • a.k.a no Java primitives
  • Provide service level metadata, the structure of
    which is defined in the caDSR
  • Internal (non API) classes do not need to be
    registered in the caDSR

Analytical Tool Guidelines
  • Inputs defined by
  • Objects with metadata registered in caDSR and
  • Objects with XML Schema defined
  • No output types need be defined in the caDSR
  • No service level metadata must be provided
  • Internal (non API) classes do not need to be
    registered in the caDSR

Analytical service and tool open questions
  • Tools that are provided as an API in a
    programming language
  • Example Q5
  • Should tools be a dead-end for data
  • Many tools can output well-defined,
    standards-based objects
  • Example caWorkbench
  • Many tools can abstract analyses into services
  • Example VISDA
  • Should analytical service method signatures be
    reviewed and harmonized
  • Issue raised in interoperability review of
  • Promote interoperability, plug-and-play analytics
  • Provides context by which to evaluate parameter

caBIG Service Description
  • Client and service APIs are object oriented, and
    operate over well-defined and curated data types
  • Objects are defined in UML and converted into
    Administered Components, which are in turn
    registered in the Cancer Data Standards
    Repository (caDSR)
  • Object definitions draw from vocabulary
    registered in the Enterprise Vocabulary Services
    (EVS), and their relationships are thus
    semantically described
  • XML serialization of objects adhere to XML
    schemas registered in the Global Model Exchange
  • All data in caGrid travel between services and
    between client and services as XML documents that
    conform to well-defined schemas stored in GME

Current Metadata
  • Metadata and Registry Services
  • Support for Advertisement and Discovery processes
  • Metadata and registry services maintain metadata
    associated with data and analytical services
  • All services register information to an Index
  • Services can be discovered using semantics of
    their data types
  • Three types of Service Metadata
  • Common Metadata describes generic information
    about service providing Cancer Center
  • Data Service Metadata describes the data exposed
    using terminology and objects from caDSR/EVS
  • Analytical Service Metadata describes the
    supported operations and their inputs and outputs
    using terminology and objects from caDSR/EVS

The need for more service-level metadata
  • Why?
  • Find the service you want (discovery)
  • Help understand what a service does (extension of
  • Types of fields
  • Name
  • Description with concept
  • Keywords
  • For high precision calculations operating
    system, hardware
  • Contact information
  • Method signatures

VCDE proposed model for service level metadata
VCDE proposed model for service level metadata
Service level metadata next steps
  • Form a cross-cutting working group
  • Evaluate two models, use cases
  • Get input from caGrid team
  • Propose model to VCDE, Architecture, caGrid

Bringing existing biomedical standards to caBIG
  • There is a wealth of existing standards in the
    biomedical field
  • The great thing about standards is that there
    are so many to choose from
  • The problem with standards is that there are so
    many to choose from
  • MAGE-OM/MAGE-ML, BioPax, mzXML, etc.
  • Most standards based on XML Schema
  • Or alternate non-UML encodings RDF, OWL,
    Protégé, etc.
  • Translating XML Schema to well defined object
    models in UML is not trivial
  • Passing standards-based XML across the grid using
    the caGrid infrastructure has not been explored

Converting from XML Schema to caBIG UML
  • Names of classes and attributes fixed by schema
    (if you actually want to follow the schema)
  • Plurals, poor semantics, contain parent name,
  • caGrid requires specific namespace to enter GME
  • The namespace is probably already defined in the
  • Extension of simple types (e.g. extending String)
  • XML Schema allows such extension, caDSR does not
  • Elements can contain both values (text) and
  • Examples XHTML, PubMed abstracts
  • caCORE SDK compatibility
  • id attributes, Collection
  • Elements can contain text and have attributes
  • Basically an extension of String, but also with
  • XML Schema intentionally very hierarchical
  • End up with a bunch of empty classes
  • XML Schema constructs not supported by UML and/or
  • Example choice
  • Many simple types do not exist in the caDSR
  • Duration, int versus integer, etc.
  • Collections of primitives
  • Cannot model in caDSR with primitive type

Potential solutions XSD-gtXMI
  • Preface XMI-gtXSD is much easier
  • You can even do this with EnterpriseArchitect
  • HyperModel XSD-gtUML, UML-gtXSD
  • Defacto standard for XSD-gtUML conversion
  • Plugin to Eclipse
  • Freely available, but not open source
  • XMIGenerator XSD-gtUML
  • Developed at Duke to addresses some deficiencies
    in HyperModel
  • Standalone, command-line based application
  • Open source, freely available
  • XSD-gtJava-gtUML
  • Many tools to do this, but you will get many
    artifacts in the UML

XSD-gtJAXB-gtJava-gtEA-gtUML (mzXML)
XSD-gtHyperModel-gtXMI (mzXML)
XSD-gtHyperModel-gtXMI (pepXML)
XSD-gtXMIGenerator-gtXMI (mzXML)
Discussion from breakout yesterday
  • Point 1
  • Point 2
  • Point 3

Existing Mapping from caDSR to GME
  • In caDSR, each project (application) will have
    its own Classification Scheme (e.g. caCORE). A
    Classification Scheme may define a subproject,
    which is represented as a Classification Scheme
    Item (CSI) (e.g. caBIO). In caGrid 0.5, each CSI
    had its own schema.
  • Each XML schema will be published into the caGrid
    GME service. As the caDSR ensures semantic
    interoperability, the GME ensures programmatic
    data exchange (syntactic) interoperability.

From caDSR to GME (cont)
  • The caGrid 0.5 recommendation for assigning
    schema namespaces for caBIG objects is shown
  • For example
  • gme//caTIES.caBIG/3.0/edu.upmc.opi.cabig.caties.d
  • This provides a coarse-grain, rule-based mapping
    from caDSR to GME

ltClassification Schemegt.ltContextgt/ltClassification
Scheme Versiongt/ltClassification Scheme Itemgt
ltdomaingt /
ltversiongt /
Connecting the caDSR and GME
  • Some applications will need to work at both the
    CDE level and the XML level
  • Examples workflow engine, translational query
    system, etc.
  • There is no defined link between
  • A CDE and an XML element
  • A CDE and an XML element or attribute

Names different
Attribute or element?
What about Collection associations
Potential solutions
  • Change the caDSR
  • Provide a link from each CDE and attribute to the
    location in the XSD
  • Change the GME
  • Provide a link from each element/attribute in the
    XSD to the caDSR
  • Provide a mapping service
  • Given a context and CDE, give me the XSD
  • Given an element/attribute and context, give me
    the CDE\
  • Likely we should start a cross-cutting working
    group to address the problem
