Title: caBIG Pilot Project Selection Process
10
Cancer Biomedical Informatics Grid (caBIG) An
Approach towards Data Access and Integration
Avinash Shanbhag Director, Core Infrastructure
Engineering National Cancer Institute Center for
Bioinformatics
2National Cancer Institute 2015 Goal
- Relieve suffering and death due to cancer by the
year 2015
3Origins of caBIG
- Need Enable investigators and research teams
nationwide to combine and leverage their findings
and expertise in order to meet NCI 2015 Goal. - Strategy Create scalable, actively managed
organization that will connect members of the
NCI-supported cancer enterprise by building a
biomedical informatics network and data can be
seamlessly shared
4caBIG Challenges
- Handle diversity of data types
- Precise Meaning of data
- Provide local hosting of data
- Local access control
- Provide tools to publish and access data
easily - High Performance computing will be needed in
future
5Interoperability
- ability of a system to access and use the parts
or equipment of another system
Semanticinteroperability
Syntacticinteroperability
6How to Achieve Interoperability for Data Systems?
- Well Documented public API access to data
- Based on object oriented abstraction of
underlying data - No particular technology or tool specified
- Abstraction layer must be derived using widely
accepted standards - Model Driven Architecture
- Information Model is the Metadata of the data
and needs to be persisted and accessible via API - Need to be able to unambiguously and
programmatically determine the meaning of data
7OMG Model Driven Architecture (MDA) Approach
- Analyze the problem space and develop the
artifacts for each scenario - Use Cases
- Use Unified Modeling Language (UML) to
standardize model representations and artifacts.
Design the system by developing artifacts based
on the use cases - Class Diagram Information Model
- Sequence Diagram Temporal Behavior
- Use meta-model tools to generate the code
8Limitations of MDA
- Limited expressivity for semantics
- No facility for runtime semantic metadata
management
9caCORE
- Syntactic and Semantic Integration
- MDA Plus a whole lot more!
10caCORE
11Use Cases
- Description
- Actors
- Basic Course
- Alternative Course
12Bioinformatics Objects
13Common Data Elements
- What do all those data classes and attributes
actually mean, anyway? - Data descriptors or semantic metadata required
- Computable, commonly structured, reusable units
of metadata are Common Data Elements or CDEs. - NCI uses the ISO/IEC 11179 standard for metadata
structure and registration - Semantics all drawn from Enterprise Vocabulary
Service resources
14 Description Logic
Enterprise Vocabulary
Concept Code
Relationships
Preferred Name
Definition
Synonyms
15Semantic metadata example Agent
- ltAgentgt
- ltnamegtTaxollt/namegt
- ltnSCNumbergt007lt/nSCNumbergt
- lt/Agentgt
16Why do you need metadata?
Class/ Attribute Example Object Data CIA Metadata NCI Metadata
Agent A sworn intelligence agent a spy Chemical compound administered to a human being to treat a disease or condition, or prevent the onset of a disease or condition
Agent nSCNumber 007 Identifier given to an intelligence agent by the National Security Council Identifier given to chemical compound by the US Food and Drug Administration Nomenclature Standards Committee
Agent name Taxol CIA code name given to intelligence agents Common name of chemical compound used as an agent
17Computable Interoperability
Agent
Drug
name
id
nSCNumber
NDCCode
CTEPName
approvalDate
FDAIndID
approver
IUPACName
fdaCode
My model
Your model
18Cancer Data Standards Repository
- ISO/IEC 11179 Registry for Common Data Elements
units of semantic metadata - Client for Enterprise Vocabulary metadata
constructed from controlled terminology and
annotated with concept codes - Precise specification of Classes, Attributes,
Data Types, Permissible Values Strong typing of
data objects.
19caCORE Tools
- UML Loader automatically register UML models as
metadata components - CDE Curation Fine tune metadata and constrain
permissible values with data standards - Form Builder Create standards-based data
collection forms - CDE Browser search and export metadata
components - Common Security Module Provides role based
security
20caCORE Software Development Kit
- UML Modeling Tool (any with XMI export)
- Semantic Connector (concept binding utility)
- UML Loader (model registration in caDSR)
- Codegen (middleware code generator)
- Security Adaptor (Common Security Module)
caCORE SDK generates syntactically and
semantically interoperable data service system
21 caGrid
caCORE meets grid technology!
22Use cases not satisfied by caCORE alone
- Advertisement
- Service Provider composes service metadata
describing the service and publishes it to grid.
- Discovery
- Researcher (or application developer) specifies
search criteria describing a service of interest - The research submits the discovery request to a
discovery service, which identifies a list of
services matching the criteria, and returns the
list. - Invocation
- Researcher (or application developer)
instantiates the grid service and access its
resources
23OTHER TOOLKITS
NCI
OTHER caBIG SERVICE PROVIDERS
Cancer Center
Cancer Center
Cancer Center
Cancer Center
Cancer Center
24caGrid Components
- Leverage existing technologies
- caDSR, EVS, Mobius GME Common data elements,
controlled vocabularies, schema management - Globus Toolkit (currently version 4.0.1)
- Core grid services infrastructure
- Service deployment, service registry, invocation,
base security infrastructure - Additional Core Infrastructure
- Higher-level security services (Dorian)
- Grid service access to metadata components
(caDSR, GME, etc) - Workflow, Identifier services
- Service Provider Tooling (Introduce)
- Graphical service development and configuration
environment - Abstractions from service infrastructure for Data
and Analytical services - Deployment wizards
- Client Tooling
- High-level APIs for interacting with core
components and services - Graphical Tools
25caGrid 0.5 Architecture(May be updated for 1.0)
Functions
Quality of Service
Business Process
Semantic service
ID Resolution
GUMS
Analytical
UI
Security
Resource Management
caDSR
Service Registry
Service
GSI
OGSA-DAI
GT3
GME
Index
Service Description
caDSR
Grid Communication Protocol
GLOBUS Toolkit
GT3
CAMS
Transport
EVS
GT3
26Data Object Semantics, Metadata, and Schemas
- Object oriented, APIs, well-defined data types
- Classes defined in UML and converted into ISO/IEC
11179, registered in the caDSR - Definitions drawn from Enterprise Vocabulary
Services (EVS), relationships semantically
described - XML serialization of objects adhere to XML
schemas registered in the Global Model Exchange
(GME)
27Introduce Toolkit
- A framework which enables fast and easy creation
of caGrid compatible services whether they are
data, analytical, custom, or core services. - Provide easy to use graphical service authoring
tools. - Hide all grid-ness from the developer so that
they can concentrate on the domain expert
implementation. - Utilize best practice layered grid service
architectures. - Handle all service architecture requirements of
the caGrid. - Strong service interface data typing
- Metadata and service registration
- Grid security integration
28Data Service Access on caGrid
- Specialization of caGrid grid services to expose
data through a common query interface - Present an object view of data sources
- Exposed objects are registered in caDSR and their
XML representation in GME - Queries made with caBIG Query Language (CQL)
Query objects - Results returned as objects (or identifiers)
nested in a CQL Query Result Set
29Data Service Query Language
- Specialization of caGrid grid services to expose
data through a common query interface - Present an object view of data sources
- Exposed objects are registered in caDSR and their
XML representation in GME - Queries made with CQL Query objects
- Results returned as objects (or identifiers)
nested in a CQL Query Result Set
30Data Service Interface
public CQLQueryResultsType processQuery(CQLQueryTy
pe query)
- Data Providers only responsibility is to
implement CQL over their local data resource - A default implementation will be provided for
caCORE SDK created systems - caGrid provides grid service implementation to
invoke providers CQL implementation - Service provides all features necessary for
compliance, such as advertisement of data service
metadata, and security integration
31Data Service Query Scenario
- Client builds a CQL Query
- CQL Query is serialized and submitted to the Grid
Data Service - Grid Data Service deserializes the CQL Query
Object and processes it
- Data Source is queried by the Grid Data Service
- Grid Data Service Builds a CQL Result Set
- Result Set is serialized and returned to the
client - Client deserializes result set
- Result set is iterated with client tools to
retrieve objects
32Federated and Aggregated Queries
- Componentized library being developed to
facilitate limited federating and aggregating
queries - An extension language used to describe
distributed queries - Library creates and executes a Query Plan for the
distributed query, using multiple CQL queries to
targeted data services
33Data Service Client Tooling
- APIs provided to discover available data services
on the grid based on client-defined criteria
(such exposed data models and concepts) - Object-Oriented API for building queries,
querying a given data service, and processing the
results - Client tools available to iterate query result
sets - Object iterator deserializes XML into registered
objects - XML iterator simply returns XML documents
34Acknowledgements (caGrid Team)
- Ohio State University - Department of BioMedical
Informatics - Dave Ervin
- Shannon Hastings
- Tahsin Kurc
- Stephen Langella
- Scott Oster
- Joel Saltz
- Argonne National Lab / University of Chicago
- William Allcock
- Jarek Gawor
- Ravi Madduri
- Frank Siebenlist
- Michael Wilde
- Duke University
- A. Jamie Cuticchia
- Patrick McConnell
- Georgetown University
- Colin Freas
- Paul A. Kennedy
- Chad La Joie
- SAIC (http//www.saic.com)
- Manav Kher
- ScenPro/Semantic Bits
- Vinay Kumar
- David Wellborn
- Valerie Bragg
- Booz Allen Hamilton (http//www.bah.com)
- Arumani Manisundaram
- Michael Keller
- Reechik Chatterjee
35Acknowledgements
NCI Andrew von Eschenbach Anna Barker Wendy
Patterson OC DCTD DCB DCP DCEG DCCPS CCR
Industry Partners SAIC BAH Oracle ScenPro Ekagra A
pelon Terrapin Systems Panther Informatics
NCICB Ken Buetow Peter Covitz George Komatsoulis
Denise Warzel Frank Hartel Sherri De
Coronado Dianne Reeves Gilberto Fragoso Jill
Hadfield Leslie Derr