Title: i2b2 Clinical Research Chart and Hive Architecture
1i2b2Clinical Research Chartand Hive
Architecture
- Henry Chueh
- Shawn Murphy
- Isaac Kohane, PI
2Summary
- Background
- Intro to the Clinical Research Chart (CRC)
- Hive / Cell Software Architecture
- More details on establishing and using the CRC
3Background
- Clinical documentation isclinical
- Lack of systematic approach for organizing
clinical data for research - Ownership issues are unique
- Consent issues are a challenge
4Driving Biological Projects
- Asthma
- Hypertension
- Huntingtons Disease
- Diabetes
5Clinical Research Chart (CRC)
- Organize and transform clinical data to maximize
its utility for research - Develop an Application and Database framework to
serve this goal - Establish an architecture that allows data from
different studies done on this platform to be
integrated
6Design of Clinical Research Chart
CRC DB
HL7 MSH/736401.. PID1023231285..
Text files
XML ltPatient1gt ltimagegt..
database
7Design of Clinical Research Chart
Data pipeline/workflow application
Pheno/Genotype Database
CRC DB
HL7 MSH/736401.. PID1023231285..
Text files
XML ltPatient1gt ltimagegt..
Visualization and Analysis of database contents
database
8i2b2 Skeletal Data Flow
EDC Service
EDC applications
Shared data
Enterprise data source (RPDR)
Clinical Research Chart
i2b2 ETL workflow
Annotation Service
Study specific data
Annotation UI
Analytic workflow
Enterprise Systems Registration, ADT,
Labs, Reports, Clinical Notes, etc
Local Systems Systems not gathered
into Enterprise data warehouses
9Overall Themes
- Framework to allow development of application
services in a maximally decoupled fashion. - Linux and Windows OS support
- Java and C programming languages
- Use Cases for construction of CRC come from
Driving Biology Projects and experience with
clients of Partners Research Patient Data Registry
10Focus on Workflow
- Necessary for both pre-CRC and post-CRC processes
- Needed for scientific flexibility
- Implies a consistent environment for data
pipelining and flow control
11i2b2 Hive
- Formed as a collection of interoperable Cells, or
services - Loosely coupled
- Makes no assumptions about proximity
- Connected by Web services
- Activated by a workflow engine that forms basis
of choreography among Cells for complex
interactions
12Complex choreography
13i2b2 Cell
- Behaves as a functional service
- Separates interactions conceptually into
transactions and semantics - Focuses on facilitating transactions with simple
semantics (e.g., datatype) - Leaves deep semantics to be defined by the
services provided by a Cell - Does not restrict language implementation
14Target layer for i2b2
Semantic Objects
I2b2 platform
Web Services
TCP/IP
15Cell examples
- Concept extraction from clinical narratives
- Simple transformations e.g., basic text format
conversion - Complex encoding e.g., encoding MIAME in MAGE
- Microarray data normalization
-
16Exposing Cells
- Protocols layered on top of SOAP
- At the WSDL level for integrators ie,
bioinformaticians software engineers - At a functional level for investigators
- i2b2 toolkits to allow integrators to expose
controlled functionality to investigators
(Automator)
17Automator Approach
Extend Kepler workflow engine
informaticians
i2b2 Automator
investigators
18Birds eye view
Investigator Portal
Workflow engine
CRC Repository
19Current Implementation
- Extending Kepler workflow engine for i2b2
- Data model for CRC repository
- Defining protocols necessary for interaction (in
addition to SOAP) - Created Cell for concept extraction from
narratives - Early designs for Automator toolkit
20i2b2 Architecture Key Points
- Leverage existing workflow standards and software
- Use Web services as basic form of interaction
- Assume unlimited choreography, but
- Provide tools to distill complexity into basic
automation for clinical investigators
21SW Licensing and Distribution
- Commit to Open Source software
- Use GNU Lesser General Public License
- Establish local i2b2 repository exposed through
i2b2 website - Contribute to a more global NCBC SourceForge
style repository if it emerges ?NIH Forge - Keep i2b2 protocols fully open
22Interoperability across NCBC
- Strongly consider Web services as basic protocol
for generic shared interactions - Consider sharing datasets
- Promote diversity of approach and use of shared
software (dont impose uniformity) - Facilitate/promote NCBC Open Source project teams
23Pre-CRC Data Pipeline/Workflow
- Populating the Clinical Research Chart (CRC)
24Pre-CRC Data Pipeline/Workflow
- Use workflow framework to choreograph
applications services in specific sequences - Used to extract, transform, conform, and load
data and metadata into the CRC
25Pre-CRC Data Pipeline/Workflow
Services
Ontology
Consent/Tracking
Application Pool
Management
Soap/Http interfaces
Output
Input
Data flowing
Local or through SOAP service
Custom Interfaces
A program
increasingly useful
26Ontology Service
- Manages mappings of terms to common vocabularies
- Provides lists of acceptable (enumerated) values
for various attribute and value slots. - Allows for management of hierarchies, groupings,
and relationships between terms
Ontology
Consent/Tracking
Application Pool
Management
Ontology
27Person Consent/Tracking Service
- Provides mappings between patient/subject
identifiers - Tracks patient/subject consent information
- Allows identification of the patient/subject
based upon fuzzy demographic matches
Ontology
Consent/Tracking
Application Pool
Management
Consent/Tracking
28Application Pool (CVS) Service
- Stores programs/scripts used in pipeline
- Provides applications to be downloaded when
needed - Manages versioning of software
- Provides documentation
Ontology
Consent/Tracking
Application Pool
Management
Application Pool
29Management Service
- Stores workflow execution plan
- Starts and controls workflow execution
- Schedules workflow execution
- Monitors workflow execution and data locations
- Controls permissions associated with workflow
execution
Ontology
Consent/Tracking
Application Pool
Management
Management
30Data Pipeline/Workflow ApplicationUse Case for
Asthma Data
RPDR
CRC DB
AsthmaMart
Data retrieval
Language processing
Load Data into Mart
Data de-identification
Vocabulary matching
31Data Pipeline/WorkflowImplementation
- Define standard XML representation for workflow -
MoML - Define standards for SOAP services and resource
discovery
- Adopt and extend open source workflow package
(Kepler) - Prototypes by July timeframe
- BIRN -gt NAMIC and LONI collaboration
- Can follow construction details at
http//diagon/i2b2
32Phenotype/Genotype Database
33Phenotype/Genotype DatabasePrinciples
- Analytical database schema that does not need to
change with new data types and concepts - Defined fundamental unit of data (atomic fact)
observation - Defined metadata strategy
- Various levels of de-identification (reviewed and
approved by IRB)
34Phenotype/Genotype DatabaseArchitecture
(see preprint)
35Phenotype/Genotype DatabaseUse Case
- Smoking observations represented in database
Provider_id Provider_path Name_char
M0022303 MGH\Neurology\M0022303 M0022303
Concept_cd Concept_path Name_char
CT-A-SMK AsthV1\DRptNLP\Tobacco Use\Smoker Smoking
IC9-3051 V2\Diagnosis\Mental Disorders (290-319)\Non-psychotic disorders (300-316)\(305) Nondependent abuse of drugs\(305-1) Tobacco use disorder\(305-11) Tobacco use disorder, co Tobacco Use Disorder, continuous use
CT-A-NSK AsthV1\DRptNLP\Tobacco Use\Non smoker Never smoked
Patient_id_e Concept_cd Start_date Provider_id Confidence_num
Z234 CT-A-SMK 1/1/1997 M0022303 3
Z234 CT-A-SMK 1/1/1998 M0034125 9
Z234 IC9-3051 1/1/2001 M0022303 3
Z234 CT-A-NSK 1/1/2002 M0034125 9
Patient_id_e Birth_date Sex_cd Race_cd Death_date
Z234 3/4/1924 Female Black 4/5/2003
36Phenotype/Genotype DatabaseImplementation
- Asthma CRC DB primed with data from 90,000
patients from Research Patient Data Registry - Serves as fundamental data structure for i2b2
supported data Querying and Visualization
Application Suite - CRC DBs able to fuse seamlessly together
- Various levels of de-identification to be
supported for data sharing and publication
37Visualization and Analysis of CRC database
38Visualization and AnalysisPrinciples
- Supported application suite to query and view CRC
database contents - Outside applications for analysis and viewing
able to plug in to application suite - Pipeline/Workflow framework may be used for
analysis and re-entry of derived data into CRC
database
39Visualization and AnalysisArchitecture
- Supported Applications, Querying and
Visualization - Standard querying
- Data exploration
40Visualization and AnalysisArchitecture
- Supported Applications, ontology management
- Ontology Management
- Integrate (outside?) population analysis
applications
41Visualization and AnalysisArchitecture
- Supported applications have plug-in architecture
for outside analytic tools - Standard web-link support with GET and POST
oriented data transfer - Support transfer of specifically transformed data
to outside applications - Complex analysis supported with workflow
application
42Visualization and AnalysisArchitecture - Query
Launch
43Visualization and AnalysisArchitecture -
Exploration
Launch
44Visualization and AnalysisArchitecture
Ontology mgmt
45Visualization and AnalysisUse Case
46Visualization and AnalysisImplementation of
analysis tools
- Workflow framework to accommodate external
analytic applications
patient id 0000004
ProgID AA3.3
CRC DB
subject id 4
ProgID CA2.3
ProgID CN2.3
ProgID XN0.9
subject id 4
ProgID SN5.4
ProgID CX2.3
account 347
ProgID PN5.1
ProgID TH3.0
47Final Assembly
Gene expression in APOE e4 Allele
person
concept
date
raw value
Z5937X
3/4
Outcomes calculated every week
Surgery
microarray (encrypted)
Alzheimer's
ER visit
Z5937X
3/4
Seizures
Trauma
Z5937X
3/4
ER visits
Gene-Chips
Z5937X
3/4
Clinic visits
Trauma
Seizure
Z5937X
4/6
Surgery
Gene-Chips
Z5956X
5/2
Multiple sclerosis
microarray (encrypted)
Seizure
Z5956X
5/2
Alzheimers
Z5956X
5/2
Diabetes
Z5956X
5/2
CT Scan
Z5956X
3/9
Hemorrhage
Z5956X
3/9
Trauma
Z5956X
3/9
Thalamus
Z5956X
3/9
48(No Transcript)