Title: i2b2: Harnessing The Healthcare System For Research In The Genomic Era
1i2b2Harnessing The Healthcare System For
Research In The Genomic Era
- Isaac Zak Kohane, MD, PhD
2Prismatic Example PPARg Pro12Ala and diabetes
Oh et al.
Deeb et al.
Mancini et al.
Clement et al.
Hegele et al.
Hasstedt et al.
Lei et al.
Ringel et al.
Sample size
Hara et al.
Overall P value 2 x 10-7 Odds ratio 0.79
(0.72-0.86)
Meirhaeghe et al.
Douglas et al.
Altshuler et al.
Mori et al.
All studies
Estimated risk (Ala allele)
0.2
0.4
0.6
0.8
1
1.2
2.0
1.4
1.6
1.8
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
Ala is protective
Courtesy J. Hirschhorn
3Global challenges
- Multiplicity of studies driven by the Human
Genome Project - Each study requires cohort
- Genomic studies are failing because of
insufficient sample size - Genomics are commoditized
- Phenomics are irreducibly expensive
- Assembling new cohorts for each study is
unaffordable - Imposed time delays
- Redundant phenotyping
- Millions of dollars per study
4Global Challenge
- Some questions will not yield even with large
populations - Much hard-won genomic data lies fallow
post-publication - Integrationthe Integromeacross data modalities
provides leverage. - Identifying the true underlying pathobiologies
- THREE Vignettes
5Courtesy Atul Butte
Courtesy Atul Butte
Concepts CUI Synonyms and abbreviations Source
vocabulary and term Defined relations not
all Statistical relations not all, 4K for
diabetes mellitus
6Gene-Driven Nosology
UMLS
GEO
Butte et al, Nat Biotech. 06
7Integrome (contd)
- Combining expression snapshots of human disease
and time course of model organism development
8Lung Cancer
9In-class as well as between class
Liu, H et al. PLoS Med, 2006
All ADs (125)
(month of survival for A B)
C
Genomic PC3
Stage I ADs (64)
Higher PC1 N51
Genomic PC1
B
Survival probability ()
Stage I ADs (64)
Lower PC1 N13
Genomic PC3
p 0.0169
Genomic PC1
Survival time (month)
10Integrome (contd)
- Expression and copy number
11Tumor gene expression profile
- Method Functional Aneuploidy
- Map mRNA measurements to genomic loci.
- For each cytogenetic sub-band, calculate a
statistic summarizing net expression of genes
encoded in that region. - Calculate the sum of the magnitudes of these
statistics to derive total functional
aneuploidy (tFA)
12Result CIN25 signature is a significant predictor
of prognosis in 10 of 16 cancer datasets tested,
representing 6 types of human cancer.
13Challenge
- Tissue pathology is revealed in the tissues
affected - Health individuals do not readily give up most
body tissues - This reduces apparent utility of molecular
markers - Can we measure molecular signatures in expression
and protein in major diseases
14Diabetes Mellitus
- Mitochondrial phenotype of DM well established in
fat and muscle and liver - Current mainstay of DM, glycohemoglobin does not
distinguish several physiological states - Can we measure the mitochondrial phenotype and
response to therapy using only WBC?
15Diabetes-Peripheral Signature
16Huntington's Disease
- Dominant neurodegenerative disorder caused by
expansion of a CAG trinucleotide repeat in the HD
gene - Correlation but wide variance in clinical course
with CAG repeat length - Goal Define characteristics of the trigger
mechanisms and target pathways correlated with
the increasing polyglutamine tract length - Structural biology approach
- WBC expression approach
17Pathways with decreasing activity as CAG length
increases
28,372 genes hgu133 Plus 2.0
18Challenge
- Early predecessors to PubMed, such as PaperChase,
showed that doctors could and would perform
searches and boolean operations to get the right
publications. - We do not have the equivalent facility for
homology search, pathway intersection with
expression, conservation against TFBS etc.
19Pathway
20Gene-centric
21Combination Operations
22Challenge
- The 72 names of glucose and other synonymy lack
of synchrony - Where do we get formal rigor to delineate
relationships? - How do we use this formal rigorgt
23Protégé
24Mapping program
25Exploring an Integrated Multitude
- Integrating various data sources
- Diagnosis ICD9, LMR, NLP
- Medications Inpatient cost codes, LMR, NLP
- Smoking ICD9, health maintenance, LMR, NLP
- Across millions (2.5M) of patients
- Where data may be sparse
- PREVIOUSLY Researchers did not know what they
COULD ask, let alone HOW
26(No Transcript)
27(No Transcript)
28(No Transcript)
29Challenge
- How to perform unsupervised learning in systems
that contains heterogenous data - Discrete and continuous
- Asynchronous
- Sparse
- Codified and codified through parsing
- Biased by utilization
30Initial approaches
- Low hanging fruit first
- Remove spiking behavior through entropic
filters - Aggregate over related categories
- Use sampling frequencies based on the highest
sampling frequency for a data type - Renormalize laboratory tests that have shifting
baselines.
31One Approach Relevance Networks
32Challenge
- Most phenotypic data in modern electronic
healthcare systems is in the form of narrative
text - Texts are idiosyncratic, jargon filled, and vary
in style by institution
33Approach
- Both statistical and hand-crafted pattern
matching - Implemented as modules in the widely used GATE
framework. - GATE itself is being wrapped in the Hive
- Scrubbing to maintain anonymity
- Validation of codification by sampling strategies
with experts.
34NLP (and comedy) is not pretty
35But it works
- 96,000 asthma patients identified out of 2.5M PHS
patients - Stratified by severity, pharmaco-responsiveness
and exposures - Now with cases and controls (from extrema)
reconsented and biomaterials obtained for
genome-wide scans
36And we want to recruit more colleagues to the
challenge
- 16 groups have already responded to a
de-identification and smoking challenge.
37Challenge
- Multiplicity of tools
- Some local favorites, some universal favorites
- Little guidance on how or when to use them for
clinical research - Even less on how to use them together as an
ensemble, process to perform clinical research
38The Hive Workflow framework
- Two APIs
- Interprocess
- Managed UI
- Sessions
- Security
39Workflow1
40Workflow2 (package data)
41Workflow Method dependent on data
42Workflow Cells with services and managed UI
GenopiaIframe
43Challenge
- Order n2 IRB process with n institutions
- See it propagate during updates
- Prospective vs just-in-time- consent
44Aggregate-able and delegated approval
Researcher enters through web browser
HMS eCommons credentialing of researcher
Compose Web Query
HMS IRB approval for entire network
HMS Admin Supernode
Institutional Firewall
BWH MGH
CHMC
BIDMC
Institutional IRB approval to become a node
Different HIPAA-covered entities
Partners IRB Approval
Caregroup IRB Approval
CHMC IRB Approval
45Challenge
- Multidiscplinary teams are subject to persistent
centrifugal forces - Utilities and incentives are not well aligned in
current funding and reward ecosystem - And yet, cohesion is necessary precondition for
success of i2b2
46Solution to the Sociological Challenge I
47Solution II
- Promotion paths
- (even within current system)
- Clarity on intellectual contributions and
authorship - Funding that structurally incents
- E.g i2b2 DBPs
- Transparency in IP, data sharing.
48Solution III
49Thank you
- And thanks to Chuck Friedman?, Valerie Florance,
Valentina Di Francesco