i2b2: Harnessing The Healthcare System For Research In The Genomic Era

presentation player overlay
1 / 49
About This Presentation
Transcript and Presenter's Notes

Title: i2b2: Harnessing The Healthcare System For Research In The Genomic Era


1
i2b2Harnessing The Healthcare System For
Research In The Genomic Era
  • Isaac Zak Kohane, MD, PhD

2
Prismatic Example PPARg Pro12Ala and diabetes
Oh et al.
Deeb et al.
Mancini et al.
Clement et al.
Hegele et al.
Hasstedt et al.
Lei et al.
Ringel et al.
Sample size
Hara et al.
Overall P value 2 x 10-7 Odds ratio 0.79
(0.72-0.86)
Meirhaeghe et al.
Douglas et al.
Altshuler et al.
Mori et al.
All studies
Estimated risk (Ala allele)
0.2
0.4
0.6
0.8
1
1.2
2.0
1.4
1.6
1.8
0.1
0.3
0.5
0.7
0.9
1.1
1.3
1.5
1.7
1.9
Ala is protective
Courtesy J. Hirschhorn
3
Global challenges
  • Multiplicity of studies driven by the Human
    Genome Project
  • Each study requires cohort
  • Genomic studies are failing because of
    insufficient sample size
  • Genomics are commoditized
  • Phenomics are irreducibly expensive
  • Assembling new cohorts for each study is
    unaffordable
  • Imposed time delays
  • Redundant phenotyping
  • Millions of dollars per study

4
Global Challenge
  • Some questions will not yield even with large
    populations
  • Much hard-won genomic data lies fallow
    post-publication
  • Integrationthe Integromeacross data modalities
    provides leverage.
  • Identifying the true underlying pathobiologies
  • THREE Vignettes

5
Courtesy Atul Butte
Courtesy Atul Butte
Concepts CUI Synonyms and abbreviations Source
vocabulary and term Defined relations not
all Statistical relations not all, 4K for
diabetes mellitus
6
Gene-Driven Nosology
UMLS
GEO
Butte et al, Nat Biotech. 06
7
Integrome (contd)
  • Combining expression snapshots of human disease
    and time course of model organism development

8
Lung Cancer
9
In-class as well as between class
Liu, H et al. PLoS Med, 2006
All ADs (125)
(month of survival for A B)
C
Genomic PC3
Stage I ADs (64)
Higher PC1 N51
Genomic PC1
B
Survival probability ()
Stage I ADs (64)
Lower PC1 N13
Genomic PC3
p 0.0169
Genomic PC1
Survival time (month)
10
Integrome (contd)
  • Expression and copy number

11
Tumor gene expression profile
  • Method Functional Aneuploidy
  • Map mRNA measurements to genomic loci.
  • For each cytogenetic sub-band, calculate a
    statistic summarizing net expression of genes
    encoded in that region.
  • Calculate the sum of the magnitudes of these
    statistics to derive total functional
    aneuploidy (tFA)

12
Result CIN25 signature is a significant predictor
of prognosis in 10 of 16 cancer datasets tested,
representing 6 types of human cancer.
13
Challenge
  • Tissue pathology is revealed in the tissues
    affected
  • Health individuals do not readily give up most
    body tissues
  • This reduces apparent utility of molecular
    markers
  • Can we measure molecular signatures in expression
    and protein in major diseases

14
Diabetes Mellitus
  • Mitochondrial phenotype of DM well established in
    fat and muscle and liver
  • Current mainstay of DM, glycohemoglobin does not
    distinguish several physiological states
  • Can we measure the mitochondrial phenotype and
    response to therapy using only WBC?

15
Diabetes-Peripheral Signature
16
Huntington's Disease
  • Dominant neurodegenerative disorder caused by
    expansion of a CAG trinucleotide repeat in the HD
    gene
  • Correlation but wide variance in clinical course
    with CAG repeat length
  • Goal Define characteristics of the trigger
    mechanisms and target pathways correlated with
    the increasing polyglutamine tract length
  • Structural biology approach
  • WBC expression approach

17
Pathways with decreasing activity as CAG length
increases
28,372 genes hgu133 Plus 2.0
18
Challenge
  • Early predecessors to PubMed, such as PaperChase,
    showed that doctors could and would perform
    searches and boolean operations to get the right
    publications.
  • We do not have the equivalent facility for
    homology search, pathway intersection with
    expression, conservation against TFBS etc.

19
Pathway
20
Gene-centric
21
Combination Operations
22
Challenge
  • The 72 names of glucose and other synonymy lack
    of synchrony
  • Where do we get formal rigor to delineate
    relationships?
  • How do we use this formal rigorgt

23
Protégé
24
Mapping program
25
Exploring an Integrated Multitude
  • Integrating various data sources
  • Diagnosis ICD9, LMR, NLP
  • Medications Inpatient cost codes, LMR, NLP
  • Smoking ICD9, health maintenance, LMR, NLP
  • Across millions (2.5M) of patients
  • Where data may be sparse
  • PREVIOUSLY Researchers did not know what they
    COULD ask, let alone HOW

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Challenge
  • How to perform unsupervised learning in systems
    that contains heterogenous data
  • Discrete and continuous
  • Asynchronous
  • Sparse
  • Codified and codified through parsing
  • Biased by utilization

30
Initial approaches
  • Low hanging fruit first
  • Remove spiking behavior through entropic
    filters
  • Aggregate over related categories
  • Use sampling frequencies based on the highest
    sampling frequency for a data type
  • Renormalize laboratory tests that have shifting
    baselines.

31
One Approach Relevance Networks
32
Challenge
  • Most phenotypic data in modern electronic
    healthcare systems is in the form of narrative
    text
  • Texts are idiosyncratic, jargon filled, and vary
    in style by institution

33
Approach
  • Both statistical and hand-crafted pattern
    matching
  • Implemented as modules in the widely used GATE
    framework.
  • GATE itself is being wrapped in the Hive
  • Scrubbing to maintain anonymity
  • Validation of codification by sampling strategies
    with experts.

34
NLP (and comedy) is not pretty
35
But it works
  • 96,000 asthma patients identified out of 2.5M PHS
    patients
  • Stratified by severity, pharmaco-responsiveness
    and exposures
  • Now with cases and controls (from extrema)
    reconsented and biomaterials obtained for
    genome-wide scans

36
And we want to recruit more colleagues to the
challenge
  • 16 groups have already responded to a
    de-identification and smoking challenge.

37
Challenge
  • Multiplicity of tools
  • Some local favorites, some universal favorites
  • Little guidance on how or when to use them for
    clinical research
  • Even less on how to use them together as an
    ensemble, process to perform clinical research

38
The Hive Workflow framework
  • Two APIs
  • Interprocess
  • Managed UI
  • Sessions
  • Security

39
Workflow1
40
Workflow2 (package data)
41
Workflow Method dependent on data
42
Workflow Cells with services and managed UI
GenopiaIframe
43
Challenge
  • Order n2 IRB process with n institutions
  • See it propagate during updates
  • Prospective vs just-in-time- consent

44
Aggregate-able and delegated approval
Researcher enters through web browser
HMS eCommons credentialing of researcher
Compose Web Query
HMS IRB approval for entire network
HMS Admin Supernode
Institutional Firewall
BWH MGH
CHMC
BIDMC
Institutional IRB approval to become a node
Different HIPAA-covered entities
Partners IRB Approval
Caregroup IRB Approval
CHMC IRB Approval
45
Challenge
  • Multidiscplinary teams are subject to persistent
    centrifugal forces
  • Utilities and incentives are not well aligned in
    current funding and reward ecosystem
  • And yet, cohesion is necessary precondition for
    success of i2b2

46
Solution to the Sociological Challenge I
47
Solution II
  • Promotion paths
  • (even within current system)
  • Clarity on intellectual contributions and
    authorship
  • Funding that structurally incents
  • E.g i2b2 DBPs
  • Transparency in IP, data sharing.

48
Solution III
49
Thank you
  • And thanks to Chuck Friedman?, Valerie Florance,
    Valentina Di Francesco
Write a Comment
User Comments (0)
About PowerShow.com