Title: Case Study for Clinical Relevancy: Asthma
1Case Study for Clinical Relevancy Asthma
Scott T. Weiss, M.D., M.S.
Professor of Medicine Harvard Medical
School Director, Center for Genomic
Medicine Director, Program in Bioinformatics Assoc
iate Director, Channing Laboratory Brigham and
Womens Hospital Boston, MA
2Outline
- Context focus on process and data
- Overview of Asthma DBP
- Smoking as an example of the data issues
- Predicting COPD in those with asthma
- Predicting asthma exacerbations
- Genetic prediction of asthma exacerbations
current status - DNA collection
- Lessons Learned
- Conclusions
3Context
- Channing Lab - extensive genetics
pharmacogenetics resources focused on airways
diseases - Faculty with clinical, epidemiology, genetic, and
bioinformatics training and experience - multidisciplinary research collaborative track
record - Good i2b2 driver from bench to clinic
- Strong focus and direction for Cores
4Broad Goals of Channing Program in Predictive
Medicine
- Genetic variation ? clinical practice
- ? Disease risk (asthma diagnosis)
- ? Natural history (exacerbations)
- ? Individual response to medication
(pharmacogenetics) - Develop predictive tests (genetic and nongenetic)
in Channing populations - Validate these tests in Partners asthma cohort
(PAC) at least as proof of concept
5I2B2 Airways DBP Overview
6Before we start
- Numerous important covariates
- e.g. age, tobacco, comorbidities, medications
- Adjust outcomes for covariates
- Some (eg age, gender,Dx, encounter) readily
available - Obtained through Core 4
- Others require substantial effort e.g.
medications, tobacco use, comorbid conditions - Collaboration - NLP experts in Core 1
7Phenotypes from text
- Extract specific data items
- Medication
- Smoking status
- Diagnoses (Co-morbidity)
- Extract findings to assist with case selection
- Extract findings to assist with clinical
predictions
8Smoking Status- Examples
Smoker
Non-Smoker
Past Smoker
???
Hard to pick
Hard to pick
9Smoking -Text Processing
Manually classified
10Smoking Status
Preliminary results
- Raw sample 20,000 reports
- Feature extraction gt3000
- Feature selection 25 - 1000
- Gold standard sample cases 2,800
- Correct classification rate 46 - 81(compared to
Gold Standard)
11Smoking Status
Preliminary results
Baseline performance
Increase, combine features should improve
performance
12Data Extraction
Data Mining Pipeline
13Asthma Preceding COPD
- Significant overlap of asthma and COPD DX
- Common denominator smoking
- Asthma is known to precede and predict the
development of COPD independent of smoking - Could we develop a multivariate clinical
predictor that would predict which asthmatics
would get COPD?
14Study Design
- Source Partners Healthcare Research Patient
Data Repository (RPDR). - RPDR MGH, BWH, etc clinical repository for
researchers. - Training 9349 asthmatics (843 COPD, 8506
controls) first encounter 1988 1998. - Test A future set of 992 asthmatics (46 COPD,
946 controls) first encounter from 1999-2002.
15Data Collection
- Criteria Patients observed for at least 5
years, at least 18 at the first encouter, and
race, sex, height, weight, and smoking available.
- Comorbodities International Classification of
Diseases, 9th Revision (ICD-9) codes as admission
diagnosis or ER primary diagnosis (104) - COPD ICD-9 code for Chronic Bronchitis,
Emphysema Chronic Airways Obstruction, not
otherwise specified.
16Analysis
- Model A Bayesian network was generated from the
training set of 9349 asthmatics (843 COPD, 8506
controls) encountered between1988 and 1998 from
104 comoribities and race, gender, age, smoking. - Results The risk of COPD is modulated by
gender, race, and smoking history, and 14
comorbidities Viral and chlamydial infections,
diabetes mellitus, volume depletion, acute
myocardial infarction, intermediate coronary
syndrome, cardiac dysrhythmias, heart failure,
acute upper respiratory infections, acute
bronchitis and bronchiolitis, pneumonia, early or
threatened labor, normal delivery, shortness of
breath, respiratory distress.
17Network Model
18Validation
- Propagation a Bayesian network can compute the
probability distribution of any variable given an
instance of some or all the other variables. - Test data a future set of 992 asthmatics (46
COPD, 946 controls) first encounter from
1999-2002. - Prediction for each patient, predict the
probability of COPD given the other elements in
the network (co-morbidities and demographics). - Validation compare the predicted with the
observed COPD status.
19Predictive Validation
20One variable at the time
21Asthma Exacerbations
- Asthma attacks involve worsening of asthma
symptoms including bronchoconstriction and
inflammatory response - Major cause of morbidity and mortality in asthma
- 11.7 million Americans have an exacerbation every
year (3.9 million children) - In US children, exacerbations are the third
leading cause of hospitalizations (198,000
occurrences per year) - Cost of asthma exacerbations US4 billion
dollars, Partners20 million dollars
22 23 24 25RPDR Exacerbation Prediction
26Genetic Prediction of Asthma Exacerbation
- Objective
- Predict asthma exacerbation from genetic data
- Subjects
- 290 CAMP participants
- Not on steroids
- Followed for 10 years
- Have genetic data available
- Phenotype
- Case Reported overnight hospitalization(s)
(n83) - Control No overnight hospitalizations or ER
visits (n207) - Genotype
- 2443 SNPs from 349 candidate genes
- In Hardy-Weinberg equilibrium among controls
- Minor allele frequency gt 0.05
27Exacerbation Model
132 of 2443 SNPs in 55 of 349 genes predict
exacerbation
28Validation
- Method Prediction on fitted values
- Result Area under the ROC curve (AUROC) is 0.97
- AUROC measures accuracy as trade-off between
sensitivity and specificity
AUROC Rating
0.5 - 0.6 Fail
0.6 - 0.7 Poor
0.7 - 0.8 Fair
0.8 - 0.9 Good
0.9 - 1.0 Excellent
AUROC 0.97
29Cross-Validation
- Method 20-fold cross-validation to test
robustness - Data is split into 20 groups
- One group is used as independent and remaining 19
are used to quantify the model - (2) is repeated until each group has been
independent set - Result AUROC is 0.84 (good)
AUROC 0.84
30Partners Asthma DNA collection 1
- Recruit Partners asthma patients
- Partners Asthma Center, NWH, MGH
- High quality spirometric phenotyping
- Blood for DNA extraction and storage
- Children and adults
- High cost (gt1000/subject)
- Low intensity 6 months only 100 subjects
recruited - Doctors and patients need education
31Partners Asthma DNA collection 2
- Recruit Partners asthma cohort patients
- Leverage CRIMSON blood samples
- Leverage data mart for phenotype data
- Blood for DNA extraction and storage
- Children and adults cases and controls
- low cost (lt30/subject)
- High intensity 9 months gt3000 subjects recruited
32Figure 1 Data Flow for Asthma DBP
Channing RPDR ADMPN Send to RPD
converts ADMPN to
MRN sends to pathology
Pathology (Crimson) MRN Crimson
ID ADMPN sends back to Channing with
sample for DNA extraction
Figure 1 Legend Deidentified data file analyzed
by Channing subjects for DNA collection selected.
File sent to RPDR converted back to MR and sent
to Crimson. Samples identified and given Crimson
ID ADMPN and sample Sent back to Channing.
33Recruitment for DBP from Crimson at BWH Asthma
Cases by Utilization and Race
34Recruitment for DBP from Crimson at BWH Asthma
Cases and Controls by Race
35Summary of Samples to 04/07/08
36Lessons learned 1
- Get what you ask for
- Regular meetings, regular meetings
- Negotiate your demands
- Tools are not enough
- Leverage your peers
- Recruiting patients is hard work
- IRB is hard work
37Lessons learned 2
- You can never have enough statistics or
bioinformatics - Genotyping and its technologies are secondary
- The RPDR data are dirty!
- Listen to Shawn
- Be flexible
38Summary Airways disease as a driver for i2b2
- Typical complex disease challenge
- Big impact on health care system
- Potential for large clinical impact
- Core 1 Extracting phenotypes from free text
statistical models - Core 2 Viewer for CRC
- Core 4 Data provisioning
39Conclusions
- The stronger the existing program, the more
successful the I2B2 collaboration - Communication is key
- Fit the question to the data not the other way
around - Data access will be an issue for the future
40Collaborators (and what they did)
- Scott, Zak, John, and Susanne money, project
management, IRB, and big picture - Ross Channing bioinformatics, file structures,
geek to geek translation with the cores, beta
testing, 850 collection, IRB, links to other
genetic bioinformatics tools and projects - Shawn and Vivian asthma and control data mart
- Anne, LJ, James nongenetic predictors in CAMP
- Marco and Blanca nongenetic predictors in PAC
- Marco and Blanca genetic predictors in CAMP
- Marco and Blanca genetic predictors in PAC
- Lynn Crimson
41Acknowledgments
- Ross Lazarus Susanne Churchill
- Blanca E. Himes Anne Fuhlbrigge
- Marco F. Ramoni LJ Wei
- Isaac Kohane James Sigornivitch
- Shawn Murphy Lynn Bry