Title: For Nautilus: To OLAP or not to OLAP?
1REMBRANDTEmpowering Translational Research
REpository of Molecular BRAin Neoplasia DaTa
HL7 Clinical Genomics SIG Atlanta, September04
2Agenda
- Translational Research Why do we care?
- GMDI How we got here?
- Conceptual Model
- Gene Expression Use Case analysis
- Gene Expression Data analysis
- Wire Frames
- System Architecture
- Object Model
- Data warehouse design
3Translational Research Why do we care?
- Iressa Drug Case Study (at Harvard Medical
School) - Targeted towards lung cancer
- Phase II trial A minority of patients showed
dramatic tumor shrinkage - Phase III randomized trial No survival
improvement. - Patients with mutations in Iressas target, EGFR,
showed response to the drug. - Pharmacogenomics future is based on translational
research - Reference Clinical Pharmacogenomics Almost a
reality Modern Drug Discovery, August 2004
4Scientific goals of GMDI
- Develop a molecular classification schema that is
both clinically and biologically meaningful,
based on gene expression and genomic data from
tumors (Gliomas) of patients who will be
prospectively followed through natural history
and treatment phase of their illness
5Rembrandt Knowledgebase
Better understanding Better treatments
6REMBRANDT Project Goals
- Produce a national molecular/genetic/clinical
database of several thousand primary brain tumors
that is fully open and accessible to all
investigators (including intramural and
extramural) - Provide informatics support to molecularly
characterize a large number of adult and
pediatric primary brain tumors and to correlate
those data with extensive retrospective and
prospective clinical data
7Functional genomics data in the knowledge-base
RNA
Protein
DNA
100K SNP array
Tissue Arrays for ISH
ArrayCGH
Tissue Arrays (IHC)
Proteomics (Mass Spec)
Gene Expression Analysis
Copy No.
LOH
Affy/Oligo Arrays
cDNA/GenePix Arrays
Real time RTPCR
8Conceptual Model
Prior_Therapy
Demographics
Survival
Outcome
Time course
C3D
Patient
Trial
Pathology
User Input
Sample
CaCore
Expr_Expt
CGH_Expt
SNP_Expt
Change_Status
Map_Location
caArray
Abnorm_Status
Gene
BAC_ID
SNP
E-value
Abnorm_Status
Call
9REMBRANDT will Leverage NCICB and caBIG
Infrastructure Components
- Aligns with caBIG principles
- Open source
- Open access
- Syntactic and Semantic interoperability
- Federated data
- NCICB Infrastructure
- caARRAY gene expression data repositories and
analysis tools - Cancer Genome Anatomy Project (CGAP) genomic
tools - C3D Clinical Informatics System
- caCORE Infrastructure (caBIO, EVS, caDSR)
- caBIG Infrastructure being delivered by caBIG
workspaces
10Typical Rembrandt Search
- Show me the tumors (Tumor samples) that have
amplification and over-expression of Genes EGFR
Cyclin D1. - Restrict the search to cases with
- amplification confirmed by SNP Chip and CGH,
- and over-expression confirmed by Oligo and cDNA
Arrays - Presentation of Results
- Which genes are under-expressed respect to
normal? - Do this subset of tumors have a better survival?
- Do they segregate to a certain age group,
geographical area or ethnicity?
11True Measure ofTranslation Research
- To present the all DOWN Regulated Genes within
each sample in the result set, we have to pivot
the result set on its Gene Expression axis. - All Translational Queries should allow the
ability to easily pivot between - Disease View
- Patient / Sample View
- Experiment/ Annotations View
- Time Course View
12High-level Search Use cases
13Gene Expression Search Use cases
14Gene Expression data analysis
Binary chp files from GCOS
15cDNA data handling
Technical Replicates
Pearson Correlation between one spot across all
arrays and another spot for the same clone
across all arrays
For each array, calculate the average of
expression measurement
Yes
Is Correlation gt 0.7
No
inconsistent call is made and no e-value
Computed for that clone
16UI Wire Frames
17UI Wire Frames
18Architecture
19Object Model
- DomainElement
- Represents the basic elements involved in
translational research space. - All queries, views and presentation objects are
composed of domain elements - Provides strong type checking and validations
20Database Schema
- Star schema
- Is a generic, query optimized schema
- A star schema consists of Fact tables and
dimensions - Provides a highly de-normalized view of the data
- Provides a data neutral framework from which
queries can be executed with very fast results - Prototype usage will help us validate our approach
21Database Schema
- Fact Table
- Contains key performance indicators
- Helps eliminate expensive joins from queries
- In the future, if multi-dimensional measures are
required, then our schema is extensible to allow
us to perform OLAP queries - Dimension
- Dimensions are the categories of data analysis
- When a report is requested "by" something, that
something is usually a dimension. - For example, in a gene expression query, the two
dimensions needed are genes (GENE_DM) and samples
(BIOSPECIMEN _DM)
22Database Schema
23Problem we are trying to solve
- A typical Rembrandt data portal search
- Show me all tumor samples that have amplification
of 13q11.3, deletion of 10p21, D7S522 and the
FHIT region confirmed by SNP chips and CGH
analysis. - Display regions with LOH for these samples.
- Which genes are under-expressed in these tumor
samples with respect to normal? - Do this subset of tumors have a better survival?
- Do they segregate to a certain age group,
geographical area or ethnicity?
24To solve this problem
- Fact Cancer develops as a result of Chromosomal
aberrations - Duplications
- Deletions
- Somatic Mutations
- We need to measure chromosomal aberrations
Chrom N, Copy 1
Chrom N, Copy 2
Complete Loss
Duplication
LOH
25How to measure aberrations?
- CGH
- SNP Arrays
- Have higher resolution than CGH
- Analyze chromosomal copy number and genotype in
one experiment - SNP arrays help determine the following between
normal blood sample and Tumor sample - Heterozygous to Homozygous Loss of one allele
- Heterozygous to No Call Partial Loss of one
allele/No Call - Homozygous to Homozygous Unchanged/Loss of one
allele
26Genotype model for Rembrandt
- Model basic science
- Model SNPs in relation to chromosomal aberrations
and as markers on the genome - Model to include annotations and external
cross-references - Model Experimental observations
- Capture observations such as LOH in relation to
SNPs and chromosomal aberrations (CGH data) - Capture expression value for SNP elements on
arrays to correlate with DNA copy number
27Translational Research use case
- The Clinical Genomics model should serve the
translational research use case - Model should allow for associations between
- Basic science / molecular observations (Gene
expression, SNP, pathway etc) - Clinical science (Prior therapy, outcome,
demographics etc) data.
28Translational Research Space
29Next Steps
- Reviewing the HL7 Re-usable genotype R-MIM as a
starting point to build a clinical genomics
object model - Translating the genotype R-MIM into UML to
establish relationships and cardinalities between
various scientific observations - For REMBRANDT, Extending the caBIO Object Model
- Developing a data warehouse infrastructure for
REMBRANDT to define relevant translational spaces
and relationships between them - Future We plan to merge our clinical objects
with the HL7 Clinical model
30The Rembrandt Team!
- Internal Advisors
- Ken Buetow
- Peter Covitz
- Sue Dubman
- Mervi Heiskanen
- Carl Schaefer
- Christo Andonyadis
- Scott Gustafson
- Sharon Settnek
- External Advisors
- Jean-Claude Zenklusen
- Yuri Kotliarov
- Howard Fine
- Tracy Lugo
- Bob Finkelstein
- Ram Bhattaru
- James Luo
- Alex Jiang
- Prashant Shah
- Ryan Landy
- Kevin Rosso
- Jyotsna Chilukuri
- Dana Zhang
- Nick Xiao
- Smita Hastak
- Himanso Sahni
- Subha Madhavan
31