Title: Using visualization to find relationships in chemogenomic data
1Using visualization to find relationships in
chemogenomic data
- Brian Prather, Ph.D., Application Specialist,
Spotfire - Third Virtual Conference on Genomics and
Bioinformatics - 18 September 2003
2Challenges in Chemogenomics
- Conceptual challenges
- Multidisciplinary project teams
- Data from chemistry, biology, genomics,
- Data challenges
- Disparate data sources, formats and identifiers
- Multiple data types gene expression, chemical
structures, clinical chemistry, histopathology,
receptor binding assays, - Incomplete and missing data
- Analysis challenges
- Should the data be normalized?
- Does it make sense to cluster the data?
- How to relate chemical structures and gene
expression data? - Identify correlations between chemical properties
and biological function
3Conceptual challenges
Multidisciplinary Drug Compound Candidate Project
Team
Chemistry
Genomics
Shared Analysis
Affymetrix
Toxicology
CAS
CodeLink
ABase
Biology
ISIS
Information Technology
4Conceptual challenges
- Different types of scientist ask different
questions - Chemistry
- Which chemical compound should I make next?
- What structural motifs are associated with
desired activity? - Are certain compound properties associated with
drug-like behavior? - Are specific classes of compounds affecting our
disease target? - How do structural R-groups influence activity?
- Biology/Genomics
- Which genes are significantly changed by drug
treatment? - What pathways are involved in drug response?
- Are we identifying agonists/antagonists of our
target? - What gene changes are associated with toxicity?
- Are genes involved in drug metabolism induced or
repressed? - Information Technology
- How can the data be stored, retrieved, merged,
analyzed?
5Common goals
- Speed the drug development process
- Focus research efforts on only the most promising
drug compounds - Increase the number of successful candidate drugs
in the pipeline - Discover and create drugs to help people who are
suffering from diseases, conditions and ailments.
6Data challenges multiple sources
- There are multiple kinds of data and multiple
data sources - Chemistry
- Databases, Data Marts
- Specialized databases for primary and secondary
screening - Chemical properties databases
- Chemical structure databases
- Instrumentation
- Biology/Genomics
- Databases, Data Marts
- Internal and external web-based search engines
- Specialized gene expression databases
- Gene annotation databases
- Instrumentation
7Data challenges incomplete data
- Data sources are not always complete
- Some compounds have not been subjected to
toxicity studies - Some compounds have not been examined in tissue
studies - Compounds might have been tested in different
animals - Historical information might not exist for some
compounds - Our analysis must take into account that the data
might be very sparse.
8Analysis challenges examples
- Gene expression data normalization
- Should we make experiments (entire arrays)
comparable? - Should we make genes (within arrays) comparable?
- How should missing values be handled?
- Clustering
- Does it make sense to cluster the data?
- Which distance and similarity metrics should be
used? - How should missing values be handled?
- Chemical Structure analysis
- Are the structures similar enough to allow a
meaningful R-group analysis?
9Finding relationships between data
- Example dataset
- Derived from Iconix DrugMatrix database
- Summarizes 30 million data points
- Data gathered during liver toxicity study in rats
- Each row of data represents a single compound
with each column containing a different measured
variable - Includes gene expression in liver, clinical
chemistry tests, histopathology, receptor binding
assays - Structure database provides chemical structures
- Example dataset
- Results of ADME/Tox studies from a panel of
assays - Each row is a compound, each column is an assay
10Simple view x vs. y, labeled markers
Label and color by single word group
Non-steroidal anti-inflammatory drugs (NSAIDs)
cluster below 0.15 for albumin assay (indicative
of kidney damage typically associated with
NSAIDs)
11Linking structures to biology
Do those statins with more upregulated genes have
common structural motifs?
12Exploring groups/therapeutic roles
of entries per single word group
PPARa and Statins are in the same Therapeutic
class
13Three simple views
- Investigate compound groupings based on two
spatial variables, plus labels, sizing, and
coloring - Interactively link chemical structures to 2D
visualizations - Examine overlap between categorical variables
with colored histograms
14Explore categories of activity
Determine which studies have been performed on a
given compound, and the qualitative (or binned)
result
15Explore categories of activity
Link several visualizations together to relate
numerical data, qualitative data, structures,
16Hierarchical clustering on gene subset
HC used to group compounds according to gene
expression patterns
Genes involved in cholesterol synthesis down
regulated by steroid receptor (orange category)
Single word categorical column
Genes involved in cholesterol synthesis
upregulated by Statins (light blue category)
17Explore structural motifs vs. expression
What are the compounds with structural
commonalities?
Modify structure and search for other compounds
with this substructure
18Explore structural motifs vs. expression
Are common structural motifs related to activity?
Note that each compound might have been tested at
multiple time points and/or multiple doses.
19View structures of outliers
Select two statins that did not cluster together
by gene expression and look for structural
differences
View structures associated with compounds that
formed their own cluster in gene expression
20Integrate pathway GO information
21Link clustering and PCA results
Compare results of clustering and PCA
22Statistical/structural analysis visualization
- Overview biological results, categorical data,
missing values, - Use clustering to group compounds according to
gene expression levels. Perform substructure
searches to identify structure-expression
relationships. - Explore the relationship between structural
differences and gene expression within subgroups
of similar compounds - Link structures, gene expression data,
hierarchical ontologies and pathways. - Compare results from different clustering
methods, PCA analyses, profile matching metrics,
23Summary
- There are multiple challenges in chemogenomics
- Conceptual challenges
- Data challenges
- Analysis challenges
- Visualization is a useful and effective tool for
understanding chemogenomic data - It can link multiple data types (structures, gene
expression data, biological assay results,
pathways, ). - It can help scientists interpret and understand
the results of statistical analyses and is a
complement (not a replacement!) to rigorous
statistical analysis. - It lets the human eye do something that it does
very well notice patterns, outliers, unexpected
trends and relationships, and assimilate huge
amounts of information in a glance.