Title: Molecular Signaling
1Molecular Signaling Drug Development
CourseDevelopment of Molecular Signatures from
High-Throughput Assay Data
- Alexander Statnikov, Ph.D.
- Director, Computational Causal Discovery
Laboratory - Benchmarking Director, Best Practices Integrative
Informatics Consultation Service - Assistant Professor, Department of Medicine,
Division of Clinical Pharmacology - Center for Health Informatics and Bioinformatics
, NYU School of Medicine - 5/16/2011
2Outline
- Part 1 Introduction to molecular signatures
- Part 2 Key principles for developing accurate
molecular signatures - Part 3 Comprehensive evaluation of algorithms to
develop molecular signatures for cancer
classification - Part 4 Analysis and computational dissection of
molecular signature multiplicity - Conclusion
- Homework assignment
3Part 1 Introduction to molecular signatures
4Definition of a molecular signature
- Molecular signature is a computational or
mathematical model that links high-dimensional
molecular information to phenotype or other
response variable of interest.
5FDA view on molecular signatures
- The FDA calls them in vitro diagnostic
multivariate index assays - 1. Class II Special Controls Guidance Document
Gene Expression Profiling Test System for Breast
Cancer Prognosis - Addresses device classification.
- 2. The Critical Path to New Medical Products
- - Identifies pharmacogenomics as crucial to
advancing medical product development and
personalized medicine. - 3. Draft Guidance on Pharmacogenetic Tests and
Genetic Tests for Heritable Markers Guidance
for Industry Pharmacogenomic Data Submissions - Identifies 3 main goals (dose, ADEs, responders),
- Defines IVDMIA,
- Encourages fault-free sharing of
pharmacogenomic data, - Separates probable from valid biomarkers,
- Focuses on genomics (and not other omics).
6Main uses of molecular signatures
- Direct benefits Models of disease
phenotype/clinical outcome - Diagnosis
- Prognosis, long-term disease management
- Personalized treatment (drug selection,
titration) - Ancillary benefits 1 Biomarkers for diagnosis,
or outcome prediction - Make the above tasks resource efficient, and easy
to use in clinical practice - Helps next-generation molecular imaging
- Leads for potential new drug candidates
- Ancillary benefits 2 Discovery of structure
mechanisms (regulatory/interaction networks,
pathways, sub-types) - Leads for potential new drug candidates
7Less conventional uses of molecular signatures
- Increase clinical trial sample efficiency and
decrease costs or both, using placebo responder
signatures - In silico signature-based candidate drug
screening - Drug resurrection
- Establishing existence of biological signal in
very small sample situations where univariate
signals are too weak - Assess importance of markers and of mechanisms
involving those - Choosing the right animal model
- ?
8Recent molecular signatures available for
patient care
Agendia
Clarient
Prediction Sciences
LabCorp
Veridex
University Genomics
Genomic Health
BioTheranostics
Applied Genomics
Power3
Correlogic Systems
9Molecular signatures in the market
Company Product Disease Purpose
Agendia MammaPrint Breast cancer Risk assessment for the recurrence of distant metastasis in a breast cancer patient.
Agendia TargetPrint Breast cancer Quantitative determination of the expression level of estrogen receptor, progesteron receptor and HER2 genes. This product is supplemental to MammaPrint.
Agendia CupPrint Cancer Determination of the origin of the primary tumor.
University Genomics Breast Bioclassifier Breast cancer Classification of ER-positive and ER-negative breast cancers into expression-based subtypes that more accurately predict patient outcome.
Clarient Insight Dx Breast Cancer Profile Breast cancer Prediction of disease recurrence risk.
Clarient Prostate Gene Expression Profile Prostate cancer Diagnosis of grade 3 or higher prostate cancer.
Prediction Sciences RapidResponse c-Fn Test Stroke Identification of the patients that are safe to receive tPA and those at high risk for HT, to help guide the physicians treatment decision.
Genomic Health OncotypeDx Breast cancer Individualized prediction of chemotherapy benefit and 10-year distant recurrence to inform adjuvant treatment decisions in certain women with early-stage breast cancer.
bioTheranostics CancerTYPE ID Cancer Classification of 39 types of cancer.
bioTheranostics Breast Cancer Index Breast cancer Risk assessment and identification of patients likely to benefit from endocrine therapy, and whose tumors are likely to be sensitive or resistant to chemotherapy.
Applied Genomics MammaStrat Breast cander Risk assessment of cancer recurrence.
Applied Genomics PulmoType Lung cancer Classification of non-small cell lung cancer into adenocarcinoma versus squamous cell carcinoma subtypes.
Applied Genomics PulmoStrat Lung cancer Assessment of an individual's risk of lung cancer recurrence following surgery for helping with adjuvant therapy decisions.
Correlogic OvaCheck Ovarian cancer Early detection of epithelial ovarian cancer.
LabCorp OvaSure Ovarian cancer Assessment of the presence of early stage ovarian cancer in high-risk women.
Veridex GeneSearch BLN Assay Breast cancer Determination of whether breast cancer has spread to the lymph nodes.
Power3 BC-SeraPro Breast cancer Differentiation between breast cancer patients and control subjects.
10MammaPrint
- Developed by Agendia (www.agendia.com)
- 70-gene signature to stratify women with breast
cancer that hasnt spread into low risk and
high risk for recurrence of the disease - Independently validated in gt1,000 patients
- So far performed gt10,000 tests
- Cost of the test is 3,000
- In February, 2007 the FDA cleared the
MammaPrint test for marketing in the U.S. for
node negative women under 61 years of age with
tumors of less than 5 cm. - TIME Magazines 2007 medical invention of the
year.
11Oncotype DX
- Developed by Genomic Health (www.genomichealth.c
om ) - 21-gene signature to predict whether a woman
with localized, ER breast cancer is at risk of
relapse - Independently validated in gt1,000 patients
- So far performed gt50,000 tests
- Cost of the test is 3,000
- The following paper shows the health benefits
and cost-effectiveness benefits of using Oncotype
DX http//www3.interscience.wiley.com/cgi-bin/abs
tract/114124513/ABSTRACT
12Part 2Key principles for developing accurate
molecular signatures
13Main ingredients for developing a molecular
signature
Well-defined clinical problem access to
patients/ samples
Computational biostatistical Analysis
Molecular Signature
High-throughput assays
14Challenges in computational analysis of omics
data
- Relatively easy to develop a predictive model and
even easier to believe that a model is good when
it is not ? false sense of security - Several problems exist some theoretical and some
practical - Omics data has many special characteristics and
is tricky to analyze!
15Example OvaCheck (1/2)
- Developed by Correlogic (www.correlogic.com)
- Blood test for the early detection of epithelial
ovarian cancer - Failed to obtain FDA approval
- Looks for subtle changes in patterns among the
tens of thousands of proteins, protein fragments
and metabolites in the blood - Signature developed by genetic algorithm
- Significant artifacts in data collection
analysis questioned validity of the signature - Results are not reproducible
- Data collected differently for different groups
of patients - http//www.nature.com/nature/journal/v429/n6991/fu
ll/429496a.html
16Example OvaCheck (2/2)
A
B
C
Figure from Baggerly et al (Bioinformatics,
2004)
D
E
F
17An early kind of analysis Learning disease
sub-types by clustering patient profiles
p53
Rb
18Clustering Seeking natural groupings hoping
that they will be useful
p53
Rb
19E.g., for classification (predict response to
treatment)
p53
Respond to treatment Tx1
Do not Respond to treatment Tx1
Rb
20Another use of clustering
- Cluster genes (instead of patients)
- Genes that cluster together may belong to the
same pathways - Genes that cluster apart may be unrelated
21Unfortunately clustering is a non-specific method
and falls into the one-solution fits all trap
when used for classification
p53
Squamous carcinoma
Adenocarcinoma
Rb
22Clustering is also non-specific when used to
discover pathways, or other mechanistic
relationships
It is entirely possible in this simple
illustrative counter-example for G3 (a causally
unrelated gene to the phenotype) to be more
strongly associated and thus cluster with the
phenotype (or its surrogate genes) than the true
causal oncogenes G1, G2
G1
G2
Ph
G3
23Two improved classes of methods
- Supervised learning ? classification/molecular
signatures and markers - Regulatory network reverse engineering ? pathways
24Supervised learning Use the known phenotypes
(a.k.a. class labels) in training data to build
signatures or find markers highly specific for
that phenotype
A
Classifier/ Regression Algorithm
Training samples
B
C
Molecular signature
D
T
Testing/ Validation samples
A1, B1, C1, D1, T1 A2, B2, C2, D2, T2 An, Bn,
Cn, Dn, Tn
Classification Performance
25Input data for supervised learning methods
- Class Label Variables/features
Primary Metastatic Primary Metastatic Metastatic P
rimary Metastatic Metastatic Metastatic Primary Me
tastatic Primary
26Principles and geometric representation for
supervised learning (1/7)
- Want to classify objects as boats and houses.
27Principles and geometric representation for
supervised learning (2/7)
- All objects before the coast line are boats and
all objects after the coast line are houses. - Coast line serves as a decision surface that
separates two classes.
28Principles and geometric representation for
supervised learning (3/7)
These boats will be misclassified as houses
This house will be misclassified as boat
29Principles and geometric representation for
supervised learning (4/7)
Longitude
Boat
House
Latitude
- The methods that build classification models
(i.e., classification algorithms) operate very
similarly to the previous example. - First all objects are represented geometrically.
30Principles and geometric representation for
supervised learning (5/7)
Longitude
Boat
House
Latitude
Then the algorithm seeks to find a decision
surface that separates classes of objects
31Principles and geometric representation for
supervised learning (6/7)
Longitude
These objects are classified as houses
?
?
?
?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as boats if
they fall below the decision surface and as
houses if the fall above it
32Principles and geometric representation for
supervised learning (7/7)
Longitude
Object 1
Object 2
Object 3
Latitude
33In 2-D this looks simple but what happens in
higher dimensional data
- 10,000-50,000 (gene expression microarrays, aCGH,
and early SNP arrays) - gt500,000 (tiled microarrays, SNP arrays)
- 10,000-300,000 (regular MS proteomics)
- gt10,000,000 (LC-MS proteomics)
- gt100,000,000 (next-generation sequencing)
- This is the curse of dimensionality problem
34High-dimensionality (especially with small
samples) causes
- Some methods do not run at all (classical
regression) - Some methods give bad results (KNN, Decision
trees) - Very slow analysis
- Very expensive/cumbersome clinical application
- Tends to overfit
35Two problems Over-fitting Under-fitting
- Over-fitting (a model to your data) building a
model that is good in original data but fails to
generalize well to new/unseen data - Under-fitting (a model to your data) building a
model that is poor in both original data and
new/unseen data
36Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
37Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line overfits!
Training Data Future Data
Predictor X
38Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line underfits!
Training Data Future Data
Predictor X
39Very important concept
- Successful data analysis methods balance training
data fit with complexity. - Too complex signature (to fit training data well)
?overfitting (i.e., signature does not
generalize) - Too simplistic signature (to avoid overfitting) ?
underfitting (will generalize but the fit to both
the training and future data will be low and
predictive performance small).
40The Support Vector Machine (SVM) approach for
building molecular signatures
- Support vector machines (SVMs) is a binary
classification algorithm. - SVMs are important because of (a) theoretical
reasons - Robust to very large number of variables and
small samples - Can learn both simple and highly complex
classification models - Employ sophisticated mathematical principles to
avoid overfitting - and (b) superior empirical results.
41Main ideas of SVMs (1/3)
Gene Y
Cancer patients
Normal patients
Gene X
- Consider example dataset described by 2 genes,
gene X and gene Y - Represent patients geometrically (by vectors)
42Main ideas of SVMs (2/3)
Gene Y
Cancer patients
Normal patients
Gene X
- Find a linear decision surface (hyperplane)
that can separate patient classes and has the
largest distance (i.e., largest gap or
margin) between border-line patients (i.e.,
support vectors)
43Main ideas of SVMs (3/3)
- If such linear decision surface does not exist,
the data is mapped into a much higher dimensional
space (feature space) where the separating
decision surface is found - The feature space is constructed via very clever
mathematical projection (kernel trick).
44On estimation of signature accuracy
test
data
train
Large sample case use hold-out validation
train
train
train
test
train
train
train
data
Small sample case use N-fold cross-validation
test
test
test
test
test
45Nested N-fold cross-validation
Recall the main idea of cross-validation
data
46Overview of challenges in computational analysis
of omics data for development of molecular
signatures
Rashomon effect/ Marker multiplicity
Assay validity/ reproducibility
Efficiency Statistical/ Computational
Research Designs
Data Analytics of Molecular Signatures
Is there predictive signal?
Causality vs predictiveness/ Biological
Significance
Methods Development Re-inventing the wheel
specialization
Epistasis
Many variables, small sample, noise, artifacts
Instability
Performance Predictivity, compactness
Protocols/Guidelines
Editorializing/ Over-simplifying/ Sensationalism
47Part 3Comprehensive evaluation of algorithms to
develop molecular signatures for cancer
classification
48Comprehensive evaluation of algorithms for
classification of cancer microarray data
- Main goals
- Find the best performing algorithms for building
molecular signatures for cancer diagnosis from
microarray gene expression data - Investigate benefits of using gene selection and
ensemble classification methods.
49Classification algorithms
- K-Nearest Neighbors (KNN)
- Backpropagation Neural Networks (NN)
- Probabilistic Neural Networks (PNN)
- Multi-Class SVM One-Versus-Rest (OVR)
- Multi-Class SVM One-Versus-One (OVO)
- Multi-Class SVM DAGSVM
- Multi-Class SVM by Weston Watkins (WW)
- Multi-Class SVM by Crammer Singer (CS)
- Weighted Voting One-Versus-Rest
- Weighted Voting One-Versus-One
- Decision Trees CART
instance-based
neural networks
kernel-based
voting
decision trees
50Ensemble classification methods
51Gene selection methods
- Signal-to-noise (S2N) ratio in one-versus-rest
(OVR) fashion - Signal-to-noise (S2N) ratio in one-versus-one
(OVO) fashion - Kruskal-Wallis nonparametric one-way ANOVA (KW)
- Ratio of genes between-categories to
within-category sum of squares (BW).
52Performance metrics andstatistical comparison
- Accuracy
- can compare to previous studies
- easy to interpret simplifies statistical
comparison
- 2. Relative classifier information (RCI)
- easy to interpret simplifies statistical
comparison - not sensitive to distribution of classes
- accounts for difficulty of a decision problem
- Randomized permutation testing to compare
accuracies - of the classifiers (?0.05)
53Microarray datasets
- Total
- 1300 samples
- 74 diagnostic categories
- 41 cancer types and
- 12 normal tissue types
54Summary of methods and datasets
55Results without gene selection
56Results with gene selection
Improvement of diagnostic performance by gene
selection (averages for the four datasets)
Diagnostic performance before and after gene
selection
Average reduction of genes is 10-30 times
57Comparison with previously published results
58Summary of results
- Multi-class SVMs are the best family among the
tested algorithms outperforming KNN, NN, PNN, DT,
and WV. - Gene selection in some cases improves
classification performance of all classifiers,
especially of non-SVM algorithms - Ensemble classification does not improve
performance of SVM and other classifiers - Results obtained by SVMs favorably compare with
the literature.
59Random Forest (RF) classifiers
- Appealing properties
- Work when of predictors gt of samples
- Embedded gene selection
- Incorporate interactions
- Based on theory of ensemble learning
- Can work with binary multiclass tasks
- Does not require much fine-tuning of parameters
- Strong theoretical claims
- Empirical evidence (Diaz-Uriarte and Alvarez de
Andres, BMC Bioinformatics, 2006) reported
superior classification performance of RFs
compared to SVMs and other methods
60Key principles of RF classifiers
Testing data
Training data
4) Apply to testing data combine predictions
1) Generate bootstrap samples
2) Random gene selection
3) Fit unpruned decision trees
61Results without gene selection
- SVMs nominally outperform RFs is 15 datasets, RFs
outperform SVMs in 4 datasets, algorithms are
exactly the same in 3 datasets. - In 7 datasets SVMs outperform RFs statistically
significantly. - On average, the performance advantage of SVMs is
0.033 AUC and 0.057 RCI.
62Results with gene selection
- SVMs nominally outperform RFs is 17 datasets, RFs
outperform SVMs in 3 datasets, algorithms are
exactly the same in 2 datasets. - In 1 dataset SVMs outperform RFs statistically
significantly. - On average, the performance advantage of SVMs is
0.028 AUC and 0.047 RCI.
63Part 4Analysis and computational dissection of
molecular signature multiplicity
64Molecular signature multiplicity
- Different methods or samples from the same
population lead to different but apparently
maximally predictive signatures - Far-reaching implications for biological
discovery and development of next generation
patient diagnostics and personalized treatments - Generation of biological hypotheses is very hard
even when signatures are maximally predictive of
the phenotype since thousands of completely
different signatures are equally consistent with
the data - Produced signatures are not statistically
generalizable to new cases, and thus not reliable
enough for translation to clinical practice.
65Molecular signature multiplicity
- Causes of this phenomenon are unknown several
contradictory conjectures exist in the field - Signature multiplicity is due to small samples
Michiels et al., 2005 - Signature multiplicity leads to predictively
non-reproducible signatures Ein-Dor et al.,
2006 building reproducible signatures requires
thousands of samples Ioannidis, 2005 - Signature multiplicity is a by-product of the
complex regulatory connectivity of genome
Dougherty and Brun, 2006 - Artifacts of data pre-processing, e.g.
normalization Gold et al., 2005 Qiu et al.,
2005 Ploner et al., 2005
66Major goals
- Develop a Markov boundary characterization of
molecular signature multiplicity phenomenon - Design and study algorithms that can correctly
identify the set of maximally predictive and
non-redundant molecular signatures - Conduct an empirical evaluation of the novel
algorithms and compare to the existing
state-of-the-art methods - Test and refine previously stated hypotheses
about the causes of signature multiplicity
phenomenon.
67Optimality criteria of signatures
- Signatures that are focus of this research
satisfy the following two optimality criteria - maximally predictive of the phenotype (they
achieve best predictivity of the phenotype in the
given dataset over all signatures based on
different gene sets) - do not contain predictively redundant genes
(i.e., genes that can be removed from the
signature without adversely affecting its
predictivity).
68Why do we need algorithms to extract as many
optimal signatures as possible?
- A deeper understanding of the signature
multiplicity phenomenon and how it affects
reproducibility of signatures - Improving discovery of the underlying biological
mechanisms by not missing genes that are
implicated biologically in disease processes - Catalyzing regulatory approval by establishing
in-silico equivalence to previously validated
signatures
69Existing algorithms for multiple signature
extraction Resampling-based methods
Training data
1) Generate resampled datasets (e.g., by
bootstrapping)
2) Apply a standard signature extraction
algorithm (e.g., SVM-RFE)
X1
X2
X3
XN
- Based on assumption that multiplicity is strictly
a small-sample phenomenon - An infinite number of resamplings is required to
extract all optimal signatures - May stop producing multiple signatures in large
sample sizes.
70Existing algorithms for multiple signature
extraction Iterative removal
Original data (for all genes)
Remove corresponding genes from the dataset
X1
Reduced data (excluding X1 genes)
Remove corresponding genes from the dataset
X2
Reduced data (excluding X1 and X2 genes)
Remove corresponding genes from the dataset
X3
until a signature has statistically
significantly reduced predictivity
- Agnostic to what causes molecular signature
multiplicity - Cannot discover signatures that have genes in
common.
71Existing algorithms for multiple signature
extraction Stochastic gene selection
Genetic Algorithms (e.g., GA/KNN or GA/SVM)
- Can output all signatures that are discoverable
by a genetic algorithm when it is allowed to
evolve an infinite number of generations.
KIAMB
- Stochastic Markov boundary method based on IAMB
algorithm - In a specific class of distributions, every
optimal signature will be output by this method
with nonzero probability - Requires an infinite number of iterations to
discover all optimal signatures will discover
same signature over and over again - Sample requirements are of exponential order to
the number of genes in a signatures.
72Existing algorithms for multiple signature
extraction Brute-force exhaustive search
LIKNON
- Examines predictivity of all individual genes in
the dataset, all pairs of genes, all triples of
genes, and so on - It is infeasible when a signature has more than
2-3 genes - Agnostic to what causes signature multiplicity.
In summary, no current algorithm provides a
systematic and efficient approach for
identification of the set of maximally predictive
and non-redundant molecular signatures that exist
in the underlying distribution.
73I. Markov boundary characterization of molecular
signature multiplicity
74Key definitions (1/2)
- Definition of maximally predictive molecular
signature A maximally predictive molecular
signature is a molecular signature that maximizes
predictivity of the phenotype relative to all
other signatures that can be constructed from the
same dataset. - Definition of maximally predictive and
non-redundant molecular signature A maximally
predictive and non-redundant molecular signature
based on variables X is a maximally predictive
signature such that any signature based on a
proper subset of variables in X is not maximally
predictive.
75Key definitions (2/2)
- Definition of Markov blanket A Markov blanket M
of the response variable T ? V in the joint
probability distribution P over variables V is a
set of variables conditioned on which all other
variables are independent of T, i.e. for every
, - .
- Definition of Market boundary (or non-redundant
Markov blanket) If M is a Markov blanket of T
and no proper subset of M satisfies the
definition of Markov blanket of T, then M is
called a Markov boundary (or non-redundant Markov
blanket) of T.
76Theoretical results
- Variable sets that participate in the maximally
predictive signatures of T are precisely the
Markov blankets of T and vice-versa - Similarly, variable sets that participate in the
maximally predictive and non-redundant signatures
of T are precisely the Markov boundaries of T and
vice-versa - If a joint probability distribution P over
variables V satisfies the intersection property,
then there exists a unique Markov boundary of T
Pearl, 1988.
77A fundamental reduction used in this research for
the analysis of signatures
S1
S2
S3
S4
S5
Cases
Gene Y
Controls
Signatures that have maximal predictivity of the
phenotype relative to their genes.
Signatures with worse predictivity
Gene X
- Since there is an infinite number of signatures
with maximal predictivity, when I refer to a
signature, I mean one of the predictively
equivalent classifiers (e.g., S3 or S4 or S5) - Can study signature classes by reference only to
their genes - This reduction is justified whenever the
classifiers used can learn the minimum error
decision function given sufficient sample.
78Example of Markov boundary multiplicity
Network structure
Distributional constraints
- Many optimal signatures exist e.g., A, C and
B, C are maximally predictive and non-redundant
signatures of T. Furthermore, A, C and B, C
remain maximally predictive even in infinite
samples - The network has very low connectivity
- Genes in optimal signatures do not have to be
deterministically related e.g., A and B are not
deterministically related, yet convey
individually the same information about T - If an algorithm selects only one optimal
signature, then there is danger to miss
biologically important causative genes - The union of all optimal signatures includes all
genes located in the local pathway around T - In this example the intersection of all optimal
signatures contains only genes in the local
pathway around T.
79II. A Novel algorithm to correctly identify the
set of maximally predictive and non-redundant
signatures
80TIE generative algorithm
81TIE algorithm for gene expression data analysis
82Trace of the TIE algorithm
Not a Markov boundary Do not consider any G that
is a superset of F
GF
Mnew A, B
GA
Mnew C, B, F
Markov boundary
M A, B, F
GB
Mnew A, D, E, F
Markov boundary
Mnew C, D, E, F
Markov boundary
GA,B
83Theoretical results (1/2)
- TIE returns all and only Markov boundaries of T
(i.e., maximally predictive and non-redundant
signatures) if its input components X, Y, Z are
admissible - IAMB is an admissible Markov boundary algorithm
(input component X) under assumptions - IAMB correctly outputs a Markov boundary if only
the composition property holds - HITON-PC is an admissible Markov boundary
algorithm (input component X) under assumptions - HITON-PC correctly outputs a Markov boundary if
the adjacency faithfulness assumption holds
except for violations of the intersection axiom,
global Markov condition holds, and there are no
spouses in the Markov boundary
84Theoretical results (2/2)
- Stated three strategies (IncLex, IncMinAssoc, and
IncMaxAssoc) to generate subsets of variables
that have to be removed from V to identify new
Markov boundaries of T and proved their
admissibility (input component Y) - Stated two criteria (Independence and
Predictivity) to verify Markov boundaries and
proved their admissibility (input component Z).
85III. Empirical evaluation of the novel algorithms
and comparison with existing state-of-the-art
methods
86A. Experiments with artificial simulated data
- Generative model is available, and the set of
Markov boundaries (and thus the set of maximally
predictive and non-redundant signatures) is
known. - Generate samples of systematically varied sizes
- Compare to the gold standard
- Test whether the TIE algorithm behaves according
to theoretical expectations and study its
empirical properties - Obtain clues about behavior of TIE and baseline
comparison algorithms in experiments with real
gene expression data.
87Experiments with discrete networks TIED1 and
TIED2
- Two artificial discrete networks were created
- TIED1 consists of 30 variables (including a
response variable T) and contains 72 Markov
boundaries of T - TIED2 consists of 1,000 variables (including a
response variable T) and contains the same 72
Markov boundaries of T as TIED1.
88Experiments
- Goal Compare TIE to state-of-the-art algorithms
(Resampling-based methods, KIAMB, and Iterative
Removal) and examine sensitivity of the tested
methods to high dimensionality. - Findings
- TIE correctly identifies the set of true Markov
boundaries (maximally predictive and
non-redundant signatures) in the datasets with 30
or 1,000 variables - Iterative Removal identifies only 1 signature
- KIAMB fails to identify any true signature, and
its output signatures have poor predictivity - Resampling-based methods either miss true
signatures and/or output many redundant variables
in the signatures.
89Experiments with linear continuous network LIND
LIND consists of 41 variables (including a
response variable T) and contains 12 Markov
boundaries of T.
90Experiments
- Goals
- Analyze behavior of TIE as a function of sample
size using data generated from a continuous
network - Compare criteria Independence and Predictivity
for verification of Markov boundaries in the TIE
algorithm. - Findings
- As sample size increases, the performance of both
instantiations of TIE generally improves and the
algorithms discover the set of true Markov
boundaries - ?-level in the criterion Predictivity
significantly affects the number of Markov
boundaries output by the TIE algorithm - TIE with criterion Predictivity typically leads
to a larger number of output Markov boundaries
and on average superior performance compared to
criterion Independence.
91Experiments with discrete network XORD
XORD consists of 41 variables (including a
response variable T) and contains 25 Markov
boundaries of T.
92Experiments
- Goal Evaluate TIE when the popular Markov
boundary algorithms such as IAMB and HITON-PC are
not applicable due to violations of their
fundamental assumptions. - Findings
- TIE discovers the set of true Markov boundaries
when the sample is 2,000 - There is 1 false positive variable in each
discovered Markov boundary for large sample sizes.
93B. Experiments with resimulated microarray gene
expression data
- Resimulated data by design closely resembles real
human lung cancer microarray gene expression
data - The knowledge of a generative model allows to
generate arbitrary large samples and study
behavior of TIE as a function of sample size - Unlike prior experiments with artificial
simulated datasets, the set of maximally
predictive and non-redundant signatures is not
known a priori.
94Experiment
Goal Examine whether the signature multiplicity
phenomenon vanishes as the sample size grows.
Results
95Findings of other experiments
- TIE is not sensitive to the choice of the
initial signature discovered by the algorithm - Post-processing TIE signatures with wrapping
results in more signatures with smaller number of
genes - Signatures output by tested non-TIE methods are
either redundant or have inferior predictivity
compared to signatures output by TIE techniques.
96C. Experiments with real human microarray gene
expression data
- Independent-Dataset Experiments Using pairs of
microarray datasets either from different
laboratories or different platforms - Single-Dataset Experiments Additional
experiments with relatively large sample size
microarray datasets - The primary goal of both experiments is to
compare TIE and baseline algorithms for multiple
signature extraction in terms of maximal
predictivity? of induced signatures and
reproducibility in independent data. - Operational definition of maximal predictivity
Empirically best classification performance (AUC)
achievable in each dataset over all tested
methods consideration.
97Independent-dataset experiments Datasets
Task Discovery dataset Discovery dataset Discovery dataset Discovery dataset Validation dataset Validation dataset Validation dataset Validation dataset Number of common genes
Task Sample size Samples per class Number of genes Microarray platform Sample size Samples per class Number of genes Microarray platform Number of common genes
Lung Cancer Diagnosis lung tumors vs. normals (non-tumor lung samples) 203 lung tumors (186)normals (17) 12600 Affymetrix U95A 96 lung tumors (86)normals (10) 7129 Affymetrix HuGeneFL 7094
Lung Cancer Subtype Classification adenocarcinoma vs. squamous cell carcinoma lung tumors 160 adenocarcinoma (139)squamous (21) 12600 Affymetrix U95A 28 adenocarcinoma (14)squamous (14) 12533 Affymetrix U95A 12533
Breast Cancer Subtype Classification estrogen receptor positive (ER) vs. ER- breast tumors untreated patients 286 ER (209)ER- (77) 22283 Affymetrix U133A 119 ER (85)ER- (34) 22283 Affymetrix U133A 22283
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 204 poor prognosis (66)good prognosis (138) 22283 Affymetrix U133A 72 poor prognosis (13)good prognosis (59) 22283 Affymetrix U133A 22283
Glioma Subtype Classification grade III vs. grade IV glioma tumors 100 grade III (24)grade IV (76) 22283 Affymetrix U133A 85 grade III (26)grade IV (59) 22283 Affymetrix U133A 22283
Leukemia 5 Yr. Prognosis patients with disease-free survival lt 5 years (ones who had relapse or competing events within 5 years) vs. gt 5 years 164 survival lt 5 yr. (29)survival gt 5 yr. (135) 12625 Affymetrix U95A 79 survival lt 5 yr. (18)survival gt 5 yr. (61) 22283 Affymetrix U133A 10507
98Detailed results (1/3)
99Detailed results (2/3)
100Detailed results (3/3)
101TIE signatures have maximal predictivity
- TIE achieves maximal predictivity in 5 out of 6
validation datasets - Non-TIE methods achieve maximal predictivity in
0 to 2 datasets depending on the method - In the dataset where the predictivity of TIE is
statistically distinguishable from the
empirically maximal one (Lung Cancer Subtype
Classification), the magnitude of this difference
is only 0.009 AUC on average over all discovered
signatures.
102TIE signatures are reproducible, other
signatures may be overfitted
- TIE has no overfitting on average over all
signatures and datasets - Other methods achieve predictivity in the
validation data that is lower than one in the
discovery data (by 0.02-0.03 AUC), besides having
inferior predictivity
103TIE signatures in comparison with other
signatures
Predictivity results for Leukemia 5 Yr. Prognosis
task
Classification performance (AUC) in discovery
dataset
Each dot in the plot corresponds to a signature
(computational model) of the outcome E.g.,
Outcome(x)Sign(wxb), where x, w ? ?m, b ? ?,
m is the number of genes in the signature.
Classification performance (AUC) in validation
dataset
104Single-dataset experiments Datasets
Task Sample size Samples per class Number of genes Microarray platform
Lymphoma Subtype Classification I Diffuse large-B-cell lymphoma (DLBCL) vs. Burkitt's lymphoma (BL) patients 303 DLBCL (258)BL (45) 2745 Human LymphDx 2.7k GeneChip
Lymphoma Subtype Classification II Diffuse large-B-cell lymphoma (DLBCL) vs. mediastinal large B-cell (MLBCL) patients 210 DLBCL (176)MLBCL (34) 32403 (44928) Affymetrix U133A and U133B
Breast Cancer Subtype Classification I p53 mutant vs. wild-type breast tumors 251 p53 mutant (58)p53 wild-type (193) 22283 Affymetrix U133A
Breast Cancer Subtype Classification II estrogen receptor positive (ER) vs. ER- breast tumors 247 ER (213)ER- (34) 22283 Affymetrix U133A
Breast Cancer Subtype Classification III progesterone receptor positive (PgR) vs. PgR- breast tumors 251 PgR (190)PgR- (61) 22283 Affymetrix U133A
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 215 poor prognosis (51)good prognosis (164) 24496 Agilent Hu25K
Bladder Cancer Stage Classification stage Ta. vs. other stages (T1, T2, T3, T4) of bladder tumors 404 stage Ta (189)other stages (215) 1381 (3072) MDL Human 3k
- Validation dataset ? subset of 100
samples/patients - Discovery dataset ? all remaining
samples/patients - Repeat splits into discovery validation
datasets 10 times to minimize variance
105Single-dataset experiments Summary results
- Results are similar to the ones from
independent-dataset experiments - TIE achieves maximal predictivity in 6 out of 7
validation datasets - Non-TIE methods achieve maximal predictivity in
0 to 1 datasets depending on the method - In the dataset where TIE has predictivity that
is statistically distinguishable from the
empirically maximal one (Breast Cancer Subtype
Classification II), the magnitude of this
difference is only lt0.01 AUC on average over all
discovered signatures.
106IV. Discussion and interpretation of results
107Revisiting previously published hypotheses about
signature multiplicity
- Signature reproducibility neither precludes
multiplicity nor requires sample sizes with
thousands of subjects - Multiplicity of signatures does not require dense
connectivity - Noisy measurements or normalization are not
necessary conditions for signature multiplicity - Multiplicity can be produced by a combination of
small sample size-related variance and intrinsic
multiplicity in the underlying network - Multiple signatures output by TIE are
reproducible even though they are derived from
small sample, noisy, and heavily-processed data.
108A more complete picture is emerging regarding
causes of multiplicity...
- Intrinsic information redundancy in the
underlying biological system - Variability in the output of gene selection and
classifier algorithms especially in small sample
sizes - Small sample statistical indistinguishability of
signatures with different large sample
predictivity and/or redundancy characteristics - Presence of hidden variables
- Correlated measurement noise
- RNA amplification techniques that systematically
distort measurements of transcript ratios - Cellular aggregation and sampling from mixtures
of distributions that affect inference of
conditional independence relations - Normalization and other data pre-processing
methods that artificially increase correlations
among genes - Engineered redundancy in the assay technology
platforms.
109Summary of results
- Developed a Markov boundary characterization of
molecular signature multiplicity - Designed a generative algorithm that can
correctly identify the set of maximally
predictive and non-redundant molecular signatures
in principle independently of data distribution - Conducted an empirical evaluation of the novel
algorithm and compared it to existing
state-of-the-art methods using artificial
simulated, resimulated microarray gene
expression, and real human microarray gene
expression data - Tested and refined several hypotheses about the
causes of molecular signature multiplicity
phenomenon.
110General conclusions
- Molecular signatures play a crucial role in
personalized medicine and translational
bioinformatics. - Molecular signatures are being used to treat
patients today, not in the future. - Development of accurate molecular signature
should rely on use of supervised methods. - In general, there are many challenges for
computational analysis of omics data for
development of molecular signatures. - One of these challenges is molecular signature
multiplicity. - There exist an algorithm that can extract the set
of maximally predictive and non-redundant
molecular signatures from high-throughput data.
111Homework (Due next Monday)
- Read the paper Analysis and Computational
Dissection of Molecular Signature Multiplicity. - Describe a novel and interesting application area
for TIE algorithm. Feel free to use and example
from your research where there exist many
molecular signatures of some response variable
(1/2 page max). - Come up with another cause of molecular signature
multiplicity that was not mentioned in the paper
(1/2 page max). - Email your work to Alexander.Statnikov_at_med.nyu.edu
112Computational Causal Discovery Laboratory at NYU
Center for Health Informatics and Bioinformatics
(CHIBI)
- The purpose of our lab is to develop, test and
apply computational causal discovery methods
suitable for molecular, clinical, imaging and
multi-modal data of high-dimensionality. - We are interested in methods to address the
following questions - What is causing disease/phenotype?
- What are the effects of disease/phenotype?
- What are involved biological pathways?
- How to design drugs/treatments?
- How genotype causes differences in response to
treatment? - How the environment modifies or even supersedes
the normal causal function of genes and other
molecular variables? - How genes and proteins are organized in complex
causal regulatory networks? - Questions? Email to Alexander.Statnikov_at_med.nyu.ed
u