Molecular Signaling - PowerPoint PPT Presentation

1 / 112
About This Presentation
Title:

Molecular Signaling

Description:

Title: Computational advances in reverse-engineering regulatory networks and pathways Author: Alexander Statnikov Last modified by: Alexander Statnikov – PowerPoint PPT presentation

Number of Views:337
Avg rating:3.0/5.0
Slides: 113
Provided by: AlexanderS153
Category:

less

Transcript and Presenter's Notes

Title: Molecular Signaling


1
Molecular Signaling Drug Development
CourseDevelopment of Molecular Signatures from
High-Throughput Assay Data
  • Alexander Statnikov, Ph.D.
  • Director, Computational Causal Discovery
    Laboratory
  • Benchmarking Director, Best Practices Integrative
    Informatics Consultation Service
  • Assistant Professor, Department of Medicine,
    Division of Clinical Pharmacology
  • Center for Health Informatics and Bioinformatics
    , NYU School of Medicine
  • 5/16/2011

2
Outline
  • Part 1 Introduction to molecular signatures
  • Part 2 Key principles for developing accurate
    molecular signatures
  • Part 3 Comprehensive evaluation of algorithms to
    develop molecular signatures for cancer
    classification
  • Part 4 Analysis and computational dissection of
    molecular signature multiplicity
  • Conclusion
  • Homework assignment

3
Part 1 Introduction to molecular signatures
4
Definition of a molecular signature
  • Molecular signature is a computational or
    mathematical model that links high-dimensional
    molecular information to phenotype or other
    response variable of interest.

5
FDA view on molecular signatures
  • The FDA calls them in vitro diagnostic
    multivariate index assays
  • 1. Class II Special Controls Guidance Document
    Gene Expression Profiling Test System for Breast
    Cancer Prognosis
  • Addresses device classification.
  • 2. The Critical Path to New Medical Products
  • - Identifies pharmacogenomics as crucial to
    advancing medical product development and
    personalized medicine.
  • 3. Draft Guidance on Pharmacogenetic Tests and
    Genetic Tests for Heritable Markers Guidance
    for Industry Pharmacogenomic Data Submissions
  • Identifies 3 main goals (dose, ADEs, responders),
  • Defines IVDMIA,
  • Encourages fault-free sharing of
    pharmacogenomic data,
  • Separates probable from valid biomarkers,
  • Focuses on genomics (and not other omics).

6
Main uses of molecular signatures
  • Direct benefits Models of disease
    phenotype/clinical outcome
  • Diagnosis
  • Prognosis, long-term disease management
  • Personalized treatment (drug selection,
    titration)
  • Ancillary benefits 1 Biomarkers for diagnosis,
    or outcome prediction
  • Make the above tasks resource efficient, and easy
    to use in clinical practice
  • Helps next-generation molecular imaging
  • Leads for potential new drug candidates
  • Ancillary benefits 2 Discovery of structure
    mechanisms (regulatory/interaction networks,
    pathways, sub-types)
  • Leads for potential new drug candidates

7
Less conventional uses of molecular signatures
  • Increase clinical trial sample efficiency and
    decrease costs or both, using placebo responder
    signatures
  • In silico signature-based candidate drug
    screening
  • Drug resurrection
  • Establishing existence of biological signal in
    very small sample situations where univariate
    signals are too weak
  • Assess importance of markers and of mechanisms
    involving those
  • Choosing the right animal model
  • ?

8
Recent molecular signatures available for
patient care
Agendia
Clarient
Prediction Sciences
LabCorp
Veridex
University Genomics
Genomic Health
BioTheranostics
Applied Genomics
Power3
Correlogic Systems
9
Molecular signatures in the market
Company Product Disease Purpose
Agendia MammaPrint Breast cancer Risk assessment for the recurrence of distant metastasis in a breast cancer patient.
Agendia TargetPrint Breast cancer Quantitative determination of the expression level of estrogen receptor, progesteron receptor and HER2 genes. This product is supplemental to MammaPrint.
Agendia CupPrint Cancer Determination of the origin of the primary tumor.
University Genomics Breast Bioclassifier Breast cancer Classification of ER-positive and ER-negative breast cancers into expression-based subtypes that more accurately predict patient outcome.
Clarient Insight Dx Breast Cancer Profile Breast cancer Prediction of disease recurrence risk.
Clarient Prostate Gene Expression Profile Prostate cancer Diagnosis of grade 3 or higher prostate cancer.
Prediction Sciences RapidResponse c-Fn Test Stroke Identification of the patients that are safe to receive tPA and those at high risk for HT, to help guide the physicians treatment decision.
Genomic Health OncotypeDx Breast cancer Individualized prediction of chemotherapy benefit and 10-year distant recurrence to inform adjuvant treatment decisions in certain women with early-stage breast cancer.
bioTheranostics CancerTYPE ID Cancer Classification of 39 types of cancer.
bioTheranostics Breast Cancer Index Breast cancer Risk assessment and identification of patients likely to benefit from endocrine therapy, and whose tumors are likely to be sensitive or resistant to chemotherapy.
Applied Genomics MammaStrat Breast cander Risk assessment of cancer recurrence.
Applied Genomics PulmoType Lung cancer Classification of non-small cell lung cancer into adenocarcinoma versus squamous cell carcinoma subtypes.
Applied Genomics PulmoStrat Lung cancer Assessment of an individual's risk of lung cancer recurrence following surgery for helping with adjuvant therapy decisions.
Correlogic OvaCheck Ovarian cancer Early detection of epithelial ovarian cancer.
LabCorp OvaSure Ovarian cancer Assessment of the presence of early stage ovarian cancer in high-risk women.
Veridex GeneSearch BLN Assay Breast cancer Determination of whether breast cancer has spread to the lymph nodes.
Power3 BC-SeraPro Breast cancer Differentiation between breast cancer patients and control subjects.
10
MammaPrint
  • Developed by Agendia (www.agendia.com)
  • 70-gene signature to stratify women with breast
    cancer that hasnt spread into low risk and
    high risk for recurrence of the disease
  • Independently validated in gt1,000 patients
  • So far performed gt10,000 tests
  • Cost of the test is 3,000
  • In February, 2007 the FDA cleared the
    MammaPrint test for marketing in the U.S. for
    node negative women under 61 years of age with
    tumors of less than 5 cm.
  • TIME Magazines 2007 medical invention of the
    year.

11
Oncotype DX
  • Developed by Genomic Health (www.genomichealth.c
    om )
  • 21-gene signature to predict whether a woman
    with localized, ER breast cancer is at risk of
    relapse
  • Independently validated in gt1,000 patients
  • So far performed gt50,000 tests
  • Cost of the test is 3,000
  • The following paper shows the health benefits
    and cost-effectiveness benefits of using Oncotype
    DX http//www3.interscience.wiley.com/cgi-bin/abs
    tract/114124513/ABSTRACT

12
Part 2Key principles for developing accurate
molecular signatures
13
Main ingredients for developing a molecular
signature
Well-defined clinical problem access to
patients/ samples
Computational biostatistical Analysis
Molecular Signature
High-throughput assays
14
Challenges in computational analysis of omics
data
  • Relatively easy to develop a predictive model and
    even easier to believe that a model is good when
    it is not ? false sense of security
  • Several problems exist some theoretical and some
    practical
  • Omics data has many special characteristics and
    is tricky to analyze!

15
Example OvaCheck (1/2)
  • Developed by Correlogic (www.correlogic.com)
  • Blood test for the early detection of epithelial
    ovarian cancer 
  • Failed to obtain FDA approval
  • Looks for subtle changes in patterns among the
    tens of thousands of proteins, protein fragments
    and metabolites in the blood
  • Signature developed by genetic algorithm
  • Significant artifacts in data collection
    analysis questioned validity of the signature
  • Results are not reproducible
  • Data collected differently for different groups
    of patients
  • http//www.nature.com/nature/journal/v429/n6991/fu
    ll/429496a.html

16
Example OvaCheck (2/2)
A
B
C
Figure from Baggerly et al (Bioinformatics,
2004)
D
E
F
17
An early kind of analysis Learning disease
sub-types by clustering patient profiles
p53
Rb
18
Clustering Seeking natural groupings hoping
that they will be useful
p53
Rb
19
E.g., for classification (predict response to
treatment)
p53
Respond to treatment Tx1
Do not Respond to treatment Tx1
Rb
20
Another use of clustering
  • Cluster genes (instead of patients)
  • Genes that cluster together may belong to the
    same pathways
  • Genes that cluster apart may be unrelated

21
Unfortunately clustering is a non-specific method
and falls into the one-solution fits all trap
when used for classification
p53
Squamous carcinoma
Adenocarcinoma
Rb
22
Clustering is also non-specific when used to
discover pathways, or other mechanistic
relationships
It is entirely possible in this simple
illustrative counter-example for G3 (a causally
unrelated gene to the phenotype) to be more
strongly associated and thus cluster with the
phenotype (or its surrogate genes) than the true
causal oncogenes G1, G2
G1
G2
Ph
G3
23
Two improved classes of methods
  • Supervised learning ? classification/molecular
    signatures and markers
  • Regulatory network reverse engineering ? pathways

24
Supervised learning Use the known phenotypes
(a.k.a. class labels) in training data to build
signatures or find markers highly specific for
that phenotype
A
Classifier/ Regression Algorithm
Training samples
B
C
Molecular signature
D
T
Testing/ Validation samples
A1, B1, C1, D1, T1 A2, B2, C2, D2, T2 An, Bn,
Cn, Dn, Tn
Classification Performance
25
Input data for supervised learning methods
  • Class Label Variables/features

Primary Metastatic Primary Metastatic Metastatic P
rimary Metastatic Metastatic Metastatic Primary Me
tastatic Primary
26
Principles and geometric representation for
supervised learning (1/7)
  • Want to classify objects as boats and houses.

27
Principles and geometric representation for
supervised learning (2/7)
  • All objects before the coast line are boats and
    all objects after the coast line are houses.
  • Coast line serves as a decision surface that
    separates two classes.

28
Principles and geometric representation for
supervised learning (3/7)
These boats will be misclassified as houses
This house will be misclassified as boat
29
Principles and geometric representation for
supervised learning (4/7)
Longitude
Boat
House
Latitude
  • The methods that build classification models
    (i.e., classification algorithms) operate very
    similarly to the previous example.
  • First all objects are represented geometrically.

30
Principles and geometric representation for
supervised learning (5/7)
Longitude
Boat
House
Latitude
Then the algorithm seeks to find a decision
surface that separates classes of objects
31
Principles and geometric representation for
supervised learning (6/7)
Longitude
These objects are classified as houses
?
?
?
?
?
?
These objects are classified as boats
Latitude
Unseen (new) objects are classified as boats if
they fall below the decision surface and as
houses if the fall above it
32
Principles and geometric representation for
supervised learning (7/7)
Longitude
Object 1
Object 2
Object 3
Latitude
33
In 2-D this looks simple but what happens in
higher dimensional data
  • 10,000-50,000 (gene expression microarrays, aCGH,
    and early SNP arrays)
  • gt500,000 (tiled microarrays, SNP arrays)
  • 10,000-300,000 (regular MS proteomics)
  • gt10,000,000 (LC-MS proteomics)
  • gt100,000,000 (next-generation sequencing)
  • This is the curse of dimensionality problem

34
High-dimensionality (especially with small
samples) causes
  • Some methods do not run at all (classical
    regression)
  • Some methods give bad results (KNN, Decision
    trees)
  • Very slow analysis
  • Very expensive/cumbersome clinical application
  • Tends to overfit

35
Two problems Over-fitting Under-fitting
  • Over-fitting (a model to your data) building a
    model that is good in original data but fails to
    generalize well to new/unseen data
  • Under-fitting (a model to your data) building a
    model that is poor in both original data and
    new/unseen data

36
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
37
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line overfits!
Training Data Future Data
Predictor X
38
Over/under-fitting are related to complexity of
the decision surface and how well the training
data is fit
Outcome of Interest Y
This line is good!
This line underfits!
Training Data Future Data
Predictor X
39
Very important concept
  • Successful data analysis methods balance training
    data fit with complexity.
  • Too complex signature (to fit training data well)
    ?overfitting (i.e., signature does not
    generalize)
  • Too simplistic signature (to avoid overfitting) ?
    underfitting (will generalize but the fit to both
    the training and future data will be low and
    predictive performance small).

40
The Support Vector Machine (SVM) approach for
building molecular signatures
  • Support vector machines (SVMs) is a binary
    classification algorithm.
  • SVMs are important because of (a) theoretical
    reasons
  • Robust to very large number of variables and
    small samples
  • Can learn both simple and highly complex
    classification models
  • Employ sophisticated mathematical principles to
    avoid overfitting
  • and (b) superior empirical results.

41
Main ideas of SVMs (1/3)
Gene Y
Cancer patients
Normal patients
Gene X
  • Consider example dataset described by 2 genes,
    gene X and gene Y
  • Represent patients geometrically (by vectors)

42
Main ideas of SVMs (2/3)
Gene Y
Cancer patients
Normal patients
Gene X
  • Find a linear decision surface (hyperplane)
    that can separate patient classes and has the
    largest distance (i.e., largest gap or
    margin) between border-line patients (i.e.,
    support vectors)

43
Main ideas of SVMs (3/3)
  • If such linear decision surface does not exist,
    the data is mapped into a much higher dimensional
    space (feature space) where the separating
    decision surface is found
  • The feature space is constructed via very clever
    mathematical projection (kernel trick).

44
On estimation of signature accuracy
test
data
train
Large sample case use hold-out validation
train
train
train
test
train
train
train
data
Small sample case use N-fold cross-validation
test
test
test
test
test
45
Nested N-fold cross-validation
Recall the main idea of cross-validation
data
46
Overview of challenges in computational analysis
of omics data for development of molecular
signatures
Rashomon effect/ Marker multiplicity
Assay validity/ reproducibility
Efficiency Statistical/ Computational
Research Designs
Data Analytics of Molecular Signatures
Is there predictive signal?
Causality vs predictiveness/ Biological
Significance
Methods Development Re-inventing the wheel
specialization
Epistasis
Many variables, small sample, noise, artifacts
Instability
Performance Predictivity, compactness
Protocols/Guidelines
Editorializing/ Over-simplifying/ Sensationalism
47
Part 3Comprehensive evaluation of algorithms to
develop molecular signatures for cancer
classification
48
Comprehensive evaluation of algorithms for
classification of cancer microarray data
  • Main goals
  • Find the best performing algorithms for building
    molecular signatures for cancer diagnosis from
    microarray gene expression data
  • Investigate benefits of using gene selection and
    ensemble classification methods.

49
Classification algorithms
  • K-Nearest Neighbors (KNN)
  • Backpropagation Neural Networks (NN)
  • Probabilistic Neural Networks (PNN)
  • Multi-Class SVM One-Versus-Rest (OVR)
  • Multi-Class SVM One-Versus-One (OVO)
  • Multi-Class SVM DAGSVM
  • Multi-Class SVM by Weston Watkins (WW)
  • Multi-Class SVM by Crammer Singer (CS)
  • Weighted Voting One-Versus-Rest
  • Weighted Voting One-Versus-One
  • Decision Trees CART

instance-based
neural networks
kernel-based
voting
decision trees
50
Ensemble classification methods
51
Gene selection methods
  1. Signal-to-noise (S2N) ratio in one-versus-rest
    (OVR) fashion
  2. Signal-to-noise (S2N) ratio in one-versus-one
    (OVO) fashion
  3. Kruskal-Wallis nonparametric one-way ANOVA (KW)
  4. Ratio of genes between-categories to
    within-category sum of squares (BW).

52
Performance metrics andstatistical comparison
  • Accuracy
  • can compare to previous studies
  • easy to interpret simplifies statistical
    comparison
  • 2. Relative classifier information (RCI)
  • easy to interpret simplifies statistical
    comparison
  • not sensitive to distribution of classes
  • accounts for difficulty of a decision problem
  • Randomized permutation testing to compare
    accuracies
  • of the classifiers (?0.05)

53
Microarray datasets
  • Total
  • 1300 samples
  • 74 diagnostic categories
  • 41 cancer types and
  • 12 normal tissue types

54
Summary of methods and datasets
55
Results without gene selection
56
Results with gene selection
Improvement of diagnostic performance by gene
selection (averages for the four datasets)
Diagnostic performance before and after gene
selection
Average reduction of genes is 10-30 times
57
Comparison with previously published results
58
Summary of results
  • Multi-class SVMs are the best family among the
    tested algorithms outperforming KNN, NN, PNN, DT,
    and WV.
  • Gene selection in some cases improves
    classification performance of all classifiers,
    especially of non-SVM algorithms
  • Ensemble classification does not improve
    performance of SVM and other classifiers
  • Results obtained by SVMs favorably compare with
    the literature.

59
Random Forest (RF) classifiers
  • Appealing properties
  • Work when of predictors gt of samples
  • Embedded gene selection
  • Incorporate interactions
  • Based on theory of ensemble learning
  • Can work with binary multiclass tasks
  • Does not require much fine-tuning of parameters
  • Strong theoretical claims
  • Empirical evidence (Diaz-Uriarte and Alvarez de
    Andres, BMC Bioinformatics, 2006) reported
    superior classification performance of RFs
    compared to SVMs and other methods

60
Key principles of RF classifiers
Testing data
Training data
4) Apply to testing data combine predictions
1) Generate bootstrap samples
2) Random gene selection
3) Fit unpruned decision trees
61
Results without gene selection
  • SVMs nominally outperform RFs is 15 datasets, RFs
    outperform SVMs in 4 datasets, algorithms are
    exactly the same in 3 datasets.
  • In 7 datasets SVMs outperform RFs statistically
    significantly.
  • On average, the performance advantage of SVMs is
    0.033 AUC and 0.057 RCI.

62
Results with gene selection
  • SVMs nominally outperform RFs is 17 datasets, RFs
    outperform SVMs in 3 datasets, algorithms are
    exactly the same in 2 datasets.
  • In 1 dataset SVMs outperform RFs statistically
    significantly.
  • On average, the performance advantage of SVMs is
    0.028 AUC and 0.047 RCI.

63
Part 4Analysis and computational dissection of
molecular signature multiplicity
64
Molecular signature multiplicity
  • Different methods or samples from the same
    population lead to different but apparently
    maximally predictive signatures
  • Far-reaching implications for biological
    discovery and development of next generation
    patient diagnostics and personalized treatments
  • Generation of biological hypotheses is very hard
    even when signatures are maximally predictive of
    the phenotype since thousands of completely
    different signatures are equally consistent with
    the data
  • Produced signatures are not statistically
    generalizable to new cases, and thus not reliable
    enough for translation to clinical practice.

65
Molecular signature multiplicity
  • Causes of this phenomenon are unknown several
    contradictory conjectures exist in the field
  • Signature multiplicity is due to small samples
    Michiels et al., 2005
  • Signature multiplicity leads to predictively
    non-reproducible signatures Ein-Dor et al.,
    2006 building reproducible signatures requires
    thousands of samples Ioannidis, 2005
  • Signature multiplicity is a by-product of the
    complex regulatory connectivity of genome
    Dougherty and Brun, 2006
  • Artifacts of data pre-processing, e.g.
    normalization Gold et al., 2005 Qiu et al.,
    2005 Ploner et al., 2005

66
Major goals
  1. Develop a Markov boundary characterization of
    molecular signature multiplicity phenomenon
  2. Design and study algorithms that can correctly
    identify the set of maximally predictive and
    non-redundant molecular signatures
  3. Conduct an empirical evaluation of the novel
    algorithms and compare to the existing
    state-of-the-art methods
  4. Test and refine previously stated hypotheses
    about the causes of signature multiplicity
    phenomenon.

67
Optimality criteria of signatures
  • Signatures that are focus of this research
    satisfy the following two optimality criteria
  • maximally predictive of the phenotype (they
    achieve best predictivity of the phenotype in the
    given dataset over all signatures based on
    different gene sets)
  • do not contain predictively redundant genes
    (i.e., genes that can be removed from the
    signature without adversely affecting its
    predictivity).

68
Why do we need algorithms to extract as many
optimal signatures as possible?
  1. A deeper understanding of the signature
    multiplicity phenomenon and how it affects
    reproducibility of signatures
  2. Improving discovery of the underlying biological
    mechanisms by not missing genes that are
    implicated biologically in disease processes
  3. Catalyzing regulatory approval by establishing
    in-silico equivalence to previously validated
    signatures

69
Existing algorithms for multiple signature
extraction Resampling-based methods
Training data
1) Generate resampled datasets (e.g., by
bootstrapping)

2) Apply a standard signature extraction
algorithm (e.g., SVM-RFE)
X1
X2
X3
XN
  • Based on assumption that multiplicity is strictly
    a small-sample phenomenon
  • An infinite number of resamplings is required to
    extract all optimal signatures
  • May stop producing multiple signatures in large
    sample sizes.

70
Existing algorithms for multiple signature
extraction Iterative removal
Original data (for all genes)
Remove corresponding genes from the dataset
X1
Reduced data (excluding X1 genes)
Remove corresponding genes from the dataset
X2
Reduced data (excluding X1 and X2 genes)
Remove corresponding genes from the dataset
X3
until a signature has statistically
significantly reduced predictivity
  • Agnostic to what causes molecular signature
    multiplicity
  • Cannot discover signatures that have genes in
    common.

71
Existing algorithms for multiple signature
extraction Stochastic gene selection
Genetic Algorithms (e.g., GA/KNN or GA/SVM)
  • Can output all signatures that are discoverable
    by a genetic algorithm when it is allowed to
    evolve an infinite number of generations.

KIAMB
  • Stochastic Markov boundary method based on IAMB
    algorithm
  • In a specific class of distributions, every
    optimal signature will be output by this method
    with nonzero probability
  • Requires an infinite number of iterations to
    discover all optimal signatures will discover
    same signature over and over again
  • Sample requirements are of exponential order to
    the number of genes in a signatures.

72
Existing algorithms for multiple signature
extraction Brute-force exhaustive search
LIKNON
  • Examines predictivity of all individual genes in
    the dataset, all pairs of genes, all triples of
    genes, and so on
  • It is infeasible when a signature has more than
    2-3 genes
  • Agnostic to what causes signature multiplicity.

In summary, no current algorithm provides a
systematic and efficient approach for
identification of the set of maximally predictive
and non-redundant molecular signatures that exist
in the underlying distribution.
73
I. Markov boundary characterization of molecular
signature multiplicity
74
Key definitions (1/2)
  • Definition of maximally predictive molecular
    signature A maximally predictive molecular
    signature is a molecular signature that maximizes
    predictivity of the phenotype relative to all
    other signatures that can be constructed from the
    same dataset.
  • Definition of maximally predictive and
    non-redundant molecular signature A maximally
    predictive and non-redundant molecular signature
    based on variables X is a maximally predictive
    signature such that any signature based on a
    proper subset of variables in X is not maximally
    predictive.

75
Key definitions (2/2)
  • Definition of Markov blanket A Markov blanket M
    of the response variable T ? V in the joint
    probability distribution P over variables V is a
    set of variables conditioned on which all other
    variables are independent of T, i.e. for every
    ,
  • .
  • Definition of Market boundary (or non-redundant
    Markov blanket) If M is a Markov blanket of T
    and no proper subset of M satisfies the
    definition of Markov blanket of T, then M is
    called a Markov boundary (or non-redundant Markov
    blanket) of T.

76
Theoretical results
  • Variable sets that participate in the maximally
    predictive signatures of T are precisely the
    Markov blankets of T and vice-versa
  • Similarly, variable sets that participate in the
    maximally predictive and non-redundant signatures
    of T are precisely the Markov boundaries of T and
    vice-versa
  • If a joint probability distribution P over
    variables V satisfies the intersection property,
    then there exists a unique Markov boundary of T
    Pearl, 1988.

77
A fundamental reduction used in this research for
the analysis of signatures
S1
S2
S3
S4
S5
Cases
Gene Y
Controls









































Signatures that have maximal predictivity of the
phenotype relative to their genes.
Signatures with worse predictivity
Gene X
  • Since there is an infinite number of signatures
    with maximal predictivity, when I refer to a
    signature, I mean one of the predictively
    equivalent classifiers (e.g., S3 or S4 or S5)
  • Can study signature classes by reference only to
    their genes
  • This reduction is justified whenever the
    classifiers used can learn the minimum error
    decision function given sufficient sample.

78
Example of Markov boundary multiplicity
Network structure
Distributional constraints
  1. Many optimal signatures exist e.g., A, C and
    B, C are maximally predictive and non-redundant
    signatures of T. Furthermore, A, C and B, C
    remain maximally predictive even in infinite
    samples
  2. The network has very low connectivity
  3. Genes in optimal signatures do not have to be
    deterministically related e.g., A and B are not
    deterministically related, yet convey
    individually the same information about T
  4. If an algorithm selects only one optimal
    signature, then there is danger to miss
    biologically important causative genes
  5. The union of all optimal signatures includes all
    genes located in the local pathway around T
  6. In this example the intersection of all optimal
    signatures contains only genes in the local
    pathway around T.

79
II. A Novel algorithm to correctly identify the
set of maximally predictive and non-redundant
signatures
80
TIE generative algorithm
81
TIE algorithm for gene expression data analysis
82
Trace of the TIE algorithm
Not a Markov boundary Do not consider any G that
is a superset of F
GF
Mnew A, B
GA
Mnew C, B, F
Markov boundary
M A, B, F
GB
Mnew A, D, E, F
Markov boundary
Mnew C, D, E, F
Markov boundary
GA,B

83
Theoretical results (1/2)
  • TIE returns all and only Markov boundaries of T
    (i.e., maximally predictive and non-redundant
    signatures) if its input components X, Y, Z are
    admissible
  • IAMB is an admissible Markov boundary algorithm
    (input component X) under assumptions
  • IAMB correctly outputs a Markov boundary if only
    the composition property holds
  • HITON-PC is an admissible Markov boundary
    algorithm (input component X) under assumptions
  • HITON-PC correctly outputs a Markov boundary if
    the adjacency faithfulness assumption holds
    except for violations of the intersection axiom,
    global Markov condition holds, and there are no
    spouses in the Markov boundary

84
Theoretical results (2/2)
  • Stated three strategies (IncLex, IncMinAssoc, and
    IncMaxAssoc) to generate subsets of variables
    that have to be removed from V to identify new
    Markov boundaries of T and proved their
    admissibility (input component Y)
  • Stated two criteria (Independence and
    Predictivity) to verify Markov boundaries and
    proved their admissibility (input component Z).

85
III. Empirical evaluation of the novel algorithms
and comparison with existing state-of-the-art
methods
86
A. Experiments with artificial simulated data
  • Generative model is available, and the set of
    Markov boundaries (and thus the set of maximally
    predictive and non-redundant signatures) is
    known.
  • Generate samples of systematically varied sizes
  • Compare to the gold standard
  • Test whether the TIE algorithm behaves according
    to theoretical expectations and study its
    empirical properties
  • Obtain clues about behavior of TIE and baseline
    comparison algorithms in experiments with real
    gene expression data.

87
Experiments with discrete networks TIED1 and
TIED2
  • Two artificial discrete networks were created
  • TIED1 consists of 30 variables (including a
    response variable T) and contains 72 Markov
    boundaries of T
  • TIED2 consists of 1,000 variables (including a
    response variable T) and contains the same 72
    Markov boundaries of T as TIED1.

88
Experiments
  • Goal Compare TIE to state-of-the-art algorithms
    (Resampling-based methods, KIAMB, and Iterative
    Removal) and examine sensitivity of the tested
    methods to high dimensionality.
  • Findings
  • TIE correctly identifies the set of true Markov
    boundaries (maximally predictive and
    non-redundant signatures) in the datasets with 30
    or 1,000 variables
  • Iterative Removal identifies only 1 signature
  • KIAMB fails to identify any true signature, and
    its output signatures have poor predictivity
  • Resampling-based methods either miss true
    signatures and/or output many redundant variables
    in the signatures.

89
Experiments with linear continuous network LIND
LIND consists of 41 variables (including a
response variable T) and contains 12 Markov
boundaries of T.
90
Experiments
  • Goals
  • Analyze behavior of TIE as a function of sample
    size using data generated from a continuous
    network
  • Compare criteria Independence and Predictivity
    for verification of Markov boundaries in the TIE
    algorithm.
  • Findings
  • As sample size increases, the performance of both
    instantiations of TIE generally improves and the
    algorithms discover the set of true Markov
    boundaries
  • ?-level in the criterion Predictivity
    significantly affects the number of Markov
    boundaries output by the TIE algorithm
  • TIE with criterion Predictivity typically leads
    to a larger number of output Markov boundaries
    and on average superior performance compared to
    criterion Independence.

91
Experiments with discrete network XORD
XORD consists of 41 variables (including a
response variable T) and contains 25 Markov
boundaries of T.
92
Experiments
  • Goal Evaluate TIE when the popular Markov
    boundary algorithms such as IAMB and HITON-PC are
    not applicable due to violations of their
    fundamental assumptions.
  • Findings
  • TIE discovers the set of true Markov boundaries
    when the sample is 2,000
  • There is 1 false positive variable in each
    discovered Markov boundary for large sample sizes.

93
B. Experiments with resimulated microarray gene
expression data
  • Resimulated data by design closely resembles real
    human lung cancer microarray gene expression
    data
  • The knowledge of a generative model allows to
    generate arbitrary large samples and study
    behavior of TIE as a function of sample size
  • Unlike prior experiments with artificial
    simulated datasets, the set of maximally
    predictive and non-redundant signatures is not
    known a priori.

94
Experiment
Goal Examine whether the signature multiplicity
phenomenon vanishes as the sample size grows.
Results
95
Findings of other experiments
  • TIE is not sensitive to the choice of the
    initial signature discovered by the algorithm
  • Post-processing TIE signatures with wrapping
    results in more signatures with smaller number of
    genes
  • Signatures output by tested non-TIE methods are
    either redundant or have inferior predictivity
    compared to signatures output by TIE techniques.

96
C. Experiments with real human microarray gene
expression data
  • Independent-Dataset Experiments Using pairs of
    microarray datasets either from different
    laboratories or different platforms
  • Single-Dataset Experiments Additional
    experiments with relatively large sample size
    microarray datasets
  • The primary goal of both experiments is to
    compare TIE and baseline algorithms for multiple
    signature extraction in terms of maximal
    predictivity? of induced signatures and
    reproducibility in independent data.
  • Operational definition of maximal predictivity
    Empirically best classification performance (AUC)
    achievable in each dataset over all tested
    methods consideration.

97
Independent-dataset experiments Datasets
Task Discovery dataset Discovery dataset Discovery dataset Discovery dataset Validation dataset Validation dataset Validation dataset Validation dataset Number of common genes
Task Sample size Samples per class Number of genes Microarray platform Sample size Samples per class Number of genes Microarray platform Number of common genes
Lung Cancer Diagnosis lung tumors vs. normals (non-tumor lung samples) 203 lung tumors (186)normals (17) 12600 Affymetrix U95A 96 lung tumors (86)normals (10) 7129 Affymetrix HuGeneFL 7094
Lung Cancer Subtype Classification adenocarcinoma vs. squamous cell carcinoma lung tumors 160 adenocarcinoma (139)squamous (21) 12600 Affymetrix U95A 28 adenocarcinoma (14)squamous (14) 12533 Affymetrix U95A 12533
Breast Cancer Subtype Classification estrogen receptor positive (ER) vs. ER- breast tumors untreated patients 286 ER (209)ER- (77) 22283 Affymetrix U133A 119 ER (85)ER- (34) 22283 Affymetrix U133A 22283
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 204 poor prognosis (66)good prognosis (138) 22283 Affymetrix U133A 72 poor prognosis (13)good prognosis (59) 22283 Affymetrix U133A 22283
Glioma Subtype Classification grade III vs. grade IV glioma tumors 100 grade III (24)grade IV (76) 22283 Affymetrix U133A 85 grade III (26)grade IV (59) 22283 Affymetrix U133A 22283
Leukemia 5 Yr. Prognosis patients with disease-free survival lt 5 years (ones who had relapse or competing events within 5 years) vs. gt 5 years 164 survival lt 5 yr. (29)survival gt 5 yr. (135) 12625 Affymetrix U95A 79 survival lt 5 yr. (18)survival gt 5 yr. (61) 22283 Affymetrix U133A 10507
98
Detailed results (1/3)
99
Detailed results (2/3)
100
Detailed results (3/3)
101
TIE signatures have maximal predictivity
  • TIE achieves maximal predictivity in 5 out of 6
    validation datasets
  • Non-TIE methods achieve maximal predictivity in
    0 to 2 datasets depending on the method
  • In the dataset where the predictivity of TIE is
    statistically distinguishable from the
    empirically maximal one (Lung Cancer Subtype
    Classification), the magnitude of this difference
    is only 0.009 AUC on average over all discovered
    signatures.

102
TIE signatures are reproducible, other
signatures may be overfitted
  • TIE has no overfitting on average over all
    signatures and datasets
  • Other methods achieve predictivity in the
    validation data that is lower than one in the
    discovery data (by 0.02-0.03 AUC), besides having
    inferior predictivity

103
TIE signatures in comparison with other
signatures
Predictivity results for Leukemia 5 Yr. Prognosis
task
Classification performance (AUC) in discovery
dataset
Each dot in the plot corresponds to a signature
(computational model) of the outcome E.g.,
Outcome(x)Sign(wxb), where x, w ? ?m, b ? ?,
m is the number of genes in the signature.
Classification performance (AUC) in validation
dataset
104
Single-dataset experiments Datasets
Task Sample size Samples per class Number of genes Microarray platform
Lymphoma Subtype Classification I Diffuse large-B-cell lymphoma (DLBCL) vs. Burkitt's lymphoma (BL) patients 303 DLBCL (258)BL (45) 2745 Human LymphDx 2.7k GeneChip
Lymphoma Subtype Classification II Diffuse large-B-cell lymphoma (DLBCL) vs. mediastinal large B-cell (MLBCL) patients 210 DLBCL (176)MLBCL (34) 32403 (44928) Affymetrix U133A and U133B
Breast Cancer Subtype Classification I p53 mutant vs. wild-type breast tumors 251 p53 mutant (58)p53 wild-type (193) 22283 Affymetrix U133A
Breast Cancer Subtype Classification II estrogen receptor positive (ER) vs. ER- breast tumors 247 ER (213)ER- (34) 22283 Affymetrix U133A
Breast Cancer Subtype Classification III progesterone receptor positive (PgR) vs. PgR- breast tumors 251 PgR (190)PgR- (61) 22283 Affymetrix U133A
Breast Cancer 5 Yr. Prognosis ER patients who developed distant metastases within 5 years (poor prognosis) vs. ones who did not (good prognosis) 215 poor prognosis (51)good prognosis (164) 24496 Agilent Hu25K
Bladder Cancer Stage Classification stage Ta. vs. other stages (T1, T2, T3, T4) of bladder tumors 404 stage Ta (189)other stages (215) 1381 (3072) MDL Human 3k
  • Validation dataset ? subset of 100
    samples/patients
  • Discovery dataset ? all remaining
    samples/patients
  • Repeat splits into discovery validation
    datasets 10 times to minimize variance

105
Single-dataset experiments Summary results
  • Results are similar to the ones from
    independent-dataset experiments
  • TIE achieves maximal predictivity in 6 out of 7
    validation datasets
  • Non-TIE methods achieve maximal predictivity in
    0 to 1 datasets depending on the method
  • In the dataset where TIE has predictivity that
    is statistically distinguishable from the
    empirically maximal one (Breast Cancer Subtype
    Classification II), the magnitude of this
    difference is only lt0.01 AUC on average over all
    discovered signatures.

106
IV. Discussion and interpretation of results
107
Revisiting previously published hypotheses about
signature multiplicity
  • Signature reproducibility neither precludes
    multiplicity nor requires sample sizes with
    thousands of subjects
  • Multiplicity of signatures does not require dense
    connectivity
  • Noisy measurements or normalization are not
    necessary conditions for signature multiplicity
  • Multiplicity can be produced by a combination of
    small sample size-related variance and intrinsic
    multiplicity in the underlying network
  • Multiple signatures output by TIE are
    reproducible even though they are derived from
    small sample, noisy, and heavily-processed data.

108
A more complete picture is emerging regarding
causes of multiplicity...
  1. Intrinsic information redundancy in the
    underlying biological system
  2. Variability in the output of gene selection and
    classifier algorithms especially in small sample
    sizes
  3. Small sample statistical indistinguishability of
    signatures with different large sample
    predictivity and/or redundancy characteristics
  4. Presence of hidden variables
  5. Correlated measurement noise
  6. RNA amplification techniques that systematically
    distort measurements of transcript ratios
  7. Cellular aggregation and sampling from mixtures
    of distributions that affect inference of
    conditional independence relations
  8. Normalization and other data pre-processing
    methods that artificially increase correlations
    among genes
  9. Engineered redundancy in the assay technology
    platforms.

109
Summary of results
  1. Developed a Markov boundary characterization of
    molecular signature multiplicity
  2. Designed a generative algorithm that can
    correctly identify the set of maximally
    predictive and non-redundant molecular signatures
    in principle independently of data distribution
  3. Conducted an empirical evaluation of the novel
    algorithm and compared it to existing
    state-of-the-art methods using artificial
    simulated, resimulated microarray gene
    expression, and real human microarray gene
    expression data
  4. Tested and refined several hypotheses about the
    causes of molecular signature multiplicity
    phenomenon.

110
General conclusions
  1. Molecular signatures play a crucial role in
    personalized medicine and translational
    bioinformatics.
  2. Molecular signatures are being used to treat
    patients today, not in the future.
  3. Development of accurate molecular signature
    should rely on use of supervised methods.
  4. In general, there are many challenges for
    computational analysis of omics data for
    development of molecular signatures.
  5. One of these challenges is molecular signature
    multiplicity.
  6. There exist an algorithm that can extract the set
    of maximally predictive and non-redundant
    molecular signatures from high-throughput data.

111
Homework (Due next Monday)
  • Read the paper Analysis and Computational
    Dissection of Molecular Signature Multiplicity.
  • Describe a novel and interesting application area
    for TIE algorithm. Feel free to use and example
    from your research where there exist many
    molecular signatures of some response variable
    (1/2 page max).
  • Come up with another cause of molecular signature
    multiplicity that was not mentioned in the paper
    (1/2 page max).
  • Email your work to Alexander.Statnikov_at_med.nyu.edu

112
Computational Causal Discovery Laboratory at NYU
Center for Health Informatics and Bioinformatics
(CHIBI)
  • The purpose of our lab is to develop, test and
    apply computational causal discovery methods
    suitable for molecular, clinical, imaging and
    multi-modal data of high-dimensionality.
  • We are interested in methods to address the
    following questions
  • What is causing disease/phenotype?
  • What are the effects of disease/phenotype?
  • What are involved biological pathways?
  • How to design drugs/treatments?
  • How genotype causes differences in response to
    treatment?
  • How the environment modifies or even supersedes
    the normal causal function of genes and other
    molecular variables?
  • How genes and proteins are organized in complex
    causal regulatory networks?
  • Questions? Email to Alexander.Statnikov_at_med.nyu.ed
    u
Write a Comment
User Comments (0)
About PowerShow.com