Class prediction for experiments with microarrays - PowerPoint PPT Presentation

About This Presentation
Title:

Class prediction for experiments with microarrays

Description:

Class prediction for experiments with microarrays Lara Lusa In titut za biomedicinsko informatiko Medicinska fakulteta Lara.Lusa at mf.uni-lj.si – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 45
Provided by: Lara45
Category:

less

Transcript and Presenter's Notes

Title: Class prediction for experiments with microarrays


1
Class prediction for experiments with microarrays
  • Lara Lusa
  • Inštitut za biomedicinsko informatiko Medicinska
    fakulteta
  • Lara.Lusa at mf.uni-lj.si

2
Outline
  • Objectives of microarray experiments
  • Class prediction
  • What is a predictor?
  • How to develop a predictor?
  • Which are the available methods?
  • Which features should be used in the predictor?
  • How to evaluate a predictor?
  • Internal v External validation
  • Some examples of what can go wrong
  • The molecular classification of breast cancer

3
Scheme of an experiment
  • Study design
  • Performance of the experiment
  • Sample preparation
  • Hybridization
  • Image analysis
  • Quality control and normalization
  • Data analysis
  • Class comparison
  • Class prediction
  • Class discovery
  • Interpretation of the results

4
Aims of high-throughput experiments
  • Class comparison - supervised
  • establish differences in gene expression between
    predetermined classes (phenotypes)
  • Tumor vs. Normal tissue
  • Recurrent vs. Non-recurrent patients treated with
    a drug (Ma, 2004)
  • ER vs ER- patients (West, 2001)
  • BRCA1, BRCA2 and sporadics in breast cancer
    (Hedenfalk, 2001)
  • Class prediction - supervised
  • prediction of phenotype using gene expression
    data
  • morphology of a leukemia patient based on his
    gene expression (ALL vs. AML, Golub 1999)
  • which patients with breast cancer will develop a
    distant metastasis within 5 years (vant Veer,
    2002)
  • Class discovery - unsupervised
  • discover groups of samples or genes with similar
    expression
  • Luminal A, B, C(?), Basal, ERBB2, Normal in
    Breast Cancer (Perou 2001, Sørlie, 2003)

5
Data from microarray experiments
6
How to develop a predictor?
  • On a training set of samples
  • Select a subset of genes (feature selection)
  • Use gene expression measurements (X)
  • Predict class
  • membership (Y) of new samples
  • (test set)

Obtain a RULE (g(X)) based on gene-expression for
the classification of new samples
7
An example from Duda et al.
8
Rule Nearest-neighbor classifier
  • For each sample of the independent data set
    (testing set) calculate Pearsons (centered)
    correlation of its gene expression with each
    sample from the test
  • Classification rule assign the new sample to the
    class to which belongs the samples from the
    training set which has the highest correlation
    with the new sample

Samples from training set
correlation
new sample
Bishop, 2006
9
Rule K-Nearest-neighbor classifier
  • For each sample of the independent data set
    (testing set) calculate Pearsons (centered)
    correlation of its gene expression with each
    samplefrom the test
  • Classification rule assign the new sample to the
    class to which belong the majority of the samples
    from the training set which have the K highest
    correlation with the new sample

Samples from training set
correlation
new sample
K3
Bishop, 2006
10
Rule Method of centroids (Sørlie et al. 2003)
  • Method of centroids class prediction rule
  • Define a centroid for each class on the original
    data set (training set)
  • For each gene, average its expression from the
    samples assigned to that class
  • For each sample of the independent data set
    (testing set) calculate Pearsons (centered)
    correlation of its gene expression with each
    centroid
  • Classification rule Assign the sample to the
    class for which the centroid has the highest
    correlation with the sample (if below .1 do not
    assign)

centroids
correlation
new sample
Assigned to the class which centroid has highest
correlation with the new sample
11
Rule Diagonal Linear Discriminant Analysis (DLDA)
  • Calculate mean expression of samples from Class 1
    and Class 2 in the training set for each of the G
    genes
  • and the pooled within class variance
  • For each sample x of the test set evaluate if
  • where xj is the expression of the j-th gene for
    the new sample
  • Classification rule if the above inequality is
    satisfied, classify the sample in Class 1,
    otherwise to Class 2.

12
Rule Diagonal Linear Discriminant Analysis (DLDA)
  • Particular case of discriminant analysis with the
    hypotheses that
  • the feature are not correlated
  • the variances of the two classes are the same
  • Other methods used in microarray studies are
    variants of discriminant analysis
  • Compound covariate predictor
  • Weighted vote method

Bishop, 2006
13
Other popular classification methods
  • Classification and Regression Trees (CART)
  • Prediction Analysis of Microarrays (PAM)
  • Support Vector Machines (SVM)
  • Logistic regression
  • Neural networks

Bishop, 2006
14
How to choose a classification method?
  • No single method is optimal in every situation
  • No Free Lunch Theorem in absence of assumptions
    we should not prefer any classification algorithm
    over another
  • Ugly Ducking Theorem in absence of assumptions
    there is no best set of features

15
The bias-variance tradeoff
Hastie et al, 2001
MSEED (g(x D) F(x))2 ( ED g(x D)
F(x) )2 ED ( g(x D) ED g(xD) )2
Bias2Variance
Duda et al, 2001
16
Feature selection
  • Can ALL the gene expression variables be included
    in the classifier?
  • Which variables should be used to build the
    classifier?
  • Filter methods
  • Prior to building the classifier
  • One feature at a time or joint distribution
    approaches
  • Wrapper methods
  • Performed implicitly by the classifier
  • CART, PAM

From Fridlyand, CBMB Workshop
17
A comparison of classifiers performance for
microarray data
  • Dudoit, Fridlyand and Speed -2002, JASA on 3 data
    sets
  • DA, DLDA, k-NN, SVM, CART
  • Good performance of simple classifiers as DLDA
    and NN
  • Feature selection small number of features
    included in the classifier

18
How to evaluate the performance of a classifier
  • Classification error
  • A sample is classified in a class to which it
    does not belong
  • g(X) ? Y
  • Predictive accuracy of correctly classified
    samples
  • In a two-class problem, using the terminology
    from diagnostic tests (diseased, -healthy)
  • Sensitivity P(classified true )
  • Specificity P(classified - true -)
  • Positive predictive value P( true classified
    )
  • Negative predictive value P( true -
    classified -)

19
Class prediction how to assess the predictive
accuracy?
  • Use an independent data set
  • If it is not available?
  • ABSOLUTELY WRONG
  • Apply your predictor to the data you used to
    develop it and see how well it predicts
  • OK
  • cross validation
  • bootstrap

train
train
train
test
train
train
train
data
test
test
test
test
test
20
How to develop a cross-validated class predictor
  • Training set
  • Test set
  • Predict class using class predictor from test set

21
Dupuy and Simon, JNCI 2007
Supervised prediction 12/28 reported a
misleading estimate of prediction accuracy 50
of studies contained one or more major flaws
22
(No Transcript)
23
Class prediction a famous example
vant Veer et al. report results obtained with
wrong analysis in the paper and correct analysis
(with less striking results) just in the
supplementary material
24
What went wrong?
Produces highly biased estimates of predictive
accuracy
Going beyond the quantification of predictive
accuracy and attempting to make inference with
cross-validated class predictor INFERENCE MADE
IS NOT VALID
25
Observed
Hypothesis there is no difference between
classes Prop. of rejected H0 0.01 0.05
0.10 LOO CV 0.268 0.414 0.483 (n 100) Lusa,
McShane, Radmacher, Shih, Wright, Simon,
Statistics in Medicine, 2007
lt5 yrs gt5yrs
Good prognosis 31 18
Bad prognosis 2 26
Microarray predictor
Odds ratio15.0, p-value4 10(-6)
  • Parameter Logistic Coeff Std. Error Odds
    ratio 95 CI
  • --------------------------------------------------
    --------------------------------------------------
    ------
  • Grade -0.08 0.79 1.1 0.2 5.1
  • ER 0.5 0.94 1.7 0.3 10.4
  • PR -0.75 0.93 2.1 0.3 13.1
  • size (mm) -1.26 0.66 3.5 1.0 12.8
  • Age 1.4 0.79 4 0.9 19.1
  • Angioinvasion -1.55 0.74 4.7 1.1 20.1
  • Microarray 2.87 0.851 7.6 3.3 93.7
  • --------------------------------------------------
    --------------------------------------------------
    ------

26
Michiels et al, 2005 Lancet
27
Final remarks
  • Simple classification methods such as LDDA have
    proved to work well for microarray studies and
    outperform fancier methods
  • A lot of classification methods which have been
    proposed in the field with new names are just
    slight modifications of already known techniques

28
Final remarks
  • Report all the necessary information about your
    classifier so that other can apply it to their
    data
  • Evaluate correctly the predictive accuracy of the
    classifier
  • in early microarray times, many papers
    presented analyses that were not correct, or drew
    wrong conclusions from their work.
  • still now, middle and low IF journals keep
    publishing obviously wrong analyses
  • Dont apply methods without understanding exactly
  • what they are doing
  • on which assumptions they rely

29
Other issues in classification
  • Missing data
  • Class representation
  • Choice of distance function
  • Standardization of observations and variables
  • An example where all this matters

30
Class discovery
  • Mostly performed through hierarchical clustering
    of genes and samples
  • Often abused method in microarray analysis, used
    instead of supervised methods
  • In very few examples
  • stability and reproducibility of clustering is
    assessed
  • results arevalidated or further used after
    discovery
  • a rule for classification of new samples is given
  • Projection of the clustering to new data sets
    seems still problematic

It becomes a class prediction problem
31
Molecular taxonomy of breast cancer
  • Perou/Sørlie (Stanford/Norway)
  • Class sub-type discovery (Perou, Nature 2001,
    Sørlie, PNAS 2001, Sørlie, PNAS 2003)
  • Association of discovered classes with survival
    and other clinical variables (Sørlie, PNAS 2001,
    Sørlie, PNAS 2003)
  • Validation of findings assigning class labels
    defined from class discovery to independent data
    sets (Sørlie, PNAS 2003)

32
Sørlie et al, PNAS 2003
10 (gt.31) 2/3
28 (gt.32) 89
11 (gt.28) 82
11 (gt.34) 64
19 (gt.41) 22
n79 (64) (?)
ER
Hierarchical clustering of the 122 samples from
the paper using the intrinsic gene-set (500
genes) Average linkage and distance 1- Pearsons
(centered) correlation Number of samples in each
class (node correlation for the core samples
included for each subtype) and percentage of ER
positive samples
33
Can we assign subtype membership to samples from
independent data sets?
Sørlie et al. 2003
centroids
  • Method of centroids class prediction rule
  • Define a centroid for each class on the original
    data set (training set)
  • For each gene, average its expression from the
    samples assigned to that class
  • For each sample of the independent data set
    (testing set) calculate Pearsons (centered)
    correlation of its gene expression with each
    centroid
  • Classification rule Assign the sample to the
    class for which the centroid has the highest
    correlation with the sample (if below .1 do not
    assign)

correlation
Assigned to the class which centroid has highest
correlation with the new sample
new sample
  • Cited thousands of times
  • Widely used in research papers and praised in
    editorials
  • Recent concerns raised about their
    reproducibility and robustness

West data set
34
Predicted class membership Sørlie our data
  • Loris I obtained the subtypes on our data! All
    the samples from Tam113 are Lum A, a bit
    strange... there are no Lum B in our data set
  • Lara Have you tried also on the BRCA60?
  • Loris No ... Those are mostly LumA, too. Some
    are Normal, very strange..there are no basal
    among the ER-!
  • Lara ... Have you mean-centered the genes?
  • Loris No ... Looks better on BRCA60 Now the
    ER- of are mostly basal... On Tam113 I get many
    lumB... But 50 of the samples from Tam113 are
    NOT luminal anymore!
  • Something is wrong!

BRCA60 Hereditary BRCa (42ER/16ER-)
Tam113 Tamoxifen treated BR Ca 113
ER/ 0 ER-
35
How are the systematic differences between
microarray platforms/batches taken into account?
  • Sørlies et al 2003 data set
  • Genes were mean (and eventually median) centered
  • , the data file was adjusted for array batch
    differences as follows on a gene-by-gene basis,
    we computed the mean of the nonmissing expression
    values separately in each batch. Then for each
    sample and each gene, we subtracted its batch
    mean for that gene. Hence, the adjusted array
    would have zero row-means within each batch. This
    ensures that any variance in a gene is not a
    result of a batch effect.
  • Rows (genes) were median-centered and both genes
    and experiments were clustered by using an
    average hierarchical clustering algorithm.
  • West et al data set (Affymetrix, single channel
    data)
  • Genes were centered
  • Data were transformed to a compatible format by
    normalizing to the median experiment Each
    absolute expression value in a given sample was
    converted to a ratio by dividing by its average
    expression value across all samples.
  • vant Veer et al data set
  • Genes do not seem to have been mean-centered
  • Other data sets where the method was applied
  • Genes were always centered

Mean-centering
ER-
ER
36
Possible concerns on the application of the
method of centroids
  • How are the classification results influenced
    by...
  • normalization of the data (mean-centering of the
    genes)?
  • differences in subtype prevalence across data
    sets?
  • presence of study (or batch) effects?
  • choice of the method of centroids as a
    classification method?
  • the use of the arbitrary cut-off for non
    classifiable samples?

Lusa et al, Challenges in projecting clustering
results across gene expression-profiling datasets
JNCI 2007
37
ER (Ligand-Binding Assay) 34 ER-/65 ER 7650
clones (6878 unique)
38
1. Effects of mean-centering the genes
method of centroids
centered (C)
Sorlies centroids (derived from centered data
set)
Sotirious data set
336/552 common and unique clones
non centered (N)
ER subset (65 samples)
ER- subset (34 samples)
full data set (99 samples)
Full data Full data Full data Full data ER subset ER subset
Centered Centered Not centered Not centered Centered Not centered
Class Number classified (?lt.1) ER Number classified (?lt.1) ER Number classified (?lt.1) Number classified (?lt.1)
Luminal A 43 (5) 41 59 (1) 55 19 (6) 55 (1)
Luminal B 13 (2) 11 1 (1) 1 13 (3) 1 (0)
ERBB2 13 (2) 6 10 (0) 2 11 (1) 2 (0)
Basal 21 (0) 0 5 (0) 0 11(5) 0 (0)
Normal 9 (0) 7 24 (2) 7 11 (1) 7 (0)
39
2. Effects of prevalence of subgroups in
(training and) testing set?
Predictive accuracy ER / ER-
10 ER/ 10 ER-
Test set
55 ER/ 24 ER-
95 / 79
55 ER/ 24 ER-
78 / 88
24 ER/ 24 ER-
88 / 83
12 ER/ 24 ER-
92 / 79
55 ER/ 0 ER-
53 / ND
0 ER/ 24 ER-
ND / 62
40
2b. What is the role played by prevalence of
subgroups in training and testing set?
ER status prediction Sotirious data set
multiple (100) random SPLITS
testing
training
method of centroids
Testing set
Training set
751 variance filtered unique clones
(C)
(C)
(N)
(N)
0 ?test 1 (ntest24) 0 ER/24ER- 1
ER/23ER- 24 ER/0ER-
?tr1/2 (ntr20) 10 ER/10ER-
? of ER samples in the testing set
correctly classified in class of ER
correctly classified in class of ER- of
correctly classified overall
41
3. (Possible) study effect on real data Sotiriou
vant Veer
  • vant Veer (Centered)
  • vant Veer (Non centered)

Predicted class membership
Class True ER (?lt.1) True ER-(?lt.1) Cor (min-max)
PredictedER 39 (1) 4 (2) .42 (.03-.62)
Predicted ER- 7 (4) 67 (4) .26 (.01-.55)
Class True ER (?lt.1) True ER- (?lt.1) Cor (min-max)
Predicted ER 43 (43) 8 (7) .02 (-.24-.13)
Predicted ER- 3 (3) 63 (53) -.03(-.23-16)
  • The predictive accuracy is the same
  • Most of the samples in the non-centered analysis
    would not be classificable using the threshold

42
Conclusions I
  • Musts for a clinically useful classifier
  • It classifies unambiguously a new sample,
    independently of any other samples being
    considered for classification at the same time
  • The clinical meaning of the subtype assignment
    (survival probability, probability of response to
    treatment) must be stable across populations to
    which the classifier might be applied
  • The technology used to assay the samples must be
    stable and reproducible sample assayed on
    different occasions assigned to the same subtype
  • BUT we showed that subgroup assignments of new
    samples can be substantially influenced by
  • Normalization of data
  • Appropriateness of gene-centering depends on the
    situation
  • Proportion of samples from each subtype in the
    test set
  • Presence of systematic differences across data
    sets
  • Use of arbitrary rules for identifying
    non-classifiable samples
  • Most of our conclusions apply also to different
    classification method

43
Conclusions II
  • Most of the studies claiming to have validated
    the subtypes have focused only on comparing
    clinical outcome differences
  • Shows consistency of results between studies
  • BUT does not provide direct measure of the
    robustness of the classification essential before
    using the subtypes in clinical practice
  • Careful thought must be given to comparability of
    patient populations and datasets
  • Many difficulties remain in validating and
    extending class discovery results to new samples
    and a robust classification rule remains elusive
  • The subtyping of breast cancer seems promising
  • BUT
  • a standardized definition of the subtypes based
    on a robust measurement method is needed

44
Some useful resources and readings
  • Books
  • Simon et al. Design and Analysis of DNA
    Microarray Investigations Ch.8
  • Speed (Ed.) Statistical Analysis of Gene
    Expression Microarray Data Ch.3
  • Bishop- Pattern Recognition and Machine Learning
  • Hastie, Tibshirani and Friedman The Elements of
    Statistical Learning
  • Duda, Hart and Stork Pattern Classification
  • Software for data analysis
  • R and Bioconductor (www.r-project.org,
    www.bioconductor.org)
  • BRB Array Tools (http// linus.nci.nih.gov)
  • Web sites
  • BRB/NCI web site (NIH)
  • Tibshiranis web site (Stanford)
  • Terry Speeds web site (Berkley)
Write a Comment
User Comments (0)
About PowerShow.com