Title: Missing values : impact on classification
1Missing values impact on classification
Influence of microarrays experiments missing
values on the stability of gene groups by
hierarchical clustering
Alexandre G. de Brevern Equipe de
Bioinformatique Génomique et Moléculaire
(EBGM) INSERM U726 / Université Paris VII 75251
PARIS Cedex 05 France April 2005
2Missing values impact on classification
Influence of microarrays experiments missing
values on the stability of gene groups by
hierarchical clustering Alexandre G. de Brevern
1, Serge Hazout 1 and Alain Malpertuy 2 1 EBGM,
2 Atragene Bioinformatics BMC Bioinformatics.
(2004) Aug 235(1)114.
3Missing values impact on classification
Missing values in microarray experiments
available data (www)
4Missing values impact on classification
The question.
Hierarchical clustering
Microarray results
Microarray results with (Missing Values) MVs
Analysis Consequence ?
5Missing values impact on classification
.
1. Classical approach for MVs 2. kNN method
principle and analysis 3. Principle of the
evaluation 4. Results 5. Last and future works
6Missing values impact on classification
1. Classic approach
Hierarchical clustering
Fill-in
kNN
Microarray results with MVs
Microarray results without MVs
Analysis
7Missing values impact on classification
2. kNN k-Nearest Neighbors
Two kNN Simple gt mean values of k Pondered
gt by the distance
The question k ?
Troyanskaya OG, Cantor M, Sherlock G, Brown PO,
Hastie T, Tibshirani R, Botstein D, Altman RB
Missing value estimation methods for DNA
microarrays. Bioinformatics 2001, 17 520-525.
8Missing values impact on classification
2. kNN k-Nearest Neighbors
Sum 5
2
9Missing values impact on classification
2. kNN k-Nearest Neighbors
10Missing values impact on classification
2. kNN k-Nearest Neighbors
11Missing values impact on classification
2. kNN k-Nearest Neighbors
The data Sets have been chosen because they
contain few MVs and, after filtering the number
of genes remains important, (ca. 6000). The
original Ogawa set (OS) contained 6013 genes with
230 genes having MVs. The elimination of the
genes with MVs (i.e. 3.8 of the genes) leads to
a set with 5783 genes. For the Gasch set (GS),
the number of MVs is more important and some
experimental conditions have more than 50 of
MVs. So we have limited the final number of
selected experimental conditions from 178 to 42
(see section Methods), it allows to conserve 5843
genes, i.e. only 310 genes are not analyzed,
representing 5.0 of all the genes.
12Missing values impact on classification
2. kNN k-Nearest Neighbors
The data (2) We have analyzed different
subsets corresponding to 1/7, 1/6, 1/5, 1/4, 1/3
and 1/2 of the complete sets (GS and
OS). Moreover, we have defined two smaller sets,
GSH2O2 and GSHEAT, from GS corresponding
respectively to H2O2 and heat shock experimental
conditions.
13Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Evaluation of optimal k
14Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Evaluation of optimal
k Sometimes not so close from 15.
15Missing values impact on classification
2. kNN k-Nearest Neighbors
The results Difference with the real values.
16Missing values impact on classification
2. kNN k-Nearest Neighbors
Values gt 2.5
The results The extreme values
Values lt 0.5
17Missing values impact on classification
3. Principle of the method.
18Missing values impact on classification
3. Principle of the method.
19Missing values impact on classification
3. Principle of the method.
Microarray data without MVs
20Missing values impact on classification
3. Principle of the method.
21Missing values impact on classification
3. Principle of the method.
22Missing values impact on classification
3. Principle of the method.
. The term
where
is the Kronecker symbol, i.e. it is equal to 1
when the genes i and i in the two gene lists are
identical, otherwise 0. G denotes the total
number of genes. This index takes the maximal
value 1 when the clusterings RC and GR are
identical.
23Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Different aggregation methods can be
used for the construction of the dendogram
generally leading to different tree topologies
and a fortiori to various cluster definitions.
The single-linkage algorithm is based to the
concept of joining the two closest objects (i.e.
genes) of two clusters to create a new cluster.
Thus the single-linkage clusters contain numerous
members and are branched in high-dimensional
space. The resulting clusters are affected by the
chaining phenomenon (i.e. the observations are
added to the tail of the biggest cluster). In
the complete-linkage algorithm, the distance
between clusters is defined as the distance
between the most distant pair of objects (i.e.
genes). This method gives compact clusters. The
average-linkage algorithm is based on the mean
similarity of the observations to all the members
of the cluster.
24Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
single
25Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
centroid
26Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
average
27Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
median
28Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
mc quitty
29Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
Ward
30Missing values impact on classification
3. Principle of the method.
Hierarchical clustering Highly distinct
topologies. Examples.
complete
31Missing values impact on classification
4. Results.
Visualization of 1 of Missing Values
complete algorithm.
OS 1/6 827 genes MVs rate 1 MV per
gene i.e., only 8 values !!!!
OS 1/6 827 genes MVs rate 1 MV per gene
32Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results.
33Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPP (classic)
34Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPPf
(classic). With f 5. CPPf allows to find the
genes associated to close clusters, i.e. here at
max 5.
35Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPP (extreme
values)
36Missing values impact on classification
4. Results.
Hierarchical clustering Highly distinct
topologies. Results. CPPf (extreme
values). With f 5.
37Missing values impact on classification
4. Results.
Conclusion It is not good to have missing
values However, the replacement of MVs is an
obligation. It is better to use kNN gt zero gt
nothing. The choice of k is critical (and not
trivial). The kind of algorithm is also
important. Future
38Missing values impact on classification
5. Last and future works.
New methods New approach better than kNN are
now available A recent work proposes Bayesian
principal component analysis to deal with MVs
(Oba et al., 2003). In the same way, Zhou and
co-workers (Zhou et al., 2003) have used a
Bayesian gene selection to estimate the MVs with
linear and non-linear regression. Oba S, Sato
M-A, Takemasa I, Monden M, Matsubara K-I, Ishii
S A Bayesian missing value estimation method for
gene expression profile data, Bioinformatics
2003, 19 2088-2096. Java bytecode
available. Zhou X, Wang X, Dougherty ER
Missing-value estimation using linear and
non-linear regression with Bayesian gene
selection. Bioinformatics 2003, 19
2302-2307. Not available.
39Missing values impact on classification
5. Last and future works.
BPCA
Distribution of true, BPCA-predicted and kNN
predicted values for OS (1/7) with t 6.125 (1
MV per gene). BPCA error rate equals to
0.000679946 and 0.001583531 for kNN (k13).
40Missing values impact on classification
5. Last and future works.
BPCA
41Missing values impact on classification
5. Last and future works.
Evaluation of LSimpute (Bo et al., NAR 2004,
Java), LLSimpute (Kim, Gollub Park,
Bioinformatics, 2005, Mathlab) and CMVE (Sehgal,
Gondal Dooley, Bioinformatics,
2005). Evaluation of their interests on the
different HC algorithm Extreme
values Self-Organizing Maps K-means If Ive
got time
Thank you for your
attention.
42EBGM INSERM U726 A.G. de Brevern, PhD.
(INSERM) S.Hazout, Pr. (Univ. P7) http//www.ebgm.
jussieu.fr/ Atragene Bioinformatics A. Malpertuy
Founder, CEO, Head of RD. http//www.atragene.c
om/