Title: Aucun titre de diapositive
1Incidence of Missing Values in Hierarchical
Clustering of Microarrays Data Alexandre G. de
Brevern1, Serge Hazout1 and Alain Malpertuy2 1
Equipe de Bioinformatique Génomique et
Moléculaire (EBGM), INSERM E0346, Université
PARIS VII, 2 place Jussieu, case 7113, 75251
Paris France - debrevern,hazout_at_urbb.jussieu.fr
2 Atragene Bioinformatics, 4 Rue Pierre Fontaine,
91000 Evry France - alain.malpertuy_at_atragene.com
- 1 - Introduction
- Microarray technologies enable to determine
expressed genes in various cell types, time and
experimental conditions. These experiments are
performed on a large scale and produced large
amount of data. Clustering methods such as
Hierarchical Clustering (HC) 1 or Self
Organizing Map (SOM) 2 are frequently used to
analyse microarray data and led to the
identification of co-expressed genes. However,
microarray datasets often contain missing values
(MV) which occurred during the multistep
experimental process (spotting, hybridization,
image scanning). These missing data represent a
major drawback for the use of HC, K-means 3 and
SOM and do not allowed to perform Principal
Component Analysis 4. -
- Since the presence of MV may significantly modify
the clustering results, the users usually remove
the genes with MV or replace the MV by a constant
(zero or the average expression level of the
entire experiment). Recently, Troyanskaya et al.
showed that the values of MV could be estimated
by using the weighted K - Nearest Neighbours
method (KNN). The KNN approach computes the
estimated value from the k closest expression
profils among the dataset. The authors considered
that KNN, with k 15, is the most accurate
method to estimate MV in microarray data 5. -
- In order to evaluate the influence of MV in
clustering results, we analysed the missing
values distribution in different public sets of
Saccharomyces cerevisiae and we studied the
effects of MV on HC. We finally compared the KNN
replacement method with the most frequently used
zero approach. - 2 - Materials and Methods
- We used the following microarray datasets from
Saccharomyces cerevisiae Ogawa et al. 6,
Spellman et al. 7, Ferrea et al. 8 and Gasch
et al. 9. To study the incidence of the
presence of MV in microarray experiments, the
genes containing MV were removed from the
original datasets and different subsets were
selected from the resulting set. For example, the
treatment of Ogawa published data gave a set of
5783 genes, which led to six subsets of 827, 968,
1159, 1448, 1929 and 2892 genes. For each subset,
we introduced randomly MV with a rate e (1.0 to
50.0). Hierarchical Clustering was then
performed with seven metrics (average, centroid,
complete, mcquitty, median, single and ward using
hclust from R software) and compared with the HC
results obtained with the corresponding original
subsets containing no MV. -
- For the 5783 genes set and subsets containing no
MV we defined C clusters from the HC trees
(figure 1). We then compared the consistency of
these C clusters with the corresponding trees
computed with MV (100 distinct simulations have
been done for each e value). The number of
clusters C is function of the metric used. C must
be defined to give at least ten clusters
representing 80 of the whole genes. -
- In a second step, we used the KNN and zero
methods to replace the MV and performed HC on the
resulting sets.
- Evaluation of the KNN replacement method
- Table 2 Presentation of the optimal k values
obtained - from Ogawa sets.
-
- We observed that the optimal k values were lower
in small subsets - than in sets with higher number of genes. The
error rate decreased - with high number of genes.
- Figure 3 Comparison of real and KNN predicted
- values distribution
-