Title: A Comparative Study of classification Methods for Microarray Data Analysis
1A Comparative Study of classification Methods for
Microarray Data Analysis
- Hong Hu and Jiuyong Li and Ashley Plank and Hua
Wang and Grant Daggard - Department of Mathematics Computing
- University of Southern Queensland, Australia
2Microarray Data Classification
- The task of classification is to build a model
(classifier) from categorized - historical Microarray data (training data), and
then use the model to categorize - future incoming data (test data) automatically.
It involves two stages learning - and classification.
classification
Test data
Learning
Training data
Model
Algorithm
Prediction
3Gene expression Microarray Data
4motivations
- We are very interested in classifying Microarray
data using single tree classifier classification
and ensemble tree methods. - Reading through the literature of Microarray data
classification, it is difficult to find consensus
conclusions on their relative performance.
5Ensemble method
- A ensemble method combines multiple classifiers
(models) built on a set of re-sampled training
datasets or generated from various classification
methods on a training dataset. This set of
classifiers from a decision committee, which
classifies future coming samples
6Algorithms selected for comparison
- SVMs ( Support Vector Machines)
- C4.5 ( Decision tree)
- BaggingC4.5
- AdaBoostingC4.5
- Random Forest
7Experimental design methodology
- Test data sets
- Seven data sets from kent Ridge Biological Data
set Repository are selected for our experiments.
They were collected from very well researched
journal papers. - Breast cancer
- Lung cancer
- Lymphoma
- ALL-AML Leukemia
- Colon
- Ovarian
- Prostate
- Softwares used for comparison
- Weka-3-5-2 package
8Experimental design methodology ..Conts
- Two set of experiments on Microarray data with or
without pre-processing - Ten-fold cross-validation
- Sign test and Wilcoxon signed rank test
-
9Experimental design methodology ..Conts
- Sign test
- Sign test is used to test whether one random
variable in a pair tends to be larger than the
other random variable in the pair. Given n pairs
of observations. Within each pair, either a plus,
tie or minus is assigned. The plus corresponds to
that one value is greater than the other, the
minus corresponds to that one value is less than
the other, and the tie means that both equal to
each other. The null hypothesis is that the
number of pluses and minuses are equal. If the
null hypothesis test is rejected, then one random
variable tends to be greater than the other.
10Design experimental methodology ..Conts
- Wilcoxon signed rank test
- Sign test only makes use of information of
whether a value is greater, less than or equal to
the other in a pair. Wilcoxon signed rank test
calculates differences of pairs. The absolute
differences are ranked after discarding pairs
with the difference of zero. The ranks are sorted
in ascending order. When several pairs have
absolute differences that are equal to each
other, each of these several pairs is assigned as
the average of ranks that would have otherwise
been assigned. The hypothesis is that the
differences have the mean of 0.
11Experimental results based on preprocessed data
Table 1 Average accuracy of seven preprocessed
data sets
- With preprocessed datasets, all ensemble methods
on average perform better than C4.5 and LibSVMs. - Both C4.5 and LibSVM perform similar to each
other.
12The results of sign test
Table 2 Summary of sign test at 95 confidence
between compared algorithms
13The results of sign test ..Conts
- The limitation of sign test
- The sign test measures the difference but not the
magnitude of the difference. Therefore the
difference of 0.01 and 10.0 are considered the
same in the sign test since only plus or minus is
used
14The results of Wilcoxon signed rank test
Table 3 Summary of Wilcoxon sign test at 95
confidence between compared algorithms
15Experimental results based on original data sets
Table 4 Average accuracy on seven original data
sets. The last row shows the differences in
average accuracy Between the average accuracy
based on preprocessed data and original data for
every compared classification method.
16Sign test and Wilcoxon signed rank test
Table 5 Summary of sign test at 95 confidence
between the differences Of compared algorithms
Table 6 Summary of Wilcoxon signed sign test at
95 confidence between the differences of
compared algorithms
17Conclusion
- All ensemble methods are significantly more
accurate than C4.5. - Data pre-processing significantly improves
accuracies of all five compared methods - No sufficient evidence to support the performance
difference between the SVMs and an ensemble
method although the average accuracy of SVM is
much lower than that of an ensemble method. - Wilcoxon signed rank test is better than the sign
test for the evaluation of Microarray
classification
18Questions ?
- Hong Hu
- PhD student
- Email huhong_at_usq.edu.au
- Department of Maths computing, Faculty of
Sciences, USQ
19Thank you