Title: A Study on Feature Selection for Toxicity Prediction*
1A Study on Feature Selection for Toxicity
Prediction
- Gongde Guo1, Daniel Neagu1 and Mark Cronin2
- 1Department of Computing, University of Bradford
- 2School of Pharmacy and Chemistry, Liverpool John
Moores University - EPSRC Project PYTHIA Predictive Toxicology
Knowledge representation and Processing Tool
based on a Hybrid Intelligent Systems Approach,
Grant ReferenceGR/T02508/01
2Outline of Presentation
- Predictive Toxicology
- Feature Section Methods
- Relief Family Relief, ReliefF
- KNNMFS Feature Selection
- Evaluation Criteria
- Toxicity Dataset Phenols
- Evaluation I Toxicity
- Evaluation II Mechanism of Action
- Conclusions
3Predictive Toxicology
- The goal of predictive toxicology is to describe
the relations between the chemical structure of a
molecule and biological and toxicological
processes (Structure-Activity Relationship SAR)
and to use these relations to predict the
behaviour of new, unknown chemical compounds. - Predictive toxicology data mining comprises steps
of data preparation data reduction (includes
feature selection) data modelling prediction
(classification, regression) and evaluation of
results and further knowledge discovery tasks.
4Feature Selection Methods
- Feature selection is the process of identifying
and removing as much of the irrelevant and
redundant information as possible. - Seven feature selection methods (Witten et al,
2000) are involved in our study - GR Gain Ratio feature evaluator
- IG Information Gain ranking filter
- Chi Chi-squared ranking filter
- ReliefF ReliefF Feature selection
- SVM- SVM feature evaluator
- CS Consistency Subset evaluator
- CFS Correlation-based Feature Selection
- But in this work, we focused on the drawbacks of
the ReliefF feature selection method and proposed
the kNNMFS feature selection method.
5Relief Feature Selection Method
- The Relief algorithm works by randomly
sampling an instance and locating its nearest
neighbour from the same and opposite class. The
values of the features of the nearest neighbours
are compared to the sampled instance and used to
update the relevance scores for each feature.
K1
Noise? M? How to choose individual M instances?
6Relief Feature Selection Method
- Algorithm Relief
- Input for each training instance a vector of
attribute values and the class value - Output the vector W of estimations of the
qualities of attributes - Set all weights WAi0.0, i1,2,,p
- for j1 to m do begin
- randomly select an instance Xj
- find nearest hit Hj and nearest miss Mj
- for k1 to p do begin
- WAkWAk-diff(Ak, Xj, Hj)/mdiff(Ak, Xj,
Mj)/m - end
- end
7ReliefF Feature Selection Method
Noise (X) K? M? How to choose M instances?
8ReliefF Feature Selection Method
9kNN Model-based Classification Method (Guo et al,
2003)
The basic idea of kNN model-based classification
method is to find a set of more meaningful
representatives of the complete dataset to serve
as the basis for further classification.
kNNModel can generate a set of optimal
representatives via inductively learning from the
dataset.
10An Example of kNNModel
Each representative di is represented in the form
of ltCls(di), Sim(di), Num(di), Rep(di)gt which
respectively represents the class label of di
the similarity of di to the furthest instance
among the instances covered by Ni the number of
instances covered by Ni a representation of
instance di.
11 KNNMFS kNN Model-based Feature Selection
- kNNMFS takes the output of the kNNModel as
seeds for further feature selection. Given a new
instance, kNNMFS finds the nearest representative
for each class and then directly uses the
inductive information of each representative
generated by kNNModel for feature weight
calculation. The k in ReliefF is varied in our
algorithm. Its value depends on the number of
instances covered by each nearest representative
used for feature weight calculation. The M in
kNNMFS is the number of representatives output
from the kNNModel.
12KNNMFS Feature Selection Method
13Toxicity Dataset Phenols
Phenols data set was collected from TETRATOX
database (Scheultz, 1997) which contained 250
compounds. A total of 173 descriptors were
calculated for each compounds using different
software tools, e.g., ACD/Labs, Chem-X, TSAR.
These descriptors were calculated to represent
the physico-chemical, structure and topological
properties that were relevant to toxicity. Some
features are irrelevant to or poor correlate with
the class label X CX-EMP20 YToxicity
XTS_QuadXX YToxicity
14Evaluation Measure for Continuous Class Values
Prediction
15Endpoint I Toxicity
Table 1. Performance of linear regression
algorithm on different phenols subsets
FSM NSF Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression
FSM NSF CC MAE RSE RAE RRSE
Phenols 173 0.8039 0.3993 0.5427 59.4360 65.3601
MostU 12 0.7543 0.4088 0.5454 60.8533 65.6853
GR 20 0.7722 0.4083 0.5291 60.7675 63.7304
IG 20 0.7662 0.3942 0.5325 58.6724 63.1352
Chi 20 0.7570 0.4065 0.5439 60.5101 65.5146
ReliefF 20 0.8353 0.3455 0.4568 51.4319 55.0232
SVM 20 0.8239 0.3564 0.4697 53.0501 56.5722
CS 13 0.7702 0.3982 0.5292 59.2748 63.7334
CFS 7 0.8049 0.3681 0.4908 54.7891 59.1181
kNNMFS 35 0.8627 0.3150 0.4226 46.8855 50.8992
16Endpoint II Mechanism of Action
Table 2. Performance of wkNN algorithm on
different phenols subsets
FSM NSF 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5)
FSM NSF Average Accuracy Variance Deviation
GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols 20 20 20 20 20 13 7 35 173 89.32 89.08 88.68 91.40 91.80 89.40 80.76 93.24 86.24 1.70 1.21 0.50 1.32 0.40 0.76 1.26 0.44 0.43 1.31 1.10 0.71 1.15 0.63 0.87 1.12 0.67 0.66
17Conclusion and Future Research Directions
- Using a kNN model as the starter selection can
choose a set of more meaningful representatives
to replace the original data for feature
selection - Presenting a more reasonable difference function
calculation based on inductive information in
each representative obtained by kNNModel. - Better performances are obtained on the subsets
of the Phenol dataset with different endpoints by
kNNMFS. - Investigating the effectives of boundary data or
centre data of clusters chosen as seeds for
kNNMFS - More comprehensive experiments on the benchmark
data will be carried out.
18References
- Witten, I.H. and Frank, E. Data Mining
Practical Machine Learning Tools with Java
Implementations, Morgan Kaufmann (2000), San
Francisco - Guo, G., Wang, H., Bell, D. et al. kNN
Model-based Approach in Classification. In Proc.
of CoopIS/DOA/ODBASE 2003, LNCS 2888,
Springer-Verlag, pp. 986-996 (2003) - Scheultz, T.W. TETRATOX The Tetrahymena
Pyriformis Population Growth Impairment Endpoint
A Surrogate for Fish Lethality. Toxicol.
Methods, 7, 289-309 (1997)
19 Thank you very much!