A Study on Feature Selection for Toxicity Prediction* - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

A Study on Feature Selection for Toxicity Prediction*

Description:

... Toxicity 50.8992% 46.8855% 0.4226 0.3150 0.8627 35 kNNMFS 59.1181% 54.7891% 0.4908 0.3681 0.8049 7 CFS 63.7334% 59.2748% 0.5292 0.3982 0.7702 13 CS 56 ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 20

Provided by: gongd

Category:

more less

Transcript and Presenter's Notes

Title: A Study on Feature Selection for Toxicity Prediction*

1
A Study on Feature Selection for Toxicity
Prediction

Gongde Guo1, Daniel Neagu1 and Mark Cronin2
1Department of Computing, University of Bradford
2School of Pharmacy and Chemistry, Liverpool John
Moores University
EPSRC Project PYTHIA Predictive Toxicology
Knowledge representation and Processing Tool
based on a Hybrid Intelligent Systems Approach,
Grant ReferenceGR/T02508/01

2
Outline of Presentation

Predictive Toxicology
Feature Section Methods
Relief Family Relief, ReliefF
KNNMFS Feature Selection
Evaluation Criteria
Toxicity Dataset Phenols
Evaluation I Toxicity
Evaluation II Mechanism of Action
Conclusions

3
Predictive Toxicology

The goal of predictive toxicology is to describe
the relations between the chemical structure of a
molecule and biological and toxicological
processes (Structure-Activity Relationship SAR)
and to use these relations to predict the
behaviour of new, unknown chemical compounds.
Predictive toxicology data mining comprises steps
of data preparation data reduction (includes
feature selection) data modelling prediction
(classification, regression) and evaluation of
results and further knowledge discovery tasks.

4
Feature Selection Methods

Feature selection is the process of identifying
and removing as much of the irrelevant and
redundant information as possible.
Seven feature selection methods (Witten et al,
2000) are involved in our study
GR Gain Ratio feature evaluator
IG Information Gain ranking filter
Chi Chi-squared ranking filter
ReliefF ReliefF Feature selection
SVM- SVM feature evaluator
CS Consistency Subset evaluator
CFS Correlation-based Feature Selection
But in this work, we focused on the drawbacks of
the ReliefF feature selection method and proposed
the kNNMFS feature selection method.

5
Relief Feature Selection Method

The Relief algorithm works by randomly
sampling an instance and locating its nearest
neighbour from the same and opposite class. The
values of the features of the nearest neighbours
are compared to the sampled instance and used to
update the relevance scores for each feature.

K1
Noise? M? How to choose individual M instances?
6
Relief Feature Selection Method

Algorithm Relief
Input for each training instance a vector of
attribute values and the class value
Output the vector W of estimations of the
qualities of attributes
Set all weights WAi0.0, i1,2,,p
for j1 to m do begin
randomly select an instance Xj
find nearest hit Hj and nearest miss Mj
for k1 to p do begin
WAkWAk-diff(Ak, Xj, Hj)/mdiff(Ak, Xj,
Mj)/m
end
end

7
ReliefF Feature Selection Method
Noise (X) K? M? How to choose M instances?
8
ReliefF Feature Selection Method
9
kNN Model-based Classification Method (Guo et al,
2003)
The basic idea of kNN model-based classification
method is to find a set of more meaningful
representatives of the complete dataset to serve
as the basis for further classification.
kNNModel can generate a set of optimal
representatives via inductively learning from the
dataset.
10
An Example of kNNModel
Each representative di is represented in the form
of ltCls(di), Sim(di), Num(di), Rep(di)gt which
respectively represents the class label of di
the similarity of di to the furthest instance
among the instances covered by Ni the number of
instances covered by Ni a representation of
instance di.
11
KNNMFS kNN Model-based Feature Selection

kNNMFS takes the output of the kNNModel as
seeds for further feature selection. Given a new
instance, kNNMFS finds the nearest representative
for each class and then directly uses the
inductive information of each representative
generated by kNNModel for feature weight
calculation. The k in ReliefF is varied in our
algorithm. Its value depends on the number of
instances covered by each nearest representative
used for feature weight calculation. The M in
kNNMFS is the number of representatives output
from the kNNModel.

12
KNNMFS Feature Selection Method
13
Toxicity Dataset Phenols
Phenols data set was collected from TETRATOX
database (Scheultz, 1997) which contained 250
compounds. A total of 173 descriptors were
calculated for each compounds using different
software tools, e.g., ACD/Labs, Chem-X, TSAR.
These descriptors were calculated to represent
the physico-chemical, structure and topological
properties that were relevant to toxicity. Some
features are irrelevant to or poor correlate with
the class label X CX-EMP20 YToxicity
XTS_QuadXX YToxicity
14
Evaluation Measure for Continuous Class Values
Prediction
15
Endpoint I Toxicity
Table 1. Performance of linear regression
algorithm on different phenols subsets
FSM NSF Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression
FSM NSF CC MAE RSE RAE RRSE
Phenols 173 0.8039 0.3993 0.5427 59.4360 65.3601
MostU 12 0.7543 0.4088 0.5454 60.8533 65.6853
GR 20 0.7722 0.4083 0.5291 60.7675 63.7304
IG 20 0.7662 0.3942 0.5325 58.6724 63.1352
Chi 20 0.7570 0.4065 0.5439 60.5101 65.5146
ReliefF 20 0.8353 0.3455 0.4568 51.4319 55.0232
SVM 20 0.8239 0.3564 0.4697 53.0501 56.5722
CS 13 0.7702 0.3982 0.5292 59.2748 63.7334
CFS 7 0.8049 0.3681 0.4908 54.7891 59.1181
kNNMFS 35 0.8627 0.3150 0.4226 46.8855 50.8992
16
Endpoint II Mechanism of Action
Table 2. Performance of wkNN algorithm on
different phenols subsets
FSM NSF 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5)
FSM NSF Average Accuracy Variance Deviation
GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols 20 20 20 20 20 13 7 35 173 89.32 89.08 88.68 91.40 91.80 89.40 80.76 93.24 86.24 1.70 1.21 0.50 1.32 0.40 0.76 1.26 0.44 0.43 1.31 1.10 0.71 1.15 0.63 0.87 1.12 0.67 0.66
17
Conclusion and Future Research Directions

Using a kNN model as the starter selection can
choose a set of more meaningful representatives
to replace the original data for feature
selection
Presenting a more reasonable difference function
calculation based on inductive information in
each representative obtained by kNNModel.
Better performances are obtained on the subsets
of the Phenol dataset with different endpoints by
kNNMFS.
Investigating the effectives of boundary data or
centre data of clusters chosen as seeds for
kNNMFS
More comprehensive experiments on the benchmark
data will be carried out.

18
References

Witten, I.H. and Frank, E. Data Mining
Practical Machine Learning Tools with Java
Implementations, Morgan Kaufmann (2000), San
Francisco
Guo, G., Wang, H., Bell, D. et al. kNN
Model-based Approach in Classification. In Proc.
of CoopIS/DOA/ODBASE 2003, LNCS 2888,
Springer-Verlag, pp. 986-996 (2003)
Scheultz, T.W. TETRATOX The Tetrahymena
Pyriformis Population Growth Impairment Endpoint
A Surrogate for Fish Lethality. Toxicol.
Methods, 7, 289-309 (1997)

19
Thank you very much!

Write a Comment

User Comments (0)