A Study on Feature Selection for Toxicity Prediction* - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

A Study on Feature Selection for Toxicity Prediction*

Description:

... Toxicity 50.8992% 46.8855% 0.4226 0.3150 0.8627 35 kNNMFS 59.1181% 54.7891% 0.4908 0.3681 0.8049 7 CFS 63.7334% 59.2748% 0.5292 0.3982 0.7702 13 CS 56 ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 20
Provided by: gongd
Category:

less

Transcript and Presenter's Notes

Title: A Study on Feature Selection for Toxicity Prediction*


1
A Study on Feature Selection for Toxicity
Prediction
  • Gongde Guo1, Daniel Neagu1 and Mark Cronin2
  • 1Department of Computing, University of Bradford
  • 2School of Pharmacy and Chemistry, Liverpool John
    Moores University
  • EPSRC Project PYTHIA Predictive Toxicology
    Knowledge representation and Processing Tool
    based on a Hybrid Intelligent Systems Approach,
    Grant ReferenceGR/T02508/01

2
Outline of Presentation
  1. Predictive Toxicology
  2. Feature Section Methods
  3. Relief Family Relief, ReliefF
  4. KNNMFS Feature Selection
  5. Evaluation Criteria
  6. Toxicity Dataset Phenols
  7. Evaluation I Toxicity
  8. Evaluation II Mechanism of Action
  9. Conclusions

3
Predictive Toxicology
  • The goal of predictive toxicology is to describe
    the relations between the chemical structure of a
    molecule and biological and toxicological
    processes (Structure-Activity Relationship SAR)
    and to use these relations to predict the
    behaviour of new, unknown chemical compounds.
  • Predictive toxicology data mining comprises steps
    of data preparation data reduction (includes
    feature selection) data modelling prediction
    (classification, regression) and evaluation of
    results and further knowledge discovery tasks.

4
Feature Selection Methods
  • Feature selection is the process of identifying
    and removing as much of the irrelevant and
    redundant information as possible.
  • Seven feature selection methods (Witten et al,
    2000) are involved in our study
  • GR Gain Ratio feature evaluator
  • IG Information Gain ranking filter
  • Chi Chi-squared ranking filter
  • ReliefF ReliefF Feature selection
  • SVM- SVM feature evaluator
  • CS Consistency Subset evaluator
  • CFS Correlation-based Feature Selection
  • But in this work, we focused on the drawbacks of
    the ReliefF feature selection method and proposed
    the kNNMFS feature selection method.

5
Relief Feature Selection Method
  • The Relief algorithm works by randomly
    sampling an instance and locating its nearest
    neighbour from the same and opposite class. The
    values of the features of the nearest neighbours
    are compared to the sampled instance and used to
    update the relevance scores for each feature.

K1
Noise? M? How to choose individual M instances?
6
Relief Feature Selection Method
  • Algorithm Relief
  • Input for each training instance a vector of
    attribute values and the class value
  • Output the vector W of estimations of the
    qualities of attributes
  • Set all weights WAi0.0, i1,2,,p
  • for j1 to m do begin
  • randomly select an instance Xj
  • find nearest hit Hj and nearest miss Mj
  • for k1 to p do begin
  • WAkWAk-diff(Ak, Xj, Hj)/mdiff(Ak, Xj,
    Mj)/m
  • end
  • end

7
ReliefF Feature Selection Method
Noise (X) K? M? How to choose M instances?
8
ReliefF Feature Selection Method
9
kNN Model-based Classification Method (Guo et al,
2003)
The basic idea of kNN model-based classification
method is to find a set of more meaningful
representatives of the complete dataset to serve
as the basis for further classification.
kNNModel can generate a set of optimal
representatives via inductively learning from the
dataset.
10
An Example of kNNModel
Each representative di is represented in the form
of ltCls(di), Sim(di), Num(di), Rep(di)gt which
respectively represents the class label of di
the similarity of di to the furthest instance
among the instances covered by Ni the number of
instances covered by Ni a representation of
instance di.
11
KNNMFS kNN Model-based Feature Selection
  • kNNMFS takes the output of the kNNModel as
    seeds for further feature selection. Given a new
    instance, kNNMFS finds the nearest representative
    for each class and then directly uses the
    inductive information of each representative
    generated by kNNModel for feature weight
    calculation. The k in ReliefF is varied in our
    algorithm. Its value depends on the number of
    instances covered by each nearest representative
    used for feature weight calculation. The M in
    kNNMFS is the number of representatives output
    from the kNNModel.

12
KNNMFS Feature Selection Method
13
Toxicity Dataset Phenols
Phenols data set was collected from TETRATOX
database (Scheultz, 1997) which contained 250
compounds. A total of 173 descriptors were
calculated for each compounds using different
software tools, e.g., ACD/Labs, Chem-X, TSAR.
These descriptors were calculated to represent
the physico-chemical, structure and topological
properties that were relevant to toxicity. Some
features are irrelevant to or poor correlate with
the class label X CX-EMP20 YToxicity
XTS_QuadXX YToxicity
14
Evaluation Measure for Continuous Class Values
Prediction
15
Endpoint I Toxicity
Table 1. Performance of linear regression
algorithm on different phenols subsets
FSM NSF Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression Evaluation Using Linear Regression
FSM NSF CC MAE RSE RAE RRSE
Phenols 173 0.8039 0.3993 0.5427 59.4360 65.3601
MostU 12 0.7543 0.4088 0.5454 60.8533 65.6853
GR 20 0.7722 0.4083 0.5291 60.7675 63.7304
IG 20 0.7662 0.3942 0.5325 58.6724 63.1352
Chi 20 0.7570 0.4065 0.5439 60.5101 65.5146
ReliefF 20 0.8353 0.3455 0.4568 51.4319 55.0232
SVM 20 0.8239 0.3564 0.4697 53.0501 56.5722
CS 13 0.7702 0.3982 0.5292 59.2748 63.7334
CFS 7 0.8049 0.3681 0.4908 54.7891 59.1181
kNNMFS 35 0.8627 0.3150 0.4226 46.8855 50.8992
16
Endpoint II Mechanism of Action
Table 2. Performance of wkNN algorithm on
different phenols subsets
FSM NSF 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5) 10-Fold Cross Validation Using wkNN (k5)
FSM NSF Average Accuracy Variance Deviation
GR IG Chi ReliefF SVM CS CFS kNNMFS Phenols 20 20 20 20 20 13 7 35 173 89.32 89.08 88.68 91.40 91.80 89.40 80.76 93.24 86.24 1.70 1.21 0.50 1.32 0.40 0.76 1.26 0.44 0.43 1.31 1.10 0.71 1.15 0.63 0.87 1.12 0.67 0.66
17
Conclusion and Future Research Directions
  • Using a kNN model as the starter selection can
    choose a set of more meaningful representatives
    to replace the original data for feature
    selection
  • Presenting a more reasonable difference function
    calculation based on inductive information in
    each representative obtained by kNNModel.
  • Better performances are obtained on the subsets
    of the Phenol dataset with different endpoints by
    kNNMFS.
  • Investigating the effectives of boundary data or
    centre data of clusters chosen as seeds for
    kNNMFS
  • More comprehensive experiments on the benchmark
    data will be carried out.

18
References
  • Witten, I.H. and Frank, E. Data Mining
    Practical Machine Learning Tools with Java
    Implementations, Morgan Kaufmann (2000), San
    Francisco
  • Guo, G., Wang, H., Bell, D. et al. kNN
    Model-based Approach in Classification. In Proc.
    of CoopIS/DOA/ODBASE 2003, LNCS 2888,
    Springer-Verlag, pp. 986-996 (2003)
  • Scheultz, T.W. TETRATOX The Tetrahymena
    Pyriformis Population Growth Impairment Endpoint
    A Surrogate for Fish Lethality. Toxicol.
    Methods, 7, 289-309 (1997)

19
Thank you very much!
Write a Comment
User Comments (0)
About PowerShow.com