Title: Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models
1Induction of Decision Trees Using Genetic
Programming for the Development of SAR Toxicity
Models
2The Background
26 million distinct organic, inorganic chemicals
known gt 80, 000 in commercial production
Combinatorial chemistry adds more than 1 million
new compounds to the library every year
In UK, gt 10,000 are evaluated for possible
production every year Biggest cost factor
3What is toxicity?
- "The dose makes the poison
- - Paracelsus (1493-1541)
- Toxicity Endpoints EC50, LC50,
4- Toxicity tests are
- expensive,
- time consuming and
- disliked by many people
5 DESCRIPTORS Physcochemical, biological,
structural
Molecular Modelling
SAR QSARs
Toxicity Endpoints
e.g. Neural Networks PLS, Expert Systems
HOMO - highest occupied molecular orbital LUMO -
Lowest unoccupied molecular orbital
No of descriptors cost time
6Aims of Research
integrated data mining environment (IDME) for in
silico toxicity prediction decision tree
induction technique for eco-toxicity modelling in
silico techniques for mixture toxicity prediction
7Why Data Mining System for In Silico Toxicity
Prediction
- Existing systems
- Unknown confidence level of prediction
- Extrapolation
- Models built from small datasets
- Fixed descriptors
- May not cover the endpoint required
8Data Mining Discover UsefulInformation and
Knowledge from Data
Value
Data records of numerical data, symbols, images,
documents
Decision
Knowledge Rules IF .. THEN .. Cause-effect
relationships Decision trees Patterns abnormal,
normal operation Predictive equations
Knowledge
Information
Data Data Data Data
Volume
The non-trivial process of identifying valid,
novel, potentially useful, and ultimately
understandable patterns in data
9Clustering Classification Conceptual
Clustering Inductive learning Dependency
modelling Summarisation Regression Case-based
Learning
10eg. Dependency Modelling or Link Analysis
x1 x2 x3 1 0 0 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1
0 0 0 1 1 1 0 0 0
x1
x2
x3
x2
x1
x3
11Data pre-processing - Wavelet for on-line
signal feature extraction and dimension
reduction - Fuzzy approach for dynamic trend
interpretation
Clustering - Supervised classification -
BPNN - Fuzzy set covering approach Unsupervised
classification - ART2 (Adaptive resonance
theory) - AutoClass - PCA Dependency
modelling - Bayesian networks - Fuzzy - SDG
(signed directed graph) - Decision
trees Others - Automatic rules extraction from
data using Fuzzy-NN and Fuzzy SDG -
Visualisation
12(No Transcript)
13(No Transcript)
14Process Operational Safety Envelopes
Loss Prevention in Process Ind. 2002
15Integrated Data Mining Environment
- Toxicity
16User Interface
17Quantitative Structure Activity Relationship
- 75 organic compounds with 1094 descriptors and
endpoint Log(1/EC50) to Vibrio fischeri - Zhao et al QSAR 17(2) 1998 pages 131-138
- Log(1/EC50) -0.3766 0.0444 Vx (r2 0.7078, MSE
0.2548)
Vx McGowans characteristic volume r2
Pearsons correlation coefficient q2 leave-one
out cross validated correlation coefficient
18Principal Component Analysis
19Clustering in IDME
20Multidimensional Visualisation
21Feedforward neural networks
Input layer Hidden layer
Output layer
PC1
PC2
PC3
Log(1/EC50)
PCm
22FFNN Results graph
23QSAR Mode for Mixture Toxicity Prediction
Similar Constituents
TRAINING
Dissimilar Constituents
24Why Inductive Data Mining for In Silico Toxicity
Prediction ?
- Lack of knowledge on what descriptors are
important to toxicity endpoints (feature
selection) - Expert systems subjective knowledge obtained
from human experts - Linear vs nonlinear
- Black box models
25What is inductive learning?
Aims at Developing a Qualitative Causal Language
for Grouping Data Patterns into
Clusters Decision trees or production rules
Explicit and transparent
26Expert Systems
? Knowl. Subjective. ? Data not used ? Often
qualitative
- Human expert knowl.
- Knowl. transparent,
- causal
Statistical Methods
? Black-box ? Human knowl. not used
Neural Networks
- ? Data driven
- Quantitative
- Nonlinear
- ? Easy setup
? Black-box ? Human knowl. not used
27Discretization techniques
Methods Tested
C5.0 Binary discretization Information entropy
(Quinlan 1986 1993)
C5.0
LERS (Learning from Examples using Rough Sets,
Grzymala-Busse 1997)
LERS_C5.0
Probability distribution histogram
Histogram_C5.0
Equal width interval
EQI_C5.0
KEX_chi_C5.0 KEX_fre_C5.0 KEX_fuzzy_C5.0
KEX (Knowledge EXplorer, Berka Bruha 1998)
CN4 (Berka Bruha 1998)
CN4_C5.0
Chi2 (Liu Setiono 1995, Kerber 1992)
Chi2_C5.0
28Decision Tree Generation Based on Genetic
Programming
Traditional Tree Generation methods Greedy
search, can miss potential models
Genetic Algorithm optimisation approach can
effectively avoid local minima and
simultaneously evaluate many solutions GA has
been used in decision tree generation to decide
the splitting points and attributes to be used
whilst growing a tree
Genetic (evolutionary) Programming Not only
simultaneously evaluate many solutions
and avoid local minima But does not require
parameter encoding into fixed length vectors
called chromosomes Based on direct application
of the GA to tree structures
29Genetic Computation
- Generation of a population of solutions
- Repeat steps (i) and (ii) until the stop
criteria are - satisfied
- (i) calculate the fitness function values
for each - solution candidate
- (ii) perform crossover and mutation to
generate - the next generation
- the best solution in all generations is regarded
as - the solution
30Crossover
Genetic algorithms
Genetic (Evolutionary) Programming / EPTree
311. Divide data into training and test sets 2.
Generate the 1st population of trees -
randomly choosing a row (i.e. a compound), and
column (i.e. descriptor) - Using
the value of the slot, s, to split, left child
takes those data points with selected attribute
values lt s, whilst the right child takes those gt
s.
Descriptors
Molecules
lts
gts
DeLisle Dixon J Chem Inf Comput Sci 44,
862-870 (2004) Buontempo Wang et al, J Chem Inf
Comput Sci 45, 904-912 (2005)
32- If a child will not cover enough rows (e.g. 10
of - the training rows), another combination is
tried. - - A child node becomes a leaf node if pure i.e.
all the - rows covered are in the same class, or near
pure, - whilst the other nodes grow children
- When all nodes either have two children or are
leaf - nodes, the tree is fully grown and added to the
first - generation.
- A leaf node is assigned to a class label
- corresponding to the majority class of points
- partitioned there.
333. Crossover, Mutation - Tournament
randomly select a groups of trees e.g. 16 -
Calculate fitness values - Generate the first
parent - Similarly generate the second
parent - Crossover to generate a child -
Generate other children - Select a percentage
for mutation
34- Mutation Methods
- Random choice of change of split point (i.e.
choosing - a different rows value for the current
attribute) - Choosing a new attribute whilst keeping the same
row - choosing a new attribute and a new row
- re-growing part of the tree
- If no improvement in accuracy for k generations,
trees - generated were mutated
- -
35Two Data Sets
Data Set 1 Concentration lethal to 50 of the
population, LC50, 1/Log(LC50), of vibrio
fischeri, a biolumininescent bactorium 75
compounds 1069 molecular descriptors
Data Set 2 Concentration effecting 50 of the
population, EC50 of algae chlorella vulgaris, by
causing fluorescein diacetate to disappear 80
compounds 1150 descriptors
36600 trees were grown in each egneration 16 trees
competing in each tournament to select trees for
crossover, 66.7 were mutated for the bacterial
dataset, and 50 mutated for the algae dataset.
37Evolutionary Programming Results Dataset 1
For data set 1, bacteria data in generation
37 91.7 for training (60 cases) 73.3 for the
test set (15 cases)
38Decision Tree Using C5.0 for the Same Data
For data set 1, bacteria data 88.3 for training
(60 cases) 60.0 for test set (15 cases)
Gravitational index 7.776
No
Yes
Valence connectivity index 3.346
Class 1 (13/14)
Yes
No
Class 4 (14/15)
Cl attached to C1 (sp2) 1
No
Yes
H Autocorrelation lag 5 weighted by atomic mass
0.007
Class 3 (7/7)
Yes
No
Class 4 (3/6)
Summed atomic weights of angular scattering
function -0.082
No
Yes
Class 2 (11/12)
Class 3 (5/6)
39Evolutionary Programming Results Dataset 2
Solvation connectivity index 2.949
No
Yes
Self-returning walk count order 8 3.798
Molecular multiple path count order 3 92.813
Yes
No
Yes
No
Class 2 (14/15)
Class 1 (16/16)
Class 3 (6/8)
H autocorrelation of lag 2 weighted by Sanderson
electro-negativities 0.401
Yes
No
2nd dataset - algae data GP Tree, generation
9 Training 92.2 Test 81.3
Class 4 (6/7)
2nd component symmetry directional WHIM
index weighted by van der Waals volume 0.367
No
Yes
Class 4 (8/8)
Class 3 (9/10)
40Decision Tree Using See5.0 for the Same Data
Max eigenvalue of Burden matrix weighted by van
der Waals vol 3.769
No
Yes
Broto-Moreau autocorrelation of topological
structure lag 4 weighted by atomic mass 9.861
Class 1 (16/16)
Yes
No
Class 2 (15/16)
Total accessibility index weighted by van der
Waals vol 0.281
2nd dataset, algae data See 5, Training
90.6 Test 75.0
No
Yes
Class 4 (12/12)
Class 3 (15/20)
41Summary of Results
Data set 1 Bacteria data
GP method
C5.0
6 88.3 60.0
Tree size Training Accuracy Test Accuracy
8 91.7 73.3
42Comparison of Test Accuracy for See5.0 and GP
Trees Having the Same Training Accuracy
Data Set 1 Bacteria data
GP (Generation 31)
C5.0
6 88.3 60.0
Tree size Training Accuracy Test Accuracy
8 88.3 73.3
43Application to Wastewater Treatment Plant Data
44Data Corresponding to 527 Days Operation 38
Variables
Aeration Tanks
Primary Settler
Screws
Output
Input
Secondary Settler
Primary Treatment
Pre-Treatment
Secondary Treatment
Sludge Line
45Decision tree for prediction of suspended solids
in effluents training data
SS-P -2.9572
SS-P -1.8445
DQO-D 1.80444
SS-P -1.68597
DBO-D 0.47006
RD-DBO-G 0.8097
PH-D 0.8699
N 20
PH-D 0.59323
L 2
SS-P -3.167930
N 5
DQO-D 2.53335
SS-P -1.58468
SSV-P 0.17786
N 7
N 320/1
N 2
N 11
SS-P -3.6479
N 3
H 4
N 16
ZN-E 2.2447
L 3
N 3
DBO-SS 0.81806
N 30
PH-D 0.65534
H 2
PH-D 0.68569
SS-P -1.20793
N 2
N 27
L 3
N 4
RD-DQO-S 0.31152
Total No of Obs. 470 Training Accuracy
99.8 Test Accuracy 93.0 Leaf Nodes 20 L
Low N Normal H High
SS-P input SS to primary settler DQO-D input
COD to secondary settler DBO-D input COD to
secondary settler PH-D input pH to secondary
settler SSV-P input volatile SS to primary
settler
N 3
H 3
46DBO-E 0.49701
Using all the data of 527 days
N 76/3
SS-P -1.86019
SS-P -1.86017
SS-P -3.08361
N 13
RD-DQO-S 0.35794
RD-SS-P 0.491144
RD-SS-G0.50018
N 11
DBO-D 0.408557
SS-P -1.20793
RD-DQO-S 0.357935
N 69
SED-P -2.81193
H 9
PH-P 0.41833
N 234/1
N 2
DBO-E 0.71809
L 3
N 25
PH-D 0.65537
SS-P -3.39768
N 31
L 2
COND-S 0.49438
N 8
PH-P 0.17333
No of Obs. 527 Accuracy 99.25 Leaf Nodes
18 L Low N Normal H High
H 3
N 4
L 3
L 3
N 20
N 11
47Final Remarks
- An Integrated Data Mining Prototype System for
Toxicity Prediction of Chemicals and Mixtures
Developed - An Evaluation of Current Inductive Data Mining
Approaches to Toxicity Prediction Has Been
Conducted - A New Methodology for the Inductive Data Mining
Based Novel Use of Genetic Programming is
Proposed, Giving Promising Results in Three Case
Studies
48On-going Work
- Adaptive Discretization of End-point Values
through - Simultaneous Mutation of the Output
SSRD - sum of squared differences in rank
The best training accuracy in each generation for
the trees grown for the algae data using the
SSRD. The 2 class trees no longer dominate and
very accurate 3 class trees have been found.
49Future Work
2) Extend the Method to Model Trees Fuzzy Model
Trees Generation
Rule 1 If antecedent one applies, with degree
µ1µ1,1µ1,2µ1,9 then y1 0.1910 PC1 0.6271
PC2 0.2839 PC3 1.2102 PC4 0.2594 PC5
0.3810 PC6 - 0.3695 PC7 0.8396 PC8 1.0986 PC9
- 0.5162 Rule 2 If antecedent two applies, with
degree µ2µ2,1µ2,2µ2,9 then y2 0.7403 PC1
0.5453 PC2 - 0.0662 PC3 - 0.8266 PC4 0.1699 PC5
- 0.0245 PC6 0.9714 PC7 - 0.3646 PC8 - 0.3977
PC9 - 0.0511 Final output Crisp value (µ1y1
µ2y2) / (µ1 µ2) where µiµi,1µi,2µi,10
50Fuzzy Membership Functions Used in Rules
51Future Work
3) Extend the Method to Mixture Toxicity
Prediction
Similar Constituents
Similar Constituents
TRAINING
TESTING
Dissimilar Constituents
Dissimilar Constituents
52Acknowledgements
Crystal Faraday Partnership on Green Technology
FV Buontempo M Mwense A Young D Osborn
AstraZenaca Brixham Environmental Laboratory
NERC Centre of Ecology and Hydrology
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)