QSARQSPR Model development and Validation for successful prediction and interpretation - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

QSARQSPR Model development and Validation for successful prediction and interpretation

Description:

Mathematical relation between structural attribute(s) and a ... Relat for selected halogenated aliphatic chemicals, Environ. Toxicol. Pharm. (1999) 7, 33-39. ... – PowerPoint PPT presentation

Number of Views:455
Avg rating:3.0/5.0
Slides: 50
Provided by: alad3
Category:

less

Transcript and Presenter's Notes

Title: QSARQSPR Model development and Validation for successful prediction and interpretation


1
QSAR/QSPR Model development and
Validationfor successful prediction and
interpretation
In the name of GOD
8th Iranian Workshop on Chemometrics, IASBS,
7-9 Feb 2009
Mohsen Kompany-Zareh
2
Contents
  • Introduction
  • Selwood data set (all descriptors
  • Model development
  • Model validation
  • Statistical diagnostics (R2, q2, RMSEC,
    RMSEP, RMSECV
  • Internal validation
  • QUIK
  • Selwood data (a descriptors
  • Descriptor selection
  • LMO and Jackknife
  • Cross model validation
  • Bootstrapping
  • Training and test set selection
  • Leverage

3
Introduction
QSPR/QSAR (Quantitative structure activity
relationship) Mathematical relation between
structural attribute(s) and a property(an
activity) of a set of chemicals.
Application Prediction of property for a
variety of chemicals,
  • prior to expensive synthesis and experimental
    measurement.
  • To determine environmental risk of thousands of
    untested industrial chemicals.

Description of a mechanism of action for a
variety of chemicals,
4
Descriptors
Introduction
X
y
Activities
Surf. Area
MW
Lipoph.
LUMO
molec. 1
molec. 2
QSARmodel
molec. 3
molec. 4
molec. 5
molec. 6
5
Introduction
Data preparation
1. Collection and cleaning of target property
data selection of accurate, precise and
consistent experimental data.
2. Calculation of molecular descriptors for
chemicals with acceptable target
properties(After optimiz. of conform.)
more than 3000 descr.s
DRAGON (Todeschini et al, 2001 ADAPT (Jurs 2002
Stuper and Hurs 1976 OASIS (Mekenyan and Bonchev
1986 CODESSA (Katritzky et al, 1994 Gaussian
6
Introduction
Unique numerical representation of molecular
structure in term of few molecular descriptors
that capture salient compositional, electronic
and steric attributes From a very large number
of descriptors from different softwares As few
explanatory descriptors as possible for simple
interpretation of model (sometimes by variable
select
Descriptors Topologic (edges and
vertices Geometric (surface, volume, Electronic
(e dencity, local charges Constitutional (C,
OH, .
7
Data set
31 molecules 53 descriptors
Selwood data D (31x53) , Y(31x1)
Selwood, et al J Med Chem (1990) 33, 136.
31 antifilarial antimycin analogous characterized
by 53 physicochemical descriptors
gtgt load selwood.txt gtgt Dselwood(,1end-1) gtgt
yselwood(,end)
8
Model development
Model generation
Indep variables descriptors Depend variables
properties (activities)
  • Model developm methods
  • Multiple linear regression MLR,
  • Partial least squares PLS,
  • Artificial neural netorks (ANNs),
  • k-nearest neighbor

samplesltdescr.s !!
9
Model development
Multiple Linear Regression
Simplest model
D b y
b D y
gtgt b D\y gtgt yEST Db
Model is developed? Application of model
? Validation?
22 of 53 coeff.s are zero!!
R21
b0
10
Model development
Other statistical diagnostics
Coefficient of determination, R2
Fraction of dependent variable variance explained
by a model (e.g. MLR model). Closer to unity is
better. It is a measure of the quality of fit
between model-predicted and experimental values,
and does not reflect the predictive power, at
all.
11
Model development
Many QSPR/QSAR practitioners find data
preparation and model generation steps sufficient
to arrive at acceptable model !! They do not
include model validation in model development.
Schultz, et al Toxicity of Tetrahymena
Pyriformis QSAR 2002 meeting, May 25-29, Ottawa,
Canada.
Ex
log(1/IGC50)0.54 logKw 8.90 LUMO 0.99
n11, r20.82, s0.28, r2cv 0.64
n/descr11/2gt5 ? r2cv lt r2 fit unstable
model
12
Model development
Ex
Akers et al Struc.-tox. Relat for selected
halogenated aliphatic chemicals, Environ.
Toxicol. Pharm. (1999) 7, 33-39.
x
Claim The goodness of fit is satisfactory for
predictive purposes.
Ex
Benigni et al QSAR of mutagenic and carcinogenic
aromatic amines, Chem. Rev. (2000) 100, 3697-3714.
..use of a limited set of individual parameters
with clear mechanistic significance is still the
best approach that ensure the optimal
comprehension of the results and gives the
possibility of performing non-formal validations
much superior to those provided by statistics !!
13
Problem
  • Sometimes a highly fitted and accurate model
    for training set is not proper for validation
    sets !!

..so, the model is not reliable !!
14
Model validation
Model validation
Real utility of a QSAR/QSPR model is its ability
to accurately predict the modeled
property/activity for new chemicals.
Quantitative assessment of model robustness and
its predictive power.
Definition of the application domain of the model
in the space of applied chemical descriptors
15
Model validation
External validation
Division to calibration and test sets
calD D(13end,)D(23end,)
valD D(33end,) caly
y(13end,)y(23end,) valy
y(33end,)
There are many different methods for selection of
members in training and test set.
bcalD\caly model development
16
Model validation
gtgt calyESTcalDb
R21
gtgt valyESTvalDb model validation
?
?
Not good prediction
17
Model validation
gtgt calyESTcalDb root mean square error
of calibr gtgt rmsecsqrt(((caly-calyEST)'(caly-cal
yEST))/calDr)
RMSEC2.9396e-014
gtgt valyESTvalDb root mean square
error validation gtgt rmsepsqrt(((valy-valyEST)'(v
aly-valyEST))/valDr)
?
Not good prediction
?
RMSEP2.2940
18
Model validation
  • A model with high R2 could be a poor predictor
  • Variable muticollinearity,
  • Statistically insignificant model descriptors,
  • High leverage points in the training set.

A regression model with k descriptors and n
training set compounds may be acceptable for
validation only if n gt 4 k For any of k
descriptors Pair-wise correlation
coefficient lt0.9, Tolerance gt0.1.
19
Model validation
  • Validation strategies
  • Randimization of model property
  • (Y-scrambling).
  • Internal validation.
  • Only training
  • External validation.
  • Division to
    training and test sets.

20
Model validation
Predictive power of QSAR models
From sufficiently large external test set of
compounds that were not used in the model
development.
Golbraikh, et al Beware of q2 !, J Mol Graph
Model (2002) 20, 269-276.
Zefirov, et al QSAR for boiling points of small
sulfides. Are the high-quality
structure-property-activity regressions the
real high quality QSAR models? , J Chem Inf
Comput Sci (2001) 41, 1022-1027.
21
Model validation
Train
Test
residual SS
22
Model validation
Train
Test
Tot variance SS
23
Model validation
Train
R2 1.0000
Test
?
q2 -8.5220
24
Internal validation
Internal validation
Cross validation (CV) (applied to training set )
Leave-one-out (LOO) (common
Leave-many-out (LMO) (sometimes
CV corr coeff
Similar to R2 !
25
Internal validation
Training set, only
Internal validation
Cross validation
Leave-one-out
Useful when small number of molecules are
available.
26
Internal validation
Subsamples (copies from Training set
subsamples molec.s
27
Internal validation
SubTrain1
SubValid1
SubTrain2
SubValid2
SubTrain3
SubValid3
SubTrain5
SubValid5
cumPRESS
subsamples molec.s in training set
28
Internal validation
LOO CV
for i 1Dr calX X(1i-1,)X(i1Dr,)
valX X(i,) caly
y(1i-1,)y(i1Dr,) valy y(i,)
b (calX\caly)' valyEST(i)
valXb press(i) ((valyEST(i)-valy).2
)' end cumpress sum(press) rmsecv
sqrt(cumpress/Dr) q2LOO1-((y-valyEST')'(y
-valyEST'))/
((y-mean(y))'(y-mean(y)))
29
Internal validation
q2LOO -4.8574
RMSECV 2.0397
gtgt q2ASYMPTOT1-(1-R2)(calDr/(calDr-calDc))2
q2ASYMPTOT 1.0000
gtgt if q2LOO-q2ASYMPTOTlt0.005,disp('reject'),end
q2LOO and R2 should not be considerably different
.
REJECT
30
Internal validation
Many authors consider q2LOOgt0.5 as an indicator
of the high predictive power of model and do not
evaluate the model on an external test set or use
only one- or two-compounds test set.
Ex
Cronin, et al The importace of hydrophobicty and
in mechanistically based QSARs for
toxicological endpoints, SAR QSAR Environ. Res.
(2002) 13, 167-176.
Ex
Moss, et al Q. S. Permeability Relationships for
percutaneous absorption, Toxicol. In Vitro (2002)
16, 299-317.
Ex
Suzuki, et al Classification of environ.
estrogens by physicochem. properties using PCA
and hierachical cluster analysis, J Chem Inf
Comput Sci (2001) 41, 718-726.
31
Internal validation
Small value of q2LOO or q2LMO test indicates low
prediction ability, But opposite is not
necessarily true.
(high q2LOO is necess and not enough) It
indicates robustness, but not the prediction
ability of model.
32
Internal validation
It has been shown that there exist no correlation
between LOO cross-validation q2LOO and the
correlation coefficient R2 between the predicted
and observed activities for an external test set.
Kubinyi, et al Three dimensional quant.
similarity-activ. relationships (QSiAR) from SEAL
similarity matrices, J Med Chem (1998) 41,
2553-2564.
Golbraikh, et al Beware of q2 !, J Mol Graph
Model (2002) 20, 269-276.
High q2LOO is the necessary condition for a model
to have a high predictive power, but not a
sufficient condition.
33
QUIK
QUIK
R. Todeschini, et al Detecting bad Regression
models Multicriteria fitness functions in
regression analysis Anal. Chim Acta (2004) 515,
199-208.
For illustration of correlation (collinearity)
among independent variables. Based on
Multivariate correlation index K
34
QUIK
4 correlated descriptors
M
y
gtgt corr(M)
gtgt psize(M,2) gtgt CorrEVsvds(corr(M),p)
It seems possible to use svd(M)
35
QUIK
gtgt Ksum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2(p-1)
/p)
function
gtgt KMQUIK(M)
KM 1.0000
Maximum correlation between descriptors
gtgt KMYQUIK(M Y) in the pres of depend var
KMY 1.0000
if KMY-KMlt0.05,disp('reject'),else,disp('NOT
reject'), end
REJECT
36
QUIK
gtgt Mrand(4,5)
M
y
gtgt corr(M)
37
QUIK
gtgt KMQUIK(M)
KM 0.5000
gtgt KMYQUIK(M Y)
KMY 0.6000
if KMY-KMlt0.05,disp('reject'),else,disp('NOT
reject'), end
NOT REJECTED
38
QUIK
gtgt KMQUIK(calD) Selwood data, all
descriptors
KM 0.7919
gtgt KMYQUIK(calD Y)
KMY 0.7923
gtgtif KMY-KMlt0.03,disp('reject'),else,disp('NOT
reject'), end
REJECTED
39
Development of MLR model using all descriptors is
not acceptable.
Model can be improved, using a factor based
method,
and by descriptor selection.
40
A number of descriptors
Development of MLR model using a number of
descriptors.
gtgt DDini(,51 37 35 38 39 36 15)
RMSEP 0.4993
RMSEC 0.4989
Comparable Improved
41
A number of descriptors
R2 0.6495
q2 0.5490
Comparable Improved
q2LOO 0.2816
NOT REJECTED
42
A number of descriptors
DDini(,51 37 35 38 39 36 15)
QUIK
KX 0.6384
KXY 0.5996
if KMY-KMlt0.03,disp('reject'),else,disp('NOT
reject'), end
REJECTED
43
A number of descriptors
DDini(,51 1 38)
QUIK
KX 0.3159
KXY 0.3953
if KMY-KMlt0.03,disp('reject'),else,disp('NOT
reject'), end
NOT REJECTED
44
Using proper set of descriptors, improved results
from MLR can be obtained.
But how the proper set of descriptors can be
selected.
45
Descriptor Selection
Descriptor selection
  • Forward selection,
  • Backward elimination,
  • Genetic algorithm
  • Kohonen map
  • SPA
  • CWSPA

46
Descriptor Selection
Rows (descriptors) as input for Kohonen map
selwood data matrix
53 31
Kohonen Map
1. Sampling from all regions in descriptors
space 2. Sampling from regions which descriptors
have high correlation with Y (activity)
By Mehdi Vasighi
47
Descriptor Selection
Successive projections algorithm (SPA)
SPA is a forward selection method that starts
with one variable, and incorporates a new one at
each iteration, until a specified number N of
variables is reached.
In SPA, to minimize the the collinearity between
the selected descriptors, the criterion for the
stepwise selection of variables is the
orthogonality of them to the previously selected
variable.
Y. Akhlaghi and M. Kompany-Zareh Application of
RBFNN and successive projections algorithm in a
QSAR study of anti-HIV activity of HEPT
derivatives, Journal of Chemometrics, (2006) 20,
1-12
48
Descriptor Selection
Important parameters 1- Starting vector
2- N, maximum number of descriptors
Araujo, et al The successive projections
algorithm for variable selection in Spectroscopic
Multicomponent Analysis. Chemom. Intell. Lab.
Syst. (2001) 57, 6573.
49
Descriptor Selection
Correlation weighted SPA
A limitation of SPA is that the only criterion
for the stepwise selection of variables is the
orthogonality of them to the previously selected
variable, relation of entered vector as an
independent variable to the response is not
considered.
Incorporation of a form of correlation ranking
procedure by which the variables are weighted by
their correlation coefficient with dependent
variable, within SPA procedure will overcome this
limitation of SPA.
M. Kompany-Zareh and Y. Akhlaghi Correlation
weighted successive projections algorithm A QSAR
study of anti-HIV activity of HEPT derivatives, J
of Chemom, (2007) 21, 239-250.
Write a Comment
User Comments (0)
About PowerShow.com