Title: QSAR/QSPR Model development and Validation Essential for successful application and interpretation
1QSAR/QSPR Model development and
ValidationEssential for successful application
and interpretation
8th Iranian Workshop on Chemometrics, IASBS,
7-9 Feb 2009
Mohsen Kompany-Zareh
2Content
331 molecules 53 descriptors
Selwood data D (31x53) , Y(31x1)
gtgt load selwood.txt gtgt Dselwood(,1end-1) gtgt
yselwood(,end)
4Simplest model
Multiple Linear Regression
D b y
b D y
gtgt b0 D\y gtgt yEST Db0
Model is developed? Validation?
22 of 53 coeff.s are zero!!
b0
5Problem
- Sometimes a highly fitted and accurate model
for training set is not proper for validation
sets !! -
Is not reliable !!
6External validation
There are many different methods for selection of
members in training and test set.
Division to calibration and test sets
calD D(13end,)D(23end,)
valD D(33end,) caly
y(13end,)y(23end,) valy
y(33end,)
Model
calD
Developm.
caly
valD
valy
validation
b1calD\caly model development
7gtgt calyESTcalDb1
gtgt valyESTvalDb1 external model
validation
?
?
Not good prediction
8gtgt calyESTcalDb1 root mean square error of
calibr gtgt rmsec1sqrt(((caly-calyEST)'(caly-calyE
ST))/calDr)
RMSEC2.9396e-014
gtgt testyESTtestDb1 external model
validation gtgt rmsep1sqrt(((testy-testyEST)'(test
y-testyEST))/testDr)
?
Not good prediction
?
RMSEP2.2940
9Train
Test
residual SS
10Train
Test
Tot variance SS
11Train
R2 1.0000
Test
?
q2 -8.5220
12Training set
Internal validation
Cross validation
Leave-one-out
13Training set
14validation
developm
cumPRESS
subsamples molec.s in training set
15LOO CV
for i 1Dr calX X(1i-1,)X(i1Dr,)
valX X(i,) caly
y(1i-1,)y(i1Dr,) valy y(i,)
b (calX\caly)' valyEST(i)
valXb press(i) ((valyEST(i)-valy).2
)' end cumpress sum(press) rmsecv
sqrt(cumpress/Dr) q2LOO1-((y-valyEST')'(y
-valyEST'))/
((y-mean(y))'(y-mean(y)))
16q2LOO -4.8574
RMSECV 2.0397
gtgt q2ASYMPTOT1-(1-R2)(calDr/(calDr-calDc))2
q2ASYMPTOT 1.0000
gtgt if q2LOO-q2ASYMPTOTlt0.005,disp('reject'),end
REJECT
17QUIK
4 correlated descriptors
2 1 1 1
4 2 2 2
6 3 3 3
8 4 4 4
10 5 5 5
10
20
30
40
50
M
y
gtgt corr(M)
gtgt psize(M,2) gtgt CorrEVsvds(corr(M),p)
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
It seems possible to use svd(M)
18gtgt Ksum(abs((CorrEV/sum(CorrEV))-(1/p)))/(2(p-1)
/p)
All in a function
gtgt KMQUIK(M)
KM 1.0000
Maximum correlation between descriptors
gtgt KMYQUIK(M Y)
KMY 1.0000
if KMY-KMlt0.05,disp('reject'),else,disp('NOT
reject'), end
REJECT