Title:
1(No Transcript)
2Analyze/StripMiner
- Analyze/StripMiner Overview
- To obtain an idiots guide type analyze gt
readme.txt - Standard Analyze Scripts
- Predicting on Blind Data
- PLS (Please Listen to Svante Wold)
- LOO, BOO and n-Fold Cross-Validation Error
Measures - Albumin Data Set and Feature Selection
- Bio-Informatics
3Analyze/StripMiner
- Data Processing
- Interface with RECON
- Different Scaling Modes
- Outlier detection/data cleansing
- Modeling
- ANN (Neural Networks)
- SVM (Support Vector Machines)
- PLS (Partial-Least Squares)
- GA-based regression clustering
- PCA regression
- Local Learning
- Outlier Detection (GAMOL)
- Visualization
- Correlation Plots
- 2-D Sensitivity Plots
- Outlier Visualization Plots
- Different Scaling Options
- Cluster Ranking Plots
- Standard ROC curves
- Continuous ROC curves
- Learning Modes
- Bootstrapping
- Bagging
- Boosting
- Leave-one-out cross-validation
- Code Specifics
- Tight Classic C-code (lt 15000 lines)
- Script-Based Shell Program
- Runs on all Platforms
- Ultra Fast
- Use TransScan GE - KODAK
- Doppler broadening
- Macro-Economics Analysis
- Feature Selection
- Sensitivity Analysis
- Genetic Algorithms
- Correlation GA (GAFEAT)
- Method specific
DDASSL
4Analyze/StripMiner Coding Philosophy
- Standard C code that compiles on all platforms
- WINDOWS and Linux platforms
- Supporting visualizations use Java and/or
gnuplot - Flexible GUI with sample problems and demos
- Fastest code possible with efficient memory
requirements - Long history of code use with variety of users
for troubleshooting - Flexible code based on scripts and operators
- Operates on a numeric standard data mining
format file
5Practical Tips for PCA
- NIPALS algorithm assumes the features are zero
centered - It is standard practice to do a Mahalanobis
scaling of the data - PCA regression does not consider the response
data - The ts are called the scores
- It is common practice to drop 4 sigma outlier
features - (if there are many features)
6StripMiner Script Examples
- PCA visualization (pca.bat)
- Pharma-plot (pharma.bat)
- Prediction for iris with PCA (iris.bat)
- Bootstrap prediction for iris (iris_boo.bat)
- Predicting with an external test set example
(iris_ext.bat)) - PLS and ROC curve for iris problem (roc.bat)
- Leave-One-Out PLS for HIV (loo_hiv.bat)
- Feature selection for HIV (prune.bat)
- Starplots (star.bat)
7File Flow for PCA.bat Script
num_eg.txt
stats.txt la_sscala.txt iris.txt.txt.txt.txt
- num_eg.txt contains the number of PCAs (2-10)
- usually data are first Mahalanobis scaled
(option -3 PLS scaling, data only)
8File Flow for pharma.bat script
num_eg.txt
stats.txt la_sscala.txt dmatrix.txt a.txt
pharmaplot
num_eg.txt has to contain a 4 for a
pharmaplot use pharmaplot.m for visualization
in MATLAB adjust color setting threshold in
pharmaplot.m
9File Flow For iris.bat Script Predicting Class
stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.
txt resultss.xxx resultss.ttt results.xxx results.
ttt
num_eg.txt
- For the random seed in splitting routine dont
use 0 (preserves order) - The test set is really only for validation
purposes (answer is known) - Note descaling from PLS uses la_sscala.txt file
- Notice q2, Q2, and RSME error measures
10File Flow for iris_boo.bat Script Bootstrap
Validation for Estimating Prediction Confidence
stats.txt la_sscala.txt a.txt resultss.xxx results
s.ttt results.ttt
num_eg.txt
- We use bootstrap cross-validation (e.g., leave 7
out 100 times) - Use MATLAB script dos_mbotw results.ttt to
display results for test set - Use MATLAB script dos_mbotw resultss.xxx to
display results training set - Notice q2, Q2, and RSME error measures
11Error Measure Criteria
For training set we use - RMSE root mean
square error for training set - r2
correlation coefficient for training set -
R2 PRESS R2 For validation/test set we use
- RMSE reast mean square error for validation
set - q2 1 rtest2 - Q2 PRESS/SD
12Script for Scaling with an External Test Set
- 3305 scatterplot (Java)
- -3305 scatterplot gnuplot
- 3313 errorplot (Java)
- -3313 errorplot (gnuplot)
13(No Transcript)
14Docking Ligands is a Nonlinear Problem
DDASSL
Drug Design and Semi-Supervised Learning
15Feature Selection (data strip mining)
PLS, K-PLS, SVM, ANN
Fuzzy Expert System Rules
GA or Sensitivity Analysis to select descriptors
16- Binding affinities to human serum
- albumin (HSA) log Khsa
- Gonzalo Colmenarejo, GalaxoSmithKline
- J. Med. Chem. 2001, 44, 4370-4378
- 95 molecules, 250-1500 descriptors
- 84 training, 10 testing (1 left out)
- 551 Wavelet PEST MOE descriptors
- Widely different compounds
- Acknowledgements Sean Ekins (Concurrent)
-
N. Sukumar (Rensselaer)
17Script for ALBUMIN_LOO.BAT Pls-loo Validation
For Albumin Data
cmatrix.ori dmatrix.ori num_eg.txt
stats.txt la_sscala.txt a.txt results.xxx results.
ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt
- PLS-LOO stands for leave-one-out PLS
cross-validation - Training set is in cmatrix.ori and external
validation set in dmatrix.ori - External validation set has 999 or 0 in the
activity field - Note that we create generic labels and and that
there is a test set - Notice the dropping of non-changing features and
4-sigma ouliers - Notice the acrobatics for displaying metrics
(visualize with dos_mbotw)
18(No Transcript)
19(No Transcript)
20PLS Feature Selection Script For Albumin Data
aa.pat bbmatrixx.txt sel_lbls.txt
select.txt
sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixx
x.txt
- Do several iterative prunings, typically leave 7
out 100 x - Use different seeds
- Number of selected feature example 400, 300,
200, 150, 120, 100, 80, 60, 50, 45,
21DDASSL
22STARPLOT.BAT Starplot for Selected Features for
Albumin
sel_lbls.txt aa.pat
bbmatrixxx.txt
sel_lbls.txt starplot.txt
starplot
- First generate bbmatrixxx.txt which contains all
sensitivities for (e.g.) 30 boostraps - using PLS bootstrap option 33
- Generate starplot.txt from bbmatrixxx.txt using
option 3320 - Use the MATLAB routine starplot.m (operates on
starplot.txt and sel_lbls.txt)
23(No Transcript)