- PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Description:

Analyze/StripMiner Overview. To obtain an idiot's guide type 'analyze ... Use: TransScan GE - KODAK. Doppler broadening. Macro-Economics Analysis. DDASSL ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 23
Provided by: markemb
Category:
Tags: kodak

less

Transcript and Presenter's Notes

Title:


1
(No Transcript)
2
Analyze/StripMiner
  • Analyze/StripMiner Overview
  • To obtain an idiots guide type analyze gt
    readme.txt
  • Standard Analyze Scripts
  • Predicting on Blind Data
  • PLS (Please Listen to Svante Wold)
  • LOO, BOO and n-Fold Cross-Validation Error
    Measures
  • Albumin Data Set and Feature Selection
  • Bio-Informatics

3
Analyze/StripMiner
  • Data Processing
  • Interface with RECON
  • Different Scaling Modes
  • Outlier detection/data cleansing
  • Modeling
  • ANN (Neural Networks)
  • SVM (Support Vector Machines)
  • PLS (Partial-Least Squares)
  • GA-based regression clustering
  • PCA regression
  • Local Learning
  • Outlier Detection (GAMOL)
  • Visualization
  • Correlation Plots
  • 2-D Sensitivity Plots
  • Outlier Visualization Plots
  • Different Scaling Options
  • Cluster Ranking Plots
  • Standard ROC curves
  • Continuous ROC curves
  • Learning Modes
  • Bootstrapping
  • Bagging
  • Boosting
  • Leave-one-out cross-validation
  • Code Specifics
  • Tight Classic C-code (lt 15000 lines)
  • Script-Based Shell Program
  • Runs on all Platforms
  • Ultra Fast
  • Use TransScan GE - KODAK
  • Doppler broadening
  • Macro-Economics Analysis
  • Feature Selection
  • Sensitivity Analysis
  • Genetic Algorithms
  • Correlation GA (GAFEAT)
  • Method specific

DDASSL
4
Analyze/StripMiner Coding Philosophy
  • Standard C code that compiles on all platforms
  • WINDOWS and Linux platforms
  • Supporting visualizations use Java and/or
    gnuplot
  • Flexible GUI with sample problems and demos
  • Fastest code possible with efficient memory
    requirements
  • Long history of code use with variety of users
    for troubleshooting
  • Flexible code based on scripts and operators
  • Operates on a numeric standard data mining
    format file

5
Practical Tips for PCA
  • NIPALS algorithm assumes the features are zero
    centered
  • It is standard practice to do a Mahalanobis
    scaling of the data
  • PCA regression does not consider the response
    data
  • The ts are called the scores
  • It is common practice to drop 4 sigma outlier
    features
  • (if there are many features)

6
StripMiner Script Examples
  • PCA visualization (pca.bat)
  • Pharma-plot (pharma.bat)
  • Prediction for iris with PCA (iris.bat)
  • Bootstrap prediction for iris (iris_boo.bat)
  • Predicting with an external test set example
    (iris_ext.bat))
  • PLS and ROC curve for iris problem (roc.bat)
  • Leave-One-Out PLS for HIV (loo_hiv.bat)
  • Feature selection for HIV (prune.bat)
  • Starplots (star.bat)

7
File Flow for PCA.bat Script
num_eg.txt
stats.txt la_sscala.txt iris.txt.txt.txt.txt
  • num_eg.txt contains the number of PCAs (2-10)
  • usually data are first Mahalanobis scaled
    (option -3 PLS scaling, data only)

8
File Flow for pharma.bat script
num_eg.txt
stats.txt la_sscala.txt dmatrix.txt a.txt
pharmaplot
num_eg.txt has to contain a 4 for a
pharmaplot use pharmaplot.m for visualization
in MATLAB adjust color setting threshold in
pharmaplot.m
9
File Flow For iris.bat Script Predicting Class
stats.txt la_sscala.txt a.txt cmatrix.txt dmatrix.
txt resultss.xxx resultss.ttt results.xxx results.
ttt
num_eg.txt
  • For the random seed in splitting routine dont
    use 0 (preserves order)
  • The test set is really only for validation
    purposes (answer is known)
  • Note descaling from PLS uses la_sscala.txt file
  • Notice q2, Q2, and RSME error measures

10
File Flow for iris_boo.bat Script Bootstrap
Validation for Estimating Prediction Confidence
stats.txt la_sscala.txt a.txt resultss.xxx results
s.ttt results.ttt
num_eg.txt
  • We use bootstrap cross-validation (e.g., leave 7
    out 100 times)
  • Use MATLAB script dos_mbotw results.ttt to
    display results for test set
  • Use MATLAB script dos_mbotw resultss.xxx to
    display results training set
  • Notice q2, Q2, and RSME error measures

11
Error Measure Criteria
For training set we use - RMSE root mean
square error for training set - r2
correlation coefficient for training set -
R2 PRESS R2 For validation/test set we use
- RMSE reast mean square error for validation
set - q2 1 rtest2 - Q2 PRESS/SD
12
Script for Scaling with an External Test Set
  • 3305 scatterplot (Java)
  • -3305 scatterplot gnuplot
  • 3313 errorplot (Java)
  • -3313 errorplot (gnuplot)

13
(No Transcript)
14
Docking Ligands is a Nonlinear Problem
DDASSL
Drug Design and Semi-Supervised Learning
15
Feature Selection (data strip mining)
PLS, K-PLS, SVM, ANN
Fuzzy Expert System Rules
GA or Sensitivity Analysis to select descriptors
16
  • Binding affinities to human serum
  • albumin (HSA) log Khsa
  • Gonzalo Colmenarejo, GalaxoSmithKline
  • J. Med. Chem. 2001, 44, 4370-4378
  • 95 molecules, 250-1500 descriptors
  • 84 training, 10 testing (1 left out)
  • 551 Wavelet PEST MOE descriptors
  • Widely different compounds
  • Acknowledgements Sean Ekins (Concurrent)

  • N. Sukumar (Rensselaer)

17
Script for ALBUMIN_LOO.BAT Pls-loo Validation
For Albumin Data
cmatrix.ori dmatrix.ori num_eg.txt
stats.txt la_sscala.txt a.txt results.xxx results.
ttt sel_lbls.txt bbmatrixx.txt bbmatrixxx.txt
  • PLS-LOO stands for leave-one-out PLS
    cross-validation
  • Training set is in cmatrix.ori and external
    validation set in dmatrix.ori
  • External validation set has 999 or 0 in the
    activity field
  • Note that we create generic labels and and that
    there is a test set
  • Notice the dropping of non-changing features and
    4-sigma ouliers
  • Notice the acrobatics for displaying metrics
    (visualize with dos_mbotw)

18
(No Transcript)
19
(No Transcript)
20
PLS Feature Selection Script For Albumin Data
aa.pat bbmatrixx.txt sel_lbls.txt
select.txt
sel_lbls.txt aa.pat aa.tes bbmatrixx.txt bbmatrixx
x.txt
  • Do several iterative prunings, typically leave 7
    out 100 x
  • Use different seeds
  • Number of selected feature example 400, 300,
    200, 150, 120, 100, 80, 60, 50, 45,

21
DDASSL
22
STARPLOT.BAT Starplot for Selected Features for
Albumin
sel_lbls.txt aa.pat
bbmatrixxx.txt
sel_lbls.txt starplot.txt
starplot
  • First generate bbmatrixxx.txt which contains all
    sensitivities for (e.g.) 30 boostraps
  • using PLS bootstrap option 33
  • Generate starplot.txt from bbmatrixxx.txt using
    option 3320
  • Use the MATLAB routine starplot.m (operates on
    starplot.txt and sel_lbls.txt)

23
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com