Lab 1 - PowerPoint PPT Presentation

1 / 55
About This Presentation
Title:

Lab 1

Description:

Baseline error rates (errate): Training and testing on various data splits with simple methods. ... Best frac. feat. Visualization. 1) Create a heatmap of the ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 56
Provided by: Isabell47
Category:
Tags: best | cd | lab | rates

less

Transcript and Presenter's Notes

Title: Lab 1


1
Lab 1
Getting started with CLOPandthe Spider package
2
CLOP Tutorial
  • CLOPChallenge Learning Object Package.
  • Based on the Spider developed at the Max Planck
    Institute.
  • Two basic abstractions
  • Data object
  • Model object

http//clopinet.com/isabelle/Projects/modelselect/
MFAQ.html
3
CLOP Data Objects
At the Matlab prompt
  • cd ltcode_dirgt
  • use_spider_clop
  • Xrand(10,8)
  • Y1 1 1 1 1 -1 -1 -1 -1 -1'
  • Ddata(X,Y) constructor
  • p,nget_dim(D)
  • get_x(D)
  • get_y(D)

4
CLOP Model Objects
D is a data object previously defined.
  • model kridge constructor
  • resu, model train(model, D)
  • resu, model.W, model.b0
  • Yhat D.Xmodel.W' model.b0
  • testD data(rand(3,8), -1 -1 1')
  • tresu test(model, testD)
  • balanced_errate(tresu.X, tresu.Y)

5
Hyperparameters and Chains
A model often has hyperparameters
  • default(kridge)
  • hyper 'degree3', 'shrinkage0.1'
  • model kridge(hyper)
  • model chain(standardize,kridge(hyper))
  • resu, model train(model, D)
  • tresu test(model, testD)
  • balanced_errate(tresu.X, tresu.Y)

Models can be chained
6
Hyper-parameters
  • http//clopinet.com/isabelle/Projects/modelselect/
    MFAQ.html
  • Kernel methods kridge and svc
  • k(x, y) (coef0 x ? y)degree exp(-gamma x -
    y2)
  • kij k(xi, xj)
  • kii ? kii shrinkage
  • Naïve Bayes naive none
  • Neural network neural
  • units, shrinkage, maxiter
  • Random Forest rf (windows only)
  • mtry

7
Lab 2
Getting started with the NIPS 2003 FS challenge
8
The Datasets
  • Arcene cancer vs. normal with mass-spectrometry
    analysis of blood serum.
  • Dexter filter texts about corporate acquisition
    from Reuters collection.
  • Dorothea predict which compounds bind to
    Thrombin from KDD cup 2001.
  • Gisette OCR digit 4 vs. digit 9 from NIST.
  • Madelon artificial data.

http//clopinet.com/isabelle/Projects/NIPS2003/Sli
des/NIPS2003-Datasets.pdf
9
Data Preparation
  • Preprocessing and scaling to numerical range 0 to
    999 for continuous data and 0/1 for binary data.
  • Probes Addition of random features distributed
    similarly to the real features.
  • Shuffling Randomization of the order of the
    patterns and the features.
  • Baseline error rates (errate) Training and
    testing on various data splits with simple
    methods.
  • Test set size Number of test examples needed
    using rule-of-thumb ntest 100/errate.

10
Data Statistics
Size
Type
Features
Training Examples
Validation Examples
Test Examples
Dataset
8.7 MB
Dense
10000
100
100
700
Arcene
22.5 MB
Dense
5000
6000
1000
6500
Gisette
0.9 MB
Sparse integer
20000
300
300
2000
Dexter
4.7 MB
Sparse binary
100000
800
350
800
Dorothea
2.9 MB
Dense
500
2000
600
1800
Madelon
11
ARCENE
ARCENE is the cancer dataset
  • Sources National Cancer Institute (NCI) and
    Eastern Virginia Medical School (EVMS).
  • Three datasets 1 ovarian cancer, 2 prostate
    cancer, all preprocessed similarly.
  • Task Separate cancer vs. normal.

12
DEXTER
DEXTER filters texts
NEW YORK, October 2, 2001 Instinet Group
Incorporated (Nasdaq INET), the worlds largest
electronic agency securities broker, today
announced that it has completed the acquisition
of ProTrader Group, LP, a provider of advanced
trading technologies and electronic brokerage
services primarily for retail active traders and
hedge funds. The acquisition excludes
ProTraders proprietary trading business.
ProTraders 2000 annual revenues exceeded 83
million.
  • Sources Carnegie Group, Inc. and Reuters, Ltd.
  • Preprocessing Thorsten Joachims.
  • Task Filter corporate acquisition texts.

13
DOROTHEA
DOROTHEA is the Thrombin dataset
  • Sources DuPont Pharmaceuticals Research
    Laboratories and KDD Cup 2001.
  • Task Predict compounds that bind to Thrombin.

14
GISETTE
GISETTE contains handwritten digits
  • Source National Institute of Standards and
    Technologies (NIST).
  • Preprocessing Yann LeCun and collaborators.
  • Task Separate digits 4 and 9.

15
MADELON
MADELON is random data
  • Source Isabelle Guyon, inspired by Simon Perkins
    et al.
  • Type of data Clusters on the summits of a
    hypercube.

16
Performance Measures
Confusion matrix
  • Balanced Error Rate (BER) the average of the
    error rates for each class BER 0.5(b/(ab)
    c/(cd)).
  • Area Under Curve (AUC) the area under the ROC
    curve obtained by plotting a/(ab) against
    d/(cd) for each confidence value, starting at
    (0,1) and ending at (1,0).
  • Fraction of Features (FF) the ratio of the num.
    of features selected to the total num. of
    features in the dataset.
  • Fraction of Probes (FP) the ratio of the num. of
    garbage features (probes) selected to the
    total num. of feat. select.

17
BER distribution
18
Power of Feature Selection
19
Visualization
  • 1) Create a heatmap of the data matrix
  • show(D.train)
  • 2) Look at individual patterns
  • browse(D.train)
  • 3) Make a scatter plot of the 2 first
    featuresshow(D.train)
  • 4) Visualize the resultsDat,Modeltrain(model,
    D.train)
  • Dattest(Model, D.valid)
  • roc(Dat)

20
BER f(threshold)
DOROTHEA
Training set
Test set
Theta -37.40
Theta -38.14
No bias adjustment, test BER22.54 with bias,
test BER12.37
21
ROC curve
DOROTHEA
1
0.9
0.8
0.7
0.6
Sensitivity
0.5
AUC0.91
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Specificity
22
Feature Selection
MADELON (pval_max0.1)
1.4
0.7
1.35
0.6
1.3
0.5
1.25
0.4
W
1.2
FDR
0.3
1.15
0.2
1.1
0.1
1.05
1
0
0
10
20
30
40
50
60
70
80
90
0
10
20
30
40
50
60
70
80
90
rank
rank
23
Heat map
ARCENE
24
Scatter plots
ARCENE
chain(standardize, s2n('f_max2'), normalize,
my_svc) Test BER49
chain(standardize, s2n('f_max1100'), normalize,
gs('f_max2'), my_svc) Test BER29.37
25
Lab 3
Playing with FS filters and classifiers on
Madelon and Dexter
26
Lab 3 software
  • Try the examples in Lab3 README.m
  • Inspiring your self by the examples, write a new
    feature ranking filter object. Choose one in
    Chapter 3 or invent your own.
  • Provide the pvalue and FDR (using a tabulated
    distribution or the probe method).

27
Filters see chapter 3
28
Filters Implemented
  • _at_s2n
  • _at_Relief
  • _at_Ttest
  • _at_Pearson (Use Matlab corrcoef. Gives the same
    results as Ttest, classes are balanced.)
  • _at_Ftest (gives the same results as Ttest.
    Important for the pvalues the Fisher criterion
    needs to be multiplied by num_patt_per_class or
    use anovan.)
  • _at_aucfs (ranksum test)

29
Evalution of pval and FDR
  • Ttest object
  • computes pval analytically
  • FDRpvalnsc/n
  • probe object
  • takes any feature ranking object as an argument
    (e.g. s2n, relief, Ttest)
  • pvalnsp/np
  • FDRpvalnsc/n

30
Analytic vs. probe
Red analytic Blue probe
31
Relief vs. Ttest (Madelon)
0.09
0.08
0.07
0.06
Ttest
pval
0.05
0.04
0.9
Ttest
0.03
Relief
0.8
0.02
0.7
0.01
0.6
Relief
FDR
0
0.5
0
5
10
15
20
25
30
35
40
45
50
rank
0.4
0.3
0.2
0.1
0
0
5
10
15
20
25
30
35
40
45
50
rank
32
Lab 4
Plying with Feature construction on Gisette
33
Gisette
  • Handwritten digits.
  • Goal get familiar with the data and result
    formats. Make a first submission.
  • Easiest LM for Gisette naive and svc.
  • Best preprocessing normalize.
  • Easiest feature selection method s2n.
  • Many training examples (6000). Unsuitable for
    kridge unless subsampling is used.
  • Many features (5000). Select features before
    running neural or rf.

34
Baseline Model
  • baselineGisette (BER1.8, feat20)
  • my_classifsvc('coef01', 'degree3', 'gamma0',
    'shrinkage1')
  • my_modelchain(normalize, s2n('f_max1000'),
    my_classif)

35
Baseline methods
baselineGisette (CV1.91, test1.80,
feat20) my_classifsvc('coef01', 'degree3',
'gamma0', 'shrinkage1') my_modelchain(normal
ize, s2n('f_max1000'), my_classif)
baselineGisette2 (CV1.34, test1.17,
feat20) my_modelchain(s2n('f_max1000'),
normalize, my_classif) pixelGisette (CV1.31,
test0.91) my_classifsvc('coef01',
'degree4', 'gamma0', 'shrinkage0.1') my_model
chain(normalize, my_classif)
36
Convolutions
GISETTE (pixelGisette_exp_conv)
prepromy_model1 show(prepro.child)
DDtest(prepro,D.train) browse_digit(DD.X,
D.train.Y)
chain(convolve(exp_ker('dim19', 'dim29')),
normalize, my_classif)
37
Principal Components
  • _at_pca_bank Filter bank object retaining the first
    f_max principal components of the data matrix
  • _at_kmeans_bank Filter bank containing templates
    corresponding to f_max cluster centers.

38
Hadamard bank
  • _at_hadamard_bank Filter bank object performing a
    Hadamard transform.

39
Fourier Transform
_at_fourier Two dimensional Fourier transform.
40
Epilogue
Becoming a pro and playing with other datasets
41
Baseline Methods for the Feature Extraction
Class Isabelle Guyon
BACKGROUND
DATASETS
METHODS
Challenge Good performance few
features. Tasks Two-class classification.
Data split Training/validation/test. Valid
entry Results on all 5 datasets.
We present supplementary course material
complementing the book Feature Extraction,
Fundamentals and Applications, I. Guyon et al
Eds., to appear in Springer. Classical algorithms
of feature extraction were reviewed in class.
More attention was given to the feature selection
than feature construction because of the recent
success of methods involving a large number of
"low-level" features. The book includes the
results of a NIPS 2003 feature selection
challenge. The students learned techniques
employed by the best challengers and tried to
match the best performances. A Matlab toolbox
was provided with sample code.The students could
makepost-challenge entries to
http//www.nipsfsc.ecs.soton.ac.uk/.
  • Scoring
  • Ranking according to test set balanced error
    rate (BER) , i.e. the average positive class
    error rate and negative class error rate.
  • Ties broken by the feature set size.
  • Learning objects
  • CLOP learning objects implemented in Matlab.
  • Two simple abstractions data and algorithm.
  • Download http//www.modelselect.inf.ethz.ch/mod
    els.php.
  • Task of the students
  • Baseline method provided, BER0 performance and
    n0 features.
  • Get BERltBER0 or BERBER0 but nltn0.
  • Extra credit for beating the best challenge
    entry.
  • OK to use the validation set labels for training.

RESULTS
ARCENE Best BER 11.9 ?1.2 - n01100 (11)
BER014.7 my_svcsvc('coef01', 'degree3',
'gamma0', 'shrinkage0.1') my_modelchain(sta
ndardize, s2n('f_max1100'), normalize, my_svc)
  • DEXTER Best BER3.30?0.40 - n0300 (1.5)
    BER05
  • my_classifsvc('coef01', 'degree1', 'gamma0',
    'shrinkage0.5')
  • my_modelchain(s2n('f_max300'), normalize,
    my_classif)

DEXTER text categorization
NEW YORK, October 2, 2001 Instinet Group
Incorporated (Nasdaq INET), the worlds largest
electronic agency securities broker, today
announced tha
DOROTHEA Best BER8.54?0.99 - n01000 (1)
BER012.37 my_modelchain(TP('f_max1000'),
naive, bias)
DOROTHEA drug discovery
GISETTE Best BER1.26?0.14 - n01000 (20)
BER01.80 my_classifsvc('coef01', 'degree3',
'gamma0', 'shrinkage1') my_modelchain(normal
ize, s2n('f_max1000'), my_classif)
GISETTE digit recognition
MADELON Best BER6.22?0.57 - n020 (4)
BER07.33 my_classifsvc('coef01', 'degree0',
'gamma1', 'shrinkage1') my_modelchain(probe(
relief,'p_num2000', 'pval_max0'),
standardize, my_classif)
MADELON artificial data
42
Best student results
http//clopinet.com/isabelle/Projects/ETH/Feature_
Selection_w_CLOP.html
43
Open until August 1st, 2007
  • Agnostic Learning
  • vs.
  • Prior Knowledge
  • challenge
  • Isabelle Guyon, Amir Saffari, Gideon Dror,
  • Gavin Cawley, Olivier Guyon,
  • and many other volunteers, see http//www.agnostic
    .inf.ethz.ch/credits.php

44
Datasets
Type
Dataset
Domain
Feat-ures
Training Examples
Validation Examples
Test Examples
Dense
ADA
415
Marketing
48
4147
41471
Dense
GINA
Digits
970
3153
315
31532
Dense
HIVA
384
Drug discovery
1617
3845
38449
Sparse binary
NOVA
Text classif.
16969
1754
175
17537
Dense
SYLVA
1308
Ecology
216
13086
130858
http//www.agnostic.inf.ethz.ch
45
ADA
  • ADA is the marketing database
  • Task Discover high revenue people from census
    data. Two-class pb.
  • Source Census bureau, Adult database from the
    UCI machine-learning repository.
  • Features 14 original attributes including age,
    workclass,  education, education, marital status,
    occupation, native country. Continuous, binary
    and categorical features.
  •  

46
GINA
GINA is the digit database
  • Task Handwritten digit recognition. Separate the
    odd from the even digits. Two-class pb. with
    heterogeneous classes.
  • Source MNIST database formatted by LeCun and
    Cortes.
  • Features 28x28 pixel map.
  •  

47
HIVA
  • HIVA is the HIV database
  • Task Find compounds active against the AIDS HIV
    infection. We brought it back to a two-class pb.
    (active vs. inactive), but provide the original
    labels (active, moderately active, and inactive).
  • Data source National Cancer Inst.
  • Data representation The compounds are
    represented by their 3d molecular structure.
  •  

48
NOVA
Subject Re Goalie masksLines 21Tom
Barrasso wore a great mask, one time, last
season.  He unveiled it at a game in Boston. 
It was all black, with Pgh city scenes on it.
The "Golden Triangle" graced the top, alongwith
a steel mill on one side and the Civic Arena on
the other.   On the back of the helmet was the
old Pens' logo the current (at the time)
Penslogo, and a space for the "new" logo.A
great mask done in by a goalie's
superstition.Lori 
  • NOVA is the text classification database
  • Task Classify newsgroup emails into politics or
    religion vs. other topics.
  • Source The 20-Newsgroup dataset from in the UCI
    machine-learning repository.
  • Data representation The raw text with an
    estimated 17000 words of vocabulary.

49
SYLVA
  • SYLVA is the ecology database
  • Task Classify forest cover types into Ponderosa
    pine vs. everything else.
  • Source US Forest Service (USFS).
  • Data representation Forest cover type for 30 x
    30 meter cells encoded with 108 features
    (elavation, hill shade, wilderness type, soil
    type, etc.)
  •  

50
BER distribution (March 1st)
Agnostic learning
Prior knowledge
The black vertical line indicates the best ranked
entry (only the 5 last entry of each participant
were ranked). Beware of overfitting!
51
CLOP models
52
Preprocessing and FS
53
Model grouping
for k110 base_modelkchain(standardize,
naive) end my_modelensemble(base_model)
54
CLOP models (best entrant)
 
Juha Reunanen, cross-indexing-7
 
sns shiftnscale, std standardize, norm
normalize (some details of hyperparameters not
shown)
55
CLOP models (2nd best entrant)
 
Hugo Jair Escalante Balderas, BRun2311062
 
sns shiftnscale, std standardize, norm
normalize (some details of hyperparameters not
shown) Note entry Boosting_1_001_x900 gave
better results, but was older.
Write a Comment
User Comments (0)
About PowerShow.com