Isabelle Guyon (Clopinet, California)

About This Presentation

Title:

Isabelle Guyon (Clopinet, California)

Description:

Active Learning Challenge Isabelle Guyon (Clopinet, California) Gavin Cawley (University of East Anglia, UK) Olivier Chapelle (Yahhoo!, California) – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 28

Provided by: Isabell47

Category:

more less

Transcript and Presenter's Notes

Title: Isabelle Guyon (Clopinet, California)

1
Active Learning Challenge

Isabelle Guyon (Clopinet, California)
Gavin Cawley (University of East Anglia, UK)
Olivier Chapelle (Yahhoo!, California) Gideon
Dror (Academic College of Tel-Aviv-Yaffo, Israel)
Vincent Lemaire (Orange, France) Amir Reza
Saffari Azar (Graz University of Technology)
Alexander Statnikov (New York University, USA)

2
What is the problem?
3
Labeling data is expensive

Unlabeled data
Labeling data
4
Examples of domains

Chemo-informatics
Handwriting and speech recognition
Image processing
Text processing
Marketing
Ecology
Embryology

5
What is active learning?
6
What is out there?
7
Scenarios
Burr Settles. Active Learning Literature Survey.
CDTR 1648, Univ. WisconsinMadison. 2009.
8
De novo queries
De novo queries implicitly assume interventions
on the system under study not for this challenge
9
Focus on pool-based AL

Simplest scenario for a challenge.

Training data labels can be queried
Test data unknown labels

Methods developed for pool-based AL should also
be useful for stream-based AL.

10
Example
Accuracy0.7
Accuracy0.9

Toy 2-class problem, 400 instances Gaussian
distributed.
Linear logistic regression model trained w. 30
random instances.
(c) Linear logistic regression model trained w.
30 actively queried
instances using uncertainty sampling.

Burr Settles, 2009
11
Learning curve
Burr Settles, 2009
12
Other methods

Expected model change (greatest gradient if
sample were used for training)
Query by committee (query the sample subject to
largest disagreement)
Bayesian active learning (maximize change in
revised posterior distribution)
Expected error reduction (maximize generalization
performance improvement)
Information density (ask for examples both
informative and representative)

Burr Settles, 2009
13
Datasets
14
Data donors

This project would not have been possible without
generous donations of data
Chemoinformatics -- Charles Bergeron, Kristin
Bennett and Curt Breneman (Rensselaer Polytechnic
Institute, New York) contributed a dataset, which
will be used for final testing.
Embryology -- Emmanuel Faure, Thierry Savy,
Louise Duloquin, Miguel Luengo Oroz, Benoit
Lombardot, Camilo Melani, Paul Bourgine, and
Nadine Peyriéras (Institut des systèmes
complexes, France) contributed the ZEBRA dataset.
Handwriting recognition -- Reza Farrahi
Moghaddam, Mathias Adankon, Kostyantyn Filonenko,
Robert Wisnovsky, and Mohamed Chériet (Ecole de
technologie supérieure de Montréal, Quebec)
contributed the IBN_SINA dataset.
Marketing -- Vincent Lemaire, Marc Boullé,
Fabrice Clérot, Raphael Féraud, Aurélie Le Cam,
and Pascal Gouzien (Orange, France) contributed
the ORANGE dataset, previously used in the KDD
cup 2009.
We also reused data made publicly available on
the Internet
Chemoinformatics -- The National Cancer Institute
(USA) for the HIVA dataset.
Ecology -- Jock A. Blackard, Denis J. Dean, and
Charles W. Anderson (US Forest Service, USA) for
the SYLVA dataset (Forest cover type).
Text processing -- Tom Mitchell (USA) and Ron
Bekkerman (Israel) for the NOVA datset (derived
from the Twenty Newsgroups).

15
Development datasets
16
Difficulties

Spase data
Missing values
Unbalanced classes
Categorical variables
Noisy data
Large datasets

17
Final test datasets

Will serve to do the final ranking
Will be from the same domains
May have different data representations and
distributions
No feed-back the results will not be revealed
until the end of the challenge

18
Protocol
19
Virtual Lab
Virtual cash

Joint work with
Constantin Aliferis, New York University
Gregory F. Cooper, Pittsburg University
André Elisseeff, Nhumi, Zürich
Jean-Philippe Pellet, IBM Zürich
Alexander Statnikov, New York University
Peter Spirtes, Carnegie Mellon

20
Step by step instructions
Download the data. You get 1 labeled example.

Predict
Sample
Submit a query
Retrieve the labels

21
Two phases

Development phase
6 datasets available
Can try as many times as you want
Matlab users can run queries on their computers
Others can use the labels (provided)
Final test phase
6 new datasets available
A single try
No feed-back

22
Evaluation
23
AUC score
For each set of samples queried, we assess the
predictions of the learning machine with the Area
under the ROC curve.
24
Area under the Learning Curve (ALC)
Linear interpolation. Horizontal extrapolation.
One query
Five queries
Thirteen queries
Lazy ask for all labels at once
25
Prizes
If you win on