Isabelle Guyon (Clopinet, California) - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Isabelle Guyon (Clopinet, California)

Description:

Active Learning Challenge Isabelle Guyon (Clopinet, California) Gavin Cawley (University of East Anglia, UK) Olivier Chapelle (Yahhoo!, California) – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 28
Provided by: Isabell47
Category:

less

Transcript and Presenter's Notes

Title: Isabelle Guyon (Clopinet, California)


1
Active Learning Challenge
  • Isabelle Guyon (Clopinet, California)
  • Gavin Cawley (University of East Anglia, UK)
    Olivier Chapelle (Yahhoo!, California) Gideon
    Dror (Academic College of Tel-Aviv-Yaffo, Israel)
    Vincent Lemaire (Orange, France) Amir Reza
    Saffari Azar (Graz University of Technology)
    Alexander Statnikov (New York University, USA)

2
What is the problem?
3
Labeling data is expensive


Unlabeled data
Labeling data
4
Examples of domains
  • Chemo-informatics
  • Handwriting and speech recognition
  • Image processing
  • Text processing
  • Marketing
  • Ecology
  • Embryology

5
What is active learning?
6
What is out there?
7
Scenarios
Burr Settles. Active Learning Literature Survey.
CDTR 1648, Univ. WisconsinMadison. 2009.
8
De novo queries
De novo queries implicitly assume interventions
on the system under study not for this challenge
9
Focus on pool-based AL
  • Simplest scenario for a challenge.

Training data labels can be queried
Test data unknown labels
  • Methods developed for pool-based AL should also
    be useful for stream-based AL.

10
Example
Accuracy0.7
Accuracy0.9
  • Toy 2-class problem, 400 instances Gaussian
    distributed.
  • Linear logistic regression model trained w. 30
    random instances.
  • (c) Linear logistic regression model trained w.
    30 actively queried
  • instances using uncertainty sampling.

Burr Settles, 2009
11
Learning curve
Burr Settles, 2009
12
Other methods
  • Expected model change (greatest gradient if
    sample were used for training)
  • Query by committee (query the sample subject to
    largest disagreement)
  • Bayesian active learning (maximize change in
    revised posterior distribution)
  • Expected error reduction (maximize generalization
    performance improvement)
  • Information density (ask for examples both
    informative and representative)

Burr Settles, 2009
13
Datasets
14
Data donors
  • This project would not have been possible without
    generous donations of data
  • Chemoinformatics -- Charles Bergeron, Kristin
    Bennett and Curt Breneman (Rensselaer Polytechnic
    Institute, New York) contributed a dataset, which
    will be used for final testing.
  • Embryology -- Emmanuel Faure, Thierry Savy,
    Louise Duloquin, Miguel Luengo Oroz, Benoit
    Lombardot, Camilo Melani, Paul Bourgine, and
    Nadine Peyriéras (Institut des systèmes
    complexes, France) contributed the ZEBRA dataset.
  • Handwriting recognition -- Reza Farrahi
    Moghaddam, Mathias Adankon, Kostyantyn Filonenko,
    Robert Wisnovsky, and Mohamed Chériet (Ecole de
    technologie supérieure de Montréal, Quebec)
    contributed the IBN_SINA dataset.
  • Marketing -- Vincent Lemaire, Marc Boullé,
    Fabrice Clérot, Raphael Féraud, Aurélie Le Cam,
    and Pascal Gouzien (Orange, France) contributed
    the ORANGE dataset, previously used in the KDD
    cup 2009.
  • We also reused data made publicly available on
    the Internet
  • Chemoinformatics -- The National Cancer Institute
    (USA) for the HIVA dataset.
  • Ecology -- Jock A. Blackard, Denis J. Dean, and
    Charles W. Anderson (US Forest Service, USA) for
    the SYLVA dataset (Forest cover type).
  • Text processing -- Tom Mitchell (USA) and Ron
    Bekkerman (Israel) for the NOVA datset (derived
    from the Twenty Newsgroups).

15
Development datasets
16
Difficulties
  • Spase data
  • Missing values
  • Unbalanced classes
  • Categorical variables
  • Noisy data
  • Large datasets

17
Final test datasets
  • Will serve to do the final ranking
  • Will be from the same domains
  • May have different data representations and
    distributions
  • No feed-back the results will not be revealed
    until the end of the challenge

18
Protocol
19
Virtual Lab
Virtual cash
  • Joint work with
  • Constantin Aliferis, New York University
  • Gregory F. Cooper, Pittsburg University
  • André Elisseeff, Nhumi, Zürich
  • Jean-Philippe Pellet, IBM Zürich
  • Alexander Statnikov, New York University
  • Peter Spirtes, Carnegie Mellon

20
Step by step instructions
Download the data. You get 1 labeled example.
  1. Predict
  2. Sample
  3. Submit a query
  4. Retrieve the labels

21
Two phases
  • Development phase
  • 6 datasets available
  • Can try as many times as you want
  • Matlab users can run queries on their computers
  • Others can use the labels (provided)
  • Final test phase
  • 6 new datasets available
  • A single try
  • No feed-back

22
Evaluation
23
AUC score
For each set of samples queried, we assess the
predictions of the learning machine with the Area
under the ROC curve.
24
Area under the Learning Curve (ALC)
Linear interpolation. Horizontal extrapolation.
One query
Five queries
Thirteen queries
Lazy ask for all labels at once
25
Prizes
If you win on
  • 1 dataset 100
  • 2 datasets 200
  • 3 datasets 400
  • 4 datasets 800
  • 5 datasets 1600
  • 6 datasets 3200!
  • Plus travel awards for top ranking students.

26
Schedule
27
Conclusion
  • Try our new challenge, learn, and win!!!!
  • Workshops
  • AISTATS 2010, Sardinia, May, 2010
  • WCCI 2010 Workshop, Barcelona, July, 2010
  • Travel awards for top ranking students.
  • Proceedings published by JMLR IEEE.
  • Prizes P(i)100 2(n-1)
  • Your problem solved by dozens of research groups
  • Help us organize the next challenge!
Write a Comment
User Comments (0)
About PowerShow.com