Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles - PowerPoint PPT Presentation

About This Presentation
Title:

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles

Description:

Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles Pinar Donmez and Jaime Carbonell Language Technologies Institute, – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 22
Provided by: SchoolofC45
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Proactive Learning: Cost-Sensitive Active Learning with Multiple Imperfect Oracles


1
Proactive Learning Cost-Sensitive Active
Learning with Multiple Imperfect Oracles
  • Pinar Donmez and Jaime Carbonell
  • Language Technologies Institute,
  • School of Computer Science,
  • Carnegie Mellon University
  • CIKM 08, Napa Valley, October 2008

2
Active learning Assumptions and Real World
Active Learning
Real World
  • unique oracle
  • perfect oracle
  • always right
  • never tired
  • works for free or charges uniformly
  • multiple sources of information
  • imperfect oracles
  • unreliable
  • reluctant
  • expensive or charges non-uniformly

3
Solution Proactive Learning
  • Proactive learning is a generalization of active
    learning to relax these assumptions
  • decision-theoretic framework to jointly optimize
    instance-oracle pair
  • utility optimization problem under a fixed budget
    constraint

4
Outline
  • Methodology
  • 3 Scenarios
  • Reluctance
  • Fallibility
  • Variable and Fixed Cost
  • Evaluation
  • Problem Setup
  • Datasets
  • Results
  • Conclusion

5
Scenario 1 Reluctance
  • 2 oracles
  • reliable oracle expensive but always answers
    with a correct label
  • reluctant oracle cheap but may not respond to
    some queries
  • Define a utility score as expected value of
    information at unit cost

6
How to simulate oracle unreliability?
  • depend on factors such as query difficulty (hard
    to classify), complexity of the data (requires
    long and time-consuming analysis), etc. In this
    work, we model it based on query difficulty
  • Assumptions
  • Perfect oracle classifier having zero training
    error on the entire data
  • Imperfect oracle weak classifier trained on a
    subset of the entire data
  • Train a logistic regression classifier on the
    subset to obtain
  • Identify instances with
  • These are the unreliable instances
  • Challenge tradeoff between the information value
    of an instance and the reliability of the oracle

7
How to estimate ?
  • Cluster unlabeled data using k-means
  • Ask the label of each cluster centroid to the
    reluctant oracle. If
  • label received increase
    of nearby points
  • no label decrease
    of nearby points
  • equals 1 when label received, -1 otherwise
  • clusters depend on the clustering budget and
    oracle fee

8
  • Algorithm works in rounds till no budget
  • At each round, sampling continues until a label
    is obtained
  • Be careful You may spend the entire budget on a
    single attempt
  • If no label, decrease the utility of remaining
    instances
  • This is adaptive Penalization of the Reluctant
    Oracle

9
Algorithm for Scenario 1
10
Scenario 2 Fallibility
  • 2 oracles
  • One perfect but expensive oracle
  • One fallible but cheap oracle, always answers
  • Alg. Similar to Scenario 1 with slight
    modifications
  • During exploration
  • Fallible oracle provides the label with its
    confidence
  • Confidence of fallible oracle
  • If then we dont
    use the label
  • but we still update

11
Outline of Scenario 2
12
Scenario 3 Non-uniform Cost
  • Uniform cost Fraud detection, face recognition,
    etc.
  • Non-uniform cost text categorization, medical
    diagnosis, protein structure prediction, etc.
  • 2 oracles
  • Fixed-cost Oracle
  • Variable-cost Oracle

13
Outline of Scenario 3
14
Evaluation
  • Datasets Face detection, UCI Letter (V-vs-Y),
    Spambase, and UCI Adult

15
Oracle Properties and Costs
  • The cost is inversely proportional to reliability
  • Higher costs for the fallible oracle since a
    noisy label should be penalized more than no
    label at all
  • Cost ratio creates an incentive to choose between
    oracles

16
Underlying Sampling Strategy
  • Conditional entropy based sampling, weighted by a
    density measure
  • Captures the information content of a close
    neighborhood

close neighbors of x
17
Results Overall and Reluctance on Spambase Data
18
Results Reluctance
19
Cost varies non-uniformly
statistically significant results (plt0.01)
20
More light on the clustering step
  • Run each baseline without the clustering step
  • Entire budget is spent in rounds for data
    elicitation
  • No separate clustering budget
  • Results on Spambase under Scenario 1, cost 13

21
Conclusion
  • Address issues with the assumptions of active
    learning
  • Introduction to a Proactive Learning framework
  • Analysis of imperfect oracles with differing
    properties and costs
  • Expected utility maximization across
    oracle-instance pairs
  • Effective against exploitation of a single oracle
Write a Comment
User Comments (0)
About PowerShow.com