Active Learning COMS 6998-4: Learning and Empirical Inference - PowerPoint PPT Presentation

About This Presentation
Title:

Active Learning COMS 6998-4: Learning and Empirical Inference

Description:

Given m labeled points, want to learn a classifier with misclassification rate ... Synthetic instances created were incomprehensible to humans! ... – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 109
Provided by: IBMU288
Category:

less

Transcript and Presenter's Notes

Title: Active Learning COMS 6998-4: Learning and Empirical Inference


1
Active LearningCOMS 6998-4 Learning and
Empirical Inference
Irina Rish IBM T.J. Watson Research Center
2
Outline
  • Motivation
  • Active learning approaches
  • Membership queries
  • Uncertainty Sampling
  • Information-based loss functions
  • Uncertainty-Region Sampling
  • Query by committee
  • Applications
  • Active Collaborative Prediction
  • Active Bayes net learning

3
Standard supervised learning model
  • Given m labeled points, want to learn a
    classifier with misclassification rate lt?, chosen
    from a hypothesis class H with VC dimension d lt 1.

VC theory need m to be roughly d/?, in the
realizable case.
4
Active learning
  • In many situations like speech recognition and
    document retrieval unlabeled data is easy to
    come by, but there is a charge for each label.

What is the minimum number of labels needed to
achieve the target error rate?
5
(No Transcript)
6
What is Active Learning?
  • Unlabeled data are readily available labels are
    expensive
  • Want to use adaptive decisions to choose which
    labels to acquire for a given dataset
  • Goal is accurate classifier with minimal cost

7
Active learning warning
  • Choice of data is only as good as the model
    itself
  • Assume a linear model, then two data points are
    sufficient
  • What happens when data are not linear?

8
Active Learning Flavors
Selective Sampling
Membership Queries
Pool
Sequential
Myopic
Batch
9
Active Learning Approaches
  • Membership queries
  • Uncertainty Sampling
  • Information-based loss functions
  • Uncertainty-Region Sampling
  • Query by committee

10
(No Transcript)
11
(No Transcript)
12
Problem
Many results in this framework, even for
complicated hypothesis classes. Baum and Lang,
1991 tried fitting a neural net to handwritten
characters. Synthetic instances created were
incomprehensible to humans! Lewis and Gale,
1992 tried training text classifiers. an
artificial text created by a learning algorithm
is unlikely to be a legitimate natural language
expression, and probably would be uninterpretable
by a human teacher.
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Uncertainty Sampling
Lewis Gale, 1994
  • Query the event that the current classifier is
    most uncertain about
  • Used trivially in SVMs, graphical models, etc.

If uncertainty is measured in Euclidean distance
x
x
x
x
x
x
x
x
x
x
17
Information-based Loss Function
MacKay, 1992
  • Maximize KL-divergence between posterior and
    prior
  • Maximize reduction in model entropy between
    posterior and prior
  • Minimize cross-entropy between posterior and
    prior
  • All of these are notions of information gain

18
Query by Committee
Seung et al. 1992, Freund et al. 1997
  • Prior distribution over hypotheses
  • Samples a set of classifiers from distribution
  • Queries an example based on the degree of
    disagreement between committee of classifiers

x
x
x
x
x
x
x
x
x
x
C
A
B
19
Infogain-based Active Learning
20
Notation
  • We Have
  • Dataset, D
  • Model parameter space, W
  • Query algorithm, q

21
Dataset (D) Example
t Sex Age Test A Test B Test C Disease
0 M 40-50 0 1 1 ?
1 F 50-60 0 1 0 ?
2 F 30-40 0 0 0 ?
3 F 60 1 1 1 ?
4 M 10-20 0 1 0 ?
5 M 40-50 0 0 1 ?
6 F 0-10 0 0 0 ?
7 M 30-40 1 1 0 ?
8 M 20-30 0 0 1 ?
22
Notation
  • We Have
  • Dataset, D
  • Model parameter space, W
  • Query algorithm, q

23
Model Example
St
Ot
Probabilistic Classifier
Notation T Number of examples Ot
Vector of features of example t St Class of
example t
24
Model Example
Patient state (St) St DiseaseState
Patient Observations (Ot) Ot1 Gender Ot2
Age Ot3 TestA Ot4 TestB Ot5 TestC
25
Possible Model Structures
26
Model Space
St
Ot
Model
P(St)
Model Parameters
P(OtSt)
Generative Model Must be able to compute P(Sti,
Otot w)
27
Model Parameter Space (W)
  • W space of possible parameter values
  • Prior on parameters
  • Posterior over models

28
Notation
  • We Have
  • Dataset, D
  • Model parameter space, W
  • Query algorithm, q

q(W,D) returns t, the next sample to label
29
Game
  • while NotDone
  • Learn P(W D)
  • q chooses next example to label
  • Expert adds label to D

30
Simulation
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
hmm
q
31
Active Learning Flavors
  • Pool
  • (random access to patients)
  • Sequential
  • (must decide as patients walk in the door)

32
q?
  • Recall q(W,D) returns the most interesting
    unlabelled example.
  • Well, what makes a doctor curious about a patient?

33
1994
34
Score Function
35
Uncertainty Sampling Example
t Sex Age Test A Test B Test C St
1 M 20-30 0 1 1 ?
2 F 20-30 0 1 0 ?
3 F 30-40 1 0 0 ?
4 F 60 1 1 0 ?
5 M 10-20 0 1 0 ?
6 M 20-30 1 1 1 ?
P(St)
0.02
0.01
0.05
0.12
0.01
0.96
H(St)
0.043
0.024
0.086
0.159
0.024
0.073
FALSE
36
Uncertainty Sampling Example
t Sex Age Test A Test B Test C St
1 M 20-30 0 1 1 ?
2 F 20-30 0 1 0 ?
3 F 30-40 1 0 0 ?
4 F 60 1 1 0 ?
5 M 10-20 0 1 0 ?
6 M 20-30 1 1 1 ?
P(St)
0.01
0.02
0.04
0.00
0.06
0.97
H(St)
0.024
0.043
0.073
0.00
0.112
0.059
FALSE
TRUE
37
Uncertainty Sampling
  • GOOD couldnt be easier
  • GOOD often performs pretty well
  • BAD H(St) measures information gain about the
    samples, not the model
  • Sensitive to noisy samples

38
Can we do better thanuncertainty sampling?
39
1992
40
Model Entropy
P(WD)
P(WD)
P(WD)
W
W
W
H(W) high
H(W) 0
better
41
Information-Gain
  • Choose the example that is expected to most
    reduce H(W)
  • I.e., Maximize H(W) H(W St)

42
Score Function
43
  • We usually cant just sum over all models to get
    H(StW)
  • but we can sample from P(W D)

44
Conditional Model Entropy
45
Score Function
46
t Sex Age Test A Test B Test C St
1 M 20-30 0 1 1 ?
2 F 20-30 0 1 0 ?
3 F 30-40 1 0 0 ?
4 F 60 1 1 1 ?
5 M 10-20 0 1 0 ?
6 M 20-30 0 0 1 ?
P(St)
0.02
0.01
0.05
0.12
0.01
0.02
Score H(C) - H(CSt)
0.53
0.58
0.40
0.49
0.57
0.52
47
Score Function
Familiar?
48
Uncertainty Sampling Information Gain
49
But there is a problem
50
If our objective is to reduce the prediction
error, then
the expected information gain of an unlabeled
sample is NOT a sufficient criterion for
constructing good queries
51
Strategy 2Query by Committee
  • Temporary Assumptions
  • Pool ? Sequential
  • P(W D) ? Version Space
  • Probabilistic ? Noiseless
  • QBC attacks the size of the Version space

52
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
FALSE!
Model 1
Model 2
53
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
TRUE!
TRUE!
Model 1
Model 2
54
O1
O2
O3
O4
O5
O6
O7
S1
S2
S3
S4
S5
S6
S7
FALSE!
TRUE!
Ooh, now were going to learn something for sure!
One of them is definitely wrong.
Model 1
Model 2
55
The Original QBCAlgorithm
  • As each example arrives
  • Choose a committee, C, (usually of size 2)
    randomly from Version Space
  • Have each member of C classify it
  • If the committee disagrees, select it.

56
1992
57
Infogain vs Query by Committee
Seung, Opper, Sompolinsky, 1992 Freund, Seung,
Shamir, Tishby 1997
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
H
Which pair of hypotheses is closest? Depends on
data distribution P. Distance measure on H
d(h,h) P(h(x) ? h(x))
58
Query-by-committee
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
To keep things simple, say d(h,h) Euclidean
distance
H
Error is likely to remain large!
59
Query-by-committee
Elegant scheme which decreases volume in a manner
which is sensitive to the data distribution. Baye
sian setting given a prior ? on H H1 H For t
1, 2, receive an unlabeled point xt drawn
from P informally is there a lot of
disagreement about xt in Ht? choose two
hypotheses h,h randomly from (?, Ht) if h(xt) ?
h(xt) ask for xts label set Ht1 Problem
how to implement it efficiently?
60
Query-by-committee
For t 1, 2, receive an unlabeled point xt
drawn from P choose two hypotheses h,h randomly
from (?, Ht) if h(xt) ? h(xt) ask for xts
label set Ht1 Observation the probability of
getting pair (h,h) in the inner loop (when a
query is made) is proportional to ?(h) ?(h)
d(h,h).
vs.
Ht
61
(No Transcript)
62
Query-by-committee
Label bound For H linear separators in Rd,
P uniform distribution, just d log 1/? labels
to reach a hypothesis with error lt
?. Implementation need to randomly pick h
according to (?, Ht). e.g. H linear
separators in Rd, ? uniform distribution
How do you pick a random point from a convex body?
Ht
63
Sampling from convex bodies
  • By random walk!
  • Ball walk
  • Hit-and-run

Gilad-Bachrach, Navot, Tishby 2005 Studies
random walks and also ways to kernelize QBC.
64
(No Transcript)
65
Some challenges
1 For linear separators, analyze the label
complexity for some distribution other than
uniform! 2 How to handle nonseparable
data? Need a robust base learner

true boundary
-
66
Active Collaborative Prediction
67
Approach Collaborative Prediction (CP)
Given previously observed ratings R(x,y), where X
is a user and Y is a product, predict
unobserved ratings
- will Alina like The Matrix? (unlikely ?)
  • will Client 86 have fast download from Server 39?

- will member X of funding committee approve our
project Y? ?
68
Collaborative Prediction Matrix Approximation
  • Important assumption
  • matrix entries are NOT independent,
  • e.g. similar users have similar tastes
  • Approaches
  • mainly factorized models assuming
  • hidden factors that affect ratings
  • (pLSA, MCVQ, SVD, NMF, MMMF, )

100 servers
100 clients
69
2 4 5 1 4 2
Assumptions - there is a number of
(hidden) factors behind the user preferences
that relate to (hidden) movie properties
- movies have intrinsic values associated
with such factors
- users have intrinsic weights with such
factors user ratings a weighted
(linear) combinations of movies values
70
2 4 5 1 4 2
71
2 4 5 1 4 2
72
rank k
2 4 5 1 4 2
3 1 2 2 5
4 2 4 1 3 1
3 3 4 2
2 3 1 4 3 2
2 2 1 4
1 3 1 1 4
7 2 5 4 5 3 1 4 2
3 1 2 4 2 2 7 5 6
4 3 2 2 4 1 4 3 1
3 1 2 3 4 3 2 4 5
2 3 2 1 3 4 3 5 2
8 2 2 9 1 8 3 4 5
1 2 3 5 1 1 5 6 4

Y
X
Objective find a factorizable XUV that
approximates Y
and satisfies some regularization constraints
(e.g. rank(X) lt k)
Loss functions depends on the nature of your
problem
73
Matrix Factorization Approaches
  • Singular value decomposition (SVD) low-rank
    approximation
  • Assumes fully observed Y and sum-squared loss
  • In collaborative prediction, Y is only partially
    observed
  • Low-rank approximation becomes non-convex
    problem w/ many local minima
  • Furthermore, we may not want sum-squared loss,
    but instead
  • accurate predictions (0/1 loss, approximated by
    hinge loss)
  • cost-sensitive predictions (missing a good server
    vs suggesting a bad one)
  • ranking cost (e.g., suggest k best movies for a
    user)
  • NON-CONVEX PROBLEMS!
  • Use instead the state-of-art Max-Margin Matrix
    Factorization Srebro 05
  • replaces bounded rank constraint by bounded norm
    of U, V vectors
  • convex optimization problem! can be solved
    exactly by semi-definite programming
  • strongly relates to learning max-margin
    classifiers (SVMs)
  • Exploit MMMFs properties to augment it with
    active sampling!

74
Key Idea of MMMF
Rows feature vectors, Columns linear
classifiers
Linear classifiers?weight vectors
margin here Dist(sample, line)
v2
f1
-1
Feature vectors
Xij signij x marginij
Predictorij signij
If signij gt 0, classify as 1, Otherwise
classify as -1
75
MMMF Simultaneous Search for Low-norm
Feature Vectors and Max-margin Classifiers
76
Active Learning with MMMF
- We extend MMMF to Active-MMMF using
margin-based active sampling - We
investigate exploitation vs exploration
trade-offs imposed by different heuristics
77
Active Max-Margin Matrix Factorization
  • A-MMMF(M,s)
  • 1. Given s sparse matrix Y, learn approximation
    X MMMF(Y)
  • 2. Using current predictions, actively
    select best s samples and request their labels
    (e.g., test client/server pair via enforced
    download)
  • 3. Add new samples to Y
  • 4. Repeat 1-3
  • Issues
  • Beyond simple greedy margin-based heuristics?
  • Theoretical guarantees? not so easy with
    non-trivial learning methods and non-trivial data
    distributions
  • (any suggestions??? ? )

78
Empirical Results
  • Network latency prediction
  • Bandwidth prediction (peer-to-peer)
  • Movie Ranking Prediction
  • Sensor net connectivity prediction

79
Empirical Results Latency Prediction
P2Psim data
NLANR-AMP data
Active sampling with most-uncertain (and
most-uncertain positive) heuristics provide
consistent improvement over random and
least-uncertain-next sampling
80
Movie Rating Prediction (MovieLens)
81
Sensor Network Connectivity
82
Introducing Cost Exploration vs Exploitation
DownloadGrid bandwidth prediction
PlanetLab latency prediction
Active sampling ? lower prediction errors at
lower costs (saves 100s of samples) (better
prediction ? better server assignment decisions ?
faster downloads
Active sampling achieves a good exploration
vs exploitation trade-off reduced
decision cost AND information gain
83
Conclusions
  • Common challenge in many applications
    need for cost-efficient
    sampling
  • This talk linear hidden factor models with
    active sampling
  • Active sampling improves predictive accuracy
    while keeping sampling complexity low in a wide
    variety of applications
  • Future work
  • Better active sampling heuristics?
  • Theoretical analysis of active sampling
    performance?
  • Dynamic Matrix Factorizations tracking
    time-varying matrices
  • Incremental MMMF? (solving from scratch every
    time is too costly)

84
References
  • Some of the most influential papers
  • Simon Tong, Daphne Koller. Support Vector Machine
    Active Learning with Applications to Text
    Classification. Journal of Machine Learning
    Research. Volume 2, pages 45-66. 2001.
  • Y. Freund, H. S. Seung, E. Shamir, N. Tishby.
    1997. Selective sampling using the query by
    committee algorithm. Machine Learning, 28133168
  • David Cohn, Zoubin Ghahramani, and Michael
    Jordan. Active learning with statistical models,
    Journal of Artificial Intelligence Research, (4)
    129-145, 1996.
  • David Cohn, Les Atlas and Richard Ladner.
    Improving generalization with active learning,
    Machine Learning 15(2)201-221, 1994.
  • D. J. C. Mackay. Information-Based Objective
    Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.

85
NIPS papers
  • Francis Bach. Active learning for misspecified
    generalized linear models. NIPS-06
  • Ran Gilad-Bachrach, Amir Navot, Naftali Tishby.
    Query by Committee Made Real. NIPS-05
  • Brent Bryan, Jeff Schneider, Robert Nichol,
    Christopher Miller, Christopher Genovese, Larry
    Wasserman. Active Learning For Identifying
    Function Threshold Boundaries . NIPS-05
  • Rui Castro, Rebecca Willett, Robert Nowak. Faster
    Rates in Regression via Active Learning. NIPS-05
  • Sanjoy Dasgupta. Coarse sample complexity bounds
    for active learning. NIPS-05
  • Masashi Sugiyama. Active Learning for
    Misspecified Models. NIPS-05
  • Brigham Anderson, Andrew Moore. Fast Information
    Value for Graphical Models. NIPS-05
  • Dan Pelleg, Andrew W. Moore. Active Learning for
    Anomaly and Rare-Category Detection. NIPS-04
  • Sanjoy Dasgupta. Analysis of a greedy active
    learning strategy. NIPS-04
  • T. Jaakkola and H. Siegelmann. Active Information
    Retrieval. NIPS-01
  • M. K. Warmuth et al. Active Learning in the Drug
    Discovery Process. NIPS-01
  • Jonathan D. Nelson, Javier R. Movellan. Active
    Inference in Concept Learning. NIPS-00
  • Simon Tong, Daphne Koller. Active Learning for
    Parameter Estimation in Bayesian Networks.
    NIPS-00
  • Thomas Hofmann and Joachim M. Buhnmnn. Active
    Data Clustering. NIPS-97
  • K. Fukumizu. Active Learning in Multilayer
    Perceptrons. NIPS-95
  • Anders Krogh, Jesper Vedelsby. NEURAL NETWORK
    ENSEMBLES, CROSS VALIDATION, AND ACTIVE LEARNING.
    NIPS-94
  • Kah Kay Sung, Partha Niyogi. ACTIVE LEARNING FOR
    FUNCTION APPROXIMATION. NIPS-94
  • David Cohn, Zoubin Ghahramani, Michael I. Jordan.
    ACTIVE LEARNING WITH STATISTICAL MODELS. NIPS-94
  • Sebastian B. Thrun and Knut Moller. Active
    Exploration in Dynamic Environments. NIPS-91

86
ICML papers
  • Maria-Florina Balcan, Alina Beygelzimer, John
    Langford. Agnostic Active Learning. ICML-06
  • Steven C. H. Hoi, Rong Jin, Jianke Zhu, Michael
    R. Lyu. Batch Mode Active Learning and Its
    Application to Medical Image Classification.
    ICML-06
  • Sriharsha Veeramachaneni, Emanuele Olivetti,
    Paolo Avesani. Active Sampling for Detecting
    Irrelevant Features. ICML-06
  • Kai Yu, Jinbo Bi, Volker Tresp. Active Learning
    via Transductive Experimental Design. ICML-06
  • Rohit Singh, Nathan Palmer, David Gifford, Bonnie
    Berger, Ziv Bar-Joseph. Active Learning for
    Sampling in Time-Series Experiments With
    Application to Gene Expression Analysis. ICML-05
  • Prem Melville, Raymond Mooney. Diverse Ensembles
    for Active Learning. ICML-04
  • Klaus Brinker. Active Learning of Label Ranking
    Functions. ICML-04
  • Hieu Nguyen, Arnold Smeulders. Active Learning
    Using Pre-clustering. ICML-04
  • Greg Schohn and David Cohn. Less is More Active
    Learning with Support Vector Machines, ICML-00
  • Simon Tong, Daphne Koller. Support Vector Machine
    Active Learning with Applications to Text
    Classification. ICML-00.
  • COLT papers
  • S. Dasgupta, A. Kalai, and C. Monteleoni.
    Analysis of perceptron-based active learning.
    COLT-05.
  • H. S. Seung, M. Opper, and H. Sompolinski. 1992.
    Query by committee. COLT-92, pages 287--294.

87
Journal Papers
  • Antoine Bordes, Seyda Ertekin, Jason Weston, Leon
    Bottou. Fast Kernel Classifiers with Online and
    Active Learning. Journal of Machine Learning
    Research (JMLR), vol. 6, pp. 1579-1619, 2005.
  • Simon Tong, Daphne Koller. Support Vector Machine
    Active Learning with Applications to Text
    Classification. Journal of Machine Learning
    Research. Volume 2, pages 45-66. 2001.
  • Y. Freund, H. S. Seung, E. Shamir, N. Tishby.
    1997. Selective sampling using the query by
    committee algorithm. Machine Learning,
    28133--168
  • David Cohn, Zoubin Ghahramani, and Michael
    Jordan. Active learning with statistical models,
    Journal of Artificial Intelligence Research, (4)
    129-145, 1996.
  • David Cohn, Les Atlas and Richard Ladner.
    Improving generalization with active learning,
    Machine Learning 15(2)201-221, 1994.
  • D. J. C. Mackay. Information-Based Objective
    Functions for Active Data Selection. Neural
    Comput., vol. 4, no. 4, pp. 590--604, 1992.
  • Haussler, D., Kearns, M., and Schapire, R. E.
    (1994). Bounds on the sample complexity of
    Bayesian learning using information theory and
    the VC dimension. Machine Learning, 14, 83--113
  • Fedorov, V. V. 1972. Theory of optimal
    experiment. Academic Press.
  • Saar-Tsechansky, M. and F. Provost. Active
    Sampling for Class Probability Estimation and
    Ranking. Machine Learning 542 2004, 153-178.

88
Workshops
  • http//domino.research.ibm.com/comm/research_proje
    cts.nsf/pages/nips05workshop.index.html

89
Appendix
90
Active Learning of Bayesian Networks
91
Entropy Function
  • A measure of information in random event X with
    possible outcomes x1,,xn
  • Comments on entropy function
  • Entropy of an event is zero when the outcome is
    known
  • Entropy is maximal when all outcomes are equally
    likely
  • The average minimum yes/no questions to answer
    some question (connection to binary search)

H(x) - Si p(xi) log2 p(xi)
Shannon, 1948
92
Kullback-Leibler divergence
  • P is the true distribution Q distribution is
    used to encode data instead of P
  • KL divergence is the expected extra message
    length per datum that must be transmitted using Q
  • Measure of how wrong Q is with respect to true
    distribution P

DKL(P Q) Si P(xi) log (P(xi)/Q(xi))
Si P(xi) log Q(xi) Si P(xi)
log P(xi)
-H(P,Q) H(P)
-Cross-entropy entropy
93
Learning Bayesian Networks
Data Prior Knowledge
Learner
  • Model Building
  • Parameter estimation
  • Causal structure discovery
  • Passive Learning vs Active Learning

94
Active Learning
  • Selective Active Learning
  • Interventional Active Learning
  • Obtain measure of quality of current model
  • Choose query that most improves quality
  • Update model

95
Active Learning Parameter EstimationTong
Koller, NIPS-2000
  • Given a BN structure G
  • A prior distribution p(?)
  • Learner request a particular instantiation q
    (Query)

Active Learner
Query (q)

Training data
Response (x)
?How to update parameter density ?How to select
next query based on p
96
Updating parameter density
  • Do not update A since we are fixing it
  • If we select A then do not update B
  • Sampling from P(BAa) ? P(B)
  • If we force A then we can update B
  • Sampling from P(BAa) P(B)
  • Update all other nodes as usual
  • Obtain new density


B

A

J
M
Pearl 2000
97
Bayesian point estimation
  • Goal a single estimate ?
  • instead of a distribution p over ?
  • If we choose ? and the true model is ? then we
    incur some loss, L(? ?)

98
Bayesian point estimation
  • We do not know the true ?
  • Density p represents optimal beliefs over ?
  • Choose ? that minimizes the expected loss
  • ? argmin? ?p(?) L(? ?) d?
  • Call ? the Bayesian point estimate
  • Use the expected loss of the Bayesian point
    estimate as a measure of quality of p(?)
  • Risk(p) ?p(?) L(? ?) d?



99
The Querying component
  • Set the controllable variables so as to minimize
    the expected posterior risk
  • KL divergence will be used for loss
  • KL(? ?)?KL(P?(XiUi) P?(XiUi))

Conditional KL-divergence
100
Algorithm Summary
  • For each potential query q
  • Compute ?Risk(Xq)
  • Choose q for which ?Risk(Xq) is greatest
  • Cost of computing ?Risk(Xq)
  • Cost of Bayesian network inference
  • Complexity O (Q. Cost of inference)

101
Uncertainty sampling
Maintain a single hypothesis, based on labels
seen so far. Query the point about which this
hypothesis is most uncertain. Problem
confidence of a single hypothesis may not
accurately represent the true diversity of
opinion in the hypothesis class.
X
-
-
-
-
-


-

-


-
-
102
(No Transcript)
103
Region of uncertainty
Current version space portion of H consistent
with labels so far. Region of uncertainty
part of data space about which there is still
some uncertainty (i.e. disagreement within
version space)
current version space
Suppose data lies on circle in R2 hypotheses are
linear separators. (spaces X, H superimposed)


region of uncertainty in data space
104
Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query.
current version space
Data and hypothesis spaces, superimposed (both
are the surface of the unit sphere in Rd)
region of uncertainty in data space
105
Region of uncertainty
Number of labels needed depends on H and also on
P. Special case H linear separators in Rd,
P uniform distribution over unit sphere.
Then just d log 1/? labels are needed to
reach a hypothesis with error rate lt ?. 1
Supervised learning d/? labels. 2 Best we can
hope for.
106
Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query. For more general distributions
suboptimal
Need to measure quality of a query or
alternatively, size of version space.
107
Expected Infogain of sample
Uncertainty sampling!
108
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com