Title: Models of active learning for classification
1Models of active learning for classification
- Sanjoy Dasgupta
- UC San Diego
2Supervised learning
- Given access to labeled data (drawn iid from an
unknown underlying distribution P), want to learn
a good classifier chosen from hypothesis class H.
3Active learning
- In many situations like speech recognition and
document retrieval unlabeled data is easy to
come by, but there is a charge for each label.
Pick a good classifier, at low cost.
4Membership queries
Earliest model of active learning in theory work
Angluin 1992 X space of possible inputs,
like 0,1n H class of hypotheses Target
concept h 2 H to be identified exactly. You can
ask for the label of any point in X no unlabeled
data. H0 H For t 1,2, pick a point x 2 X
and query its label h(x) let Ht all
hypotheses in Ht-1 consistent with (x,
h(x)) What is the minimum number of membership
queries needed to reduce H to just h?
5Membership queries example
X 0,1n H AND-of-positive-literals, like x1
Æ x3 Æ x10 S (set of AND positions) For i
1 to n ask for the label of (1,,1,0,1,,1)
0 at position i if negative S S
i Total n queries General idea synthesize
highly informative points. Each query cuts the
version space in half.
6Problem
Many results in this framework, even for
complicated hypothesis classes. Baum and Lang,
1991 tried fitting a neural net to handwritten
characters. Synthetic instances created were
incomprehensible to humans! Lewis and Gale,
1992 tried training text classifiers. an
artificial text created by a learning algorithm
is unlikely to be a legitimate natural language
expression, and probably would be uninterpretable
by a human teacher.
7A better, PAC-like model
Cohn, Atlas, and Ladner, 1992 Underlying
distribution P on the (x,y) data. Learner has
two abilities -- draw an unlabeled sample from
the distribution -- ask for a label of one of
these samples The error of any classifier h is
measured on distribution P err(h) P(h(x) ?
y) Special case to simplify matters assume the
data is separable, ie. some concept h 2 H labels
all points perfectly.
81 Uncertainty sampling
Maintain a single hypothesis, based on labels
seen so far. Query the point about which this
hypothesis is most uncertain. Problem
confidence of a single hypothesis may not
accurately represent the true diversity of
opinion in the hypothesis class.
X
-
-
-
-
-
-
-
-
-
92 Region of uncertainty
Current version space portion of H consistent
with labels so far. Region of uncertainty
part of data space about which there is still
some uncertainty (ie. disagreement within version
space)
current version space
Suppose data lies on circle in R2 hypotheses are
linear separators. (spaces X, H superimposed)
region of uncertainty in data space
102 Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query.
current version space
Data and hypothesis spaces, superimposed (both
are the surface of the unit sphere in Rd)
region of uncertainty in data space
112 Region of uncertainty
Number of labels needed depends on H and also on
P. Special case H linear separators in Rd,
P uniform distribution over unit sphere.
Then just d log 1/? labels are needed to
reach a hypothesis with error rate lt ?. 1
Supervised learning d/? labels. 2 Best we can
hope for.
122 Region of uncertainty
Algorithm CAL92 of the unlabeled points which
lie in the region of uncertainty, pick one at
random to query. For more general distributions
suboptimal
Need to measure quality of a query or
alternatively, size of version space.
13Query-by-committee
Seung, Opper, Sompolinsky, 1992 Freund, Seung,
Shamir, Tishby 1997
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
H
Which pair of hypotheses is closest? Depends on
data distribution P. Distance measure on H
d(h,h) P(h(x) ? h(x))
14Query-by-committee
First idea Try to rapidly reduce volume of
version space? Problem doesnt take data
distribution into account.
To keep things simple, say d(h,h) ¼ Euclidean
distance in this picture.
H
Error is likely to remain large!
153 Query-by-committee
Elegant scheme which decreases volume in a manner
which is sensitive to the data distribution. Baye
sian setting given a prior ? on H H1 H For t
1, 2, receive an unlabeled point xt drawn
from P informally is there a lot of
disagreement about xt in Ht? choose two
hypotheses h,h randomly from (?, Ht) if h(xt) ?
h(xt) ask for xts label set Ht1
16Query-by-committee
For t 1, 2, receive an unlabeled point xt
drawn from P choose two hypotheses h,h randomly
from (?, Ht) if h(xt) ? h(xt) ask for xts
label set Ht1 Observation the probability of
getting pair (h,h) in the inner loop (when a
query is made) is proportional to ?(h) ?(h)
d(h,h).
vs.
Ht
17Query-by-committee
Label bound For H linear separators in Rd,
P uniform distribution, just d log 1/? labels
to reach a hypothesis with error lt
?. Implementation need to randomly pick h
according to (?, Ht). e.g. H linear
separators in Rd, ? uniform distribution
How do you pick a random point from a convex body?
Ht
18Sampling from convex bodies
- By random walk!
- Ball walk
- Hit-and-run
Gilad-Bachrach, Navot, Tishby 2005 Studies
random walks and also ways to kernelize QBC.
19Online active learning
- Online algorithms
- see unlabeled data streaming by, one point at a
time - can query current points label, at a cost
- can only maintain current hypothesis (memory
bound) - Dasgupta, Kalai, Monteleoni 2005 An active
version of the perceptron algorithm. - Guarantee For linear separators under the
uniform distribution, label complexity is d log
1/?.
20Active perceptron?
No matter what selective sampling rule is used,
the perceptron algorithm needs 1/?2 labels to
reach error ?.
?t angle between vt and u Then 1 This angle
increases unless vt 1/sin ?t 2 vt
t1/2 Therefore need sin ?t 1/t1/2 (For
uniform distribution) error rate is approximately
sin ?t
vt current hypothesis u target
vt1
vt
u
?t
xt
(Graphic taken from C. Monteleoni)
21Conservative update
Standard Perceptron update vt1 vt yt
xt Instead, weight the update by confidence
w.r.t. current hypothesis vt vt1 vt 2 yt
vt xt xt (v1 y0 x0) (smaller update for
points close to boundary) Unlike Perceptron 1
Length remains constant vt 1 2 Angle ?t
decreases monotonically
22A more conservative update
Standard Perceptron update vt1 vt yt
xt Modified update vt1 vt 2 yt vt xt
xt
vt1
vt1
u
vt
vt1
vt
xt
(Graphic taken from C. Monteleoni)
234 Active perceptron
Input a stream of data points x0, x1, x2, Set
initial hypothesis v0 y0 x0 For t 1, 2, 3,
receive unlabeled point xt Filtering step
decide whether to ask for xts label yt if label
is requested if (xt, yt) is misclassified
vt1 vt 2 yt vt xt xt adjust filtering
rule else vt1 vt What filtering rule
should be used?
24Selective sampling rule
Ideally select exactly the points which lie in
the error region. But we dont know what this
region is
vt
- So choose points within a certain margin of vt
labeling region - L x x vt st
- (threshold st).
- Tradeoff in choosing L
- - If too large wait forever for a misclassified
point - If too small update is miniscule
- Solution set st adaptively
u
L
Error region
(Graphic taken from C. Monteleoni)
25Some challenges
1 For linear separators, analyze the label
complexity for some distribution other than
uniform! 2 How to handle nonseparable
data? Need a robust base learner
true boundary
-
26Thanks
For many helpful discussions Peter
Bartlett Yoav Freund Adam Kalai John
Langford Claire Monteleoni