Title: The Informational Complexity of Interactive Machine Learning
1The Informational Complexity of Interactive
Machine Learning
2Passive Learning
Data Source
Expert / Oracle
Learning Algorithm
Raw Unlabeled Data
Labeled examples
Algorithm outputs a classifier
3Learning by Interaction The Big Picture
Data Source
Learning Algorithm
Expert / Oracle
Raw Unlabeled Data
Learner asks a question about the data
Expert answers the question
Learner asks a question about the data
Expert answers the question
. . .
Algorithm outputs a classifier
4Interactive Learning A Manifesto
- Machine learning is a collaborative effort
between human and machine. - In passive learning, there is often a bottleneck
on the human side (data annotation). - Conclusion
- Passive algorithms are lazy collaborators.
- Interactive algorithms may only require the human
to expend effort providing relevant details,
minimizing unnecessary redundancy.
5The Value of Interaction
- But how much improvement can we expect for any
particular learning problem? - How much interaction is necessary and sufficient
for learning?
6Outline
- Active learning with label requests
- Disagreement Coefficient (Hanneke, ICML 2007)
- Teaching Dimension (Hanneke, COLT 2007)
- Class-conditional queries
- Arbitrary Sample-based queries
7Active Learning with Label Requests
8Active Learning with Label Requests
- This is clearly an upper bound on the label
complexity of active learning. - Other than noise rate, VC dimension summarizes
sample complexity. - The algorithm achieving this is ERM, and often
must be approximated.
9Outline
- Active learning with label requests
- Disagreement Coefficient (Hanneke, ICML 2007)
- Teaching Dimension (Hanneke, COLT 2007)
- Class-conditional queries
- Arbitrary Sample-based queries
10Reducing Uncertainty
- Real knowledge is to know the extent of ones
ignorance. - -- Confucius
- As we know, There are known knowns. There are
things we know we know. We also know There are
known unknowns. That is to say We know there
are some things We do not know. But there are
also unknown unknowns, The ones we don't know
We don't know. - Donald Rumsfeld, Feb. 12, 2002, Department of
Defense news briefing
11Reducing Uncertainty
DIS(B(h,r))
h
Concepts in B(h,r) look like this
12Reducing Uncertainty A2 Algorithm
Version Space-based Passive Learning
Add the labeled example to the data set.
Repeat
x
(x,y)
D
x
Sample an example from the distribution.
h
h
h
y
Discard concepts we are statistically confident
are suboptimal.
Request its label from the Expert.
Expert
13Reducing Uncertainty A2 Algorithm
- A2 (Balcan, Beygelzimer Langford, 2006)
14Reducing Uncertainty A2 Algorithm
- A2 BBL06 (slightly oversimplified
explanation)
Version Space-based Agnostic Active Learning
Add the labeled example to the data set.
If it is not in the region of disagreement,
ignore it (move on to next sample).
Repeat
x
(x,y)
x
D
x
Sample an example from the distribution.
h
h
h
Discard concepts we are statistically confident
are suboptimal (wrt the filtered distribution).
y
If it is in the region of disagreement, request
its label from the Expert.
Expert
15Reducing Uncertainty
16Outline
- Active learning with label requests
- Disagreement Coefficient (Hanneke, ICML 2007)
- Teaching Dimension (Hanneke, COLT 2007)
- Class-conditional queries
- Arbitrary Sample-based queries
17Exact Learning Halving Algorithm
- Suppose we can hand the teacher a concept, and
ask for an example that contradicts it if one
exists. (Equivalence queries) - The Halving algorithm (Littlestone, 88)
- Let hmaj be the majority vote concept of C
- Ask for an example (X,Y) where hmaj is wrong
- If no such example exists, return hmaj
- Else remove from C any h with h(X) ? Y
- The Halving algorithm needs at most logC
queries to identify any target function in C.
18Exact Learning Membership Queries
- Suppose, instead of equivalence queries, we can
request the label of any example in X. - We still want to run the Halving algorithm.
- How many label requests does it take to build an
equivalence query?
19Teaching Dimension (Hegedüs, 95)
20Teaching Dimension for PAC
Say V is linear separators.
Sample U from D.
A specifying set uniquely identifies (at most)
one labeling in VU.
As an example, take f to be this colored region.
21XTD and Label Complexity
22XTD and Label Complexity
Conjecture a bound of this form is valid, even
with no knowledge of the noise rate (i.e., for
agnostic learning).
23Outline
- Active learning with label requests
- Disagreement Coefficient (Hanneke, ICML 2007)
- Teaching Dimension (Hanneke, COLT 2007)
- Class-conditional queries
- Arbitrary Sample-based queries
24What about other types of queries?
- Ask the question you want answered
For example, consider multiclass image
classification. Perhaps learning would be easier
if only the algorithm had an image of a car.
Whats this a picture of?
Horse
Planet
Person
Car
25Class-Conditional Queries
- Ask the question you want answered
For example, consider multiclass image
classification. Perhaps learning would be easier
if only the algorithm had an image of a car.
Click on a picture of a car, if there is one.
Can do this for each class individually (except
perhaps the other class)
26Class-Conditional Queries
- A concrete example Conjunctions (without noise).
27Outline
- Active learning with label requests
- Disagreement Coefficient (Hanneke, ICML 2007)
- Teaching Dimension (Hanneke, COLT 2007)
- Class-conditional queries
- Arbitrary Sample-based queries
28Arbitrary Example-based Queries
- Suppose we let the algorithm ask any question it
wants about the data labels.
29Cost Complexity
30Questions? (cost free ?)
31Open Problems for Label Queries
- The value of having more unlabeled data?
(especially for Agnostic learning). - Optimal agnostic active learning algorithm?
32Open Problems
- Unknown cost functions
- E.g., maybe examples near the separator are more
expensive to label. - Other types of queries
- E.g., give me a rule/explanation you used to
decide the label of this example.
33Definition of GIC
- Say the teacher gets drunk, and doesnt
necessarily answer accurately. But she manages
to scribble her answers to every question on a
piece of paper. - We have a spy who steals the paper and
photocopies it. - The spy tells us exactly which questions to ask
so that using minimum cost there is at most one
concept in C consistent with the answers. - Define GIC(C,c) as the worst-case cost of this
game.