Title: Computational Learning Theory
1????
- ??
- ??????????????
- ??????????
2??????
- ??????
- chenyu_at_icst.pku.edu.cn
- Tel82529680
- ?????
- wanghongyan2003_at_163.com
- ????http//www.icst.pku.edu.cn/course/mlearning/i
ndex.htm
3Ch5 Computational Learning Theory
- Introduction
- Probably Learning
- Sample Complexity for Finite Hypothesis Spaces
- Sample Complexity for Infinite Hypothesis Spaces
- Mistake Bound Model
4Introduction
- Problem setting in Ch5
- Inductively learning an unknown target function,
given training examples and a hypothesis space - Focus on
- How many training examples are sufficient?
- How many mistakes will the learner make before
succeeding?
5Introduction (2)
- Desirable quantitative bounds depending on
- Complexity of hypo space,
- Accuracy of approximation
- Probability of outputting a successful hypo
- How the training examples are presented
- Learner proposes instances
- Teacher presents instances
- Some random process produces instances
- Specifically, study sample complexity,
computational complexity, and mistake bound.
6Ch5 Computational Learning Theory
- Introduction
- PAC Learning model
- Sample Complexity for Finite Hypothesis Spaces
- Sample Complexity for Infinite Hypothesis Spaces
- Mistake Bound Model
7Problem Setting
- Space of possible instances X (e.g. set of all
people) over which target functions may be
defined. - Assume that different instances in X may be
encountered with different frequencies. - Modeling above assumption as unknown
(stationary) probability distribution D that
defines the probability of encountering each
instance in X - Training examples are provided by drawing
instances independently from X, according to D,
and they are noise-free. - Each element c in target function set C
corresponds to certain subset of X, i.e. c is a
Boolean function. (Just for the sake of
simplicity)
8Error of a Hypothesis
- Training error of hypo h w.r.t. target function c
and training data set S of n sample is - True error of hypo h w.r.t. target function c and
distribution D is - errorD(h) is not observable, so how probable is
it that errorS(h) gives a misleading estimates of
errorD(h)? - Different from problem setting in Ch3, where
samples are drawn independently from h, here h
depends on training samples.
9An Illustration of True Error
10PAC Learnability
- PAC refers to Probably Approximately Correct
- It is desirable that errorD(h) to be zero,
however, to be realistic, we weaken our demand in
two ways - errorD(h) is to be bounded by a small number e
- Learner is not required to success on every
training sample, rather that its probability of
failure is to be bounded by a constant d. - Hence we come up with the idea of Probably
Approximately Correct
11PAC-Learnable
- Def. Consider concept class C defined over
instance space X of cardinality n and a learner L
using hypo space H. C is PAC-learnable by L using
H if for all c in C, distribution D over X, and e
d in (0,0.5), L will with probability at least
1-d output a hypo h s.t. errorD(h) e, in time
that is polynomial in 1/e, 1/d, n, and size(c),
the coding length of c in C.
12Ch5 Computational Learning Theory
- Introduction
- Probably Learning
- Sample Complexity for Finite Hypothesis Spaces
- Sample Complexity for Infinite Hypothesis Spaces
- Mistake Bound Model
13Sample Complexity for Finite Hypothesis Spaces
- Start from a good class of learnerconsistent
learner, defined as one that outputs a hypo which
perfectly fits the training data set, whenever
possible. - Recall Version space VSH,D is defined to be the
set of all hypo h?H that correctly classify all
training examples in D. - Property. Every consistent learner outputs a hypo
belonging to version space.
14e-exhausted
- Def. VSH,D is said to be e-exhausted w.r.t. c and
D if for any h in VSH,D, errorD(h)lte.
15e-exhausting the Version Space
- Theorem 5.1 If hypo space H is finite, and D is a
sequence of m independent randomly drawn examples
of some target concept c, then for any 0e1, the
probability that VSH,D is not e-exhausted w.r.t.
c is no more than He -em. - Basic idea behind the proof Since H is finite,
we can enumerate hypotheses in VSH,D by h1, h2,
hk. VSH,D is not e-exhausted iff at least one
of these hi satisfies errorD(h)e, however, such
hi perfectly fits the m number of training
examples
16Contd
- The Theorem bounds the probability that m number
of training examples fail to eliminate all bad
hypotheses. - If we want the upper bound to be no more than d,
and we solve the resulting inequality for m, it
follows that m(1/e)(lnHln(1/d)). - Such m number of training examples are sufficient
to guarantee that any consistent hypo will be
probably (with probability 1-d) approximately
correct (with error less than e).
17A PAC-Learnable Example
- Consider class C of conjunction of boolean
literals. - A boolean literal is any boolean variable or its
negation - Q Is such C PAC-learnable?
- A Yes, by going through the following two steps
- Show that any consistent learner will require
only a polynomial number of training examples to
learn any element of C - Suggest a specific algorithm that use polynomial
time per training example.
18Contd
- Step1
- Let H consists of conjunction of literals based
on n boolean variables. - Now take a look at m(1/e)(lnHln(1/d)),
observe that H3n, then the inequality becomes
m(1/e)(nln3ln(1/d)). - Step2
- FIND-S algorithm satisfies the requirement
- For each new positive training example, the
algorithm computes intersection of literals
shared by current hypothesis and the example,
using time linear in n
19Contd
- Conclusion Conjunctions of boolean literals are
PAC-learnable.
20Agnostic Learning Inconsistent Hypo
- In the proof of Theorem 5.1, we assume that VSH,D
is not empty, and a simple way to guarantee such
condition holds is that we assume that c belongs
to H. - Agnostic learning setting Dont assume c?H, and
the learner simply finds hypo with minimum
training error instead.
21Contd
- The question in Theorem 5.1 becomes
- Let errorD(h) denotes the training error of hypo
h, and hbest be the hypo in H smallest training
error. How many training examples suffice to
ensure (with high probability) that errorD(hbest
)errorD(hbest )e? - Borrow the setting under which we estimate
errorD(h) via errorS(h) in Ch3, and apply The
Hoeffding bound PrerrorD(h)gterrorD(h)eexp-2m
e2 for any h. - Let the above probability be bounded by some
constant d, it follows that
22Ch5 Computational Learning Theory
- Introduction
- Probably Learning
- Sample Complexity for Finite Hypothesis Spaces
- Sample Complexity for Infinite Hypothesis Spaces
- Mistake Bound Model
23Limitation of Theorem 5.1
- Quite weak bound (can easily be greater than 1 if
cardinality of H is large enough!) - H must be finite
- Introduce a new measureVapnik Chervonenkis
dimension of H, or VC dimension. - Rough idea of VC it measures complexity of H by
the number of distinct instances from X that can
be completely discriminated using H
24Shattering a Set of Instances
- Def. A dichotomy of a set S is a partition of S
into two disjoint subsets - Def. A set of instances S is shattered by hypo
space H iff for every dichotomy of S, there
exists some hypo in H consistent with this
dichotomy.
3 instances shattered
25VC Dimension
- Motivation What if H cant shatter X? Try finite
subsets of X. - Def. VC dimension of hypo space H defined over
instance space X is the size of largest finite
subset of X shattered by H. If any arbitrarily
large finite subsets of X can be shattered by H,
then VC(H)8 - Roughly speaking, VC dimension measures how many
(training) points can be separated for all
possible labeling using functions of a given
class.
26An Example Linear Decision Surface
- Line case Xreal number set, and Hset of all
open intervals, then VC(H)2. - Plane case Xxy-plane, and Hset of all linear
decision surface of the plane, then VC(H)3. - General case For n-dim real-number space, let H
be its linear decision surface, then VC(H)n1.
27Sample Complexity from VC Dimension
- How many randomly drawn examples suffice to
e-exhaust VSH,D with probability at least 1-d? - (Blumer et al. 1989)
- Furthermore, it is possible to obtain a lower
bound on sample complexity (i.e. minimum number
of required training samples)
28Lower Bound on Sample Complexity
- Theorem 5.2 (Ehrenfeucht et al. 1989) Consider
any concept class C s.t. VC(C)2, any learner L,
and any 0ltelt1/8, and 0ltdlt1/100. Then there exists
a distribution D and target concept in C s.t. if
L observes fewer examples than - max(1/e)log(1/d), (VC(C)-1)/(32e), then with
probability at least d, L outputs a hypo h having
errorD(h)gte.
29Ch5 Computational Learning Theory
- Introduction
- Probably Learning
- Sample Complexity for Finite Hypothesis Spaces
- Sample Complexity for Infinite Hypothesis Spaces
- Mistake Bound Model
30Recall Introduction of this Chapter
- Problem setting in Ch5
- Inductively learning an unknown target function,
given training examples and a hypothesis space - Focus on
- How many training examples are sufficient?
- How many mistakes will the learner make before
succeeding?
31Introduction (2)
- Desirable quantitative bounds depending on
- Complexity of hypo space,
- Accuracy of approximation
- Probability of outputting a successful hypo
- How the training examples are presented
- Learner proposes instances
- Teacher presents instances
- Some random process produces instances
- Specifically, study sample complexity,
computational complexity, and mistake bound.
32Introduction to Mistake Bound
- Mistake bound the total number of mistakes a
learner makes before it converges to the correct
hypothesis - Assume the learner receives a sequence of
training examples, however, for each instance x,
the learner must first predict c(x) before it
receives correct answer from the teacher. - Application scenario when the learning must be
done on-the-fly, rather than during off-line
training stage.
33Find-S Algorithm
- Finding-S Find a maximally specific hypothesis
- Initialize h to the most specific hypothesis in H
- For each positive training example x
- For each attribute constraint ai in h, if it is
satisfied by x, then do nothing otherwise
replace ai by the next more general constraint
that is satisfied by x. - Output hypo h
34Mistake Bound for FIND-S
- Assume training data is noise-free and target
concept c is in the hypo space H, which consists
of conjunction of up to n boolean literals - Then in the worst case the learner needs to make
n1 mistakes before it learns c - Note that misclassification occurs only in case
that the latest learned hypo misclassifies a
positive example as negative, and one such
mistake removes at least one constraint from the
hypo - In the above worst case c is the function that
assigns every instance to true value
35Mistake Bound for Halving Algorithm
- Halving algorithm incrementally learning the
version space as every new instance arrives
predict a new instance by majority votes (of hypo
in VS) - Q What is the maximum number of mistakes that
can be made by a halving algorithm, for an
arbitrary finite H, before it exactly learns the
target concept c (assume c is in H)? - Answer the largest integer no more than log2H
- How about the minimum number of mistakes?
- Answer zero-mistake!
36Optimal Mistake Bounds
- For an arbitrary concept class C, assuming HC,
interested in the lowest worst-case mistake bound
over all possible learning algorithms - Let MA(c) denotes the maximum number of mistakes
over all possible training examples that a
learner A makes to exactly learn c. - Def. MA(C) maxc?CMA(c)
- Ex MFIND-s(C)n-1, MHalving(C)log2C
37Optimal Mistake Bounds (2)
- The optimal mistake bound for C, denoted by
Opt(C), defined as minA?learning algMA(C) - Notice that Opt(C)MHalving(C)log2C
- Furthermore, Littlestone (1987) shows that
VC(C)Opt(C) ! - When C equal to the power-set Cp of any subset of
finite instance space X, the above four
quantities become equal to each other, i.e. X
38Weighted-Majority Algorithm
- It is a generalization of Halving algorithm
makes a prediction by taking a weighted vote
among a pool of prediction algorithms (or
hypotheses) and learns by altering the weights - It starts by assigning equal weight (1) to every
prediction algorithm. Whenever an algorithm
misclassifies a training example, reduces its
weight - Halving algorithm reduces the weight to zero
39Procedure for Adjusting Weights
- ai denotes the ith prediction algorithm in the
pool wi denotes the weight of ai, and is
initialized to 1 - For each training example ltx, c(x)gt
- Initialize q0 q1 to be 0
- For each ai, if ai(x)0 then q0?q0wi, else
q1?q1wi - If q1gtq0, predicts c(x) to be 1, else
- if q1ltq0, predicts c(x) to be 0, else
- predicts c(x) at random to be 1 or 0.
- For each ai, do
- If ai(x)?c(x) (given by the teacher), wi?ßwi
40Comments on Adjusting Weights Idea
- The idea can be found in various problems such as
pattern matching, where we might reduce weights
of less frequently used patterns in the learned
library - The textbook claims that one benefit of the
algorithm is that it is able to accommodate
inconsistent training data, but in case of
learning by query, we presume that answer given
by the teacher is always correct.
41Relative Mistake Bound for the Algorithm
- Theorem 5.3 Let D be the training sequence, A be
any set of n prediction algorithms, and k be the
minimum number of mistakes made by any algorithm
in A for the training sequence D. Then the number
of mistakes over D made by Weighted-Majority
algorithm using ß0.5 is at most 2.4(klog2n) - Proof The basic idea is that we compare the
final weight of best prediction algorithm to the
sum of weights over all predictions. Let aj be
such algorithm with k mistakes, then its final
weight wj0.5k. Now consider the sum W of weights
over all predictions, observe that for every
mistake made, W is reduced to at most 0.75W.
42Proof of Theorem 5.3 (contd)
- Let M be the total number of mistakes made by the
algorithm, then the final total weight is at most
n(0.75)M, and furthermore, 0.5k n(0.75)M. Solve
this inequality for M, and we are done.
43Summary
- Problem setting in Ch5
- Inductively learning an unknown target function,
given training examples and a hypothesis space - Focus on
- How many training examples are sufficient?
- PAC-learning model (probably approximately), VC
dimension for infinite hypo space - How many mistakes will the learner make before
succeeding? - Mistake bound, optimal mistake bound
44HW
- 7.2, 7.5, 7.8 (10pt each, Due Tuesday, 11-3)