Computational Learning Theory - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Computational Learning Theory

Description:

Ch5 Computational Learning Theory. Introduction. Probably Learning ... The optimal mistake bound for C, denoted by Opt(C), defined as minAlearning algMA(C) ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 45

Provided by: Che79

Category:

more less

Transcript and Presenter's Notes

Title: Computational Learning Theory

1
????

??
??????????????
??????????

2
??????

??????
chenyu_at_icst.pku.edu.cn
Tel82529680
?????
wanghongyan2003_at_163.com
????http//www.icst.pku.edu.cn/course/mlearning/i
ndex.htm

3
Ch5 Computational Learning Theory

Introduction
Probably Learning
Sample Complexity for Finite Hypothesis Spaces
Sample Complexity for Infinite Hypothesis Spaces
Mistake Bound Model

4
Introduction

Problem setting in Ch5
Inductively learning an unknown target function,
given training examples and a hypothesis space
Focus on
How many training examples are sufficient?
How many mistakes will the learner make before
succeeding?

5
Introduction (2)

Desirable quantitative bounds depending on
Complexity of hypo space,
Accuracy of approximation
Probability of outputting a successful hypo
How the training examples are presented
Learner proposes instances
Teacher presents instances
Some random process produces instances
Specifically, study sample complexity,
computational complexity, and mistake bound.

6
Ch5 Computational Learning Theory

Introduction
PAC Learning model
Sample Complexity for Finite Hypothesis Spaces
Sample Complexity for Infinite Hypothesis Spaces
Mistake Bound Model

7
Problem Setting

Space of possible instances X (e.g. set of all
people) over which target functions may be
defined.
Assume that different instances in X may be
encountered with different frequencies.
Modeling above assumption as unknown
(stationary) probability distribution D that
defines the probability of encountering each
instance in X
Training examples are provided by drawing
instances independently from X, according to D,
and they are noise-free.
Each element c in target function set C
corresponds to certain subset of X, i.e. c is a
Boolean function. (Just for the sake of
simplicity)

8
Error of a Hypothesis

Training error of hypo h w.r.t. target function c
and training data set S of n sample is
True error of hypo h w.r.t. target function c and
distribution D is
errorD(h) is not observable, so how probable is
it that errorS(h) gives a misleading estimates of
errorD(h)?
Different from problem setting in Ch3, where
samples are drawn independently from h, here h
depends on training samples.

9
An Illustration of True Error
10
PAC Learnability

PAC refers to Probably Approximately Correct
It is desirable that errorD(h) to be zero,
however, to be realistic, we weaken our demand in
two ways
errorD(h) is to be bounded by a small number e
Learner is not required to success on every
training sample, rather that its probability of
failure is to be bounded by a constant d.
Hence we come up with the idea of Probably
Approximately Correct

11
PAC-Learnable

Def. Consider concept class C defined over
instance space X of cardinality n and a learner L
using hypo space H. C is PAC-learnable by L using
H if for all c in C, distribution D over X, and e
d in (0,0.5), L will with probability at least
1-d output a hypo h s.t. errorD(h) e, in time
that is polynomial in 1/e, 1/d, n, and size(c),
the coding length of c in C.

12
Ch5 Computational Learning Theory

Introduction
Probably Learning
Sample Complexity for Finite Hypothesis Spaces
Sample Complexity for Infinite Hypothesis Spaces
Mistake Bound Model

13
Sample Complexity for Finite Hypothesis Spaces

Start from a good class of learnerconsistent
learner, defined as one that outputs a hypo which
perfectly fits the training data set, whenever
possible.
Recall Version space VSH,D is defined to be the
set of all hypo h?H that correctly classify all
training examples in D.
Property. Every consistent learner outputs a hypo
belonging to version space.

14
e-exhausted

Def. VSH,D is said to be e-exhausted w.r.t. c and
D if for any h in VSH,D, errorD(h)lte.

15
e-exhausting the Version Space

Theorem 5.1 If hypo space H is finite, and D is a
sequence of m independent randomly drawn examples
of some target concept c, then for any 0e1, the
probability that VSH,D is not e-exhausted w.r.t.
c is no more than He -em.
Basic idea behind the proof Since H is finite,
we can enumerate hypotheses in VSH,D by h1, h2,
hk. VSH,D is not e-exhausted iff at least one
of these hi satisfies errorD(h)e, however, such
hi perfectly fits the m number of training
examples

16
Contd

The Theorem bounds the probability that m number
of training examples fail to eliminate all bad
hypotheses.
If we want the upper bound to be no more than d,
and we solve the resulting inequality for m, it
follows that m(1/e)(lnHln(1/d)).
Such m number of training examples are sufficient
to guarantee that any consistent hypo will be
probably (with probability 1-d) approximately
correct (with error less than e).

17
A PAC-Learnable Example

Consider class C of conjunction of boolean
literals.
A boolean literal is any boolean variable or its
negation
Q Is such C PAC-learnable?
A Yes, by going through the following two steps
Show that any consistent learner will require
only a polynomial number of training examples to
learn any element of C
Suggest a specific algorithm that use polynomial
time per training example.

18
Contd

Step1
Let H consists of conjunction of literals based
on n boolean variables.
Now take a look at m(1/e)(lnHln(1/d)),
observe that H3n, then the inequality becomes
m(1/e)(nln3ln(1/d)).
Step2
FIND-S algorithm satisfies the requirement
For each new positive training example, the
algorithm computes intersection of literals
shared by current hypothesis and the example,
using time linear in n

19
Contd

Conclusion Conjunctions of boolean literals are
PAC-learnable.

20
Agnostic Learning Inconsistent Hypo

In the proof of Theorem 5.1, we assume that VSH,D
is not empty, and a simple way to guarantee such
condition holds is that we assume that c belongs
to H.
Agnostic learning setting Dont assume c?H, and
the learner simply finds hypo with minimum
training error instead.

21
Contd

The question in Theorem 5.1 becomes
Let errorD(h) denotes the training error of hypo
h, and hbest be the hypo in H smallest training
error. How many training examples suffice to
ensure (with high probability) that errorD(hbest
)errorD(hbest )e?
Borrow the setting under which we estimate
errorD(h) via errorS(h) in Ch3, and apply The
Hoeffding bound PrerrorD(h)gterrorD(h)eexp-2m
e2 for any h.
Let the above probability be bounded by some
constant d, it follows that

22
Ch5 Computational Learning Theory

Introduction
Probably Learning
Sample Complexity for Finite Hypothesis Spaces
Sample Complexity for Infinite Hypothesis Spaces
Mistake Bound Model

23
Limitation of Theorem 5.1

Quite weak bound (can easily be greater than 1 if
cardinality of H is large enough!)
H must be finite
Introduce a new measureVapnik Chervonenkis
dimension of H, or VC dimension.
Rough idea of VC it measures complexity of H by
the number of distinct instances from X that can
be completely discriminated using H

24
Shattering a Set of Instances

Def. A dichotomy of a set S is a partition of S
into two disjoint subsets
Def. A set of instances S is shattered by hypo
space H iff for every dichotomy of S, there
exists some hypo in H consistent with this
dichotomy.

3 instances shattered
25
VC Dimension

Motivation What if H cant shatter X? Try finite
subsets of X.
Def. VC dimension of hypo space H defined over
instance space X is the size of largest finite
subset of X shattered by H. If any arbitrarily
large finite subsets of X can be shattered by H,
then VC(H)8
Roughly speaking, VC dimension measures how many
(training) points can be separated for all
possible labeling using functions of a given
class.

26
An Example Linear Decision Surface

Line case Xreal number set, and Hset of all
open intervals, then VC(H)2.
Plane case Xxy-plane, and Hset of all linear
decision surface of the plane, then VC(H)3.
General case For n-dim real-number space, let H
be its linear decision surface, then VC(H)n1.

27
Sample Complexity from VC Dimension

How many randomly drawn examples suffice to
e-exhaust VSH,D with probability at least 1-d?
(Blumer et al. 1989)
Furthermore, it is possible to obtain a lower
bound on sample complexity (i.e. minimum number
of required training samples)

28
Lower Bound on Sample Complexity

Theorem 5.2 (Ehrenfeucht et al. 1989) Consider
any concept class C s.t. VC(C)2, any learner L,
and any 0ltelt1/8, and 0ltdlt1/100. Then there exists
a distribution D and target concept in C s.t. if
L observes fewer examples than
max(1/e)log(1/d), (VC(C)-1)/(32e), then with
probability at least d, L outputs a hypo h having
errorD(h)gte.

29
Ch5 Computational Learning Theory

Introduction
Probably Learning
Sample Complexity for Finite Hypothesis Spaces
Sample Complexity for Infinite Hypothesis Spaces
Mistake Bound Model

30
Recall Introduction of this Chapter

Problem setting in Ch5
Inductively learning an unknown target function,
given training examples and a hypothesis space
Focus on
How many training examples are sufficient?
How many mistakes will the learner make before
succeeding?

31
Introduction (2)

Desirable quantitative bounds depending on
Complexity of hypo space,
Accuracy of approximation
Probability of outputting a successful hypo
How the training examples are presented
Learner proposes instances
Teacher presents instances
Some random process produces instances
Specifically, study sample complexity,
computational complexity, and mistake bound.

32
Introduction to Mistake Bound

Mistake bound the total number of mistakes a
learner makes before it converges to the correct
hypothesis
Assume the learner receives a sequence of
training examples, however, for each instance x,
the learner must first predict c(x) before it
receives correct answer from the teacher.
Application scenario when the learning must be
done on-the-fly, rather than during off-line
training stage.

33
Find-S Algorithm

Finding-S Find a maximally specific hypothesis
Initialize h to the most specific hypothesis in H
For each positive training example x
For each attribute constraint ai in h, if it is
satisfied by x, then do nothing otherwise
replace ai by the next more general constraint
that is satisfied by x.
Output hypo h

34
Mistake Bound for FIND-S

Assume training data is noise-free and target
concept c is in the hypo space H, which consists
of conjunction of up to n boolean literals
Then in the worst case the learner needs to make
n1 mistakes before it learns c
Note that misclassification occurs only in case
that the latest learned hypo misclassifies a
positive example as negative, and one such
mistake removes at least one constraint from the
hypo
In the above worst case c is the function that
assigns every instance to true value

35
Mistake Bound for Halving Algorithm

Halving algorithm incrementally learning the
version space as every new instance arrives
predict a new instance by majority votes (of hypo
in VS)
Q What is the maximum number of mistakes that
can be made by a halving algorithm, for an
arbitrary finite H, before it exactly learns the
target concept c (assume c is in H)?
Answer the largest integer no more than log2H
How about the minimum number of mistakes?
Answer zero-mistake!

36
Optimal Mistake Bounds

For an arbitrary concept class C, assuming HC,
interested in the lowest worst-case mistake bound
over all possible learning algorithms
Let MA(c) denotes the maximum number of mistakes
over all possible training examples that a
learner A makes to exactly learn c.
Def. MA(C) maxc?CMA(c)
Ex MFIND-s(C)n-1, MHalving(C)log2C

37
Optimal Mistake Bounds (2)

The optimal mistake bound for C, denoted by
Opt(C), defined as minA?learning algMA(C)
Notice that Opt(C)MHalving(C)log2C
Furthermore, Littlestone (1987) shows that
VC(C)Opt(C) !
When C equal to the power-set Cp of any subset of
finite instance space X, the above four
quantities become equal to each other, i.e. X

38
Weighted-Majority Algorithm

It is a generalization of Halving algorithm
makes a prediction by taking a weighted vote
among a pool of prediction algorithms (or
hypotheses) and learns by altering the weights
It starts by assigning equal weight (1) to every
prediction algorithm. Whenever an algorithm
misclassifies a training example, reduces its
weight
Halving algorithm reduces the weight to zero

39
Procedure for Adjusting Weights

ai denotes the ith prediction algorithm in the
pool wi denotes the weight of ai, and is
initialized to 1
For each training example ltx, c(x)gt
Initialize q0 q1 to be 0
For each ai, if ai(x)0 then q0?q0wi, else
q1?q1wi
If q1gtq0, predicts c(x) to be 1, else
if q1ltq0, predicts c(x) to be 0, else
predicts c(x) at random to be 1 or 0.
For each ai, do
If ai(x)?c(x) (given by the teacher), wi?ßwi

40
Comments on Adjusting Weights Idea

The idea can be found in various problems such as
pattern matching, where we might reduce weights
of less frequently used patterns in the learned
library
The textbook claims that one benefit of the
algorithm is that it is able to accommodate
inconsistent training data, but in case of
learning by query, we presume that answer given
by the teacher is always correct.

41
Relative Mistake Bound for the Algorithm

Theorem 5.3 Let D be the training sequence, A be
any set of n prediction algorithms, and k be the
minimum number of mistakes made by any algorithm
in A for the training sequence D. Then the number
of mistakes over D made by Weighted-Majority
algorithm using ß0.5 is at most 2.4(klog2n)
Proof The basic idea is that we compare the
final weight of best prediction algorithm to the
sum of weights over all predictions. Let aj be
such algorithm with k mistakes, then its final
weight wj0.5k. Now consider the sum W of weights
over all predictions, observe that for every
mistake made, W is reduced to at most 0.75W.

42
Proof of Theorem 5.3 (contd)

Let M be the total number of mistakes made by the
algorithm, then the final total weight is at most
n(0.75)M, and furthermore, 0.5k n(0.75)M. Solve
this inequality for M, and we are done.

43
Summary

Problem setting in Ch5
Inductively learning an unknown target function,
given training examples and a hypothesis space
Focus on
How many training examples are sufficient?
PAC-learning model (probably approximately), VC
dimension for infinite hypo space
How many mistakes will the learner make before
succeeding?
Mistake bound, optimal mistake bound

44
HW