Title: CS 9633 Machine Learning
1CS 9633Machine Learning
- Computational Learning Theory
Adapted from notes by Tom Mitchell http//www-2.cs
.cmu.edu/tom/mlbook-chapter-slides.html
2Theoretical Characterization of Learning Problems
- Under what conditions is successful learning
possible and impossible? - Under what conditions is a particular learning
algorithm assured of learning successfully?
3Two Frameworks
- PAC (Probably Approximately Correct) Learning
Framework Identify classes of hypotheses that
can and cannot be learned from a polynomial
number of training examples - Define a natural measure of complexity for
hypothesis spaces that allows bounding the number
of training examples needed - Mistake Bound Framework
4Theoretical Questions of Interest
- Is it possible to identify classes of learning
problems that are inherently difficult or easy,
independent of the learning algorithm? - Can one characterize the number of training
examples necessary or sufficient to assure
successful learning? - How is the number of examples affected
- If observing a random sample of training data?
- if the learner is allowed to pose queries to the
trainer? - Can one characterize the number of mistakes that
a learner will make before learning the target
function? - Can one characterize the inherent computational
complexity of a class of learning algorithms?
5Computational Learning Theory
- Relatively recent field
- Area of intense research
- Partial answers to some questions on previous
page is yes. - Will generally focus on certain types of learning
problems.
6Inductive Learning of Target Function
- What we are given
- Hypothesis space
- Training examples
- What we want to know
- How many training examples are sufficient to
successfully learn the target function? - How many mistakes will the learner make before
succeeding?
7Questions for Broad Classes of Learning Algorithms
- Sample complexity
- How many training examples do we need to
converge to a successful hypothesis with a high
probability? - Computational complexity
- How much computational effort is needed to
converge to a successful hypothesis with a high
probability? - Mistake Bound
- How many training examples will the learner
misclassify before converging to a successful
hypothesis?
8PAC Learning
- Probably Approximately Correct Learning Model
- Will restrict discussion to learning
boolean-valued concepts in noise-free data.
9Problem SettingInstances and Concepts
- X is set of all possible instances over which
target function may be defined - C is set of target concepts learner is to learn
- Each target concept c in C is a subset of X
- Each target concept c in C is a boolean function
- c X?0,1
- c(x) 1 if x is positive example of concept
- c(x) 0 otherwise
10Problem Setting Distribution
- Instances generated at random using some
probability distribution D - D may be any distribution
- D is generally not known to the learner
- D is required to be stationary (does not change
over time) - Training examples x are drawn at random from X
according to D and presented with target value
c(x) to the learner.
11Problem Setting Hypotheses
- Learner L considers set of hypotheses H
- After observing a sequence of training examples
of the target concept c, L must output some
hypothesis h from H which is its estimate of c
12Example Problem(Classifying Executables)
- Three Classes (Malicious, Boring, Funny)
- Features
- a1 GUI present (yes/no)
- a2 Deletes files (yes/no)
- a3 Allocates memory (yes/no)
- a4 Creates new thread (yes/no)
- Distribution?
- Hypotheses?
13Instance a1 a2 a3 a4 Class
1 Yes No No Yes B
2 Yes No No No B
3 No Yes Yes No F
4 No No Yes Yes M
5 Yes No No Yes B
6 Yes No No No F
7 Yes Yes Yes No M
8 Yes Yes No Yes M
9 No No No Yes B
10 No No Yes No M
14True Error
- Definition The true error (denoted errorD(h))
of hypothesis h with respect to target concept c
and distribution D , is the probability that h
will misclassify an instance drawn at random
according to D.
15Error of h with respect to c
Instance space X
-
-
-
c
h
-
16Key Points
- True error defined over entire instance space,
not just training data - Error depends strongly on the unknown probability
distribution D - The error of h with respect to c is not directly
observable to the learner Lcan only observe
performance with respect to training data
(training error) - Question How probable is it that the observed
training error for h gives a misleading estimate
of the true error?
17PAC Learnability
- Goal characterize classes of target concepts
that can be reliably learned - from a reasonable number of randomly drawn
training examples and - using a reasonable amount of computation
- Unreasonable to expect perfect learning where
errorD(h) 0 - Would need to provide training examples
corresponding to every possible instance - With random sample of training examples, there is
always a non-zero probability that the training
examples will be misleading
18Weaken Demand on Learner
- Hypothesis error (Approximately)
- Will not require a zero error hypothesis
- Require that error is bounded by some constant ?,
that can be made arbitrarily small - ? is the error parameter
- Error on training data (Probably)
- Will not require that the learner succeed on
every sequence of randomly drawn training
examples - Require that its probability of failure is
bounded by a constant, ?, that can be made
arbitrarily small - ? is the confidence parameter
19Definition of PAC-Learnability
- Definition Consider a concept class C defined
over a set of instances X of length n and a
learner L using hypothesis space H. C is
PAC-learnable by L using H if all c ? C,
distributions D over X, ? such that 0 lt ? lt ½ ,
and ? such that 0 lt ? lt ½, learner L will with
probability at least (1 - ?) output a hypothesis
h? H such that errorD(h) ? ?, in time that is
polynomial in 1/?, 1/?, n, and size(c).
20Requirements of Definition
- L must with arbitrarily high probability (1-?),
out put a hypothesis having arbitrarily low error
(?). - Ls learning must be efficientgrows polynomially
in terms of - Strengths of output hypothesis (1/?, 1/?)
- Inherent complexity of instance space (n) and
concept class C (size(c)).
21Block Diagram of PAC Learning Model
Control Parameters ?, ?
Training sample
Hypothesis h
Learning algorithm L
22Examples of second requirement
- Consider executables problem where instances are
conjunctions of boolean features - a1yes ? a2no ? a3yes ? a4no
- Concepts are conjunctions of a subset of the
features - a1yes ? a3yes ? a4yes
23Using the Concept of PAC Learning in Practice
- We often want to know how many training instances
we need in order to achieve a certain level of
accuracy with a specified probability. - If L requires some minimum processing time per
training example, then for C to be PAC-learnable
by L, L must learn from a polynomial number of
training examples.
24Sample Complexity
- Sample complexity of a learning problem is the
growth in the required training examples with
problem size. - Will determine the sample complexity for
consistent learners. - A learner is consistent if it outputs hypotheses
which perfectly fit the training data whenever
possible. - All algorithms in Chapter 2 are consistent
learners.
25Recall definition of VS
- The version space, denoted VSH,D, with respect to
hypothesis space H and training examples D, is
the subset of hypotheses from H consistent with
the training examples in D
26VS and PAC learning by consistent learners
- Every consistent learner outputs a hypothesis
belonging to the version space, regardless of the
instance space X, hypothesis space H, or training
data D. - To bound the number of examples needed by any
consistent learner, we need only to bound the
number of examples needed to assure that the
version space contains no unacceptable hypotheses.
27?-exhausted
- Definition Consider a hypothesis space H,
target concept c, instance distribution D, and
set of training examples D of c. The version
space VSH,D is said to be ?-exhausted with
respect to c and D, if every hypothesis h in VH,D
has error less than ? with respect to c and D.
28Exhausting the version space
Hypothesis Space H
error 0.2 r0
error 0.1 r0.2
error 0.3 r0.4
VSH,D
error 0.1 r0
error 0.2 r0.3
error 0.3 r0.2
29Exhausting the Version Space
- Only an observer who knows the identify of the
target concept can determine with certainty
whether the version space is ?-exhausted. - But, we can bound the probability that the
version space will be ?-exhausted after a given
number of training examples - Without knowing the identity of the target
concept - Without knowing the distribution from which
training examples were drawn
30Theorem 7.1
- Theorem 7.1 ?-exhausting the version space. If
the hypothesis space H is finite, D is a sequence
of m ? 1 independent randomly drawn examples of
some target concept c, then for any 0???1, the
probability that the version space VSH,D is not
?-exhausted (with respect to c) is less than or
equal to - He-?m
31Proof of theorem
32Number of Training Examples (Eq. 7.2)
33Summary of Result
- Inequality on previous slide provides a general
bound on the number of trianing examples
sufficient for any consistent learner to
successfully learn any target concept in H, for
any desired values of ? and ?. - This number m of training examples is sufficient
to assure that any consistent hypothesis will be - probably (with probability 1-?)
- approximately (within error ?) correct.
- The value of m grows
- linearly with 1/?
- logarithmically with 1/?
- logarithmically with H
- The bound can be a substantial overestimate.
34Problem
- Suppose we have the instance space described for
the EnjoySports problem - Sky (Sunny, Cloudy, Rainy)
- AirTemp (Warm, Cold)
- Humidity (Normal, High)
- Wind (Strong, Weak)
- Water (Warm, Cold)
- Forecast (Same, Change)
- Hypotheses can be as before
- (?, Warm, Normal, ?, ?, Same) (0, 0, 0, 0, 0,
0) - How many training examples do we need to have an
error rate of less than 10 with a probability of
95?
35Limits of Equation 7.2
- Equation 7.2 tell us how many training examples
suffice to ensure (with probability (1-?) that
every hypothesis having 0 training error, will
have a true error of at most ?. - Problem there may be no hypothesis that is
consistent with if the concept is not in H. In
this case, we want the minimum error hypothesis.
36Agnostic Learning and Inconsistent Hypotheses
- An Agnostic Learner does not make the assumption
that the concept is contained in the hypothesis
space. - We may want to consider the hypothesis with the
minimum error - Can derive a bound similar to the previous one
37Concepts that are PAC-Learnable
- Proofs that a type of concept is PAC-Learnable
usually consist of two steps - Show that each target concept in C can be learned
from a polynomial number of training examples - Show that the processing time per training
example is also polynomially bounded
38PAC Learnability of Conjunctions of Boolean
Literals
- Class C of target concepts described by
conjunctions of boolean literals - GUI_Present ? ?Opens_files
- Is C PAC learnable? Yes.
- Will prove by
- Showing that a polynomial of training examples
is needed to learn each concept - Demonstrate an algorithm that uses polynomial
time per training example
39Examples Needed to Learn Each Concept
- Consider a consistent learner that uses
hypothesis space H C - Compute number m of random training examples
sufficient to ensure that L will, with
probability (1 - ?), output a hypothesis with
maximum error ?. - We will use m ?(1/?)(lnHln(1/?))
- What is the size of the hypothesis space?
40Complexity Per Example
- We just need to show that for some algorithm, we
can spend a polynomial amount of time per
training example. - One way to do this is to give an algorithm.
- In this case, we can use Find-S as the learning
algorithm. - Find-S incrementally computes the most specific
hypothesis consistent with each training example. - Old ? Tired
- Old ? Happy
- Tired
- Old ? ?Tired -
- Rich ? Happy
- What is a bound on the time per example?
41Theorem 7.2
- PAC-learnability of boolean conjunctions. The
class C of conjunctions of boolean literals is
PAC-learnable by the FIND-S algorithm using HC
42Proof of Theorem 7.2
- Equation 7.4 shows that the sample complexity for
this concept class id polynomial in n, 1/?, and
1/?, and independent of size(c). To incrematally
process each training example, the FIND-S
algorithm requires effort linear in n and
independent of 1/?, 1/?, and size(c). Therefore,
this concept class is PAC-learnable by the FIND-S
algorithm.
43Interesting Results
- Unbiased learners are not PAC learnable because
they require an exponential number of examples. - K-term Disjunctive Normal Form is not PAC
learnable - K-term Conjunctive Normal Form is a superset of
k-DNF, but it is PAC learnable
44Sample Complexity with Infinite Hypothesis Spaces
- Two drawbacks to previous result
- It often does not give a very tight bound on the
sample complexity - It only applies to finite hypothesis spaces
- Vapnik-Chervonekis dimension of H (VC dimension)
- Will give tighter bounds
- Applies to many infinite hypothesis spaces.
45Shattering a Set of Instances
- Consider a subset of instances S from the
instance space X. - Every hypothesis imposes dichotomies on S
- x?S h(x) 1
- x?S h(x) 0
- Given some instance space S, there are 2S
possible dichotomies. - The ability of H to shatter a set of concepts is
a measure of its capacity to represent target
concepts defined over these instances.
46Shattering a Hypothesis Space
- Definition A set of instances S is shattered by
hypothesis space H if and only if for every
dichotomy of S there exists some hypothesis in H
consistent with this dichotomy.
47Vapnik-Chervonenkis Dimension
- Ability to shatter a set of instances is closely
related to the inductive bias of the hypothesis
space. - An unbiased hypothesis space is one that shatters
the instance space X. - Sometimes H cannot be shattered, but a large
subset of it can.
48Vapnik-Chervonenkis Dimension
- Definition The Vapnik-Chervonenkis dimension,
VC(H) of hypothesis space H defined over instance
space X, is the size of the largest finite subset
of X shattered by H. If arbitrarily large finite
sets of X can be shattered by H, then VC(H) ?.
49Shattered Instance Space
50Example 1 of VC Dimension
- Instance space X is the set of real numbers X
R. - H is the set of intervals on the real number
line. Form of H is - a lt x lt b
- What is VC(H)?
51Shattering the real number line
-1.2
3.4
6.7
What is VC(H)? What is H?
52Example 2 of VC Dimension
- Set X of instances corresponding to numbers on
the x,y plane - H is the set of all linear decision surfaces
- What is VC(H)?
53Shattering the x-y plane
2 instances
3 instances
VC(H) ? H ?
54Proving limits on VC dimension
- If we find any set of instances of size d that
can be shattered, then VC(H) ? d. - To show that VC(H) lt d, we must show that no set
of size d can be shattered.
55General result for r dimensional space
- The VC dimension of linear decision surfaces in
an r dimensional space is r1.
56Example 3 of VC dimension
- Set X of instances are conjunctions of exactly
three boolean literals - young ? happy ? single
- H is the set of hypothesis described by a
conjunction of up to 3 boolean literals. - What is VC(H)?
57Shattering conjunctions of literals
- Approach construct a set of instances of size 3
that can be shattered. Let instance i have
positive literal li and all other literals
negative. Representation of instances that are
conjunctions of literals l1, l2 and l3 as bit
strings - Instance1 100
- Instance2 010
- Instance3 001
- Construction of dichotomy To exclude an
instance, add appropriate ?li to the hypothesis. - Extend the argument to n literals.
- Can VC(H) be greater than n (number of literals)?
58Sample Complexity and the VC dimension
- Can derive a new bound for the number of randomly
drawn training examples that suffice to probably
approximately learn a target concept (how many
examples do we need to ?-exhaust the version
space with probability (1-?)?)
59Comparing the Bounds
60Lower Bound on Sample Complexity
- Theorem 7.3 Lower bound on sample complexity.
Consider any concept class C such that VC(C) ? 2,
any learner L, and any 0 lt ? lt 1/8, and 0 lt ? lt
1/100. Then there exists a distribution D and
target concept in C such that if L observes fewer
examples than - Then with probability at least ?, L outputs a
hypothesis h having errorD(h) gt ?.