Title: Multimedia search: From Lab to Web
1KI2 - 5
- Computational Learning Theory
- PAC
- IID
- VC Dimension
- SVM
Marius Bulacu prof. dr. Lambert Schomaker
Kunstmatige Intelligentie / RuG
2Learning
- Learning is essential for unknown environments
- i.e., when designer lacks omniscience
- Learning is useful as a system construction
method - i.e., expose the agent to reality rather than
trying to write it down - Learning modifies the agent's decision
mechanisms to improve performance
3Learning Agents
4Learning Element
- Design of a learning element is affected by
- Which components of the performance element are
to be learned - What feedback is available to learn these
components - What representation is used for the components
- Type of feedback
- Supervised learning correct answers for each
example - Unsupervised learning correct answers not given
- Reinforcement learning occasional rewards
5Inductive Learning
- Simplest form learn a function from examples
- - f is the target function
- - an example is a pair (x, f(x))
- Problem find a hypothesis h
- such that h f
- given a training set of examples
- This is a highly simplified model of real
learning - - ignores prior knowledge
- - assumes examples are given
6Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
7Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
8Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
9Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
10Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
11Inductive Learning Method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
Occams razor prefer the simplest hypothesis
consistent with data
12Occams Razor
- If two theories explain the facts equally well,
then the simpler theory is to be preferred. - Rationale
- There are fewer short hypotheses than long
hypotheses. - A short hypothesis that fits the data is
unlikely to be a coincidence. - A long hypothesis that fits the data may be a
coincidence. - Formal treatment in computational learning theory
William of Occam (1285-1349, England)
13The Problem
- Why does learning work?
- How do we know that the learned hypothesis h is
close to the target function f if we do not know
what f is?
answer provided by computational learning theory
14The Answer
- Any hypothesis h that is consistent with a
sufficiently large number of training examples is
unlikely to be seriously wrong.
Therefore it must be Probably Approximately
Correct PAC
15The Stationarity Assumption
- The training and test sets are drawn randomly
from the same population of examples using the
same probability distribution.
Therefore training and test data
are Independently and Identically
Distributed IID the future is like the past
16How many examples are needed?
Probability of existence of a wrong hypothesis
consistent with all examples
Size of hypothesis space
Number of examples
Probability that h and f disagree on an example
Sample complexity
17Formal Derivation
H (the set of all possible hypothese)
HBAD (the set of wrong hypotheses)
e
f
18What if hypothesis space is infinite?
- Cant use our result for finite H
- Need some other measure of complexity for H
- Vapnik-Chervonenkis dimension
19(No Transcript)
20(No Transcript)
21(No Transcript)
22Shattering two binary dimensionsover a number of
classes
- In order to understand the principle of
shattering sample points into classes we will
look at the simple case of - two dimensions
- of binary value
232-D feature space
1
f2
0
1
0
f1
242-D feature space, 2 classes
1
f2
0
1
0
f1
25the other class
1
f2
0
1
0
f1
262 left vs 2 right
1
f2
0
1
0
f1
27top vs bottom
1
f2
0
1
0
f1
28right vs left
1
f2
0
1
0
f1
29bottom vs top
1
f2
0
1
0
f1
30lower-right outlier
1
f2
0
1
0
f1
31lower-left outlier
1
f2
0
1
0
f1
32upper-left outlier
1
f2
0
1
0
f1
33upper-right outlier
1
f2
0
1
0
f1
34etc.
1
f2
0
1
0
f1
352-D feature space
1
f2
0
1
0
f1
362-D feature space
1
f2
0
1
0
f1
372-D feature space
1
f2
0
1
0
f1
38XOR configuration A
1
f2
0
1
0
f1
39XOR configuration B
1
f2
0
1
0
f1
402-D feature space, two classes 16 hypotheses
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f10 f11 f20 f21
hypothesis possible class partioning of all
data samples
412-D feature space, two classes, 16 hypotheses
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
f10 f11 f20 f21
two XOR class configurations 2/16 of hypotheses
requires a non-linear separatrix
42XOR, a possible non-linear separation
1
f2
0
1
0
f1
43XOR, a possible non-linear separation
1
f2
0
1
0
f1
442-D feature space, three classes, hypotheses?
0 1 2 3 4 5 6 7 8
f10 f11 f20 f21
452-D feature space, three classes, hypotheses?
0 1 2 3 4 5 6 7 8
f10 f11 f20 f21
34 81 possible hypotheses
46Maximum, discrete space
- Four classes 44 256 hypotheses
- Assume that there are no more classes than
discrete cells - Nhypmax ncellsnclasses
472-D feature space, three classes
In this example, ? is linearly separatable from
the rest, as is ?. But ? is not linearly
separatable from the rest of the classes.
1
f2
0
1
0
f1
482-D feature space, four classes
1
f2
Minsky Papert simple table lookup or logic
will do nicely.
0
1
0
f1
492-D feature space, four classes
1
Spheres or radial-basis functions may offer a
compact class encapsulation in case of limited
noise and limited overlap (but in the end the
data will tell experimentation required!)
f2
0
1
0
f1
50SVM (1) Kernels
- Implicit mapping to a higher dimensional space
where linear separation is possible.
f3
f2
f2
f1
f1
Complicated separation boundary
Simple separation boundary Hyperplane
51SVM (2) Max Margin
Support vectors
f2
Good generalization
Best Separating Hyperplane
f1
Max Margin
- From all the possible separating hyperplanes,
select the one that gives Max Margin. - Solution found by Quadratic Optimization
Learning.