On generalization bounds, projection profile, and margin distribution presentation

About This Presentation

Transcript and Presenter's Notes

Title: On generalization bounds, projection profile, and margin distribution

1
On generalization bounds, projection profile, and
margin distribution

Chien-I Liao
Jan. 24 2006

2
Quotes

Theory is where one knows everything but nothing
works.
Practice is where everything works but nobody
knows why.
In my research, I worked on both theory and
practice.
Therefore, nothing works and nobody knows why.

3
Outline

Generalization bounds
(1) PAC Learning Model
(2) VC dimension
(3) Support Vector Machine (SVM)
Projection profile
Margin distribution
Experiments
Conclusion

4
Generalization Bounds

Definitions
X the set of all possible instances
c the target concept to learn
C the collections of concepts
h the concept hypothesis trying to
approximate the
concept to be learned.
H the collection of concept hypotheses
D a fixed probability distribution over X.
Training and
testing samples are drawn according to
D.
T the set of training examples

5
Generalization Bounds

Example Snakes Classification
X All snakes in the world, represented by
(L, c, h) L length, c color, h head
shape
C all functions f X ? Poisonous, Not
poisonous, so if X10000, C210000
H a subset of C
D uniform distribution over X
T a subset of X

6
Generalization Bounds

Generalization Bounds Bounds on generalization
error
Generalization error Assume c is the actual
concept and h is the hypothesis returned by the
learning algorithm, then the error
errorD(h) PrxDc(x) ? h(x)

7
Generalization Bounds

NOTE
Any meaningful generalization bounds shouldnt
be greater than
0.5
Otherwise, it is no better than a fair coin!

8
PAC Learning Model
9
PAC Learning Model

PAC Probably Approximately Correct
A concept class C over X is PAC-learnable iff
there exists an algorithm L such that for all c
in C, for all distributions D on X, and for all 0
ltelt 1/2 and 0 ltdlt 1/2, after observing m examples
drawn according to D, m polynomial in 1/e and
1/d, with probability at least (1-d), L outputs a
hypothesis h with generalization error at most e.
i.e.
PrerrorD(h) gt e lt d

10
PAC Learning Model Example

Guessing the legal age to drink
- -
A B ? C D
Any consistent hypothesis could go wrong only in
the range (B,C)
Assume PrDBltxltC e, then the probability that m
samples are all outside (B,C) is (1-e)m lt e-em ltd
So, it suffices to test 1/eln(1/d) training
samples.

11
PAC Learning Model

For consistent case, it suffices to draw m
samples if
m ln(Hln(1/d))/e
For inconsistent case, it suffices to draw m
samples if
m ln(2Hln(1/d))/2e2

12
PAC Learning Model

Naïve bound, almost meaningless in most
casesince H is usually very huge in real world
problems
Cannot apply to infinite hypotheses space.
Lets add some flavor to it

13
VC dimension
14
VC dimension

If concept class C H has VC dimension d, and
hypothesis h is consistent with all m(gtd)
training data, the generalization error of h is
bounded by
For inconsistent case, the bound would be

15
VC dimension

More accurate estimation with VC dimension
Lower bound is also available!
If concept class CH has VC dimension d, for any
learning algorithm L there exists a distribution
D such that with probability at least d, given m
random examples, the error of the hypothesis
output by L is at least
O((d1/d)/m)

16
VC dimension

The upper bound does not apply if VC dimension d
is infinite
Using powerful hypotheses set could describe the
concept more accurately, but it also yields
higher error on some extreme distribution

17
VC dimension
Adapted from Prof. Mohris lecture notes
18
Support Vector Machine
19
Support Vector Machine (SVM)

Solving binary classification problems with
maximum margin hyperplanes.
(x1, y1),(x2, y2),,(xm, ym) in RN-1,1
h(x) wx b w in RN, b in R
Classifier signh(x)
Optimization problem
minx w2/2
subject to yi(xiwb) 1

20
Support Vector Machine (SVM)
Adapted and modified from Prof. Mohris lecture
notes
21
Support Vector Machine (SVM)

In separable case, the expected generalization
error of h on m training data is bounded by the
expected fraction of support vectors for the
training set of size m1
Eerror(hm) ENSV/(m1)
(NSV Number of support vectors)

22
General Margin Bound

Let H h x?wx w?, xR. If the
output classifier sign(h) has margin at least
?/w on m training data, then there is a
constant c such that for any distribution D, with
probability at least 1-d,

23
Comparison of two bounds

VC bound
(1)
related only with VC dimension, independent to
the training data
Margin bound
(2)
related only with the training data,
independent to VC dimension

24
Comparison of two bounds

In real world problem, feature space dimension n
is usually very high, which leads to large VC
dimension d. For (1) to be meaningful, usually we
have to observe at least 17d examples
The margin bound, however, is very loose such
that we need to observe 106 examples when the
margin is as big as 0.3

25
Can we find a new bound by combining these two
aspects?
26
Projection Profile
27
Projection Profile

Project original Rn vector to a much lower
dimension space Rk
Projector Random k by n matrices
Distortion Some correctly classified data would
be misclassified in the new space, and vice
versa.
With larger margin in original space, the
distortion would be smaller

28
Projection Profile

Random matrix A k by n matrix R where each entry
rijN(0,1/k). The projection is denoted by x
Rx where x in Rn and x in Rk
For any constant c,
(3)

29
Projection Profile

Clearly, we can assume w.l.o.g that all data come
from the surface of unit sphere and ?h?1 (Why?)
Note that if ?u??v?1, ?u-v? 2-2uv. Therefore
(3) can be viewed as stating that with high
probability, random projection preserves the
angle between vectors which lie on the unit
sphere. (Why?)

30
Projection Profile

Let the classifier be sign(hTxb), xj a sample
point,?h??xj?1 and ?jhT xj. hRh, xjRxj in
projected space Rk. Then
Psign(hTxjb)?sign(hTxjb)

31
Projection Profile

Define projection error PEk(h,R,T) as the portion
of data points that were differently classified
under original and projected space. Then with
probability at least 1-d (over the choice of R),
PEk(h,R,T) is upper bounded by

32
Projection Profile

Since the VC dimension of k- dimensional
hyperplanes is k1, substituting d by k1 in
formula (1), the error contributed by the VC
component could be bounded by

33
Projection Profile (Cont.)

Symmetrization
ge generalization error
teS testing error on S
Prge teT1 gte lt
2PrteT1 teT2 gt e/2

34
Projection Profile

Finally, combining two components while using
symmetrization, with probability at least 1-4d,
the generalization error could be bounded as the
following

35
Projection Profile

Tradeoff between Random Projection Error and VC
dimension Error

36
Margin Distribution
37
Margin Distribution
(d) might be a better choice than (c)
38
Margin Distribution
The contribution of data points to the
generalization error as a function of margin
39
Margin Distribution

Weight function
Objective function to minimize

40
Margin Distribution
The weight given to the data points by the MDO
algorithms as a function of margin
41
Margin Distribution

Comparing two functions again
ashould be thought of as the optimal projection
dimension and could be optimized.

42
Margin Distribution

But in fact, a and ß are chosen from experimental
results.
Observation in most case setting
a1/(?b)2, ß1/(?b)
gave good results where ?S?i/m is an
estimate of average margin for some h.

43
MDO algorithm

Minimize L(h,b)
subject to h 1
Difficulty not convex, could get trapped into a
local minimum.
Choosing a good initial classifier is important!
Solution Use SVM to obtain initial classifier,
then use gradient descent methods to achieve the
optimum.

44
Experiments
45
Experiments

Considered 17,000 dimensional data taken from the
problem of context sensitive spelling correction.
The margin bound is not useful since the margin
is quite small.
To gain confidence from the VC bounds, we need
over 120,000 data points.
The random projection term is below 0.5 already
after 2000 samples.

46
Experiments
Histogram of margin distribution
Projection error as a function of dimension k
47
Experiments

Correlation between margin and test error
Correlation between training margin and test
margin

48
Experiments

MDO algorithm - Margin v.s. iterations

49
Experiments

Training/Testing error v.s. iterations

50
Experiments

SVM v.s. MDO

51
Conclusion
52
Conclusion (Theirs)

New theoretical bound in this paper.
New algorithm focusing on margin distribution
rather than the typical use of the notion of
margin in machine learning.
The bound is still loose and more research is
needed to match observed performance on real
data.
Any new algorithmic implication?

53
Conclusion (Mine)

They did not apply projection profile technique
in real experiments! -(
Actually, I dont think there two papers are well
linked. The proposed algorithm could not be
analyzed by their theoretical results.
But the idea that trying to derive good bounds
from margin distribution is useful.

54
Conclusion (Mine)

With new theoretical bounds, we could apply it to
different existing methods. (Like SVM and
Boosting)
If it turns out that the results are not as good
as the original one, probably we could fix the
theoretical results and return to the previous
step.

On generalization bounds, projection profile, and margin distribution PowerPoint PPT Presentation