Title: On generalization bounds, projection profile, and margin distribution
1On generalization bounds, projection profile, and
margin distribution
- Chien-I Liao
- Jan. 24 2006
2Quotes
- Theory is where one knows everything but nothing
works. - Practice is where everything works but nobody
knows why. - In my research, I worked on both theory and
practice. - Therefore, nothing works and nobody knows why.
3Outline
- Generalization bounds
- (1) PAC Learning Model
- (2) VC dimension
- (3) Support Vector Machine (SVM)
- Projection profile
- Margin distribution
- Experiments
- Conclusion
4Generalization Bounds
- Definitions
- X the set of all possible instances
- c the target concept to learn
- C the collections of concepts
- h the concept hypothesis trying to
approximate the - concept to be learned.
- H the collection of concept hypotheses
- D a fixed probability distribution over X.
Training and - testing samples are drawn according to
D. - T the set of training examples
5Generalization Bounds
- Example Snakes Classification
- X All snakes in the world, represented by
- (L, c, h) L length, c color, h head
shape - C all functions f X ? Poisonous, Not
- poisonous, so if X10000, C210000
- H a subset of C
- D uniform distribution over X
- T a subset of X
6Generalization Bounds
- Generalization Bounds Bounds on generalization
error - Generalization error Assume c is the actual
concept and h is the hypothesis returned by the
learning algorithm, then the error - errorD(h) PrxDc(x) ? h(x)
7Generalization Bounds
- NOTE
- Any meaningful generalization bounds shouldnt
be greater than - 0.5
- Otherwise, it is no better than a fair coin!
8PAC Learning Model
9PAC Learning Model
- PAC Probably Approximately Correct
- A concept class C over X is PAC-learnable iff
there exists an algorithm L such that for all c
in C, for all distributions D on X, and for all 0
ltelt 1/2 and 0 ltdlt 1/2, after observing m examples
drawn according to D, m polynomial in 1/e and
1/d, with probability at least (1-d), L outputs a
hypothesis h with generalization error at most e.
i.e. - PrerrorD(h) gt e lt d
10PAC Learning Model Example
- Guessing the legal age to drink
- - -
- A B ? C D
- Any consistent hypothesis could go wrong only in
the range (B,C) - Assume PrDBltxltC e, then the probability that m
samples are all outside (B,C) is (1-e)m lt e-em ltd - So, it suffices to test 1/eln(1/d) training
samples.
11PAC Learning Model
- For consistent case, it suffices to draw m
samples if - m ln(Hln(1/d))/e
- For inconsistent case, it suffices to draw m
samples if - m ln(2Hln(1/d))/2e2
12PAC Learning Model
- Naïve bound, almost meaningless in most
casesince H is usually very huge in real world
problems - Cannot apply to infinite hypotheses space.
- Lets add some flavor to it
13VC dimension
14VC dimension
- If concept class C H has VC dimension d, and
hypothesis h is consistent with all m(gtd)
training data, the generalization error of h is
bounded by - For inconsistent case, the bound would be
15VC dimension
- More accurate estimation with VC dimension
Lower bound is also available! - If concept class CH has VC dimension d, for any
learning algorithm L there exists a distribution
D such that with probability at least d, given m
random examples, the error of the hypothesis
output by L is at least - O((d1/d)/m)
16VC dimension
- The upper bound does not apply if VC dimension d
is infinite - Using powerful hypotheses set could describe the
concept more accurately, but it also yields
higher error on some extreme distribution
17VC dimension
Adapted from Prof. Mohris lecture notes
18Support Vector Machine
19Support Vector Machine (SVM)
- Solving binary classification problems with
maximum margin hyperplanes. - (x1, y1),(x2, y2),,(xm, ym) in RN-1,1
- h(x) wx b w in RN, b in R
- Classifier signh(x)
- Optimization problem
- minx w2/2
- subject to yi(xiwb) 1
20Support Vector Machine (SVM)
Adapted and modified from Prof. Mohris lecture
notes
21Support Vector Machine (SVM)
- In separable case, the expected generalization
error of h on m training data is bounded by the
expected fraction of support vectors for the
training set of size m1 - Eerror(hm) ENSV/(m1)
- (NSV Number of support vectors)
22General Margin Bound
- Let H h x?wx w?, xR. If the
output classifier sign(h) has margin at least
?/w on m training data, then there is a
constant c such that for any distribution D, with
probability at least 1-d,
23Comparison of two bounds
- VC bound
- (1)
- related only with VC dimension, independent to
the training data - Margin bound
- (2)
- related only with the training data,
independent to VC dimension
24Comparison of two bounds
- In real world problem, feature space dimension n
is usually very high, which leads to large VC
dimension d. For (1) to be meaningful, usually we
have to observe at least 17d examples - The margin bound, however, is very loose such
that we need to observe 106 examples when the
margin is as big as 0.3
25Can we find a new bound by combining these two
aspects?
26Projection Profile
27Projection Profile
- Project original Rn vector to a much lower
dimension space Rk - Projector Random k by n matrices
- Distortion Some correctly classified data would
be misclassified in the new space, and vice
versa. - With larger margin in original space, the
distortion would be smaller
28Projection Profile
- Random matrix A k by n matrix R where each entry
rijN(0,1/k). The projection is denoted by x
Rx where x in Rn and x in Rk - For any constant c,
- (3)
29Projection Profile
- Clearly, we can assume w.l.o.g that all data come
from the surface of unit sphere and ?h?1 (Why?) - Note that if ?u??v?1, ?u-v? 2-2uv. Therefore
(3) can be viewed as stating that with high
probability, random projection preserves the
angle between vectors which lie on the unit
sphere. (Why?)
30Projection Profile
- Let the classifier be sign(hTxb), xj a sample
point,?h??xj?1 and ?jhT xj. hRh, xjRxj in
projected space Rk. Then - Psign(hTxjb)?sign(hTxjb)
31Projection Profile
- Define projection error PEk(h,R,T) as the portion
of data points that were differently classified
under original and projected space. Then with
probability at least 1-d (over the choice of R),
PEk(h,R,T) is upper bounded by
32Projection Profile
- Since the VC dimension of k- dimensional
hyperplanes is k1, substituting d by k1 in
formula (1), the error contributed by the VC
component could be bounded by
33Projection Profile (Cont.)
- Symmetrization
- ge generalization error
- teS testing error on S
- Prge teT1 gte lt
- 2PrteT1 teT2 gt e/2
34Projection Profile
- Finally, combining two components while using
symmetrization, with probability at least 1-4d,
the generalization error could be bounded as the
following
35Projection Profile
- Tradeoff between Random Projection Error and VC
dimension Error
36Margin Distribution
37Margin Distribution
(d) might be a better choice than (c)
38Margin Distribution
The contribution of data points to the
generalization error as a function of margin
39Margin Distribution
- Weight function
- Objective function to minimize
40Margin Distribution
The weight given to the data points by the MDO
algorithms as a function of margin
41Margin Distribution
- Comparing two functions again
- ashould be thought of as the optimal projection
dimension and could be optimized.
42Margin Distribution
- But in fact, a and ß are chosen from experimental
results. - Observation in most case setting
- a1/(?b)2, ß1/(?b)
- gave good results where ?S?i/m is an
estimate of average margin for some h.
43MDO algorithm
- Minimize L(h,b)
- subject to h 1
- Difficulty not convex, could get trapped into a
local minimum. - Choosing a good initial classifier is important!
- Solution Use SVM to obtain initial classifier,
then use gradient descent methods to achieve the
optimum.
44Experiments
45Experiments
- Considered 17,000 dimensional data taken from the
problem of context sensitive spelling correction. - The margin bound is not useful since the margin
is quite small. - To gain confidence from the VC bounds, we need
over 120,000 data points. - The random projection term is below 0.5 already
after 2000 samples.
46Experiments
Histogram of margin distribution
Projection error as a function of dimension k
47Experiments
- Correlation between margin and test error
- Correlation between training margin and test
margin
48Experiments
- MDO algorithm - Margin v.s. iterations
49Experiments
- Training/Testing error v.s. iterations
50Experiments
51Conclusion
52Conclusion (Theirs)
- New theoretical bound in this paper.
- New algorithm focusing on margin distribution
rather than the typical use of the notion of
margin in machine learning. - The bound is still loose and more research is
needed to match observed performance on real
data. - Any new algorithmic implication?
53Conclusion (Mine)
- They did not apply projection profile technique
in real experiments! -( - Actually, I dont think there two papers are well
linked. The proposed algorithm could not be
analyzed by their theoretical results. - But the idea that trying to derive good bounds
from margin distribution is useful.
54Conclusion (Mine)
- With new theoretical bounds, we could apply it to
different existing methods. (Like SVM and
Boosting) - If it turns out that the results are not as good
as the original one, probably we could fix the
theoretical results and return to the previous
step.
55Quotes
- Nothing is more practical than good theory!
56Questions? Comments?
57Thank you!