On generalization bounds, projection profile, and margin distribution - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

On generalization bounds, projection profile, and margin distribution

Description:

Conclusion (Mine) They did not apply projection profile technique in real experiments! ... Conclusion (Mine) With new theoretical bounds, we could apply it to ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 58
Provided by: Leon94
Category:

less

Transcript and Presenter's Notes

Title: On generalization bounds, projection profile, and margin distribution


1
On generalization bounds, projection profile, and
margin distribution
  • Chien-I Liao
  • Jan. 24 2006

2
Quotes
  • Theory is where one knows everything but nothing
    works.
  • Practice is where everything works but nobody
    knows why.
  • In my research, I worked on both theory and
    practice.
  • Therefore, nothing works and nobody knows why.

3
Outline
  • Generalization bounds
  • (1) PAC Learning Model
  • (2) VC dimension
  • (3) Support Vector Machine (SVM)
  • Projection profile
  • Margin distribution
  • Experiments
  • Conclusion

4
Generalization Bounds
  • Definitions
  • X the set of all possible instances
  • c the target concept to learn
  • C the collections of concepts
  • h the concept hypothesis trying to
    approximate the
  • concept to be learned.
  • H the collection of concept hypotheses
  • D a fixed probability distribution over X.
    Training and
  • testing samples are drawn according to
    D.
  • T the set of training examples

5
Generalization Bounds
  • Example Snakes Classification
  • X All snakes in the world, represented by
  • (L, c, h) L length, c color, h head
    shape
  • C all functions f X ? Poisonous, Not
  • poisonous, so if X10000, C210000
  • H a subset of C
  • D uniform distribution over X
  • T a subset of X

6
Generalization Bounds
  • Generalization Bounds Bounds on generalization
    error
  • Generalization error Assume c is the actual
    concept and h is the hypothesis returned by the
    learning algorithm, then the error
  • errorD(h) PrxDc(x) ? h(x)

7
Generalization Bounds
  • NOTE
  • Any meaningful generalization bounds shouldnt
    be greater than
  • 0.5
  • Otherwise, it is no better than a fair coin!

8
PAC Learning Model
9
PAC Learning Model
  • PAC Probably Approximately Correct
  • A concept class C over X is PAC-learnable iff
    there exists an algorithm L such that for all c
    in C, for all distributions D on X, and for all 0
    ltelt 1/2 and 0 ltdlt 1/2, after observing m examples
    drawn according to D, m polynomial in 1/e and
    1/d, with probability at least (1-d), L outputs a
    hypothesis h with generalization error at most e.
    i.e.
  • PrerrorD(h) gt e lt d

10
PAC Learning Model Example
  • Guessing the legal age to drink
  • - -
  • A B ? C D
  • Any consistent hypothesis could go wrong only in
    the range (B,C)
  • Assume PrDBltxltC e, then the probability that m
    samples are all outside (B,C) is (1-e)m lt e-em ltd
  • So, it suffices to test 1/eln(1/d) training
    samples.

11
PAC Learning Model
  • For consistent case, it suffices to draw m
    samples if
  • m ln(Hln(1/d))/e
  • For inconsistent case, it suffices to draw m
    samples if
  • m ln(2Hln(1/d))/2e2

12
PAC Learning Model
  • Naïve bound, almost meaningless in most
    casesince H is usually very huge in real world
    problems
  • Cannot apply to infinite hypotheses space.
  • Lets add some flavor to it

13
VC dimension
14
VC dimension
  • If concept class C H has VC dimension d, and
    hypothesis h is consistent with all m(gtd)
    training data, the generalization error of h is
    bounded by
  • For inconsistent case, the bound would be

15
VC dimension
  • More accurate estimation with VC dimension
    Lower bound is also available!
  • If concept class CH has VC dimension d, for any
    learning algorithm L there exists a distribution
    D such that with probability at least d, given m
    random examples, the error of the hypothesis
    output by L is at least
  • O((d1/d)/m)

16
VC dimension
  • The upper bound does not apply if VC dimension d
    is infinite
  • Using powerful hypotheses set could describe the
    concept more accurately, but it also yields
    higher error on some extreme distribution

17
VC dimension
Adapted from Prof. Mohris lecture notes
18
Support Vector Machine
19
Support Vector Machine (SVM)
  • Solving binary classification problems with
    maximum margin hyperplanes.
  • (x1, y1),(x2, y2),,(xm, ym) in RN-1,1
  • h(x) wx b w in RN, b in R
  • Classifier signh(x)
  • Optimization problem
  • minx w2/2
  • subject to yi(xiwb) 1

20
Support Vector Machine (SVM)
Adapted and modified from Prof. Mohris lecture
notes
21
Support Vector Machine (SVM)
  • In separable case, the expected generalization
    error of h on m training data is bounded by the
    expected fraction of support vectors for the
    training set of size m1
  • Eerror(hm) ENSV/(m1)
  • (NSV Number of support vectors)

22
General Margin Bound
  • Let H h x?wx w?, xR. If the
    output classifier sign(h) has margin at least
    ?/w on m training data, then there is a
    constant c such that for any distribution D, with
    probability at least 1-d,

23
Comparison of two bounds
  • VC bound
  • (1)
  • related only with VC dimension, independent to
    the training data
  • Margin bound
  • (2)
  • related only with the training data,
    independent to VC dimension

24
Comparison of two bounds
  • In real world problem, feature space dimension n
    is usually very high, which leads to large VC
    dimension d. For (1) to be meaningful, usually we
    have to observe at least 17d examples
  • The margin bound, however, is very loose such
    that we need to observe 106 examples when the
    margin is as big as 0.3

25
Can we find a new bound by combining these two
aspects?
26
Projection Profile
27
Projection Profile
  • Project original Rn vector to a much lower
    dimension space Rk
  • Projector Random k by n matrices
  • Distortion Some correctly classified data would
    be misclassified in the new space, and vice
    versa.
  • With larger margin in original space, the
    distortion would be smaller

28
Projection Profile
  • Random matrix A k by n matrix R where each entry
    rijN(0,1/k). The projection is denoted by x
    Rx where x in Rn and x in Rk
  • For any constant c,
  • (3)

29
Projection Profile
  • Clearly, we can assume w.l.o.g that all data come
    from the surface of unit sphere and ?h?1 (Why?)
  • Note that if ?u??v?1, ?u-v? 2-2uv. Therefore
    (3) can be viewed as stating that with high
    probability, random projection preserves the
    angle between vectors which lie on the unit
    sphere. (Why?)

30
Projection Profile
  • Let the classifier be sign(hTxb), xj a sample
    point,?h??xj?1 and ?jhT xj. hRh, xjRxj in
    projected space Rk. Then
  • Psign(hTxjb)?sign(hTxjb)

31
Projection Profile
  • Define projection error PEk(h,R,T) as the portion
    of data points that were differently classified
    under original and projected space. Then with
    probability at least 1-d (over the choice of R),
    PEk(h,R,T) is upper bounded by

32
Projection Profile
  • Since the VC dimension of k- dimensional
    hyperplanes is k1, substituting d by k1 in
    formula (1), the error contributed by the VC
    component could be bounded by

33
Projection Profile (Cont.)
  • Symmetrization
  • ge generalization error
  • teS testing error on S
  • Prge teT1 gte lt
  • 2PrteT1 teT2 gt e/2

34
Projection Profile
  • Finally, combining two components while using
    symmetrization, with probability at least 1-4d,
    the generalization error could be bounded as the
    following

35
Projection Profile
  • Tradeoff between Random Projection Error and VC
    dimension Error

36
Margin Distribution
37
Margin Distribution
(d) might be a better choice than (c)
38
Margin Distribution
The contribution of data points to the
generalization error as a function of margin
39
Margin Distribution
  • Weight function
  • Objective function to minimize

40
Margin Distribution
The weight given to the data points by the MDO
algorithms as a function of margin
41
Margin Distribution
  • Comparing two functions again
  • ashould be thought of as the optimal projection
    dimension and could be optimized.

42
Margin Distribution
  • But in fact, a and ß are chosen from experimental
    results.
  • Observation in most case setting
  • a1/(?b)2, ß1/(?b)
  • gave good results where ?S?i/m is an
    estimate of average margin for some h.

43
MDO algorithm
  • Minimize L(h,b)
  • subject to h 1
  • Difficulty not convex, could get trapped into a
    local minimum.
  • Choosing a good initial classifier is important!
  • Solution Use SVM to obtain initial classifier,
    then use gradient descent methods to achieve the
    optimum.

44
Experiments
45
Experiments
  • Considered 17,000 dimensional data taken from the
    problem of context sensitive spelling correction.
  • The margin bound is not useful since the margin
    is quite small.
  • To gain confidence from the VC bounds, we need
    over 120,000 data points.
  • The random projection term is below 0.5 already
    after 2000 samples.

46
Experiments
Histogram of margin distribution
Projection error as a function of dimension k
47
Experiments
  • Correlation between margin and test error
  • Correlation between training margin and test
    margin

48
Experiments
  • MDO algorithm - Margin v.s. iterations

49
Experiments
  • Training/Testing error v.s. iterations

50
Experiments
  • SVM v.s. MDO

51
Conclusion
52
Conclusion (Theirs)
  • New theoretical bound in this paper.
  • New algorithm focusing on margin distribution
    rather than the typical use of the notion of
    margin in machine learning.
  • The bound is still loose and more research is
    needed to match observed performance on real
    data.
  • Any new algorithmic implication?

53
Conclusion (Mine)
  • They did not apply projection profile technique
    in real experiments! -(
  • Actually, I dont think there two papers are well
    linked. The proposed algorithm could not be
    analyzed by their theoretical results.
  • But the idea that trying to derive good bounds
    from margin distribution is useful.

54
Conclusion (Mine)
  • With new theoretical bounds, we could apply it to
    different existing methods. (Like SVM and
    Boosting)
  • If it turns out that the results are not as good
    as the original one, probably we could fix the
    theoretical results and return to the previous
    step.

55
Quotes
  • Nothing is more practical than good theory!

56
Questions? Comments?
57
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com