VC Dimension - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

VC Dimension

Description:

... of functions f(X,w) shatters the sample if all 2L separations ... 2) There is at least one sample of h 1 vectors that cannot be shattered by any function from S ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 57
Provided by: erikma7
Category:

less

Transcript and Presenter's Notes

Title: VC Dimension


1
VC Dimension Direct Robustness Control in
Statistical Learning TheoryMichel Béra
co-Founder and Chief Scientific
Officer21/10/2002
2
Agenda
Company positionning 15 Mins
SRM / SVM Theory
45 Mins
KXEN Analytic Framework (Demo) 15Mins
3
Company Background
  • Founded in July 1998
  • Delaware corporation
  • Headquartered in San Francisco
  • Operations in U.S. and Europe
  • Strong Executive Team
  • R. Haddad (CEO), E. Marcade (CTO), M. Bera (CSO)
  • Active Scientific Committee
  • Includes Gregory Piatetsky-Shapiro (founder
    SigKDD), Lee Giles (Penn State Professor,
    formerly with NEC), Gilbert Saporta (French
    Statistical Society President), Yann Le Cun (NEC,
    Manager of V.Vapnik), Léon Bottou (NEC), Bernhard
    Schoelkopf (Max-Planck-Institut)

4
Go-To-Market Strategy
By embedding KXEN into major Applications and
partnering with leading SI, KXEN is on the way to
become the de-facto standard for Predictive
Analytics
5
KXEN Value Proposition
Analyst
Business User

Traditional Data Mining Approach Cost per Model
30,000
Business Question
Interpret Results
Business User
Prepare Data
Build Model
Test Model
Understand
Apply
Wait for Analyst Time
3 Weeks
Our company builds hundreds of predictive models
in the same time we used to build one. KXEN
allows us to save millions of dollars with more
effective campaigns Financial Industry Customer
6
What is a good model?
Low quality /High Robustness
Low Robustness
Robust Model
7
Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
8
VC dimension - definition (1)
  • Let us consider a sample (x1, .. , xL) from Rn
  • There are 2L different ways to separate the
    sample in two sub-samples
  • A set S of functions f(X,w) shatters the sample
    if all 2L separations can be defined by different
    f(X,w) from family S

9
VC dimension - definition (2)
  • A function family S has VC dimension h (h is an
    integer) if
  • 1) Every sample of h vectors from Rn can be
    shattered by a function from S
  • 2) There is at least one sample of h1 vectors
    that cannot be shattered by any function from S

10
Example VC dimension
  • VC dimension
  • Measures the complexity of a solution
    (function).
  • Is not directely related to the number of
    variables

11
Other examples
  • VC dimension for hyperplanes of Rn is n1
  • VC dimension of set of functions
  • f(x,w) sign (sin (w.x) ), c lt x lt 1,
    cgt0,
  • where w is a free parameter, is infinite.
  • VC dimensions are not always equal to the number
    of parameter that define a given family S of
    functions.

12
Key Example linear modelsy ltwxgt b
  • VC dimension of family S of linear models
  • with
  • depends on C and can take any value between 0
    and n.

13
VC dimension interpretation
  • VC dimension of S an integer, that measures the
    dispersion or separating power (complexity) of
    function family S
  • We shall now show that VC dimension (a major
    theorem from Vapnik) gives a powerful indication
    for model robustness.

14
Learning Theory Problem (1)
  • A model computes a function
  • Problem minimize in w Risk Expectation
  • w a parameter that specifies the chosen model
  • z (X, y) are possible values for attributes
    (variables)
  • Q measures (quantifies) model error cost
  • P(z) is the underlying probability law (unknown)
    for data z

15
Learning Theory Problem (2)
  • We get L data from learning sample (z1, .. , zL),
    and we suppose them iid sampled from law P(z).
  • To minimize R(w), we start by minimizing
    Empirical Risk over this sample
  • We shall use such an approach for
  • classification (eg. Q can be a cost function
    based on cost for misclassified points)
  • regression (eg. Q can be a cost of least squares
    type)

16
Learning Theory Problem (3)
  • Central problem for Statistical Learning Theory
  • What is the relation between Risk Expectation
    R(W)and Empirical Risk E(W)?
  • How to define and measure a generalization
    capacity (robustness) for a model ?

17
Four Pillars for SLT (1 and 2)
  • Consistency (guarantees generalization)
  • Under what conditions will a model be consistent
    ?
  • Model convergence speed (a measure for
    generalization)
  • How does generalization capacity improve when
    sample size L grows?

18
Four Pillars for SLT (3 and 4)
  • Generalization capacity control
  • How to control in an efficient way model
    generalization starting with the only given
    information we have our sample data?
  • A strategy for good learning algorithms
  • Is there a strategy that gurantees, measures and
    controls our learning model generalization
    capacity ?

19
Consistency definition
  • A learning process (model) is said to be
    consistent if model error, measured on new data
    sampled from the same underlying probability laws
    of our original sample, converges, when original
    sample size increases, towards model error,
    measured on original sample.

20
Consistent training?
error
Test error
Training error
number of training examples
error
Test error
Training error
number of training examples
21
Vapnik main theorem
  • Q Under which conditions will a learning
    process (model) be consistent?
  • R A model will be consistent if and only if the
    function f that defines the model comes from a
    family of functions S with finite VC dimension h
  • A finite VC dimension h not only guarantees a
    generalization capacity (consistency), but to
    pick f in a family S with finite VC dimension h
    is the only way to build a model that generalizes.

22
Model convergence speed (generalization capacity)
  • Q What is the nature of model error difference
    between learning data (sample) and test data, for
    a sample of finite size L?
  • R This difference is no greater than a limit
    that only depends on the ratio between VC
    dimension h of model functions family S, and
    sample size L, ie h/L
  • This statement is a new theorem that belongs to
    Kolmogorov-Smirnov way for results, ie theorems
    that do not depend on datas underlying
    probability law.

23
Model convergence speed
Sample size L
24
Empirical Risk Minimization
  • With probability 1-q, the following inequality is
    true
  • where w0 is the parameter w value that minimizes
    Empirical Risk

25
SRM methodology how to control model
generalization capacity
  • Risk Expectation Empirical Risk Confidence
    Interval
  • To minimize Empirical Risk alone will not always
    give a good generalization capacity one will
    want to minimize the sum of Empirical Risk and
    Confidence Interval
  • What is important is not Vapnik limit numerical
    value , most often too large to be of any
    practical use, it is the fact that this limit is
    a non decreasing function of model family
    function richness

26
SRM strategy (1)
  • With probability 1-q,
  • When L/h is small (h too large), second term of
    equation becomes large
  • SRM basic idea for strategy is to minimize
    simultaneously both terms standing on the right
    of above majoring equation for R(w)
  • To do this, one has to make h a controlled
    parameter

27
SRM strategy (2)
  • Let us consider a sequence S1 lt S2 lt .. lt Sn of
    model family functions, with respective growing
    VC dimensions
  • h1 lt h2 lt .. lt hn
  • For each family Si of our sequence, the
    inequality
  • is valid

28
SRM strategy (3)
SRM find i such that expected risk R(w) becomes
minimum, for a specific hhi, relating to a
specific family Si of our sequence build model
using f from Si
Risk
Model Complexity
29
Putting SRM into action linear models case (1)
  • There are many SRM-based strategies to build
    models
  • In the case of linear models
  • y ltwxgt b,
  • one wants to make w a controlled
    parameter let us call SC the linear model
    function family satisfying the constraint
  • w lt C
  • Vapnik Major theorem
  • When C decreases, h(SC) decreases
  • x lt R

30
Putting SRM into action linear models case (2)
  • To control w, one can envision two routes to
    model
  • Regularization/Ridge Regression, ie min. over w
    and b
  • RG(w,b) S(yi-ltwxigt - b)² i1,..,L l
    w²
  • Support Vector Machines (SVM), ie solve directly
    an optimization problem (hereunder classif. SVM,
    separable data)
  • Minimize w², with (yi /-1)
  • and yi(ltwxigt b) gt1 for all i1,..,L

31
Linear classifiers
  • Rules of the form weight vector w, threshold b
  • f(x) sign( Swixi i1,..,L b )
  • f(x) 1 if Swixi i1,..,L b gt 0
  • -1 else
  • If the L training examples (x1,y1),..,(xL,yL),
    where xi is a vector from Rn, and yi 1 or 1,
    are linearly separable, there is an infinity of
    hyperplanes f separating the 1 and the -1.
  • However, there is a unique f that will define
    the maximum width corridor between yi1 and
    yi-1 of the sample. Let 2d be the width of this
    optimal corridor.

32
VC dimension of  thick  Hyperplanes
  • Lemma the VC dimension of hyperplanes defined
    by f (w,b) with margin d ( thick 
    hyperplane) and sample vectors xi verifying
    xi lt R, i 1,..,L is bounded by
  • VC dim lt R²/d² 1
  • The VC dimension of such a linear classifier
    does not necessarily depend on the number of
    attributes or the number of parameters!

33
Maximizing margin d relation to SRM
  • The hypothesis space with minimal VC-dim
    according to SRM will be the hyperplane with
    maximum margin d.
  • It will be entirely defined through the parts of
    the sample with minimal distance the support
    vectors
  • The number of support vectors is neither L, nor
    n, not h, the corresponding VC-dimension
  • If the number of support vectors is large
    compared to L, the model may be beautiful in
    theory, but extremely costly to apply!

34
Computing Optimal Hyperplane
  • Training examples
  • (x1,y1), ..,(xL,yL), xi from Rn, yi 1 or 1,
    i1,..,L
  • Requirement 1 zero training error
  • (yi-1) gt ltw.xigt b lt 0
  • (yi1) gtltw.xigt b gt 0
  • Hence in all cases yiltw.xigt b gt 0
  • Requirement 2 maximum margin
  • Maximise d, with d miniltw.xigtb/w,
    i1,..,L
  • Requirements 12
  • Maximize d with for every i1,..,L, yi
    ltw.xigtb/w gt d

35
Notions of Duality
  • A large numer of linear models optimization (this
    includes ridge regression and SVMs) can be
    written in a dual space, of dimension L, the
    space of a
  • With y f(x) ltwxgt b, let us have w Xa,
    where a is a vector from RL, X being matrix xij,
    i1,..,L j1,..,n
  • Wi Sxijaj j1,..,n
  • It can be shown that for such models, a solution
    y can be written with the only use of scalar
    products ltxixjgtand ltxixgt, and expressed as a
    linear combination of yi
  • f(x) S
    aiyiltxxigt i1,..,L b

36
SVM dual optimization problem
  • By setting d 1/w, the problem becomes
  • Minimize J(w,b) ½w², with yiltwxigtbgt1
  • The solution can be written as a linear
    combination of the training data
  • w Saiyixi, aigt0, i1,..,L
  • b (1/2)(ltwxposgt ltwxneggt)
  • Dual optimization problem
  • Maximize L(a)Sai i1,..,L
    (1/2)SSaiajyiyjltxixjgt i1,..,L j1,..,L
  • with Saiyi i1,..,L 0 and ai gt0,
    i1,..,L
  • This is a positive semi-definite quadratic program

37
SVM primal and dual equivalences
  • Theorem the primal OP and the dual OP have the
    same solution.
  • Given the solution a of the dual OP,
  • w Saiyixi i1,..,L
  • b (1/2)(ltwxposgt ltwxneggt)
  • is the solution of the primal OP.
  • Hence learning result (SVM classifier) can be
    represented in two alternative ways
  • weight vector and threshold (w,b)
  • Vector of  influences  of each sample data
    a1,..,aL

38
Properties of the SVM Dual OP
  • Dual optimization problem
  • Maximize
  • L(a)Sai i1,..,L (1/2)SSaiajyiyjltxixjgt
    i1,..,L j1,..,L
  • with
  • Saiyi i1,..,L 0 and ai gt0, i1,..,L
  • There is a single solution (ie (w,b) is unique)
  • One factor ai for each training example
  • Describes the influence of training example i on
    the result
  • ai gt 0 ? training example is a support vector
  • ai 0 else
  • Depends exclusively on inner products between
    samples

39
SVM the ugly case of non-separable training
samples
  • For some training samples there is no separating
    hyperplane
  • Complete separation is suboptimal for many
    training samples (eg. a single  -1  close to
    the cloud of  1 , all the other  -1  far
    away)
  • There is hence a need to trade-off between margin
    size (robustness) and training error

40
Soft Margin SubOptimal Example
41
SVM Soft-Margin Separation
  • Same idea as regularization maximize margin and
    mininimize training error simultaneously.
  • Hard Margin
  • Minimize J(w,b) (1/2) w²
  • With constraints yiltwxigt b gt 1, i1,..,L
  • Soft Margin
  • Minimize J(w,b,x) (1/2) w² C Sxi,
    i1,..,L
  • With constraints yiltwxigt b gt 1 xi and xi
    gt 0, i1,..,L
  • Sxi, i1,..,L is an upper bound on training
    error number
  • C is a parameter that controls trade-off between
    margin and error
  • Dual optimization problem for soft margin
  • Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
    jgt i,j1,..,L
  • With constraints Saiyii1,..,L0 and
    0ltailtC, i1,..,L

42
Properties of the Soft-Margin Dual OP
  • Dual OP
  • Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
    jgt i,j1,..,L
  • With constraints Saiyii1,..,L0 and
    0ltailtC, i1,..,L
  • Single solution (ie (w,b) is unique)
  • One factor ai for each training example
  • Influence of single training example limited by C
  • 0ltailtC ? SV with xi0
  • aiC ? SV with xigt0
  • ai0 else
  • Results are based exclusively on inner products
    between training examples

43
Soft Margin - Support vectors
xi
xj
44
Strategies towards non-linear problems (1/4)
  • Notion of  feature space  (Vapnik extended
    space for attributes) a feature space is a
    manifold in which one tries to embed through an
    homeomorphism the original attributes
    x(x1,..,xn) -gt Y(x) (f1(x),..,fN(x))
  • By doing so one will try, in the new manifold, to
    build models in a linear approach on new
    attributes fp
  • y f(x) Swp fp(x), p1,..,N b

45
Strategies towards non-linear problems (2/4)
  • The use of dual representation allows then to
    express the generalized linear model such
    obtained under the following way
  • y Saiyiltf(xi),f(x)gt i1,..,L
    b
  • The idea for Reproducing Kernels in Hilbert
    Spaces (RKHS) is to express non-linearity in an
    indirect way through the use of a function K,
    following a certain number of criteria (Mercer),
    defining extended feature space geometry through
    its inner product
  • K(xi,xj) ltf(xi),f(xj)gt
  • Our model becomes y Saiyi K(xi,x) i1,..,L
    b

46
Strategies towards non-linear problems (3/4)
  • There are many exmaples of Mercer kernels K, such
    as
  • Linear K(xi,xj) ltxixjgt
  • Polynomial K(xi,xj) ltxixjgt1d
  • Radial basis functions K(xi,xj)
    exp(-gxi-xj²)
  • Sigmoid Kernels K(xi,xj) tanh(gxi-xjc)
  • Dual approach and kernels allow to build an
    important class of robust models for non linear
    problems, where the number of attributes can be
    huge (thousands, millions, even infinite..). In
    dual space one works with a space of finite
    dimension L. Such models belong to the family of
    so called generalized linear models.

47
Strategies towards non-linear problems (4/4)
  • Induced non-linearity Mercers theorem allows to
    express a kernel K under the following form
  • K(x1,x2) Slifi(x1)fi(x2)
    i1,2,..
  • K here defines an inner product in extended
    feature space
  • A generalized linear model has then a
    representation in original attributes space
  • y Sliyifi(x) i1,2,.. b,
  • where y Samymf(xm)) m1,..,L

48
Soft Margin SVM with Kernels
  • Training Optimization Problem in dual space
  • Maximize
  • L(a)Saii1,..,L-(1/2)SSaiajyiyjK(xixj)
    i,j1,..,L
  • With constraints
  • Saiyii1,..,L0
  • and 0ltailtC, i1,..,L
  • Classification model for new example x,
  • f(x) sign(SaiyiK(xi,x) xi e SV set b)

49
When do SVMs Work?
  • If
  • Training error on the sample on average is low
  • And the margin d / R on the sample on average is
    large
  • Then
  • SVM learns a classification rule with low error
    rate with high probability (worst case)
  • SVM learns classification rules that have low
    error rate on average
  • SVM learns a classification rule for which the
    (leave-one-out) estimated error rate is low

50
Conclusion (1/2)
  • Vapniks theory allows one to build a new vision
    on the notion of robustness, with a set of
    theorems that belong to the  Kolmogorov  type,
    which means  whatever the underlying
    probabilistic laws of sample data 
  • To build a model becomes to negotiate under this
    vision a trade-off (Friedman) between an
    excellent fit and a proper robustness.
    Cross-validation (a tool used also in SVMs to
    fine tune constant C for soft margin SVM models)
    replaces here tests used in the Fisher approach.

51
Conclusion (2/2)
  • In a first phase of his work, the Statistician
    becomes with SRM (and K2C!) free of a tedious and
    time-expensive work fine-tune and test data
    probabilistic laws.
  • Linear models can be controlled efficiently in
    robustness. Two roads to model are Regularization
    (eg. Ridge Regression, K2R) and Support Vector
    Machines (SVM and KSVM)
  • Reproducing Kernel in Hilbert Spaces (RKHS)
    theory together with Vapniks vision on linear
    models open a Major Way to the build up of
    efficient non linear models generalized linear
    models.

52
Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
53
The Business Value of Analytics

High
Traditional Data Mining
Business Value
OLAP
Query and Reporting
Low
Usability
Easy to Use
Hard to use
54
KXEN Analytic Framework 2.2
Data Access C API
Consistent Coder K2C
55
DEMO - American Census
Business Case Identify target the persons
in my Db who are earning more than 50K
  • Available Information
  • American Census
  • International benchmark
  • Dataset
  • 15 Variables
  • 48,000 Records
  • Mix of text and numerical data

56
Enterprise and Production Modeling
  • Enterprise
  • Planning
  • KPIs
  • ROI
  • Setting Policy

57
Value Added for Professionals
Each group has different goals and
constraints. KXEN breaks down the walls between
the departments.
58
THANKS FOR YOUR TIME
Write a Comment
User Comments (0)
About PowerShow.com