What is a good model?
Low quality /High Robustness
Low Robustness
Robust Model
VC dimension - definition (1)
  • Let us consider a sample (x1, .. , xL) from Rn
  • There are 2L different ways to separate the
    sample in two sub-samples
  • A set S of functions f(X,w) shatters the sample
    if all 2L separations can be defined by different
    f(X,w) from family S

VC dimension - definition (2)
  • A function family S has VC dimension h (h is an
    integer) if
  • 1) Every sample of h vectors from Rn can be
    shattered by a function from S
  • 2) There is at least one sample of h1 vectors
    that cannot be shattered by any function from S

Example VC dimension
  • VC dimension
  • Measures the complexity of a solution
  • Is not directely related to the number of

Other examples
  • VC dimension for hyperplanes of Rn is n1
  • VC dimension of set of functions
  • f(x,w) sign (sin (w.x) ), c lt x lt 1,
  • where w is a free parameter, is infinite.
  • VC dimensions are not always equal to the number
    of parameter that define a given family S of

Key Example linear modelsy ltwxgt b
  • VC dimension of family S of linear models
  • with
  • depends on C and can take any value between 0
    and n.

VC dimension interpretation
  • VC dimension of S an integer, that measures the
    dispersion or separating power (complexity) of
    function family S
  • We shall now show that VC dimension (a major
    theorem from Vapnik) gives a powerful indication
    for model robustness.

Learning Theory Problem (1)
  • A model computes a function
  • Problem minimize in w Risk Expectation
  • w a parameter that specifies the chosen model
  • z (X, y) are possible values for attributes
  • Q measures (quantifies) model error cost
  • P(z) is the underlying probability law (unknown)
    for data z

Learning Theory Problem (2)
  • We get L data from learning sample (z1, .. , zL),
    and we suppose them iid sampled from law P(z).
  • To minimize R(w), we start by minimizing
    Empirical Risk over this sample
  • We shall use such an approach for
  • classification (eg. Q can be a cost function
    based on cost for misclassified points)
  • regression (eg. Q can be a cost of least squares

Learning Theory Problem (3)
  • Central problem for Statistical Learning Theory
  • What is the relation between Risk Expectation
    R(W)and Empirical Risk E(W)?
  • How to define and measure a generalization
    capacity (robustness) for a model ?

Four Pillars for SLT (1 and 2)
  • Consistency (guarantees generalization)
  • Under what conditions will a model be consistent
  • Model convergence speed (a measure for
  • How does generalization capacity improve when
    sample size L grows?

Four Pillars for SLT (3 and 4)
  • Generalization capacity control
  • How to control in an efficient way model
    generalization starting with the only given
    information we have our sample data?
  • A strategy for good learning algorithms
  • Is there a strategy that gurantees, measures and
    controls our learning model generalization
    capacity ?

Consistency definition
  • A learning process (model) is said to be
    consistent if model error, measured on new data
    sampled from the same underlying probability laws
    of our original sample, converges, when original
    sample size increases, towards model error,
    measured on original sample.

Consistent training?
Test error
Training error
number of training examples
Test error
Training error
number of training examples
Vapnik main theorem
  • Q Under which conditions will a learning
    process (model) be consistent?
  • R A model will be consistent if and only if the
    function f that defines the model comes from a
    family of functions S with finite VC dimension h
  • A finite VC dimension h not only guarantees a
    generalization capacity (consistency), but to
    pick f in a family S with finite VC dimension h
    is the only way to build a model that generalizes.

Model convergence speed (generalization capacity)
  • Q What is the nature of model error difference
    between learning data (sample) and test data, for
    a sample of finite size L?
  • R This difference is no greater than a limit
    that only depends on the ratio between VC
    dimension h of model functions family S, and
    sample size L, ie h/L
  • This statement is a new theorem that belongs to
    Kolmogorov-Smirnov way for results, ie theorems
    that do not depend on datas underlying
    probability law.

Model convergence speed
Sample size L
Empirical Risk Minimization
  • With probability 1-q, the following inequality is
  • where w0 is the parameter w value that minimizes
    Empirical Risk

SRM methodology how to control model
generalization capacity
  • Risk Expectation Empirical Risk Confidence
  • To minimize Empirical Risk alone will not always
    give a good generalization capacity one will
    want to minimize the sum of Empirical Risk and
    Confidence Interval
  • What is important is not Vapnik limit numerical
    value , most often too large to be of any
    practical use, it is the fact that this limit is
    a non decreasing function of model family
    function richness

SRM strategy (1)
  • With probability 1-q,
  • When L/h is small (h too large), second term of
    equation becomes large
  • SRM basic idea for strategy is to minimize
    simultaneously both terms standing on the right
    of above majoring equation for R(w)
  • To do this, one has to make h a controlled

SRM strategy (2)
  • Let us consider a sequence S1 lt S2 lt .. lt Sn of
    model family functions, with respective growing
    VC dimensions
  • h1 lt h2 lt .. lt hn
  • For each family Si of our sequence, the
  • is valid

SRM strategy (3)
SRM find i such that expected risk R(w) becomes
minimum, for a specific hhi, relating to a
specific family Si of our sequence build model
using f from Si
Model Complexity
Putting SRM into action linear models case (1)
  • There are many SRM-based strategies to build
  • In the case of linear models
  • y ltwxgt b,
  • one wants to make w a controlled
    parameter let us call SC the linear model
    function family satisfying the constraint
  • w lt C
  • Vapnik Major theorem
  • When C decreases, h(SC) decreases
  • x lt R

Putting SRM into action linear models case (2)
  • To control w, one can envision two routes to
  • Regularization/Ridge Regression, ie min. over w
    and b
  • RG(w,b) S(yi-ltwxigt - b)² i1,..,L l
  • Support Vector Machines (SVM), ie solve directly
    an optimization problem (hereunder classif. SVM,
    separable data)
  • Minimize w², with (yi /-1)
  • and yi(ltwxigt b) gt1 for all i1,..,L

Linear classifiers
  • Rules of the form weight vector w, threshold b
  • f(x) sign( Swixi i1,..,L b )
  • f(x) 1 if Swixi i1,..,L b gt 0
  • -1 else
  • If the L training examples (x1,y1),..,(xL,yL),
    where xi is a vector from Rn, and yi 1 or 1,
    are linearly separable, there is an infinity of
    hyperplanes f separating the 1 and the -1.
  • However, there is a unique f that will define
    the maximum width corridor between yi1 and
    yi-1 of the sample. Let 2d be the width of this
    optimal corridor.

VC dimension of  thick  Hyperplanes
  • Lemma the VC dimension of hyperplanes defined
    by f (w,b) with margin d ( thick 
    hyperplane) and sample vectors xi verifying
    xi lt R, i 1,..,L is bounded by
  • VC dim lt R²/d² 1
  • The VC dimension of such a linear classifier
    does not necessarily depend on the number of
    attributes or the number of parameters!

Maximizing margin d relation to SRM
  • The hypothesis space with minimal VC-dim
    according to SRM will be the hyperplane with
    maximum margin d.
  • It will be entirely defined through the parts of
    the sample with minimal distance the support
  • The number of support vectors is neither L, nor
    n, not h, the corresponding VC-dimension
  • If the number of support vectors is large
    compared to L, the model may be beautiful in
    theory, but extremely costly to apply!

Computing Optimal Hyperplane
  • Training examples
  • (x1,y1), ..,(xL,yL), xi from Rn, yi 1 or 1,
  • Requirement 1 zero training error
  • (yi-1) gt ltw.xigt b lt 0
  • (yi1) gtltw.xigt b gt 0
  • Hence in all cases yiltw.xigt b gt 0
  • Requirement 2 maximum margin
  • Maximise d, with d miniltw.xigtb/w,
  • Requirements 12
  • Maximize d with for every i1,..,L, yi
    ltw.xigtb/w gt d

Notions of Duality
  • A large numer of linear models optimization (this
    includes ridge regression and SVMs) can be
    written in a dual space, of dimension L, the
    space of a
  • With y f(x) ltwxgt b, let us have w Xa,
    where a is a vector from RL, X being matrix xij,
    i1,..,L j1,..,n
  • Wi Sxijaj j1,..,n
  • It can be shown that for such models, a solution
    y can be written with the only use of scalar
    products ltxixjgtand ltxixgt, and expressed as a
    linear combination of yi
  • f(x) S
    aiyiltxxigt i1,..,L b

SVM dual optimization problem
  • By setting d 1/w, the problem becomes
  • Minimize J(w,b) ½w², with yiltwxigtbgt1
  • The solution can be written as a linear
    combination of the training data
  • w Saiyixi, aigt0, i1,..,L
  • b (1/2)(ltwxposgt ltwxneggt)
  • Dual optimization problem
  • Maximize L(a)Sai i1,..,L
    (1/2)SSaiajyiyjltxixjgt i1,..,L j1,..,L
  • with Saiyi i1,..,L 0 and ai gt0,
  • This is a positive semi-definite quadratic program

SVM primal and dual equivalences
  • Theorem the primal OP and the dual OP have the
    same solution.
  • Given the solution a of the dual OP,
  • w Saiyixi i1,..,L
  • b (1/2)(ltwxposgt ltwxneggt)
  • is the solution of the primal OP.
  • Hence learning result (SVM classifier) can be
    represented in two alternative ways
  • weight vector and threshold (w,b)
  • Vector of  influences  of each sample data

Properties of the SVM Dual OP
  • Dual optimization problem
  • Maximize
  • L(a)Sai i1,..,L (1/2)SSaiajyiyjltxixjgt
    i1,..,L j1,..,L
  • with
  • Saiyi i1,..,L 0 and ai gt0, i1,..,L
  • There is a single solution (ie (w,b) is unique)
  • One factor ai for each training example
  • Describes the influence of training example i on
    the result
  • ai gt 0 ? training example is a support vector
  • ai 0 else
  • Depends exclusively on inner products between

SVM the ugly case of non-separable training
  • For some training samples there is no separating
  • Complete separation is suboptimal for many
    training samples (eg. a single  -1  close to
    the cloud of  1 , all the other  -1  far
  • There is hence a need to trade-off between margin
    size (robustness) and training error

Soft Margin SubOptimal Example
SVM Soft-Margin Separation
  • Same idea as regularization maximize margin and
    mininimize training error simultaneously.
  • Hard Margin
  • Minimize J(w,b) (1/2) w²
  • With constraints yiltwxigt b gt 1, i1,..,L
  • Soft Margin
  • Minimize J(w,b,x) (1/2) w² C Sxi,
  • With constraints yiltwxigt b gt 1 xi and xi
    gt 0, i1,..,L
  • Sxi, i1,..,L is an upper bound on training
    error number
  • C is a parameter that controls trade-off between
    margin and error
  • Dual optimization problem for soft margin
  • Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
    jgt i,j1,..,L
  • With constraints Saiyii1,..,L0 and
    0ltailtC, i1,..,L

Properties of the Soft-Margin Dual OP
  • Dual OP
  • Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
    jgt i,j1,..,L
  • With constraints Saiyii1,..,L0 and
    0ltailtC, i1,..,L
  • Single solution (ie (w,b) is unique)
  • One factor ai for each training example
  • Influence of single training example limited by C
  • 0ltailtC ? SV with xi0
  • aiC ? SV with xigt0
  • ai0 else
  • Results are based exclusively on inner products
    between training examples

Soft Margin - Support vectors
Strategies towards non-linear problems (1/4)
  • Notion of  feature space  (Vapnik extended
    space for attributes) a feature space is a
    manifold in which one tries to embed through an
    homeomorphism the original attributes
    x(x1,..,xn) -gt Y(x) (f1(x),..,fN(x))
  • By doing so one will try, in the new manifold, to
    build models in a linear approach on new
    attributes fp
  • y f(x) Swp fp(x), p1,..,N b

Strategies towards non-linear problems (2/4)
  • The use of dual representation allows then to
    express the generalized linear model such
    obtained under the following way
  • y Saiyiltf(xi),f(x)gt i1,..,L
  • The idea for Reproducing Kernels in Hilbert
    Spaces (RKHS) is to express non-linearity in an
    indirect way through the use of a function K,
    following a certain number of criteria (Mercer),
    defining extended feature space geometry through
    its inner product
  • K(xi,xj) ltf(xi),f(xj)gt
  • Our model becomes y Saiyi K(xi,x) i1,..,L

Strategies towards non-linear problems (3/4)
  • There are many exmaples of Mercer kernels K, such
  • Linear K(xi,xj) ltxixjgt
  • Polynomial K(xi,xj) ltxixjgt1d
  • Radial basis functions K(xi,xj)
  • Sigmoid Kernels K(xi,xj) tanh(gxi-xjc)
  • Dual approach and kernels allow to build an
    important class of robust models for non linear
    problems, where the number of attributes can be
    huge (thousands, millions, even infinite..). In
    dual space one works with a space of finite
    dimension L. Such models belong to the family of
    so called generalized linear models.

Strategies towards non-linear problems (4/4)
  • Induced non-linearity Mercers theorem allows to
    express a kernel K under the following form
  • K(x1,x2) Slifi(x1)fi(x2)
  • K here defines an inner product in extended
    feature space
  • A generalized linear model has then a
    representation in original attributes space
  • y Sliyifi(x) i1,2,.. b,
  • where y Samymf(xm)) m1,..,L

Soft Margin SVM with Kernels
  • Training Optimization Problem in dual space
  • Maximize
  • L(a)Saii1,..,L-(1/2)SSaiajyiyjK(xixj)
  • With constraints
  • Saiyii1,..,L0
  • and 0ltailtC, i1,..,L
  • Classification model for new example x,
  • f(x) sign(SaiyiK(xi,x) xi e SV set b)

When do SVMs Work?
  • If
  • Training error on the sample on average is low
  • And the margin d / R on the sample on average is
  • Then
  • SVM learns a classification rule with low error
    rate with high probability (worst case)
  • SVM learns classification rules that have low
    error rate on average
  • SVM learns a classification rule for which the
    (leave-one-out) estimated error rate is low

Conclusion (1/2)
  • Vapniks theory allows one to build a new vision
    on the notion of robustness, with a set of
    theorems that belong to the  Kolmogorov  type,
    which means  whatever the underlying
    probabilistic laws of sample data 
  • To build a model becomes to negotiate under this
    vision a trade-off (Friedman) between an
    excellent fit and a proper robustness.
    Cross-validation (a tool used also in SVMs to
    fine tune constant C for soft margin SVM models)
    replaces here tests used in the Fisher approach.

Conclusion (2/2)
  • In a first phase of his work, the Statistician
    becomes with SRM (and K2C!) free of a tedious and
    time-expensive work fine-tune and test data
    probabilistic laws.
  • Linear models can be controlled efficiently in
    robustness. Two roads to model are Regularization
    (eg. Ridge Regression, K2R) and Support Vector
    Machines (SVM and KSVM)
  • Reproducing Kernel in Hilbert Spaces (RKHS)
    theory together with Vapniks vision on linear
    models open a Major Way to the build up of
    efficient non linear models generalized linear

