Title: Chapter 6 The Structural Risk Minimization Principle
1Chapter 6 The Structural Risk Minimization
Principle
- Junping Zhang
- jpzhang_at_fudan.edu.cn
- Intelligent Information Processing Laboratory,
Fudan University - March 23, 2004
2Objectives
3Structural risk minimization
4Two other induction principles
5The Scheme of the SRM induction principle
6Real-Valued functions
7(No Transcript)
8(No Transcript)
9(No Transcript)
10(No Transcript)
11(No Transcript)
12(No Transcript)
13Principle of SRM
14(No Transcript)
15(No Transcript)
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21SRM
22(No Transcript)
23Minimum Description Length and SRM inductive
principles
- The idea about the Nature of Random Phenomena
- Minimum Description Length Principle for the
Pattern Recognition Problem - Bounds for the MDL
- SRM for the simplest Model and MDL
- The Shortcoming of the MDL
24The idea about the Nature of Random Phenomena
- Probability theory (1930s, Kolmogrov)
- Formal inference
- Axiomatization hasnt considered nature of
randomness - Axioms given probability measures
25The idea about the Nature of Random Phenomena
- The model of randomness
- Solomonoff (1965), Kolmogrov (1965), Chaitin
(1966). - Algorithm (descriptive) complexity
- The length of the shortest binary computer
program - Up to an additive constant does not depend on the
type of computer. - Universal characteristic of the object.
26- A relatively large string describing an object
is random - If algorithm complexity of an object is high
- If the given description of an object cannot be
compressed significantly. - MML (Wallace and Boulton, 1968) MDL (Rissanen,
1978) - Algorithm Complexity as a main tool of induction
inference of learning machines
27Minimum Description Length Principle for the
Pattern Recognition Problem
- Given l pairs containing the vector x and the
binary value ? - Consider two strings the binary string
28Question
- Q Given (147), is the string (146) a random
object? - A to analyze the complexity of the string (146)
in the spirit of Solomonoff-Kolmogorov-Chaitin
ideas
29Compress its description
- Since ?i i1,l are binary values, the string
(146) is described by l bits. - Since training pairs were drawn randomly and
independently. - The value ?i depend on the vector xi but not on
the vector xj.
30Model
31(No Transcript)
32(No Transcript)
33(No Transcript)
34General Case not contain the perfect table.
35(No Transcript)
36Randomness
37Bounds for the MDL
- Q
- Does the compression coefficient K(T) determine
the probability of the test error in
classification (decoding) vectors x by the table
T? - A
- Yes
38Comparison between the MDL and ERM in the
simplest model
39(No Transcript)
40(No Transcript)
41SRM for the simplest Model and MDL
42SRM for the simplest Model and MDL
43(No Transcript)
44The power of compression coefficient
- To obtain bound for the probability of error
- Only information about the coefficient need to be
known.
45The power of compression coefficient
- How many examples we used
- How the structure of code books was organized
- Which code book was used and how many tables were
in this code book. - How many errors were made by the table from the
code book we used.
46MDL principle
- To minimize the probability of error
- One has to minimize the coefficient of compression
47The shortcoming of the MDL
- MDL uses code books with a finite number of
tables. - Continuously depends on parameters, one has to
first quantize that set to make the tables.
48Quantization
- How do we make the smart quantization for a
given number of observations. - For a given set of functions, how can we
construct a code book with a small number of
tables but with good approximation ability?
49The shortcoming of the MDL
- Finding a good quantization is extremely
difficult and determines the main shortcoming of
MDL principle. - The MDL principle works well when the problem of
constructing reasonable code books has a good
solution.
50Consistency of the SRM principle and asymptotic
bounds on the rate of convergence
- Q
- Is the SRM consistent?
- What is the bound on the (asymptotic) rate of
convergence?
51(No Transcript)
52(No Transcript)
53Consistency of the SRM principle.
54Simplification version
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59Remark
- To avoid choosing the minimum of functional (156)
over the infinite number of elements of the
structure. - Additional constraint
- Choose the minimum from the first l elements of
the structure where l is equal to the number of
observations.
60(No Transcript)
61(No Transcript)
62Discussions and Example
63- The rate of convergence is determined by two
contradictory requirements on the rule nn(l). - The first summand The larger nn(l) , the
smaller is the deviation - The second summand The larger nn(l), the larger
deviation - For structures with a known bound on the rate of
approximation, select the rule that assures the
largest rate of convergence.
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70Bounds for the regression estimation problem
71The model of regression estimation by series
expansion
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77Example
78(No Transcript)
79(No Transcript)
80The problem of approximating functions
81(No Transcript)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91- To get high asymptotic rate of approximation
- the only constraint is that
- the kernel should be a bounded function which can
be described as a family of functions possessing
finite VC dimension.
92Problem of local risk minimization
93(No Transcript)
94(No Transcript)
95(No Transcript)
96Local Risk Minimization Model
97(No Transcript)
98(No Transcript)
99(No Transcript)
100(No Transcript)
101(No Transcript)
102(No Transcript)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106Note
- Using local risk minimization methods, one
probably does not need rich sets of approximating
functions. - Whereas the classical semi-local methods are
based on using a set of constant functions.
107Note
- For local estimation functions in the
one-dimensional case, it is probably enough to
consider elements Sk, k0,1,2,3 containing the
polynomials of degree 0,1,2,3
108Summary
- MDL
- SRM
- Local Risk Functional