Chapter 6 The Structural Risk Minimization Principle - PowerPoint PPT Presentation

1 / 108

About This Presentation

Title:

Chapter 6 The Structural Risk Minimization Principle

Description:

Given l pairs containing the vector x and the binary value ? ... For a given set of functions, how can we construct a code book with a small ... – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 109

Provided by: zjp

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 6 The Structural Risk Minimization Principle

1
Chapter 6 The Structural Risk Minimization
Principle

Junping Zhang
jpzhang_at_fudan.edu.cn
Intelligent Information Processing Laboratory,
Fudan University
March 23, 2004

2
Objectives
3
Structural risk minimization
4
Two other induction principles
5
The Scheme of the SRM induction principle
6
Real-Valued functions
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
Principle of SRM
14
(No Transcript)
15
(No Transcript)
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
SRM
22
(No Transcript)
23
Minimum Description Length and SRM inductive
principles

The idea about the Nature of Random Phenomena
Minimum Description Length Principle for the
Pattern Recognition Problem
Bounds for the MDL
SRM for the simplest Model and MDL
The Shortcoming of the MDL

24
The idea about the Nature of Random Phenomena

Probability theory (1930s, Kolmogrov)
Formal inference
Axiomatization hasnt considered nature of
randomness
Axioms given probability measures

25
The idea about the Nature of Random Phenomena

The model of randomness
Solomonoff (1965), Kolmogrov (1965), Chaitin
(1966).
Algorithm (descriptive) complexity
The length of the shortest binary computer
program
Up to an additive constant does not depend on the
type of computer.
Universal characteristic of the object.

A relatively large string describing an object
is random
If algorithm complexity of an object is high
If the given description of an object cannot be
compressed significantly.
MML (Wallace and Boulton, 1968) MDL (Rissanen,
1978)
Algorithm Complexity as a main tool of induction
inference of learning machines

27
Minimum Description Length Principle for the
Pattern Recognition Problem

Given l pairs containing the vector x and the
binary value ?
Consider two strings the binary string

28
Question

Q Given (147), is the string (146) a random
object?
A to analyze the complexity of the string (146)
in the spirit of Solomonoff-Kolmogorov-Chaitin
ideas

29
Compress its description

Since ?i i1,l are binary values, the string
(146) is described by l bits.
Since training pairs were drawn randomly and
independently.
The value ?i depend on the vector xi but not on
the vector xj.

30
Model
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
General Case not contain the perfect table.
35
(No Transcript)
36
Randomness
37
Bounds for the MDL

Q
Does the compression coefficient K(T) determine
the probability of the test error in
classification (decoding) vectors x by the table
T?
A
Yes

38
Comparison between the MDL and ERM in the
simplest model
39
(No Transcript)
40
(No Transcript)
41
SRM for the simplest Model and MDL
42
SRM for the simplest Model and MDL
43
(No Transcript)
44
The power of compression coefficient

To obtain bound for the probability of error
Only information about the coefficient need to be
known.

45
The power of compression coefficient

How many examples we used
How the structure of code books was organized
Which code book was used and how many tables were
in this code book.
How many errors were made by the table from the
code book we used.

46
MDL principle

To minimize the probability of error
One has to minimize the coefficient of compression

47
The shortcoming of the MDL

MDL uses code books with a finite number of
tables.
Continuously depends on parameters, one has to
first quantize that set to make the tables.

48
Quantization

How do we make the smart quantization for a
given number of observations.
For a given set of functions, how can we
construct a code book with a small number of
tables but with good approximation ability?

49
The shortcoming of the MDL

Finding a good quantization is extremely
difficult and determines the main shortcoming of
MDL principle.
The MDL principle works well when the problem of
constructing reasonable code books has a good
solution.

50
Consistency of the SRM principle and asymptotic
bounds on the rate of convergence

Q
Is the SRM consistent?
What is the bound on the (asymptotic) rate of
convergence?

51
(No Transcript)
52
(No Transcript)
53
Consistency of the SRM principle.
54
Simplification version
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
Remark

To avoid choosing the minimum of functional (156)
over the infinite number of elements of the
structure.
Additional constraint
Choose the minimum from the first l elements of
the structure where l is equal to the number of
observations.

60
(No Transcript)
61
(No Transcript)
62
Discussions and Example
63

The rate of convergence is determined by two
contradictory requirements on the rule nn(l).
The first summand The larger nn(l) , the
smaller is the deviation
The second summand The larger nn(l), the larger
deviation
For structures with a known bound on the rate of
approximation, select the rule that assures the
largest rate of convergence.

64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
Bounds for the regression estimation problem
71
The model of regression estimation by series
expansion
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
Example
78
(No Transcript)
79
(No Transcript)
80
The problem of approximating functions
81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91

To get high asymptotic rate of approximation
the only constraint is that
the kernel should be a bounded function which can
be described as a family of functions possessing
finite VC dimension.

92
Problem of local risk minimization
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
Local Risk Minimization Model
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
(No Transcript)
101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Note

Using local risk minimization methods, one
probably does not need rich sets of approximating
functions.
Whereas the classical semi-local methods are
based on using a set of constant functions.

107
Note

For local estimation functions in the
one-dimensional case, it is probably enough to
consider elements Sk, k0,1,2,3 containing the
polynomials of degree 0,1,2,3

108
Summary