Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine

Description:

Evaluate explanation quality using the rest of the training set. ... An explanation is a generalization of a training example, a proposed equivalence ... – PowerPoint PPT presentation

Number of Views:113

Avg rating:3.0/5.0

Slides: 54

Provided by: us340733

Category:

more less

Transcript and Presenter's Notes

Title: Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine

1
Robustness through Prior Knowledge Using
Explanation-Based Learning to Distinguish
Handwritten Chinese Characters

Gerald DeJong
Computer Science
University of Illinois at Urbana
mrebl_at_uiuc.edu
Qiang Sun, Shiau Hong Lim, Li-Lun Wang

2
Challenges of Noisy Unstructured Text Data

Noise working with real input
Bottom-up limitations
Some true noise
Some self-induced variability
More reliant on prior structure
Lack of structure problem complexity
Top-down limitations
Highly structured little variability
More reliant on input (noisy or otherwise)

3
Noise

True noise
Missing information
Extra information
Random / Normal(?)
Induced noise
Imperfect representation
Pixelization
Staircasing
Extra / missing blobs or pixels
Variability
Unmodeled / approximated world dynamics
Ignored parameters / covariates
Not random
Convenient to pretend it is true noise

4
Structure vs. Unstructured
Relatively unstructured Very
structured With more structure, less induced
noise
Call me Ishmael. Some years ago - never mind how
long precisely - having little or no money in my
purse, and nothing particular to interest me on
shore, I thought I would sail about a little and
see the watery part of the world. It is a way I
have of driving off the spleen, and regulating
the circulation
Name Ishmael . Finances Low
. Problem Bored, Spleen
. Date Recent? .
5
Unstructured Deal with the Noise

With structure ? programming problem
Without structure ? learning problem
Learn signal from noise via training examples
Each training example contains little information
Is there enough information?
Task dependent
Difficulty Subtlety of required processing
Two statistical NLP question types
How large is Brazil?
Will the Fed raise interest rates?
Second requires integrating lots of partial
evidence

6
Machine Learning as an Empirically Guided Search
through a Hypothesis Space
Example Space X with Training Set Z
Hypothesis Space H
7
What Makes a Learning Problem Hard?

Expressiveness of hypothesis space H
Large / Diverse / Complex H
More bad hypothesis can masquerade as good
More training examples are required for desired
confidence
Want high confidence that a learner will produce
a good approximation of the true concept
Cost More information ? More training examples

8
Explanation Based LearningInformation Beyond
Training Examples

Utilize existing domain knowledge
Treat training examples as illustrations of a
deeper pattern
Explain how the assigned class label may arise
from an examples properties
Explanations suggest the deeper patterns
Calibrate and confirm using other training
examples

9
Two Kinds of Prior Knowledge

Solution Knowledge is directly relevant to a
specific classification task.
Can be readily used to bias a learning system.
But it requires the expert to already know the
solution and to possess expertise about the
machine learner and its bias space.
Domain Knowledge is more abstract and not tied to
any particular classification task.
The same pen will leave similar-width strokes.
Only indirectly helpful for telling a 3 from a
6
Easy for human experts to articulate.
Difficult to express in a statistical learners
bias vocabulary

10
Solution vs. Domain Knowledge

3 vs 8
Right half little information
Left half much more information
Solution knowledge pay attention to the left
half
Domain knowledge
Prior idealized stroke representations
Conjecture differential information
Calibrate Verify with training data
EBL
Derive solution knowledge
Use domain knowledge
Interacting with training examples

3
8
11
The Explanation-Based Learning ApproachTransform
Domain Knowledge into Solution Knowledge.

Conjecture explanations for some training labels
using Domain Knowledge.
Evaluate explanation quality using the rest of
the training set.
Assemble statistically confirmed explanations
into Solution Knowledge.
Adjust the statistical learners bias to reflect
the new Solution Knowledge.

12
SVM Background(Support Vector Machines)

Generic few parameters to manipulate
Linear AND nonlinear
Linear in a high dimensional dot product space
Nonlinear in the input feature space
Expressiveness nonlinear
Cost linear ( convex optimization)
Two cute nuggets
Large margin prefer low capacity / reduce
overfitting
Kernel function (Kernel trick) compact,
efficient, expressive

13
Handwritten Digitsan ML success story(?)

Pixel input, e.g.
32 ? 32 ? 8 bits
x 1024 dimensions, 256 values
Multi-class classifiers
Ten index classifiers 1vAll
Four Boolean encoders
All pairs w/ voting
Generic ANNs work poorly
Generic SVMs work better
Specially designed ANNs work well
Well lt 0.5 overall(LeCun et al, 98 Simard et
al 03)

We are interested in generic solutions
14
Class Information

Let x be the vector of image pixelsx x1, x2,
x3, x1024
Distributed
No crucial input pixel
Class c relations among many pixels
x is Sufficient
Given the input x, the label is not ambiguous
(at least to people)
Entropy (c x) ? 0
Separator is a function of the input pixels
It must be nonlinear interaction / relation
among pixels determines class assignment

15
Whats the Best Separating Hyperplane?
16
Whats the Best Separating Hyperplane?
17
Whats the Best Separating Hyperplane?
18
Whats the Best Separating Hyperplane?
Margin m
Can use the radius r of the smallest enclosing
sphere Capacity is related to (r/m)2
19
Kernel Methods

Map to a new higher dimensional space
Can be very high
Can be infinite
Kernel functions
Introduce high dimensionality
Computation is independent of dimensionality
Defined w/ dot product of input image
vectors(information on the Cosine between image
vectors)
A kernel function defines a distance metric over
space of example images
Points not linearly separable soft margin,
margin distributions,

20
SVMs for Digit Images

K(x,y) (x?? y)3 or (x?? y 1)3
Dot product ? scalar cube it Consider how this
works
Before 322 features (or about 103)
Now (322)3 features (or about 109)
New Feature monomial correlation among three
pixels
VC(lin sep) dimensions
Overfitting problem?
Not if the margin is large
Monitor number of support vectors

21
Mercers Condition / Representer Theorem

ltKernel matrix is positive semidefinitegt
The desired hyperplane can be represented as
Linear weighted sum of distances to support
vectors
Kernel defines the distance metric
The hypothesis space is represented efficiently
by using some of the training examples the
support vectors

22
Distinguishing Handwritten Sevens vs. Twos and
Eights
Handwritten 32 x 32 gray scale pixels
Input feature space is inappropriate Map inputs
to a high-dimensional space Many more features
nonlinear combinations Linearly separable in the
new space
23
Mercer Kernels

Usually start with a kernel rather than features
(s ? x)d Homogeneous polynomials
(s ? x 1)d Complete polynomials
Exp(-s x2 / 2??2 ) Gaussian / RBF
K k
c ? K
K c
K ? k

24
ProblemsSVMs statistical learning generally

Little information from each training example
Signal must show through the noise
Need many training examples
Thousands of are needed for handwritten digits
Much information is ignored (weak bias
vocabulary)
Compare w/ humans
Novel simple shape of similar complexity
Master with several tens (perhaps a hundred)
training examples
Exceedingly small non-fatigue error rate
Chinese characters are much more difficult than
digits

25
Two Related Classification Problems
26
Two Related Classification Problems
27
Two Related Classification Problems
To an SVM these are the same problemApparently
the SVM ignores information crucial to people
28
Strokes Make the Difference

Explanatory hidden features
Humans know that strokes mediate between pixels
and class labels.
Statistical machine learners find the pattern
using pixel level inputs alone without knowing
about strokes.
What can this example tell us?
Statistical learning algorithms are advanced
enough to extract complex pattern from data.
But simple prior knowledge (e.g., the existence
of strokes) may help to find relevant patterns
faster and more accurately.
Inventing latent features is hard for statistics

29
Domain Knowledge

What can we say about strokes?
Within an image they are written by the same
person using the same writing instrument
They are made by a succession of simple pen
movements
They give rise to the pixels
Much Information! (suppose it did not hold)
This is not easily captured in the native bias
vocabulary (not solution knowledge)
Knowledge about strokes is imperfect so that
building a bottom-up stroke extractor is
error-prone.

30
Primary Domain Distinguishing Handwritten
Chinese Characters

More complex than digits or Western characters
(64x63 pixels).
Thousands of different characters ? Few training
examples available for each (200 labeled images
for us).
Domain knowledge includes anideal prototype
stroke representation for each character.

31
Handwritten Chinese Characters

We selected ten characters in three classes
Yields forty-five classification problems.
Classification difficultyvaries significantly
byclassification problem.

32
Hough Transform

Old (but good) idea
ltx,ygt ? ltm,bgt given y mx b
Hough transform makes a poor line detector
BUT Explaining is easy and reliable(class label
determines the ideal prototype stroke
representation)
We know the lines
approximate parameters,
geometric constraints
Find / hallucinate the Hough peaks to optimize
the fit

33
Feature Kernel Functions

Design special-purpose kernel functions
Adapt distance metric to fit the task
Emphasize expected high-information content pixels

34
Explaining Chinese Characters

A pixel is judged to be informative if it is
likely to be part of an informative stroke
feature.
Stroke features are informative if they are
distinctive between the ideal prototype
characters.
Interaction between training examples and the
prior domain knowledge is crucial.

35
Constructing Explanations

From domain knowledge, the top and bottom
horizontal strokes are unlikely to be
informative.
Explanation apply a linear Hough transformation
to identify lines in the image, and associate
pixels in the images with strokes.
Prototype stroke representations greatly aid in
identifying the pixel stroke correspondence in
training examples (but not test examples).
High information pixels correspond to distinctive
stroke-level features

36
What is an Explanation for the Feature Kernel
Function Approach?

An account of where the class information is
expected to be found within the input image
pixels
Uniform emphasis over disk of 90 probability
mass of the fitted Gaussian

37
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
FKF similar performance withnearly an order of
magnitude less training
Performance by problemScatter Plot for 45
ProblemsAll problems improve FKF never
hurtsLower slope?(suggests hardest problems are
helped most)
38
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)

Learning curves by problem difficulty (as judged
by SVM accuracy)
A) Hardest B) Middle C) Easiest third

39
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
For each problem at full trainingFKF always uses
fewer support vectors
Interaction between prior knowledge and training
examples is crucial
40
Explanation-Augmented Support Vector Machine

EA-SVM another approach
Previous approach adapted the kernel function
EA-SVM alters the SVM algorithm uses standard
kernel function
Explanations are integrated directly as a bias

41
EA-SVMWhat is an Explanation?

An explanation is a generalization of a training
example, a proposed equivalence class of
examples.
Same explanation implies same label for the same
reason, and should be treated the same by the
classifier.
For an SVM, examples with the same explanation
should have the same margin.
A perfect explanation is a hyperplane to which
the classifier should be parallel
Explanations are not perfect.
So prefer a decision surface that ismore nearly
parallel to confirmed explanations.
Penalize non-parallelness

42
Formalizing the Constraints Mathematically

Let an explanation justify the label for a given
example x using only a subset e of features, the
explained example v is defined as
The special symbol indicates that this
feature does not participate in the inner product
evaluation. With numerical features one can
simply use the value zero.
The constraints can be expressed as
or equally
Geometrically, this requires the classifier
hyper-plane to be parallel to the direction x
v.

43
EA-SVMs Explanation-Augmented Support Vector
Machines

Incorporate high quality explanations into a
conventional SVM
Classifier reflects information from both
examples and domain knowledge.
Optimal classifier blends
Maximal conventional margin to training examples
Maximally parallel to high quality explanations
We use soft constraints for each.
Similar analyses using two sets of slack
variables.
Linear blending via cross validation.

44
The EA-SVM Optimization Problem

Perfect knowledge
Imperfect knowledge
Introduce positive new slack variables (?i)
The optimization problem become
K, the confidence parameter, is determined by
cross-validation it blends empirical and
explanation information

45
Solutions for EA-SVM

With perfect knowledge
where
With imperfect knowledge
where
When confidence parameter K goes to infinity, the
second solution reduces to the same as the first
one.
When K and the ?i are 0, the problem ignores the
explanations and reduces to a standard SVM.

46
Formal Analysis Why EA-SVM works

EA-SVM algorithm minimizes the following error
bound

Interesting symbols in the expression of h
Rv The radius of the ball that contains all the
explained examples. We expect Rv lt R.
D The penalty of a separator ltu,bgt violates the
parallel constrains imposed by explanations.
D is determined by cross-validation to minimize
h.

47
A Simple Prediction

A closer look at h
With perfect knowledge, D0
Without knowledge
EA-SVM has most to offer when the ratio Rv /R is
small, which means explanations uses few
important features to justify the label.
Intuitively, the learning problem is difficult
but the domain knowledge is informative.

48
Experiment 1 Does Explanation-Augmentation Help?
Results for 45 classifiers on pairs of Chinese
characters. Below the line means EA-SVM makes
fewer errors than SVM.
49
Experiment 2 Difficult Problems Benefit More
EA-SVM vs. SVM Easy tasks SimilarDifficult
tasks EA-SVM wins at all training levels.
Task difficulty is highly correlated with
Improvement of EA-SVM over conventional SVM.
50
Exp 3 Robustness and the Effect of Knowledge
Quality

EA-SVM benefits from good knowledge, and is not
hurt by incorrect knowledge.

51
Exp 4 Additional (Non-image) Domains.

Protein Explanations only known motif sequences
are important for proteins categorization.
Text Explanations Only words related to the
category label are important.
ROC (protein) and F1 (text) scores show EA-SVM
improvement.

52
Previous Work on Incorporating Knowledge into
SVMs (Solution Knowledge)

Incorporating transformation invariance into
SVMs.
Virtual support vector (Schölkopf, 1996)
Invariant kernel function (Schölkopf, 2002)
Jittered SVM (DeCoste Schölkopf, 2002)
Tangent propagation (Simard 1992, 1998)
Locally-improved kernel function explores spatial
locality property (Schölkopf, 1998)
Convolutional networks (LeCun et al 1998, Simard
et al 2003)
Knowledge-based SVM and kernels incorporates
prior rules. (Fung, Mangasarian Shavlik, 2002,
2003 Mangasarian, Shavlik Wild 2004)
Extracting character high-level features from
pixel representation. (Teow 2000, Shi 2003, Kadir
2004)

53
Conclusion

Inductive learning algorithms can benefit from
domain knowledge.
This work illustrates a novel direction of using
knowledge by combining EBL ideas into a
statistical learner.
With Domain Knowledge, the expert need not also
be expert in the learning algorithms.
The EBL components are extremely simple more can
be done.
The role of Domain knowledge rather than Solution
Knowledge demands further study this is an
important and little-explored direction.
Next step IJCAI07 Poster Explanation-Based
Feature ConstructionShiau Hong Lim