Title: Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine
1Robustness through Prior Knowledge Using
Explanation-Based Learning to Distinguish
Handwritten Chinese Characters
- Gerald DeJong
- Computer Science
- University of Illinois at Urbana
- mrebl_at_uiuc.edu
- Qiang Sun, Shiau Hong Lim, Li-Lun Wang
2Challenges of Noisy Unstructured Text Data
- Noise working with real input
- Bottom-up limitations
- Some true noise
- Some self-induced variability
- More reliant on prior structure
- Lack of structure problem complexity
- Top-down limitations
- Highly structured little variability
- More reliant on input (noisy or otherwise)
3Noise
- True noise
- Missing information
- Extra information
- Random / Normal(?)
- Induced noise
- Imperfect representation
- Pixelization
- Staircasing
- Extra / missing blobs or pixels
- Variability
- Unmodeled / approximated world dynamics
- Ignored parameters / covariates
- Not random
- Convenient to pretend it is true noise
4Structure vs. Unstructured
Relatively unstructured Very
structured With more structure, less induced
noise
Call me Ishmael. Some years ago - never mind how
long precisely - having little or no money in my
purse, and nothing particular to interest me on
shore, I thought I would sail about a little and
see the watery part of the world. It is a way I
have of driving off the spleen, and regulating
the circulation
Name Ishmael . Finances Low
. Problem Bored, Spleen
. Date Recent? .
5Unstructured Deal with the Noise
- With structure ? programming problem
- Without structure ? learning problem
- Learn signal from noise via training examples
- Each training example contains little information
- Is there enough information?
- Task dependent
- Difficulty Subtlety of required processing
- Two statistical NLP question types
- How large is Brazil?
- Will the Fed raise interest rates?
- Second requires integrating lots of partial
evidence
6Machine Learning as an Empirically Guided Search
through a Hypothesis Space
Example Space X with Training Set Z
Hypothesis Space H
7What Makes a Learning Problem Hard?
- Expressiveness of hypothesis space H
- Large / Diverse / Complex H
- More bad hypothesis can masquerade as good
- More training examples are required for desired
confidence - Want high confidence that a learner will produce
a good approximation of the true concept - Cost More information ? More training examples
8Explanation Based LearningInformation Beyond
Training Examples
- Utilize existing domain knowledge
- Treat training examples as illustrations of a
deeper pattern - Explain how the assigned class label may arise
from an examples properties - Explanations suggest the deeper patterns
- Calibrate and confirm using other training
examples
9Two Kinds of Prior Knowledge
- Solution Knowledge is directly relevant to a
specific classification task. - Can be readily used to bias a learning system.
- But it requires the expert to already know the
solution and to possess expertise about the
machine learner and its bias space. - Domain Knowledge is more abstract and not tied to
any particular classification task. - The same pen will leave similar-width strokes.
- Only indirectly helpful for telling a 3 from a
6 - Easy for human experts to articulate.
- Difficult to express in a statistical learners
bias vocabulary
10Solution vs. Domain Knowledge
- 3 vs 8
- Right half little information
- Left half much more information
- Solution knowledge pay attention to the left
half - Domain knowledge
- Prior idealized stroke representations
- Conjecture differential information
- Calibrate Verify with training data
- EBL
- Derive solution knowledge
- Use domain knowledge
- Interacting with training examples
3
8
11The Explanation-Based Learning ApproachTransform
Domain Knowledge into Solution Knowledge.
- Conjecture explanations for some training labels
using Domain Knowledge. - Evaluate explanation quality using the rest of
the training set. - Assemble statistically confirmed explanations
into Solution Knowledge. - Adjust the statistical learners bias to reflect
the new Solution Knowledge.
12SVM Background(Support Vector Machines)
- Generic few parameters to manipulate
- Linear AND nonlinear
- Linear in a high dimensional dot product space
- Nonlinear in the input feature space
- Expressiveness nonlinear
- Cost linear ( convex optimization)
- Two cute nuggets
- Large margin prefer low capacity / reduce
overfitting - Kernel function (Kernel trick) compact,
efficient, expressive
13Handwritten Digitsan ML success story(?)
- Pixel input, e.g.
- 32 ? 32 ? 8 bits
- x 1024 dimensions, 256 values
- Multi-class classifiers
- Ten index classifiers 1vAll
- Four Boolean encoders
- All pairs w/ voting
-
- Generic ANNs work poorly
- Generic SVMs work better
- Specially designed ANNs work well
- Well lt 0.5 overall(LeCun et al, 98 Simard et
al 03)
We are interested in generic solutions
14Class Information
- Let x be the vector of image pixelsx x1, x2,
x3, x1024 - Distributed
- No crucial input pixel
- Class c relations among many pixels
- x is Sufficient
- Given the input x, the label is not ambiguous
(at least to people) - Entropy (c x) ? 0
- Separator is a function of the input pixels
- It must be nonlinear interaction / relation
among pixels determines class assignment
15Whats the Best Separating Hyperplane?
16Whats the Best Separating Hyperplane?
17Whats the Best Separating Hyperplane?
18Whats the Best Separating Hyperplane?
Margin m
Can use the radius r of the smallest enclosing
sphere Capacity is related to (r/m)2
19Kernel Methods
- Map to a new higher dimensional space
- Can be very high
- Can be infinite
- Kernel functions
- Introduce high dimensionality
- Computation is independent of dimensionality
- Defined w/ dot product of input image
vectors(information on the Cosine between image
vectors) - A kernel function defines a distance metric over
space of example images - Points not linearly separable soft margin,
margin distributions,
20SVMs for Digit Images
- K(x,y) (x?? y)3 or (x?? y 1)3
- Dot product ? scalar cube it Consider how this
works - Before 322 features (or about 103)
- Now (322)3 features (or about 109)
- New Feature monomial correlation among three
pixels - VC(lin sep) dimensions
- Overfitting problem?
- Not if the margin is large
- Monitor number of support vectors
21Mercers Condition / Representer Theorem
- ltKernel matrix is positive semidefinitegt
- The desired hyperplane can be represented as
- Linear weighted sum of distances to support
vectors - Kernel defines the distance metric
- The hypothesis space is represented efficiently
by using some of the training examples the
support vectors
22Distinguishing Handwritten Sevens vs. Twos and
Eights
Handwritten 32 x 32 gray scale pixels
Input feature space is inappropriate Map inputs
to a high-dimensional space Many more features
nonlinear combinations Linearly separable in the
new space
23Mercer Kernels
- Usually start with a kernel rather than features
- (s ? x)d Homogeneous polynomials
- (s ? x 1)d Complete polynomials
- Exp(-s x2 / 2??2 ) Gaussian / RBF
- K k
- c ? K
- K c
- K ? k
24ProblemsSVMs statistical learning generally
- Little information from each training example
- Signal must show through the noise
- Need many training examples
- Thousands of are needed for handwritten digits
- Much information is ignored (weak bias
vocabulary) - Compare w/ humans
- Novel simple shape of similar complexity
- Master with several tens (perhaps a hundred)
training examples - Exceedingly small non-fatigue error rate
- Chinese characters are much more difficult than
digits
25Two Related Classification Problems
26Two Related Classification Problems
27Two Related Classification Problems
To an SVM these are the same problemApparently
the SVM ignores information crucial to people
28Strokes Make the Difference
- Explanatory hidden features
- Humans know that strokes mediate between pixels
and class labels. - Statistical machine learners find the pattern
using pixel level inputs alone without knowing
about strokes. - What can this example tell us?
- Statistical learning algorithms are advanced
enough to extract complex pattern from data. - But simple prior knowledge (e.g., the existence
of strokes) may help to find relevant patterns
faster and more accurately. - Inventing latent features is hard for statistics
29Domain Knowledge
- What can we say about strokes?
- Within an image they are written by the same
person using the same writing instrument - They are made by a succession of simple pen
movements - They give rise to the pixels
- Much Information! (suppose it did not hold)
- This is not easily captured in the native bias
vocabulary (not solution knowledge) - Knowledge about strokes is imperfect so that
building a bottom-up stroke extractor is
error-prone.
30Primary Domain Distinguishing Handwritten
Chinese Characters
- More complex than digits or Western characters
(64x63 pixels). - Thousands of different characters ? Few training
examples available for each (200 labeled images
for us). - Domain knowledge includes anideal prototype
stroke representation for each character.
31Handwritten Chinese Characters
- We selected ten characters in three classes
- Yields forty-five classification problems.
- Classification difficultyvaries significantly
byclassification problem.
32Hough Transform
- Old (but good) idea
- ltx,ygt ? ltm,bgt given y mx b
- Hough transform makes a poor line detector
- BUT Explaining is easy and reliable(class label
determines the ideal prototype stroke
representation) - We know the lines
- approximate parameters,
- geometric constraints
- Find / hallucinate the Hough peaks to optimize
the fit
33Feature Kernel Functions
- Design special-purpose kernel functions
- Adapt distance metric to fit the task
- Emphasize expected high-information content pixels
34Explaining Chinese Characters
- A pixel is judged to be informative if it is
likely to be part of an informative stroke
feature. - Stroke features are informative if they are
distinctive between the ideal prototype
characters. - Interaction between training examples and the
prior domain knowledge is crucial.
35Constructing Explanations
- From domain knowledge, the top and bottom
horizontal strokes are unlikely to be
informative. - Explanation apply a linear Hough transformation
to identify lines in the image, and associate
pixels in the images with strokes. - Prototype stroke representations greatly aid in
identifying the pixel stroke correspondence in
training examples (but not test examples). - High information pixels correspond to distinctive
stroke-level features
36What is an Explanation for the Feature Kernel
Function Approach?
- An account of where the class information is
expected to be found within the input image
pixels - Uniform emphasis over disk of 90 probability
mass of the fitted Gaussian
37Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
FKF similar performance withnearly an order of
magnitude less training
Performance by problemScatter Plot for 45
ProblemsAll problems improve FKF never
hurtsLower slope?(suggests hardest problems are
helped most)
38Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
- Learning curves by problem difficulty (as judged
by SVM accuracy) - A) Hardest B) Middle C) Easiest third
-
39Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
For each problem at full trainingFKF always uses
fewer support vectors
Interaction between prior knowledge and training
examples is crucial
40Explanation-Augmented Support Vector Machine
- EA-SVM another approach
- Previous approach adapted the kernel function
- EA-SVM alters the SVM algorithm uses standard
kernel function - Explanations are integrated directly as a bias
41EA-SVMWhat is an Explanation?
- An explanation is a generalization of a training
example, a proposed equivalence class of
examples. - Same explanation implies same label for the same
reason, and should be treated the same by the
classifier. - For an SVM, examples with the same explanation
should have the same margin. - A perfect explanation is a hyperplane to which
the classifier should be parallel - Explanations are not perfect.
- So prefer a decision surface that ismore nearly
parallel to confirmed explanations. - Penalize non-parallelness
42Formalizing the Constraints Mathematically
- Let an explanation justify the label for a given
example x using only a subset e of features, the
explained example v is defined as - The special symbol indicates that this
feature does not participate in the inner product
evaluation. With numerical features one can
simply use the value zero. - The constraints can be expressed as
- or equally
- Geometrically, this requires the classifier
hyper-plane to be parallel to the direction x
v.
43EA-SVMs Explanation-Augmented Support Vector
Machines
- Incorporate high quality explanations into a
conventional SVM - Classifier reflects information from both
examples and domain knowledge. - Optimal classifier blends
- Maximal conventional margin to training examples
- Maximally parallel to high quality explanations
- We use soft constraints for each.
- Similar analyses using two sets of slack
variables. - Linear blending via cross validation.
44The EA-SVM Optimization Problem
- Perfect knowledge
- Imperfect knowledge
- Introduce positive new slack variables (?i)
- The optimization problem become
- K, the confidence parameter, is determined by
cross-validation it blends empirical and
explanation information
45Solutions for EA-SVM
- With perfect knowledge
- where
- With imperfect knowledge
- where
- When confidence parameter K goes to infinity, the
second solution reduces to the same as the first
one. - When K and the ?i are 0, the problem ignores the
explanations and reduces to a standard SVM.
46Formal Analysis Why EA-SVM works
- EA-SVM algorithm minimizes the following error
bound
- Interesting symbols in the expression of h
- Rv The radius of the ball that contains all the
explained examples. We expect Rv lt R. - D The penalty of a separator ltu,bgt violates the
parallel constrains imposed by explanations. - D is determined by cross-validation to minimize
h.
47A Simple Prediction
- A closer look at h
- With perfect knowledge, D0
- Without knowledge
- EA-SVM has most to offer when the ratio Rv /R is
small, which means explanations uses few
important features to justify the label.
Intuitively, the learning problem is difficult
but the domain knowledge is informative.
48Experiment 1 Does Explanation-Augmentation Help?
Results for 45 classifiers on pairs of Chinese
characters. Below the line means EA-SVM makes
fewer errors than SVM.
49Experiment 2 Difficult Problems Benefit More
EA-SVM vs. SVM Easy tasks SimilarDifficult
tasks EA-SVM wins at all training levels.
Task difficulty is highly correlated with
Improvement of EA-SVM over conventional SVM.
50Exp 3 Robustness and the Effect of Knowledge
Quality
- EA-SVM benefits from good knowledge, and is not
hurt by incorrect knowledge.
51Exp 4 Additional (Non-image) Domains.
- Protein Explanations only known motif sequences
are important for proteins categorization. - Text Explanations Only words related to the
category label are important. - ROC (protein) and F1 (text) scores show EA-SVM
improvement.
52Previous Work on Incorporating Knowledge into
SVMs (Solution Knowledge)
- Incorporating transformation invariance into
SVMs. - Virtual support vector (Schölkopf, 1996)
- Invariant kernel function (Schölkopf, 2002)
- Jittered SVM (DeCoste Schölkopf, 2002)
- Tangent propagation (Simard 1992, 1998)
- Locally-improved kernel function explores spatial
locality property (Schölkopf, 1998) - Convolutional networks (LeCun et al 1998, Simard
et al 2003) - Knowledge-based SVM and kernels incorporates
prior rules. (Fung, Mangasarian Shavlik, 2002,
2003 Mangasarian, Shavlik Wild 2004) - Extracting character high-level features from
pixel representation. (Teow 2000, Shi 2003, Kadir
2004)
53Conclusion
- Inductive learning algorithms can benefit from
domain knowledge. - This work illustrates a novel direction of using
knowledge by combining EBL ideas into a
statistical learner. - With Domain Knowledge, the expert need not also
be expert in the learning algorithms. - The EBL components are extremely simple more can
be done. - The role of Domain knowledge rather than Solution
Knowledge demands further study this is an
important and little-explored direction. - Next step IJCAI07 Poster Explanation-Based
Feature ConstructionShiau Hong Lim