Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine

Description:

Evaluate explanation quality using the rest of the training set. ... An explanation is a generalization of a training example, a proposed equivalence ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Robustness through Prior Knowledge: Using ExplanationBased Learning to Distinguish Handwritten Chine


1
Robustness through Prior Knowledge Using
Explanation-Based Learning to Distinguish
Handwritten Chinese Characters
  • Gerald DeJong
  • Computer Science
  • University of Illinois at Urbana
  • mrebl_at_uiuc.edu
  • Qiang Sun, Shiau Hong Lim, Li-Lun Wang

2
Challenges of Noisy Unstructured Text Data
  • Noise working with real input
  • Bottom-up limitations
  • Some true noise
  • Some self-induced variability
  • More reliant on prior structure
  • Lack of structure problem complexity
  • Top-down limitations
  • Highly structured little variability
  • More reliant on input (noisy or otherwise)

3
Noise
  • True noise
  • Missing information
  • Extra information
  • Random / Normal(?)
  • Induced noise
  • Imperfect representation
  • Pixelization
  • Staircasing
  • Extra / missing blobs or pixels
  • Variability
  • Unmodeled / approximated world dynamics
  • Ignored parameters / covariates
  • Not random
  • Convenient to pretend it is true noise

4
Structure vs. Unstructured
Relatively unstructured Very
structured With more structure, less induced
noise
Call me Ishmael. Some years ago - never mind how
long precisely - having little or no money in my
purse, and nothing particular to interest me on
shore, I thought I would sail about a little and
see the watery part of the world. It is a way I
have of driving off the spleen, and regulating
the circulation
Name Ishmael . Finances Low
. Problem Bored, Spleen
. Date Recent? .
5
Unstructured Deal with the Noise
  • With structure ? programming problem
  • Without structure ? learning problem
  • Learn signal from noise via training examples
  • Each training example contains little information
  • Is there enough information?
  • Task dependent
  • Difficulty Subtlety of required processing
  • Two statistical NLP question types
  • How large is Brazil?
  • Will the Fed raise interest rates?
  • Second requires integrating lots of partial
    evidence

6
Machine Learning as an Empirically Guided Search
through a Hypothesis Space
Example Space X with Training Set Z
Hypothesis Space H
7
What Makes a Learning Problem Hard?
  • Expressiveness of hypothesis space H
  • Large / Diverse / Complex H
  • More bad hypothesis can masquerade as good
  • More training examples are required for desired
    confidence
  • Want high confidence that a learner will produce
    a good approximation of the true concept
  • Cost More information ? More training examples


8
Explanation Based LearningInformation Beyond
Training Examples
  • Utilize existing domain knowledge
  • Treat training examples as illustrations of a
    deeper pattern
  • Explain how the assigned class label may arise
    from an examples properties
  • Explanations suggest the deeper patterns
  • Calibrate and confirm using other training
    examples

9
Two Kinds of Prior Knowledge
  • Solution Knowledge is directly relevant to a
    specific classification task.
  • Can be readily used to bias a learning system.
  • But it requires the expert to already know the
    solution and to possess expertise about the
    machine learner and its bias space.
  • Domain Knowledge is more abstract and not tied to
    any particular classification task.
  • The same pen will leave similar-width strokes.
  • Only indirectly helpful for telling a 3 from a
    6
  • Easy for human experts to articulate.
  • Difficult to express in a statistical learners
    bias vocabulary

10
Solution vs. Domain Knowledge
  • 3 vs 8
  • Right half little information
  • Left half much more information
  • Solution knowledge pay attention to the left
    half
  • Domain knowledge
  • Prior idealized stroke representations
  • Conjecture differential information
  • Calibrate Verify with training data
  • EBL
  • Derive solution knowledge
  • Use domain knowledge
  • Interacting with training examples

3
8
11
The Explanation-Based Learning ApproachTransform
Domain Knowledge into Solution Knowledge.
  • Conjecture explanations for some training labels
    using Domain Knowledge.
  • Evaluate explanation quality using the rest of
    the training set.
  • Assemble statistically confirmed explanations
    into Solution Knowledge.
  • Adjust the statistical learners bias to reflect
    the new Solution Knowledge.

12
SVM Background(Support Vector Machines)
  • Generic few parameters to manipulate
  • Linear AND nonlinear
  • Linear in a high dimensional dot product space
  • Nonlinear in the input feature space
  • Expressiveness nonlinear
  • Cost linear ( convex optimization)
  • Two cute nuggets
  • Large margin prefer low capacity / reduce
    overfitting
  • Kernel function (Kernel trick) compact,
    efficient, expressive

13
Handwritten Digitsan ML success story(?)
  • Pixel input, e.g.
  • 32 ? 32 ? 8 bits
  • x 1024 dimensions, 256 values
  • Multi-class classifiers
  • Ten index classifiers 1vAll
  • Four Boolean encoders
  • All pairs w/ voting
  • Generic ANNs work poorly
  • Generic SVMs work better
  • Specially designed ANNs work well
  • Well lt 0.5 overall(LeCun et al, 98 Simard et
    al 03)

We are interested in generic solutions
14
Class Information
  • Let x be the vector of image pixelsx x1, x2,
    x3, x1024
  • Distributed
  • No crucial input pixel
  • Class c relations among many pixels
  • x is Sufficient
  • Given the input x, the label is not ambiguous
    (at least to people)
  • Entropy (c x) ? 0
  • Separator is a function of the input pixels
  • It must be nonlinear interaction / relation
    among pixels determines class assignment

15
Whats the Best Separating Hyperplane?
16
Whats the Best Separating Hyperplane?
17
Whats the Best Separating Hyperplane?
18
Whats the Best Separating Hyperplane?
Margin m
Can use the radius r of the smallest enclosing
sphere Capacity is related to (r/m)2
19
Kernel Methods
  • Map to a new higher dimensional space
  • Can be very high
  • Can be infinite
  • Kernel functions
  • Introduce high dimensionality
  • Computation is independent of dimensionality
  • Defined w/ dot product of input image
    vectors(information on the Cosine between image
    vectors)
  • A kernel function defines a distance metric over
    space of example images
  • Points not linearly separable soft margin,
    margin distributions,

20
SVMs for Digit Images
  • K(x,y) (x?? y)3 or (x?? y 1)3
  • Dot product ? scalar cube it Consider how this
    works
  • Before 322 features (or about 103)
  • Now (322)3 features (or about 109)
  • New Feature monomial correlation among three
    pixels
  • VC(lin sep) dimensions
  • Overfitting problem?
  • Not if the margin is large
  • Monitor number of support vectors

21
Mercers Condition / Representer Theorem
  • ltKernel matrix is positive semidefinitegt
  • The desired hyperplane can be represented as
  • Linear weighted sum of distances to support
    vectors
  • Kernel defines the distance metric
  • The hypothesis space is represented efficiently
    by using some of the training examples the
    support vectors

22
Distinguishing Handwritten Sevens vs. Twos and
Eights
Handwritten 32 x 32 gray scale pixels
Input feature space is inappropriate Map inputs
to a high-dimensional space Many more features
nonlinear combinations Linearly separable in the
new space
23
Mercer Kernels
  • Usually start with a kernel rather than features
  • (s ? x)d Homogeneous polynomials
  • (s ? x 1)d Complete polynomials
  • Exp(-s x2 / 2??2 ) Gaussian / RBF
  • K k
  • c ? K
  • K c
  • K ? k

24
ProblemsSVMs statistical learning generally
  • Little information from each training example
  • Signal must show through the noise
  • Need many training examples
  • Thousands of are needed for handwritten digits
  • Much information is ignored (weak bias
    vocabulary)
  • Compare w/ humans
  • Novel simple shape of similar complexity
  • Master with several tens (perhaps a hundred)
    training examples
  • Exceedingly small non-fatigue error rate
  • Chinese characters are much more difficult than
    digits

25
Two Related Classification Problems
26
Two Related Classification Problems
27
Two Related Classification Problems
To an SVM these are the same problemApparently
the SVM ignores information crucial to people
28
Strokes Make the Difference
  • Explanatory hidden features
  • Humans know that strokes mediate between pixels
    and class labels.
  • Statistical machine learners find the pattern
    using pixel level inputs alone without knowing
    about strokes.
  • What can this example tell us?
  • Statistical learning algorithms are advanced
    enough to extract complex pattern from data.
  • But simple prior knowledge (e.g., the existence
    of strokes) may help to find relevant patterns
    faster and more accurately.
  • Inventing latent features is hard for statistics

29
Domain Knowledge
  • What can we say about strokes?
  • Within an image they are written by the same
    person using the same writing instrument
  • They are made by a succession of simple pen
    movements
  • They give rise to the pixels
  • Much Information! (suppose it did not hold)
  • This is not easily captured in the native bias
    vocabulary (not solution knowledge)
  • Knowledge about strokes is imperfect so that
    building a bottom-up stroke extractor is
    error-prone.

30
Primary Domain Distinguishing Handwritten
Chinese Characters
  • More complex than digits or Western characters
    (64x63 pixels).
  • Thousands of different characters ? Few training
    examples available for each (200 labeled images
    for us).
  • Domain knowledge includes anideal prototype
    stroke representation for each character.

31
Handwritten Chinese Characters
  • We selected ten characters in three classes
  • Yields forty-five classification problems.
  • Classification difficultyvaries significantly
    byclassification problem.

32
Hough Transform
  • Old (but good) idea
  • ltx,ygt ? ltm,bgt given y mx b
  • Hough transform makes a poor line detector
  • BUT Explaining is easy and reliable(class label
    determines the ideal prototype stroke
    representation)
  • We know the lines
  • approximate parameters,
  • geometric constraints
  • Find / hallucinate the Hough peaks to optimize
    the fit

33
Feature Kernel Functions
  • Design special-purpose kernel functions
  • Adapt distance metric to fit the task
  • Emphasize expected high-information content pixels

34
Explaining Chinese Characters
  • A pixel is judged to be informative if it is
    likely to be part of an informative stroke
    feature.
  • Stroke features are informative if they are
    distinctive between the ideal prototype
    characters.
  • Interaction between training examples and the
    prior domain knowledge is crucial.

35
Constructing Explanations
  • From domain knowledge, the top and bottom
    horizontal strokes are unlikely to be
    informative.
  • Explanation apply a linear Hough transformation
    to identify lines in the image, and associate
    pixels in the images with strokes.
  • Prototype stroke representations greatly aid in
    identifying the pixel stroke correspondence in
    training examples (but not test examples).
  • High information pixels correspond to distinctive
    stroke-level features

36
What is an Explanation for the Feature Kernel
Function Approach?
  • An account of where the class information is
    expected to be found within the input image
    pixels
  • Uniform emphasis over disk of 90 probability
    mass of the fitted Gaussian

37
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
FKF similar performance withnearly an order of
magnitude less training
Performance by problemScatter Plot for 45
ProblemsAll problems improve FKF never
hurtsLower slope?(suggests hardest problems are
helped most)
38
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
  • Learning curves by problem difficulty (as judged
    by SVM accuracy)
  • A) Hardest B) Middle C) Easiest third

39
Experiments Feature kernel function vs
conventional (cubic polynomial SVM)
For each problem at full trainingFKF always uses
fewer support vectors
Interaction between prior knowledge and training
examples is crucial
40
Explanation-Augmented Support Vector Machine
  • EA-SVM another approach
  • Previous approach adapted the kernel function
  • EA-SVM alters the SVM algorithm uses standard
    kernel function
  • Explanations are integrated directly as a bias

41
EA-SVMWhat is an Explanation?
  • An explanation is a generalization of a training
    example, a proposed equivalence class of
    examples.
  • Same explanation implies same label for the same
    reason, and should be treated the same by the
    classifier.
  • For an SVM, examples with the same explanation
    should have the same margin.
  • A perfect explanation is a hyperplane to which
    the classifier should be parallel
  • Explanations are not perfect.
  • So prefer a decision surface that ismore nearly
    parallel to confirmed explanations.
  • Penalize non-parallelness

42
Formalizing the Constraints Mathematically
  • Let an explanation justify the label for a given
    example x using only a subset e of features, the
    explained example v is defined as
  • The special symbol indicates that this
    feature does not participate in the inner product
    evaluation. With numerical features one can
    simply use the value zero.
  • The constraints can be expressed as
  • or equally
  • Geometrically, this requires the classifier
    hyper-plane to be parallel to the direction x
    v.

43
EA-SVMs Explanation-Augmented Support Vector
Machines
  • Incorporate high quality explanations into a
    conventional SVM
  • Classifier reflects information from both
    examples and domain knowledge.
  • Optimal classifier blends
  • Maximal conventional margin to training examples
  • Maximally parallel to high quality explanations
  • We use soft constraints for each.
  • Similar analyses using two sets of slack
    variables.
  • Linear blending via cross validation.

44
The EA-SVM Optimization Problem
  • Perfect knowledge
  • Imperfect knowledge
  • Introduce positive new slack variables (?i)
  • The optimization problem become
  • K, the confidence parameter, is determined by
    cross-validation it blends empirical and
    explanation information

45
Solutions for EA-SVM
  • With perfect knowledge
  • where
  • With imperfect knowledge
  • where
  • When confidence parameter K goes to infinity, the
    second solution reduces to the same as the first
    one.
  • When K and the ?i are 0, the problem ignores the
    explanations and reduces to a standard SVM.

46
Formal Analysis Why EA-SVM works
  • EA-SVM algorithm minimizes the following error
    bound
  • Interesting symbols in the expression of h
  • Rv The radius of the ball that contains all the
    explained examples. We expect Rv lt R.
  • D The penalty of a separator ltu,bgt violates the
    parallel constrains imposed by explanations.
  • D is determined by cross-validation to minimize
    h.

47
A Simple Prediction
  • A closer look at h
  • With perfect knowledge, D0
  • Without knowledge
  • EA-SVM has most to offer when the ratio Rv /R is
    small, which means explanations uses few
    important features to justify the label.
    Intuitively, the learning problem is difficult
    but the domain knowledge is informative.

48
Experiment 1 Does Explanation-Augmentation Help?
Results for 45 classifiers on pairs of Chinese
characters. Below the line means EA-SVM makes
fewer errors than SVM.
49
Experiment 2 Difficult Problems Benefit More
EA-SVM vs. SVM Easy tasks SimilarDifficult
tasks EA-SVM wins at all training levels.
Task difficulty is highly correlated with
Improvement of EA-SVM over conventional SVM.
50
Exp 3 Robustness and the Effect of Knowledge
Quality
  • EA-SVM benefits from good knowledge, and is not
    hurt by incorrect knowledge.

51
Exp 4 Additional (Non-image) Domains.
  • Protein Explanations only known motif sequences
    are important for proteins categorization.
  • Text Explanations Only words related to the
    category label are important.
  • ROC (protein) and F1 (text) scores show EA-SVM
    improvement.

52
Previous Work on Incorporating Knowledge into
SVMs (Solution Knowledge)
  • Incorporating transformation invariance into
    SVMs.
  • Virtual support vector (Schölkopf, 1996)
  • Invariant kernel function (Schölkopf, 2002)
  • Jittered SVM (DeCoste Schölkopf, 2002)
  • Tangent propagation (Simard 1992, 1998)
  • Locally-improved kernel function explores spatial
    locality property (Schölkopf, 1998)
  • Convolutional networks (LeCun et al 1998, Simard
    et al 2003)
  • Knowledge-based SVM and kernels incorporates
    prior rules. (Fung, Mangasarian Shavlik, 2002,
    2003 Mangasarian, Shavlik Wild 2004)
  • Extracting character high-level features from
    pixel representation. (Teow 2000, Shi 2003, Kadir
    2004)

53
Conclusion
  • Inductive learning algorithms can benefit from
    domain knowledge.
  • This work illustrates a novel direction of using
    knowledge by combining EBL ideas into a
    statistical learner.
  • With Domain Knowledge, the expert need not also
    be expert in the learning algorithms.
  • The EBL components are extremely simple more can
    be done.
  • The role of Domain knowledge rather than Solution
    Knowledge demands further study this is an
    important and little-explored direction.
  • Next step IJCAI07 Poster Explanation-Based
    Feature ConstructionShiau Hong Lim
Write a Comment
User Comments (0)
About PowerShow.com