Title: Nonlinear Knowledge in Kernel Approximation
1Nonlinear Knowledge in Kernel Approximation
- Olvi Mangasarian
- UW Madison UCSD La Jolla
- Edward Wild
- UW Madison
2Objectives
- Primary objective Incorporate prior knowledge
over completely arbitrary sets into function
approximation without transforming (kernelizing)
the knowledge - Secondary objective Achieve transparency of the
prior knowledge for practical applications
3Outline
- Use kernels for function approximation
- Incorporate prior knowledge
- Previous approaches require transformation of
knowledge - New approach does not require any transformation
of knowledge - Knowledge given over completely arbitrary sets
- Experimental results
- Two synthetic examples and one real world example
related to breast cancer prognosis - Approximations with prior knowledge more accurate
than approximations without prior knowledge
4Function Approximation
- Given a set m of points in n-dimensional real
space Rn and corresponding function values in R - Points are represented by rows of a matrix A ? Rm
? n - Exact or approximate function values for each
point are given by a corresponding vector y ? Rm - Find a function fRn ? R based on the given data
- f(Ai0)? yi
5Function Approximation
- Problem utilizing only given data may result in
a poor approximation - Points may be noisy
- Sampling may be costly
- Solution use prior knowledge to improve the
approximation
6Adding Prior Knowledge
- Standard approach fit function at given data
points without knowledge - Constrained approach satisfy inequalities at
given points - 2004 MSW Paper satisfy linear inequalities over
polyhedral regions - Proposed new approach satisfy nonlinear
inequalities over arbitrary regions
7Kernel Approximation
- Approximate f by a nonlinear kernel function K
- f(x) ? K(x0, A0)? b
- K(x0, A0)? ?????exp(-?x-Ai2)?i
- Error in kernel approximation of given data
- -s ? K(A, A0)? be - y ? s
- e is a vector of ones in Rm
8Kernel Approximation
- Trade off between solution complexity
- (?1) and data fitting (s1)
- Convert to a linear program
- At solution
- e0a ?1
- e0s s1
9Incorporating Nonlinear Prior Knowledge (MSW 2004)
- Bx ? d ? ?0Ax b ? h0x ?
- Need to kernelize knowledge from input space to
feature space of kernel - Requires change of variable x A0t
- BA0t ? d ? ?0AA0t b ? h0A0t ?
- K(B, A0)t ? d ? ?0K(A, A0)t b ? h0A0t ?
- Motzkins theorem of the alternative gives an
equivalent linear system of inequalities which is
added to a linear program - Achieves good numerical results, but
kernelization is not readily interpretable in the
original space
10Incorporating Nonlinear Prior Knowledge New
Approach
- Start with arbitrary nonlinear knowledge
implication - g(x) ? 0 ? K(x0, A0)? b ? h(x), 8x 2 ? ½ Rn
- g, h are arbitrary functions
- g?! Rk, h?! R
- Problem need to add this knowledge to the
optimization problem - Logically equivalent system
- g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has no
solution x ? ?
11Prior Knowledge as a System of Linear Inequalities
- Use a theorem of the alternative for convex
functions Assume that g(x), K(x0, A0)? b,
-h(x) are convex functions of x, that ? is convex
and 9 x 2 ? g(x)lt0. Then either - I. g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has a
solution x ? ?, or - II. ?v ? Rk, v ? 0 K(x0, A0)? b - h(x)
v0g(x) ? 0 ?x ? ? - But never both.
- If we can find v ? 0 K(x0, A0)? b - h(x)
v0g(x) ? 0 ?x ? ?, then by above theorem - g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has no
solution x ? ? or equivalently - g(x) ? 0 ? K(x0, A0)? b ? h(x), 8x 2 ?
12Proof
- ?I ? II
- Follows from OLM 1969, Corollary 4.2.2 and the
existence - of an x 2 ? such that g(x) lt0.
- ?I ? II
- Suppose not. That is, there exists x 2 ?, v 2
Rk,, v 0 - g(x) ? 0, K(x0, A0)? b - h(x) lt 0, (I)
- v ? 0, v0g(x) K(x0, A0)? b - h(x) ? 0 , 8 x
2 ?? (II) - Then we have the contradiction
- 0 gt v0g(x) K(x0, A0)? b - h(x) ? 0
- Requires no assumptions on g, h, K, or ?
whatsoever
13Example g(x) 1250 - x3 , f(x) x4, h(x) x2
5000
v0g(x) f(x) - h(x) 0
g(x) 0 ) f(x) h(x)
II
I
x3 ? 1250
x4 ? x2 5000
14Incorporating Prior Knowledge
- Linear semi-infinite program infinite number of
constraints - Discretize finite linear program
- Slacks allow knowledge to be satisfied inexactly
- Add term to objective function to drive slacks to
zero
15Numerical Experience
- Evaluate on three datasets
- Two synthetic datasets
- Wisconsin Prognostic Breast Cancer Database
- Compare approximation with prior knowledge to one
without prior knowledge - Prior knowledge leads to an improved
approximation - Prior knowledge used cannot be handled exactly by
previous work - No kernelization needed on knowledge set
16Two-Dimensional Hyperboloid
17Two-Dimensional Hyperboloid
x2
- Given exact values only at 11 points along line
x1 x2 - At x1 2 -5, , 5
x1
18Two-Dimensional Hyperboloid Approximation without
Prior Knowledge
19Two-Dimensional Hyperboloid
- Add prior knowledge
- x1x2 ? 1 ? f(x1, x2) ? x1x2
- Nonlinear term x1x2 can not be handled exactly by
any previous approach - Discretization used only 11 points along the line
x1 -x2, x1 ? -5, -4, , 4, 5
20Two-Dimensional Hyperboloid Approximation with
Prior Knowledge
21Two-Dimensional Tower Function
22Two-Dimensional Tower Function Data
- Given 400 points on the grid -4, 4 ? -4, 4
- Values are ming(x), 2, where g(x) is the exact
tower function
23Two-Dimensional Tower Function Approximation
without Prior Knowledge
24Two-Dimensional Tower FunctionPrior Knowledge
- Add prior knowledge
- (x1, x2) ? -4, 4 ? -4, 4 ? f(x) g(x)
- Prior knowledge is the exact function value.
- Enforced at 2500 points on the grid
-4, 4 ? -4, 4 through above implication - Principal objective of prior knowledge is to
overcome poor given data
25Two-Dimensional Tower Function Approximation with
Prior Knowledge
26Predicting Lymph Node Metastasis
- Number of metastasized lymph nodes is an
important prognostic indicator for breast cancer
recurrence - Determined by surgery in addition to the removal
of the tumor - Wisconsin Prognostic Breast Cancer (WPBC) data
- Lymph node metastasis for 194 patients
- 30 cytological features from a fine-needle
aspirate - Tumor size, obtained during surgery
- Task predict the number of metastasized lymph
nodes given tumor size alone
27Predicting Lymph Node Metastasis
- Split data into two portions
- Past data 20 used to find prior knowledge
- Present data 80 used to evaluate performance
- Simulates acquiring prior knowledge from an
experts experience
28Prior Knowledge for Lymph Node Metastasis
- Use kernel approximation without knowledge on the
past data - f1(x) K(x0, A10)?1 b1
- A1 is the matrix of the past data points
- Use density estimation to decide where to enforce
knowledge - p(x) is the empirical density of the past data
- Number of metastasized lymph nodes is greater
than the predicted value on the past data, with
tolerance of 0.01 - p(x) ? 0.1 ? f(x) ? f1(x) - 0.01
29Predicting Lymph Node Metastasis Results
- Table shows root-mean-squared-error (RMSE) of
past data (20) approximation f1(x) on present
data (80) - Leave-one-out (LOO) RMSE reported for
approximations with and without knowledge - Improvement due to knowledge 14.8
30Conclusion
- Added general nonlinear prior knowledge to kernel
approximation - Implemented as linear inequalities in a linear
programming problem - Knowledge incorporated transparently
- Demonstrated effectiveness
- Two synthetic examples
- Real world problem from breast cancer prognosis
- Future work
- More general prior knowledge with inequalities
replaced by more general functions - Apply to classification problems
31Questions
- Websites linking to papers and talks
- http//www.cs.wisc.edu/olvi/
- http//www.cs.wisc.edu/wildt/