Nonlinear Knowledge in Kernel Approximation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Nonlinear Knowledge in Kernel Approximation

1
Nonlinear Knowledge in Kernel Approximation

Olvi Mangasarian
UW Madison UCSD La Jolla
Edward Wild
UW Madison

2
Objectives

Primary objective Incorporate prior knowledge
over completely arbitrary sets into function
approximation without transforming (kernelizing)
the knowledge
Secondary objective Achieve transparency of the
prior knowledge for practical applications

3
Outline

Use kernels for function approximation
Incorporate prior knowledge
Previous approaches require transformation of
knowledge
New approach does not require any transformation
of knowledge
Knowledge given over completely arbitrary sets
Experimental results
Two synthetic examples and one real world example
related to breast cancer prognosis
Approximations with prior knowledge more accurate
than approximations without prior knowledge

4
Function Approximation

Given a set m of points in n-dimensional real
space Rn and corresponding function values in R
Points are represented by rows of a matrix A ? Rm
? n
Exact or approximate function values for each
point are given by a corresponding vector y ? Rm
Find a function fRn ? R based on the given data
f(Ai0)? yi

5
Function Approximation

Problem utilizing only given data may result in
a poor approximation
Points may be noisy
Sampling may be costly
Solution use prior knowledge to improve the
approximation

6
Adding Prior Knowledge

Standard approach fit function at given data
points without knowledge
Constrained approach satisfy inequalities at
given points
2004 MSW Paper satisfy linear inequalities over
polyhedral regions
Proposed new approach satisfy nonlinear
inequalities over arbitrary regions

7
Kernel Approximation

Approximate f by a nonlinear kernel function K
f(x) ? K(x0, A0)? b
K(x0, A0)? ?????exp(-?x-Ai2)?i
Error in kernel approximation of given data
-s ? K(A, A0)? be - y ? s
e is a vector of ones in Rm

8
Kernel Approximation

Trade off between solution complexity
(?1) and data fitting (s1)
Convert to a linear program
At solution
e0a ?1
e0s s1

9
Incorporating Nonlinear Prior Knowledge (MSW 2004)

Bx ? d ? ?0Ax b ? h0x ?
Need to kernelize knowledge from input space to
feature space of kernel
Requires change of variable x A0t
BA0t ? d ? ?0AA0t b ? h0A0t ?
K(B, A0)t ? d ? ?0K(A, A0)t b ? h0A0t ?
Motzkins theorem of the alternative gives an
equivalent linear system of inequalities which is
added to a linear program
Achieves good numerical results, but
kernelization is not readily interpretable in the
original space

10
Incorporating Nonlinear Prior Knowledge New
Approach

Start with arbitrary nonlinear knowledge
implication
g(x) ? 0 ? K(x0, A0)? b ? h(x), 8x 2 ? ½ Rn
g, h are arbitrary functions
g?! Rk, h?! R
Problem need to add this knowledge to the
optimization problem
Logically equivalent system
g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has no
solution x ? ?

11
Prior Knowledge as a System of Linear Inequalities

Use a theorem of the alternative for convex
functions Assume that g(x), K(x0, A0)? b,
-h(x) are convex functions of x, that ? is convex
and 9 x 2 ? g(x)lt0. Then either
I. g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has a
solution x ? ?, or
II. ?v ? Rk, v ? 0 K(x0, A0)? b - h(x)
v0g(x) ? 0 ?x ? ?
But never both.
If we can find v ? 0 K(x0, A0)? b - h(x)
v0g(x) ? 0 ?x ? ?, then by above theorem
g(x) ? 0, K(x0, A0)? b - h(x) lt 0 has no
solution x ? ? or equivalently
g(x) ? 0 ? K(x0, A0)? b ? h(x), 8x 2 ?

12
Proof

?I ? II
Follows from OLM 1969, Corollary 4.2.2 and the
existence
of an x 2 ? such that g(x) lt0.
?I ? II
Suppose not. That is, there exists x 2 ?, v 2
Rk,, v 0
g(x) ? 0, K(x0, A0)? b - h(x) lt 0, (I)
v ? 0, v0g(x) K(x0, A0)? b - h(x) ? 0 , 8 x
2 ?? (II)
Then we have the contradiction
0 gt v0g(x) K(x0, A0)? b - h(x) ? 0
Requires no assumptions on g, h, K, or ?
whatsoever

13
Example g(x) 1250 - x3 , f(x) x4, h(x) x2
5000
v0g(x) f(x) - h(x) 0
g(x) 0 ) f(x) h(x)
II
I
x3 ? 1250
x4 ? x2 5000
14
Incorporating Prior Knowledge

Linear semi-infinite program infinite number of
constraints
Discretize finite linear program
Slacks allow knowledge to be satisfied inexactly
Add term to objective function to drive slacks to
zero

15
Numerical Experience

Evaluate on three datasets
Two synthetic datasets
Wisconsin Prognostic Breast Cancer Database
Compare approximation with prior knowledge to one
without prior knowledge
Prior knowledge leads to an improved
approximation
Prior knowledge used cannot be handled exactly by
previous work
No kernelization needed on knowledge set

16
Two-Dimensional Hyperboloid

f(x1, x2) x1x2

17
Two-Dimensional Hyperboloid
x2

Given exact values only at 11 points along line
x1 x2
At x1 2 -5, , 5

x1
18
Two-Dimensional Hyperboloid Approximation without
Prior Knowledge
19
Two-Dimensional Hyperboloid

Add prior knowledge
x1x2 ? 1 ? f(x1, x2) ? x1x2
Nonlinear term x1x2 can not be handled exactly by
any previous approach
Discretization used only 11 points along the line
x1 -x2, x1 ? -5, -4, , 4, 5

20
Two-Dimensional Hyperboloid Approximation with
Prior Knowledge
21
Two-Dimensional Tower Function
22
Two-Dimensional Tower Function Data

Given 400 points on the grid -4, 4 ? -4, 4
Values are ming(x), 2, where g(x) is the exact
tower function

23
Two-Dimensional Tower Function Approximation
without Prior Knowledge
24
Two-Dimensional Tower FunctionPrior Knowledge

Add prior knowledge
(x1, x2) ? -4, 4 ? -4, 4 ? f(x) g(x)
Prior knowledge is the exact function value.
Enforced at 2500 points on the grid
-4, 4 ? -4, 4 through above implication
Principal objective of prior knowledge is to
overcome poor given data

25
Two-Dimensional Tower Function Approximation with
Prior Knowledge
26
Predicting Lymph Node Metastasis

Number of metastasized lymph nodes is an
important prognostic indicator for breast cancer
recurrence
Determined by surgery in addition to the removal
of the tumor
Wisconsin Prognostic Breast Cancer (WPBC) data
Lymph node metastasis for 194 patients
30 cytological features from a fine-needle
aspirate
Tumor size, obtained during surgery
Task predict the number of metastasized lymph
nodes given tumor size alone

27
Predicting Lymph Node Metastasis

Split data into two portions
Past data 20 used to find prior knowledge
Present data 80 used to evaluate performance
Simulates acquiring prior knowledge from an
experts experience

28
Prior Knowledge for Lymph Node Metastasis

Use kernel approximation without knowledge on the
past data
f1(x) K(x0, A10)?1 b1
A1 is the matrix of the past data points
Use density estimation to decide where to enforce
knowledge
p(x) is the empirical density of the past data
Number of metastasized lymph nodes is greater
than the predicted value on the past data, with
tolerance of 0.01
p(x) ? 0.1 ? f(x) ? f1(x) - 0.01

29
Predicting Lymph Node Metastasis Results

Table shows root-mean-squared-error (RMSE) of
past data (20) approximation f1(x) on present
data (80)
Leave-one-out (LOO) RMSE reported for
approximations with and without knowledge
Improvement due to knowledge 14.8

30
Conclusion

Added general nonlinear prior knowledge to kernel
approximation
Implemented as linear inequalities in a linear
programming problem
Knowledge incorporated transparently
Demonstrated effectiveness
Two synthetic examples
Real world problem from breast cancer prognosis
Future work
More general prior knowledge with inequalities
replaced by more general functions
Apply to classification problems

Nonlinear Knowledge in Kernel Approximation PowerPoint PPT Presentation