Support Vector Machines for Data Fitting and Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Support Vector Machines for Data Fitting and Classification

Description:

Idea: Use a chunking method. Bring as much into memory as possible ... Method runs as a stand-alone application, with data resident on disk ... – PowerPoint PPT presentation

Number of Views:110

Avg rating:3.0/5.0

Slides: 41

Provided by: musi3

Learn more at: https://research.cs.wisc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Support Vector Machines for Data Fitting and Classification

1
Support Vector Machinesfor Data Fitting and
Classification

David R. Musicant
with Olvi L. Mangasarian

UW-Madison Data Mining InstituteAnnual
ReviewJune 2, 2000
2
Overview

Regression and its role in data mining
Robust support vector regression
Our general formulation
Tolerant support vector regression
Our contributions
Massive support vector regression
Integration with data mining tools
Active support vector machines
Other research and futuredirections

3
What is regression?

Regression forms a rule for predicting an unknown
numerical feature from known ones.
Example Predicting purchase habits.
Can we use...
age, income, level of education
To predict...
purchasing patterns?
And simultaneously...
avoid the pitfalls that standard statistical
regression falls into?

4
Regression example

Can we use.

To predict

5
Role in data mining

Goal Find new relationships in data
e.g. customer behavior, scientific
experimentation
Regression explores importance of each known
feature in predicting the unknown one.
Feature selection
Regression is a form of supervised learning
Use data where the predictive value is known for
given instances, to form a rule
Massive datasets

Regression is a fundamental task in data mining.
6
Part IRobust Regression

a.k.a. Huber Regression

7
Standard Linear Regression
Find w, b such that
8
Optimization problem

Find w, b such that

9
Examining the loss function

Standard regression uses a squared error loss
function.
Points which are far from the predicted line
(outliers) are overemphasized.

10
Alternative loss function

Instead of squared error, try absolute value of
the error

This is called the 1-norm loss function.
11
1-Norm Problems And Solution

Overemphasizes error on points close to the
predicted line
Solution Huber loss function hybrid approach

Linear
Quadratic
Many practitioners prefer the Huber loss function.
12
Mathematical Formulation

g indicates switchover from quadratic to linear

Larger g means more quadratic.
13
Regression Approach Summary

Quadratic Loss Function
Standard method in statistics
Over-emphasizes outliers
Linear Loss Function (1-norm)
Formulates well as a linear program
Over-emphasizes small errors
Huber Loss Function (hybrid approach)
Appropriate emphasis on large and small errors

14
Previous attempts complicated

Earlier efforts to solve Huber regression
Huber Gauss-Seidel method
Madsen/Nielsen Newton Method
Li Conjugate Gradient Method
Smola Dual Quadratic Program
Our new approach convex quadratic program

Our new approach is simpler and faster.
15
Experimental Results Census20k
20,000 points11 features
g
Faster!
Time (CPU sec)
16
Experimental Results CPUSmall
8,192 points12 features
g
Faster!
Time (CPU sec)
17
Introduce nonlinear kernel!

Begin with previous formulation

18
Nonlinear results
Nonlinear kernels improve accuracy.
19
Part IITolerant Regression

a.k.a. Tolerant Training

20
Regression Approach Summary

Quadratic Loss Function
Standard method in statistics
Over-emphasizes outliers
Linear Loss Function (1-norm)
Formulates well as a linear program
Over-emphasizes small errors
Huber Loss Function (hybrid approach)
Appropriate emphasis on large and small errors

21
Optimization problem (1-norm)

Find w, b such that

Bound the error by s

22
The overfitting issue

Noisy training data can be fitted too well
leads to poor generalization on future data

Prefer simpler regressions, i.e. where
some w coefficients are zero
line is flatter

23
Reducing overfitting

To achieve both goals
minimize magnitude of w vector

C is a parameter to balance the two goals
Chosen by experimentation
Reduces overfitting due to points far from
surface

24
Overfitting again close points

Close points may be wrong due to noise only
Line should be influenced by real data, not
noise

Ignore errors from those points which are close!

25
Tolerant regression

Allow an interval of size e with uniform error

26
How about a nonlinear surface?
27
Introduce nonlinear kernel!

Begin with previous formulation

28
Our improvements

This formulation and interpretation is new!
Improves intuition from prior results
Uses less variables
Solves faster!
Computational tests run on DMI Locop2
Dell PowerEdge 6300 server with
Four gigabytes of memory, 36 gigabytes of disk
space
Windows NT Server 4.0
CPLEX 6.5 solver

Donated to UW by Microsoft Corporation
29
Comparison Results
30
Problem size concerns

How does the problem scale?
m number of points
n number of features
For linear kernel problem size is O(mn)

For nonlinear kernel problem size is O(m2)

Thousands of data points gt massive problem!

Need an algorithm that will scale well.
31
Chunking approach

Idea Use a chunking method
Bring as much into memory as possible
Solve this subset of the problem
Retain solution and integrate into next subset
Explored in depth by Paul Bradley and O.L.
Mangasarian for linear kernels

Solve in pieces, one chunk at a time.
32
Row-Column Chunking

Why column chunking also?
If non-linear kernel is used, chunks are very
wide.
A wide chunk must have a small number of rows to
fit in memory.

Both these chunks use the same memory!
33
Chunking Experimental Results
34
Objective Value Tuning Set Errorfor
Billion-Element Matrix
Given enough time, we find the right answer!
35
Integration into data mining tools

Method runs as a stand-alone application, with
data resident on disk
With minimal effort, could sit on top of a RDBMS
to manage data input/output
Queries select a subset of data - easily SQLable
Database queries occur infrequently
Data mining can be performed on a different
machine from the one maintaining the DBMS
Licensing of a linear program solver necessary

Algorithm can integrate with data mining tools.
36
Part IIIActive Support Vector Machines