Title: Support Vector Machines for Data Fitting and Classification
1Support Vector Machinesfor Data Fitting and
Classification
- David R. Musicant
- with Olvi L. Mangasarian
UW-Madison Data Mining InstituteAnnual
ReviewJune 2, 2000
2Overview
- Regression and its role in data mining
- Robust support vector regression
- Our general formulation
- Tolerant support vector regression
- Our contributions
- Massive support vector regression
- Integration with data mining tools
- Active support vector machines
- Other research and futuredirections
3What is regression?
- Regression forms a rule for predicting an unknown
numerical feature from known ones. - Example Predicting purchase habits.
- Can we use...
- age, income, level of education
- To predict...
- purchasing patterns?
- And simultaneously...
- avoid the pitfalls that standard statistical
regression falls into?
4Regression example
5Role in data mining
- Goal Find new relationships in data
- e.g. customer behavior, scientific
experimentation - Regression explores importance of each known
feature in predicting the unknown one. - Feature selection
- Regression is a form of supervised learning
- Use data where the predictive value is known for
given instances, to form a rule - Massive datasets
Regression is a fundamental task in data mining.
6Part IRobust Regression
7Standard Linear Regression
Find w, b such that
8Optimization problem
9Examining the loss function
- Standard regression uses a squared error loss
function. - Points which are far from the predicted line
(outliers) are overemphasized.
10Alternative loss function
- Instead of squared error, try absolute value of
the error
This is called the 1-norm loss function.
111-Norm Problems And Solution
- Overemphasizes error on points close to the
predicted line - Solution Huber loss function hybrid approach
Linear
Quadratic
Many practitioners prefer the Huber loss function.
12Mathematical Formulation
- g indicates switchover from quadratic to linear
Larger g means more quadratic.
13Regression Approach Summary
- Quadratic Loss Function
- Standard method in statistics
- Over-emphasizes outliers
- Linear Loss Function (1-norm)
- Formulates well as a linear program
- Over-emphasizes small errors
- Huber Loss Function (hybrid approach)
- Appropriate emphasis on large and small errors
14Previous attempts complicated
- Earlier efforts to solve Huber regression
- Huber Gauss-Seidel method
- Madsen/Nielsen Newton Method
- Li Conjugate Gradient Method
- Smola Dual Quadratic Program
- Our new approach convex quadratic program
Our new approach is simpler and faster.
15Experimental Results Census20k
20,000 points11 features
g
Faster!
Time (CPU sec)
16Experimental Results CPUSmall
8,192 points12 features
g
Faster!
Time (CPU sec)
17Introduce nonlinear kernel!
- Begin with previous formulation
18Nonlinear results
Nonlinear kernels improve accuracy.
19Part IITolerant Regression
20Regression Approach Summary
- Quadratic Loss Function
- Standard method in statistics
- Over-emphasizes outliers
- Linear Loss Function (1-norm)
- Formulates well as a linear program
- Over-emphasizes small errors
- Huber Loss Function (hybrid approach)
- Appropriate emphasis on large and small errors
21Optimization problem (1-norm)
22The overfitting issue
- Noisy training data can be fitted too well
- leads to poor generalization on future data
- Prefer simpler regressions, i.e. where
- some w coefficients are zero
- line is flatter
23Reducing overfitting
- To achieve both goals
- minimize magnitude of w vector
- C is a parameter to balance the two goals
- Chosen by experimentation
- Reduces overfitting due to points far from
surface
24Overfitting again close points
- Close points may be wrong due to noise only
- Line should be influenced by real data, not
noise
- Ignore errors from those points which are close!
25Tolerant regression
- Allow an interval of size e with uniform error
26How about a nonlinear surface?
27Introduce nonlinear kernel!
- Begin with previous formulation
28Our improvements
- This formulation and interpretation is new!
- Improves intuition from prior results
- Uses less variables
- Solves faster!
- Computational tests run on DMI Locop2
- Dell PowerEdge 6300 server with
- Four gigabytes of memory, 36 gigabytes of disk
space - Windows NT Server 4.0
- CPLEX 6.5 solver
Donated to UW by Microsoft Corporation
29Comparison Results
30Problem size concerns
- How does the problem scale?
- m number of points
- n number of features
- For linear kernel problem size is O(mn)
- For nonlinear kernel problem size is O(m2)
- Thousands of data points gt massive problem!
Need an algorithm that will scale well.
31Chunking approach
- Idea Use a chunking method
- Bring as much into memory as possible
- Solve this subset of the problem
- Retain solution and integrate into next subset
- Explored in depth by Paul Bradley and O.L.
Mangasarian for linear kernels
Solve in pieces, one chunk at a time.
32Row-Column Chunking
- Why column chunking also?
- If non-linear kernel is used, chunks are very
wide. - A wide chunk must have a small number of rows to
fit in memory.
Both these chunks use the same memory!
33Chunking Experimental Results
34Objective Value Tuning Set Errorfor
Billion-Element Matrix
Given enough time, we find the right answer!
35Integration into data mining tools
- Method runs as a stand-alone application, with
data resident on disk - With minimal effort, could sit on top of a RDBMS
to manage data input/output - Queries select a subset of data - easily SQLable
- Database queries occur infrequently
- Data mining can be performed on a different
machine from the one maintaining the DBMS - Licensing of a linear program solver necessary
Algorithm can integrate with data mining tools.
36Part IIIActive Support Vector Machines
37The Classification Problem
A
A-
Separating Surface
Find surface to best separate two classes.
38Active Support Vector Machine
- Features
- Solves classification problems
- No special software tools necessary! No LP or QP!
- FAST. Works on very large problems.
- Web page www.cs.wisc.edu/musicant/asvm
- Available for download and can be integrated into
data mining tools - MATLAB integration already provided
39Summary and Future Work
- Summary
- Robust regression can be modeled simply and
efficiently as a quadratic program - Tolerant regression can be used to solve massive
regression problems - ASVM can solve massive classification problems
quickly - Future work
- Parallel approaches
- Distributed approaches
- ASVM for various types of regression
40Questions?