Support Vector Machines for Data Fitting and Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Support Vector Machines for Data Fitting and Classification

Description:

Idea: Use a chunking method. Bring as much into memory as possible ... Method runs as a stand-alone application, with data resident on disk ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 41
Provided by: musi3
Category:

less

Transcript and Presenter's Notes

Title: Support Vector Machines for Data Fitting and Classification


1
Support Vector Machinesfor Data Fitting and
Classification
  • David R. Musicant
  • with Olvi L. Mangasarian

UW-Madison Data Mining InstituteAnnual
ReviewJune 2, 2000
2
Overview
  • Regression and its role in data mining
  • Robust support vector regression
  • Our general formulation
  • Tolerant support vector regression
  • Our contributions
  • Massive support vector regression
  • Integration with data mining tools
  • Active support vector machines
  • Other research and futuredirections

3
What is regression?
  • Regression forms a rule for predicting an unknown
    numerical feature from known ones.
  • Example Predicting purchase habits.
  • Can we use...
  • age, income, level of education
  • To predict...
  • purchasing patterns?
  • And simultaneously...
  • avoid the pitfalls that standard statistical
    regression falls into?

4
Regression example
  • Can we use.
  • To predict

5
Role in data mining
  • Goal Find new relationships in data
  • e.g. customer behavior, scientific
    experimentation
  • Regression explores importance of each known
    feature in predicting the unknown one.
  • Feature selection
  • Regression is a form of supervised learning
  • Use data where the predictive value is known for
    given instances, to form a rule
  • Massive datasets

Regression is a fundamental task in data mining.
6
Part IRobust Regression
  • a.k.a. Huber Regression

7
Standard Linear Regression
Find w, b such that
8
Optimization problem
  • Find w, b such that

9
Examining the loss function
  • Standard regression uses a squared error loss
    function.
  • Points which are far from the predicted line
    (outliers) are overemphasized.

10
Alternative loss function
  • Instead of squared error, try absolute value of
    the error

This is called the 1-norm loss function.
11
1-Norm Problems And Solution
  • Overemphasizes error on points close to the
    predicted line
  • Solution Huber loss function hybrid approach

Linear
Quadratic
Many practitioners prefer the Huber loss function.
12
Mathematical Formulation
  • g indicates switchover from quadratic to linear

Larger g means more quadratic.
13
Regression Approach Summary
  • Quadratic Loss Function
  • Standard method in statistics
  • Over-emphasizes outliers
  • Linear Loss Function (1-norm)
  • Formulates well as a linear program
  • Over-emphasizes small errors
  • Huber Loss Function (hybrid approach)
  • Appropriate emphasis on large and small errors

14
Previous attempts complicated
  • Earlier efforts to solve Huber regression
  • Huber Gauss-Seidel method
  • Madsen/Nielsen Newton Method
  • Li Conjugate Gradient Method
  • Smola Dual Quadratic Program
  • Our new approach convex quadratic program

Our new approach is simpler and faster.
15
Experimental Results Census20k
20,000 points11 features
g
Faster!
Time (CPU sec)
16
Experimental Results CPUSmall
8,192 points12 features
g
Faster!
Time (CPU sec)
17
Introduce nonlinear kernel!
  • Begin with previous formulation

18
Nonlinear results
Nonlinear kernels improve accuracy.
19
Part IITolerant Regression
  • a.k.a. Tolerant Training

20
Regression Approach Summary
  • Quadratic Loss Function
  • Standard method in statistics
  • Over-emphasizes outliers
  • Linear Loss Function (1-norm)
  • Formulates well as a linear program
  • Over-emphasizes small errors
  • Huber Loss Function (hybrid approach)
  • Appropriate emphasis on large and small errors

21
Optimization problem (1-norm)
  • Find w, b such that
  • Bound the error by s

22
The overfitting issue
  • Noisy training data can be fitted too well
  • leads to poor generalization on future data
  • Prefer simpler regressions, i.e. where
  • some w coefficients are zero
  • line is flatter

23
Reducing overfitting
  • To achieve both goals
  • minimize magnitude of w vector
  • C is a parameter to balance the two goals
  • Chosen by experimentation
  • Reduces overfitting due to points far from
    surface

24
Overfitting again close points
  • Close points may be wrong due to noise only
  • Line should be influenced by real data, not
    noise
  • Ignore errors from those points which are close!

25
Tolerant regression
  • Allow an interval of size e with uniform error

26
How about a nonlinear surface?
27
Introduce nonlinear kernel!
  • Begin with previous formulation

28
Our improvements
  • This formulation and interpretation is new!
  • Improves intuition from prior results
  • Uses less variables
  • Solves faster!
  • Computational tests run on DMI Locop2
  • Dell PowerEdge 6300 server with
  • Four gigabytes of memory, 36 gigabytes of disk
    space
  • Windows NT Server 4.0
  • CPLEX 6.5 solver

Donated to UW by Microsoft Corporation
29
Comparison Results
30
Problem size concerns
  • How does the problem scale?
  • m number of points
  • n number of features
  • For linear kernel problem size is O(mn)
  • For nonlinear kernel problem size is O(m2)
  • Thousands of data points gt massive problem!

Need an algorithm that will scale well.
31
Chunking approach
  • Idea Use a chunking method
  • Bring as much into memory as possible
  • Solve this subset of the problem
  • Retain solution and integrate into next subset
  • Explored in depth by Paul Bradley and O.L.
    Mangasarian for linear kernels

Solve in pieces, one chunk at a time.
32
Row-Column Chunking
  • Why column chunking also?
  • If non-linear kernel is used, chunks are very
    wide.
  • A wide chunk must have a small number of rows to
    fit in memory.

Both these chunks use the same memory!
33
Chunking Experimental Results
34
Objective Value Tuning Set Errorfor
Billion-Element Matrix
Given enough time, we find the right answer!
35
Integration into data mining tools
  • Method runs as a stand-alone application, with
    data resident on disk
  • With minimal effort, could sit on top of a RDBMS
    to manage data input/output
  • Queries select a subset of data - easily SQLable
  • Database queries occur infrequently
  • Data mining can be performed on a different
    machine from the one maintaining the DBMS
  • Licensing of a linear program solver necessary

Algorithm can integrate with data mining tools.
36
Part IIIActive Support Vector Machines
  • a.k.a. ASVM

37
The Classification Problem
A
A-
Separating Surface
Find surface to best separate two classes.
38
Active Support Vector Machine
  • Features
  • Solves classification problems
  • No special software tools necessary! No LP or QP!
  • FAST. Works on very large problems.
  • Web page www.cs.wisc.edu/musicant/asvm
  • Available for download and can be integrated into
    data mining tools
  • MATLAB integration already provided

39
Summary and Future Work
  • Summary
  • Robust regression can be modeled simply and
    efficiently as a quadratic program
  • Tolerant regression can be used to solve massive
    regression problems
  • ASVM can solve massive classification problems
    quickly
  • Future work
  • Parallel approaches
  • Distributed approaches
  • ASVM for various types of regression

40
Questions?
Write a Comment
User Comments (0)
About PowerShow.com