Classification: Support Vector Machine - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Classification: Support Vector Machine

Description:

Classification: Support Vector Machine 10/10/07 – PowerPoint PPT presentation

Number of Views:247
Avg rating:3.0/5.0
Slides: 38
Provided by: gcyuan
Category:

less

Transcript and Presenter's Notes

Title: Classification: Support Vector Machine


1
Classification Support Vector Machine
  • 10/10/07

2
What hyperplane (line) can separate the two
classes of data?
3
What hyperplane (line) can separate the two
classes of data?
But there are many other choices! Which one is
the best?
4
M margin
What hyperplane (line) can separate the two
classes of data?
But there are many other choices! Which one is
the best?
5
Optimal separating hyperplane
M
M
The best hyperplane is the one that maximizes the
margin, M.
6
Computing the margin width
  • A hyperplane is

xTb b0 1
Find x and x- on the plus and minus plane,
so that x - x- is perpendicular to b. Then M
x - x-
xTb b0 0
xTb b0 -1
b
x
x-
7
Computing the margin width
A hyperplane is
Find x and x- on the plus and minus plane,
so that x - x- is perpendicular to b. Then M
x - x- Since xTb b0 1 x-Tb
b0 -1 (x - x-)T b 2
xTb b0 1
xTb b0 0
xTb b0 -1
b
x
x-
M x - x- 2/ b
8
Computing the marginal width
The hyperplane is separating if The maximizing
problem is subject
to
M
support vector
9
Optimal separating hyperplane
  • Rewrite the problem as
  • subject to
  • Lagrange function
  • To minimize, set partial derivatives to be 0
  • Can be solved by quadratic programming.

10
When the two classes are non-separable
What is the best hyperplane?
Idea allow some points to lie on the wrong side,
but not by much.
11
Support vector machine
  • When the two classes are not separable, the
    problem is slightly modified
  • Find
  • subject to
  • Can be solved using quadratic programming.

12
Convert a nonseparable to separable case by
nonlinear transformation
non-separable in 1D
13
Convert a nonseparable to separable case by
nonlinear transformation
separable in 1D
14
Kernel function
  • Introduce nonlinear kernel functions h(x), and
    work on the transformed functions.
  • Then the separating function is
  • In fact, all you need is the kernel function
  • Common kernels

15
Applications
16
Prediction of central nervous systems embryonic
tumor outcome
  • 42 patient samples
  • 5 cancer types
  • Array contains 6817 genes
  • Question are different tumors types
    distinguishable from gene expression pattern?

(Pomeroy et al. 2002)
17
(Pomeroy et al. 2002)
18
Gene expressions within a cancer type cluster
together
(Pomeroy et al. 2002)
19
PCA based on all genes
(Pomeroy et al. 2002)
20
PCA based on a subset of informational genes
(Pomeroy et al. 2002)
21
(No Transcript)
22
(No Transcript)
23
Classification and diagnostic prediction of
cancers using gene expression profiling and
artificial neural networks
  • Four different cancer types.
  • 88 samples
  • 6567 genes
  • Goal to predict cancer types from gene
    expression data

(Khan et al. 2001)
24
Classification and diagnostic prediction of
cancers using gene expression profiling and
artificial neural networks
(Khan et al. 2001)
25
Procedures
  • Filter out genes that have low expression values
    (retain 2308 genes)
  • Dimension reduction by using PCA --- select top
    10 principle components
  • 3 fold cross-validation

(Khan et al. 2001)
26
Artificial Neural Network
27
(No Transcript)
28
(Khan et al. 2001)
29
Procedures
  • Filter out genes that have low expression values
    (retain 2308 genes)
  • Dimension reduction by using PCA --- select top
    10 principle components
  • 3 fold cross-validation
  • Repeat 1250 times.

(Khan et al. 2001)
30
(Khan et al. 2001)
31
(Khan et al. 2001)
32
Acknowledgement
  • Sources of slides
  • Cheng Li
  • http//www.cs.cornell.edu/johannes/papers/2001/kdd
    2001-tutorial-final.pdf
  • www.cse.msu.edu/lawhiu/intro_SVM_new.ppt

33
Aggregating predictors
  • Sometimes aggregating several predictors can
    perform better than each single predictor alone.
    Aggregating is achieved by weighted sum of
    different predictors, which can be the same kind
    of predictors obtained from slightly perturbed
    training datasets.
  • Key to the improvement of accuracy is the
    instability of individual classifiers, such as
    the classification trees.

34
AdaBoost
  • Step 1 Initialization the observation weights
  • Step 2 For m 1 to M,
  • Fit a classifier Gm(X) to the training data using
    weight wi
  • Compute
  • Compute
  • Set
  • Step 3 Output

misclassified obs are given more weights
35
Boosting
36
Optimal separating hyperplane
  • Substituting, we get the Lagrange (Wolf) dual
    function
  • subject to
  • To complete the steps, see Burges et al.
  • If then
  • These xis are called the support vectors.
  • is only determined by
    the support vectors

37
Support vector machine
  • The Lagrange function is
  • Setting the partial derivatives to be 0.
  • Substituting, we get
  • Subject to
Write a Comment
User Comments (0)
About PowerShow.com