Title: Classification: Support Vector Machine
1Classification Support Vector Machine
2What hyperplane (line) can separate the two
classes of data?
3What hyperplane (line) can separate the two
classes of data?
But there are many other choices! Which one is
the best?
4M margin
What hyperplane (line) can separate the two
classes of data?
But there are many other choices! Which one is
the best?
5Optimal separating hyperplane
M
M
The best hyperplane is the one that maximizes the
margin, M.
6Computing the margin width
xTb b0 1
Find x and x- on the plus and minus plane,
so that x - x- is perpendicular to b. Then M
x - x-
xTb b0 0
xTb b0 -1
b
x
x-
7Computing the margin width
A hyperplane is
Find x and x- on the plus and minus plane,
so that x - x- is perpendicular to b. Then M
x - x- Since xTb b0 1 x-Tb
b0 -1 (x - x-)T b 2
xTb b0 1
xTb b0 0
xTb b0 -1
b
x
x-
M x - x- 2/ b
8Computing the marginal width
The hyperplane is separating if The maximizing
problem is subject
to
M
support vector
9Optimal separating hyperplane
- Rewrite the problem as
- subject to
- Lagrange function
- To minimize, set partial derivatives to be 0
- Can be solved by quadratic programming.
10When the two classes are non-separable
What is the best hyperplane?
Idea allow some points to lie on the wrong side,
but not by much.
11Support vector machine
- When the two classes are not separable, the
problem is slightly modified - Find
- subject to
- Can be solved using quadratic programming.
12Convert a nonseparable to separable case by
nonlinear transformation
non-separable in 1D
13Convert a nonseparable to separable case by
nonlinear transformation
separable in 1D
14Kernel function
- Introduce nonlinear kernel functions h(x), and
work on the transformed functions. - Then the separating function is
- In fact, all you need is the kernel function
- Common kernels
-
15Applications
16Prediction of central nervous systems embryonic
tumor outcome
- 42 patient samples
- 5 cancer types
- Array contains 6817 genes
- Question are different tumors types
distinguishable from gene expression pattern?
(Pomeroy et al. 2002)
17(Pomeroy et al. 2002)
18Gene expressions within a cancer type cluster
together
(Pomeroy et al. 2002)
19PCA based on all genes
(Pomeroy et al. 2002)
20PCA based on a subset of informational genes
(Pomeroy et al. 2002)
21(No Transcript)
22(No Transcript)
23Classification and diagnostic prediction of
cancers using gene expression profiling and
artificial neural networks
- Four different cancer types.
- 88 samples
- 6567 genes
- Goal to predict cancer types from gene
expression data
(Khan et al. 2001)
24Classification and diagnostic prediction of
cancers using gene expression profiling and
artificial neural networks
(Khan et al. 2001)
25Procedures
- Filter out genes that have low expression values
(retain 2308 genes) - Dimension reduction by using PCA --- select top
10 principle components - 3 fold cross-validation
(Khan et al. 2001)
26Artificial Neural Network
27(No Transcript)
28(Khan et al. 2001)
29Procedures
- Filter out genes that have low expression values
(retain 2308 genes) - Dimension reduction by using PCA --- select top
10 principle components - 3 fold cross-validation
- Repeat 1250 times.
(Khan et al. 2001)
30(Khan et al. 2001)
31(Khan et al. 2001)
32Acknowledgement
- Sources of slides
- Cheng Li
- http//www.cs.cornell.edu/johannes/papers/2001/kdd
2001-tutorial-final.pdf - www.cse.msu.edu/lawhiu/intro_SVM_new.ppt
33Aggregating predictors
- Sometimes aggregating several predictors can
perform better than each single predictor alone.
Aggregating is achieved by weighted sum of
different predictors, which can be the same kind
of predictors obtained from slightly perturbed
training datasets. - Key to the improvement of accuracy is the
instability of individual classifiers, such as
the classification trees.
34AdaBoost
- Step 1 Initialization the observation weights
- Step 2 For m 1 to M,
- Fit a classifier Gm(X) to the training data using
weight wi - Compute
- Compute
- Set
- Step 3 Output
misclassified obs are given more weights
35Boosting
36Optimal separating hyperplane
- Substituting, we get the Lagrange (Wolf) dual
function - subject to
- To complete the steps, see Burges et al.
- If then
- These xis are called the support vectors.
- is only determined by
the support vectors
37Support vector machine
- The Lagrange function is
- Setting the partial derivatives to be 0.
- Substituting, we get
- Subject to