Bayesian Support Vector Machine Classification

About This Presentation

Title:

Bayesian Support Vector Machine Classification

Description:

Use linear Principal Component Analysis to decompose and compress raw data into ... two because we can see that some kind of parabola will separate the classes ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 33

Provided by: vsot

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Support Vector Machine Classification

1
Bayesian Support Vector Machine Classification

Vasilis A. Sotiris
AMSC663 Midterm Presentation
December 2007
University of Maryland
College Park, MD 20783

2
Objectives

Develop an algorithm to detect anomalies in
electronic systems (multivariate)
Improve detection sensitivity of classical
Support Vector Machines (SVM)
Decrease false alarms
Predict future system performance

3
Methodology

Use linear Principal Component Analysis to
decompose and compress raw data into two models
a) PCA model, and b) Residual model.
Use Support Vector Machines to classify data (in
each model) into normal and abnormal classes
Assign probabilities to the classification output
of the SVMs using a sigmoid function
Use a Maximum Likelihood Estimation to find the
optimal sigmoid function parameters (in each
model)
Determine the joint class probability from both
models
Track changes to the joint probability to
improve detection sensitivity
decrease false alarms
Predict future system performance

4
Flow chart of Probabilistic SVC Detection
Methodology
New Observation ? R 1xm
Likelihood function
PCA Model R kxm
Probability matrix
D1(y1)
SVC
p
Health Decision
Input space R nxm
PCA Model Decision boundary
Training data
Joint Probabilities
PCA
Likelihood function
Probability matrix
D2(y2)
1
p
SVC
Trending of joint probability distributions
Residual Model R lxm
0
0
-1
1
D(x)
Baseline Population Database
Residual Model Decision boundary
Probability Model
PCA
SVC
5
Principal Component Analysis
6
Principal Component Analysis Statistical
Properties

Decompose data into two models
PCA model (Maximum variance) y1
Residual model y2
Direction of y1 is the eigenvector with largest
associated eigenvalue l
Vector a is chosen as the eigenvector of the
covariance matrix C

x2
y2, PC2
y1, PC1
x1
7
Singular Value Decomposition (SVD) - Eigenanalysis

SVD is used in this algorithm to perform PCA
SVD
performs eigenanalysis without first computing
the covariance matrix
Speeds up computations
Computes basis functions (used in projection
next)
The output of SVD is
U Basis functions for the PCA and residual
models
L Eigenvalues of covariance matrix
V Eigenvectors of covariance matrix

n x n
n x m
m x m
8
Subspace Decomposition
Residual Subspace

S PCA Model subspace
Detect dominant parameter variation
R Residual Subspace
Detects hidden anomalies
Therefore, analysis of the system behavior can be
decoupled into what is called the signal subspace
and residual subspace
To get xs and xr we project the input data onto
S and R

R
XR
Data
Xs
S
PCA Model Projection
Raw Data
Residual Model Projection
9
Least Squares Projections
R

u basis vector for PC1 and PC2
v vector from centered training data to new
observation
Objective
Find optimal p that minimizes v-pu
This gives Vp
The projection equation is finally put in terms
of SVD
HUkUkT
k - number of principal components (dimensions
for PCA model)
The projection pursuit is optimized based on the
PCA model

New Observation
PC1
S
PC2
10
Data Decomposition

With the projection matrix H, we can project any
incoming signal onto the signal S and residual
R subspaces
G is an analogous matrix to H that is used to
create the projection onto R
H is the projection onto S, and G is the
projection onto R

Projection onto R
Projection onto S
11
Support Vector Machines
12
Support Vector Machines

The performance of a system can be fully
explained with the distribution of its parameters
SVMs estimate the decision boundary for the given
distribution
Areas with less information are allowed a larger
margin of error
New observations can be classified using the
decision boundary and are labeled as
(-1) outside
(1) inside

x2
Soft decision boundary
Hard decision boundary
x1
13
Linear Classification Separable Input Space
x2
Abnormal Class

SVM finds a function D(x) that best separates the
two classes (max M)
D(x) can be used as a classifier
Through the support vectors we can
compress the input space by excluding all other
data except for the support vectors.
The SVs tell us everything we need to know about
the system in order to perform detection
By minimizing the norm of w we find the line or
linear surface that best separates the two
classes
The decision function is the linear combination
of the weight vector w

M
w
D(x)
x1
Normal Class
Training Support Vectors ai

New observation vector
14
Linear Classification Inseparable Input Space
x2
Abnormal Class

For inseparable data the SVM finds a function
D(x) that best separates the two classes by
Maximizing the margin M and minimizing the sum of
slack errors xi
Function D(x) can be used as a classifier
In this illustration, a new observation point
that falls to the right of it is considered
abnormal
Points below and to the left are considered
normal
By minimizing the norm of w and the sum of slack
errors xi we find the line or linear surface that
best separates the two classes

x1
x1
M
x2
x2
D(x)
x1
Normal Class
Training Support Vectors
New observation vector
15
Nonlinear classification

For inseparable data the SVM finds a nonlinear
function D(x) that best separates the two classes
by
Use of a kernel map k(.)
KF(xi)F(x)
Feature map F(x)x2 v2x 1T
The decision function D(x) requires the dot
product of the feature map F using the same
mathematical framework as the linear classifier
This is called the Kernel Trick
(example)

x2
16
SVM Training
17
Training SVMs for Classification
Confidence Limit training

Need effective way to train SVM without the
presence of negative class data
Convert outer distribution of positive class to
negative
Confidence limit training uses a defined
confidence level around which a negative class is
generated
One class training takes a percentage of the
positive class data and converts it to negative
class
is an optimization problem
minimizes the volume in the decision surface VS
does not need negative class information

x2
x2
D1(x)
VS1
x1
x1
One Class training
x2
x2
D2(x)
VS2
x1
x1
VS1 gt VS2
18
One Class Training
Performance region
x2

The negative class is important for SVM accuracy
The data is portioned using Kmeans
The negative class is computed around each
cluster centroid
The negative class is selected from the positive
class data as the points that have
the fewest neighboors
Denoted by D
Computationally this is done by maximizing the
sum of Euclidian distances from between all
points

x1
19
Class Prediction Probabilities and Maximum
Likelihood Estimation
20
Fitting a Sigmoid Function

In this project we are interested in finding the
probability that our class prediction is correct
Modeling the miss-classification rate
The class prediction in PHM is the prediction of
normality or abnormality
With an MLE estimate of the density function of
these class probabilities we can determine the
uncertainty of the prediction

x2
x2
D(x)
x1
x1
Hard decision boundary
Probability
distance
21
MLE and SVMs

Using a semi-parametric approach a Sigmoid
function S is fitted along the hard decision
boundary to model class probability
We are interested in determining the density
function that best prescribes this probability
The likelihood is computed based on the knowledge
of the decision function values D(xi), in the
parameter space

x2
D(x)
x1
P(yD(xi)) Likelihood function
D(x)
22
MLE and the Sigmoid Function

Parameters a and b are determined by solving a
maximum likelihood estimation (MLE) of y
The minimization is a two parameter optimization
problem of F, a function of a and b
Depending on parameters a and b the shape of
the sigmoid will change.
It can be proven that the MLE optimization
problem is convex
Can use Newtons method with a backtracking line
search

23
Joint Probability Model
24
Joint Probability Model
P ( y xS , xR )
Final Class Probability
Projection onto Residual model
Projection onto PCA model
Classification for x

Class prediction P(yxS,xR) based on the joint
class probabilities from
PCA model p(yxS)
Residual model p(yxR)
p(ycxS) - the probability that a point xS is
classified as c in the PCA model
p(yxR) - the probability that a point is
classified as c in the residual model
P(yxS,xR) - the final probability that a point
x is classified as c
Anticipate better accuracy and sensitivity to
onset of anomalies

25
Joint Probability Model
Bayes Rule
Assumption

The joint probability model depends on the
results of the SVC from both models (PCA and
Residual)
Assumption Data on models is linearly
independent
Changes in the joint classification probability
can be used as precursor to anomalies and used
for prediction

26
Schedule/Progress
27
SVM Classification Example
28
Example Non-Linear Classification

Have 4 1-D data points represented in vector x
and a label vector y given by
x1,2,5,6T
y-1,-1,1,-1T
This means that coordinates x(1), x(2) and x(4)
belong to the same class I (circles) and x(3) is
its own class II (squares)
The decision function D(x) is given as the
nonlinear combination of the weight vector which
is expressed in terms of the lagrange multipliers
The lagrange multipliers are computed in the
quadratic optimization problem
We are going to use a polynomial kernel of degree
two because we can see that some kind of parabola
will separate the classes

29
Example Non-Linear Classification - Construct
Hessian for quadratic optimization

Notice that in order to calculate the scalar
product FTF in the feature space, we do not need
to perform the mapping using the equation for F.
Instead we calculate this product directly in the
input space using the input data by computing the
kernel of the map
This is called the kernel trick

30
Example Non-Linear Classification - The Kernel
Trick

Let x belong to the real 2-D input space
Choose a mapping function F of degree two
The required dot product of the map function can
be expresses as
a dot product in the input space
This is the kernel trick
The Kernel trick basically says that any mapping
can be expressed in terms of a dot product of the
input space data to some degree
here to the second degree

31
Example Non-Linear Classification Decision
function D(x)

Compute Lagrange multipliers a through the
quadratic optimization problem
Plug into equation for D(x)
Determine b using the class constraints
y-1,-1,1,-1
b-9
The end result is a nonlinear (quadratic)
decision function
For x(1)1, sign(D(x)-4.33)lt0 ? C1
For x(2)2, sign(D(x)-1.00)lt0 ? C1
For x(3)5, sign(D(x)0.994)gt0 ? C2
For x(4)6, sign(D(x)-1.009)lt0 ? C1
The nonlinear classifier correctly classified the
data!

32
Quadratic Optimization and Global solutions

What do all these methods have in common?
Quadratic optimization of the weight vector w
Where H is the hessian matrix
y is the class membership of each training point
This type of equation is defined as a quadratic
optimization problem solution to which gives
Lagrange multipliers a, which in turn are used in
D(x)
In Matlab quadprog is used to solve the
quadratic optimization
Because there can only exist one solution to the
quadratic problem it
guarantees a global solution.

Write a Comment

User Comments (0)

About PowerShow.com

Bayesian Support Vector Machine Classification - PowerPoint PPT Presentation

Bayesian Support Vector Machine Classification

Use linear Principal Component Analysis to decompose and compress raw data into ... two because we can see that some kind of parabola will separate the classes ... – PowerPoint PPT presentation