Title: Bayesian Support Vector Machine Classification
1Bayesian Support Vector Machine Classification
- Vasilis A. Sotiris
- AMSC663 Midterm Presentation
- December 2007
- University of Maryland
- College Park, MD 20783
2Objectives
- Develop an algorithm to detect anomalies in
electronic systems (multivariate) - Improve detection sensitivity of classical
Support Vector Machines (SVM) - Decrease false alarms
- Predict future system performance
3Methodology
- Use linear Principal Component Analysis to
decompose and compress raw data into two models
a) PCA model, and b) Residual model. - Use Support Vector Machines to classify data (in
each model) into normal and abnormal classes - Assign probabilities to the classification output
of the SVMs using a sigmoid function - Use a Maximum Likelihood Estimation to find the
optimal sigmoid function parameters (in each
model) - Determine the joint class probability from both
models - Track changes to the joint probability to
- improve detection sensitivity
- decrease false alarms
- Predict future system performance
4Flow chart of Probabilistic SVC Detection
Methodology
New Observation ? R 1xm
Likelihood function
PCA Model R kxm
Probability matrix
D1(y1)
SVC
p
Health Decision
Input space R nxm
PCA Model Decision boundary
Training data
Joint Probabilities
PCA
Likelihood function
Probability matrix
D2(y2)
1
p
SVC
Trending of joint probability distributions
Residual Model R lxm
0
0
-1
1
D(x)
Baseline Population Database
Residual Model Decision boundary
Probability Model
PCA
SVC
5Principal Component Analysis
6Principal Component Analysis Statistical
Properties
- Decompose data into two models
- PCA model (Maximum variance) y1
- Residual model y2
- Direction of y1 is the eigenvector with largest
associated eigenvalue l - Vector a is chosen as the eigenvector of the
covariance matrix C
x2
y2, PC2
y1, PC1
x1
7Singular Value Decomposition (SVD) - Eigenanalysis
- SVD is used in this algorithm to perform PCA
- SVD
- performs eigenanalysis without first computing
the covariance matrix - Speeds up computations
- Computes basis functions (used in projection
next) - The output of SVD is
- U Basis functions for the PCA and residual
models - L Eigenvalues of covariance matrix
- V Eigenvectors of covariance matrix
n x n
n x m
m x m
8Subspace Decomposition
Residual Subspace
- S PCA Model subspace
- Detect dominant parameter variation
- R Residual Subspace
- Detects hidden anomalies
- Therefore, analysis of the system behavior can be
decoupled into what is called the signal subspace
and residual subspace - To get xs and xr we project the input data onto
S and R
R
XR
Data
Xs
S
PCA Model Projection
Raw Data
Residual Model Projection
9Least Squares Projections
R
- u basis vector for PC1 and PC2
- v vector from centered training data to new
observation - Objective
- Find optimal p that minimizes v-pu
- This gives Vp
- The projection equation is finally put in terms
of SVD - HUkUkT
- k - number of principal components (dimensions
for PCA model) - The projection pursuit is optimized based on the
PCA model
New Observation
PC1
S
PC2
10Data Decomposition
- With the projection matrix H, we can project any
incoming signal onto the signal S and residual
R subspaces - G is an analogous matrix to H that is used to
create the projection onto R - H is the projection onto S, and G is the
projection onto R
Projection onto R
Projection onto S
11Support Vector Machines
12Support Vector Machines
- The performance of a system can be fully
explained with the distribution of its parameters - SVMs estimate the decision boundary for the given
distribution - Areas with less information are allowed a larger
margin of error - New observations can be classified using the
decision boundary and are labeled as - (-1) outside
- (1) inside
x2
Soft decision boundary
Hard decision boundary
x1
13Linear Classification Separable Input Space
x2
Abnormal Class
- SVM finds a function D(x) that best separates the
two classes (max M) - D(x) can be used as a classifier
- Through the support vectors we can
- compress the input space by excluding all other
data except for the support vectors. - The SVs tell us everything we need to know about
the system in order to perform detection - By minimizing the norm of w we find the line or
linear surface that best separates the two
classes - The decision function is the linear combination
of the weight vector w
M
w
D(x)
x1
Normal Class
Training Support Vectors ai
New observation vector
14Linear Classification Inseparable Input Space
x2
Abnormal Class
- For inseparable data the SVM finds a function
D(x) that best separates the two classes by - Maximizing the margin M and minimizing the sum of
slack errors xi - Function D(x) can be used as a classifier
- In this illustration, a new observation point
that falls to the right of it is considered
abnormal - Points below and to the left are considered
normal - By minimizing the norm of w and the sum of slack
errors xi we find the line or linear surface that
best separates the two classes
x1
x1
M
x2
x2
D(x)
x1
Normal Class
Training Support Vectors
New observation vector
15Nonlinear classification
- For inseparable data the SVM finds a nonlinear
function D(x) that best separates the two classes
by - Use of a kernel map k(.)
- KF(xi)F(x)
- Feature map F(x)x2 v2x 1T
- The decision function D(x) requires the dot
product of the feature map F using the same
mathematical framework as the linear classifier - This is called the Kernel Trick
- (example)
x2
16SVM Training
17Training SVMs for Classification
Confidence Limit training
- Need effective way to train SVM without the
presence of negative class data - Convert outer distribution of positive class to
negative - Confidence limit training uses a defined
confidence level around which a negative class is
generated - One class training takes a percentage of the
positive class data and converts it to negative
class - is an optimization problem
- minimizes the volume in the decision surface VS
- does not need negative class information
x2
x2
D1(x)
VS1
x1
x1
One Class training
x2
x2
D2(x)
VS2
x1
x1
VS1 gt VS2
18One Class Training
Performance region
x2
- The negative class is important for SVM accuracy
- The data is portioned using Kmeans
- The negative class is computed around each
cluster centroid - The negative class is selected from the positive
class data as the points that have - the fewest neighboors
- Denoted by D
- Computationally this is done by maximizing the
sum of Euclidian distances from between all
points
x1
19Class Prediction Probabilities and Maximum
Likelihood Estimation
20Fitting a Sigmoid Function
- In this project we are interested in finding the
probability that our class prediction is correct - Modeling the miss-classification rate
- The class prediction in PHM is the prediction of
normality or abnormality - With an MLE estimate of the density function of
these class probabilities we can determine the
uncertainty of the prediction
x2
x2
D(x)
x1
x1
Hard decision boundary
Probability
distance
21MLE and SVMs
- Using a semi-parametric approach a Sigmoid
function S is fitted along the hard decision
boundary to model class probability - We are interested in determining the density
function that best prescribes this probability - The likelihood is computed based on the knowledge
of the decision function values D(xi), in the
parameter space
x2
D(x)
x1
P(yD(xi)) Likelihood function
D(x)
22MLE and the Sigmoid Function
- Parameters a and b are determined by solving a
maximum likelihood estimation (MLE) of y - The minimization is a two parameter optimization
problem of F, a function of a and b - Depending on parameters a and b the shape of
the sigmoid will change. - It can be proven that the MLE optimization
problem is convex - Can use Newtons method with a backtracking line
search
23Joint Probability Model
24Joint Probability Model
P ( y xS , xR )
Final Class Probability
Projection onto Residual model
Projection onto PCA model
Classification for x
- Class prediction P(yxS,xR) based on the joint
class probabilities from - PCA model p(yxS)
- Residual model p(yxR)
- p(ycxS) - the probability that a point xS is
classified as c in the PCA model - p(yxR) - the probability that a point is
classified as c in the residual model - P(yxS,xR) - the final probability that a point
x is classified as c - Anticipate better accuracy and sensitivity to
onset of anomalies
25Joint Probability Model
Bayes Rule
Assumption
- The joint probability model depends on the
results of the SVC from both models (PCA and
Residual) - Assumption Data on models is linearly
independent - Changes in the joint classification probability
can be used as precursor to anomalies and used
for prediction
26Schedule/Progress
27SVM Classification Example
28Example Non-Linear Classification
- Have 4 1-D data points represented in vector x
and a label vector y given by - x1,2,5,6T
- y-1,-1,1,-1T
- This means that coordinates x(1), x(2) and x(4)
belong to the same class I (circles) and x(3) is
its own class II (squares) - The decision function D(x) is given as the
nonlinear combination of the weight vector which
is expressed in terms of the lagrange multipliers - The lagrange multipliers are computed in the
quadratic optimization problem - We are going to use a polynomial kernel of degree
two because we can see that some kind of parabola
will separate the classes
29Example Non-Linear Classification - Construct
Hessian for quadratic optimization
- Notice that in order to calculate the scalar
product FTF in the feature space, we do not need
to perform the mapping using the equation for F.
Instead we calculate this product directly in the
input space using the input data by computing the
kernel of the map - This is called the kernel trick
30Example Non-Linear Classification - The Kernel
Trick
- Let x belong to the real 2-D input space
- Choose a mapping function F of degree two
- The required dot product of the map function can
be expresses as - a dot product in the input space
- This is the kernel trick
- The Kernel trick basically says that any mapping
can be expressed in terms of a dot product of the
input space data to some degree - here to the second degree
31Example Non-Linear Classification Decision
function D(x)
- Compute Lagrange multipliers a through the
quadratic optimization problem - Plug into equation for D(x)
- Determine b using the class constraints
- y-1,-1,1,-1
- b-9
- The end result is a nonlinear (quadratic)
decision function - For x(1)1, sign(D(x)-4.33)lt0 ? C1
- For x(2)2, sign(D(x)-1.00)lt0 ? C1
- For x(3)5, sign(D(x)0.994)gt0 ? C2
- For x(4)6, sign(D(x)-1.009)lt0 ? C1
- The nonlinear classifier correctly classified the
data!
32Quadratic Optimization and Global solutions
- What do all these methods have in common?
- Quadratic optimization of the weight vector w
- Where H is the hessian matrix
- y is the class membership of each training point
- This type of equation is defined as a quadratic
optimization problem solution to which gives - Lagrange multipliers a, which in turn are used in
D(x) - In Matlab quadprog is used to solve the
quadratic optimization - Because there can only exist one solution to the
quadratic problem it - guarantees a global solution.