Bayesian Support Vector Machine Classification - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Bayesian Support Vector Machine Classification

Description:

Use linear Principal Component Analysis to decompose and compress raw data into ... two because we can see that some kind of parabola will separate the classes ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: vsot
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Support Vector Machine Classification


1
Bayesian Support Vector Machine Classification
  • Vasilis A. Sotiris
  • AMSC663 Midterm Presentation
  • December 2007
  • University of Maryland
  • College Park, MD 20783

2
Objectives
  • Develop an algorithm to detect anomalies in
    electronic systems (multivariate)
  • Improve detection sensitivity of classical
    Support Vector Machines (SVM)
  • Decrease false alarms
  • Predict future system performance

3
Methodology
  • Use linear Principal Component Analysis to
    decompose and compress raw data into two models
    a) PCA model, and b) Residual model.
  • Use Support Vector Machines to classify data (in
    each model) into normal and abnormal classes
  • Assign probabilities to the classification output
    of the SVMs using a sigmoid function
  • Use a Maximum Likelihood Estimation to find the
    optimal sigmoid function parameters (in each
    model)
  • Determine the joint class probability from both
    models
  • Track changes to the joint probability to
  • improve detection sensitivity
  • decrease false alarms
  • Predict future system performance

4
Flow chart of Probabilistic SVC Detection
Methodology
New Observation ? R 1xm
Likelihood function
PCA Model R kxm
Probability matrix
D1(y1)
SVC
p
Health Decision
Input space R nxm
PCA Model Decision boundary
Training data
Joint Probabilities
PCA
Likelihood function
Probability matrix
D2(y2)
1
p
SVC
Trending of joint probability distributions
Residual Model R lxm
0
0
-1
1
D(x)
Baseline Population Database
Residual Model Decision boundary
Probability Model
PCA
SVC
5
Principal Component Analysis
6
Principal Component Analysis Statistical
Properties
  • Decompose data into two models
  • PCA model (Maximum variance) y1
  • Residual model y2
  • Direction of y1 is the eigenvector with largest
    associated eigenvalue l
  • Vector a is chosen as the eigenvector of the
    covariance matrix C

x2
y2, PC2
y1, PC1
x1
7
Singular Value Decomposition (SVD) - Eigenanalysis
  • SVD is used in this algorithm to perform PCA
  • SVD
  • performs eigenanalysis without first computing
    the covariance matrix
  • Speeds up computations
  • Computes basis functions (used in projection
    next)
  • The output of SVD is
  • U Basis functions for the PCA and residual
    models
  • L Eigenvalues of covariance matrix
  • V Eigenvectors of covariance matrix

n x n
n x m
m x m
8
Subspace Decomposition
Residual Subspace
  • S PCA Model subspace
  • Detect dominant parameter variation
  • R Residual Subspace
  • Detects hidden anomalies
  • Therefore, analysis of the system behavior can be
    decoupled into what is called the signal subspace
    and residual subspace
  • To get xs and xr we project the input data onto
    S and R

R
XR
Data
Xs
S
PCA Model Projection
Raw Data
Residual Model Projection
9
Least Squares Projections
R
  • u basis vector for PC1 and PC2
  • v vector from centered training data to new
    observation
  • Objective
  • Find optimal p that minimizes v-pu
  • This gives Vp
  • The projection equation is finally put in terms
    of SVD
  • HUkUkT
  • k - number of principal components (dimensions
    for PCA model)
  • The projection pursuit is optimized based on the
    PCA model

New Observation
PC1
S
PC2
10
Data Decomposition
  • With the projection matrix H, we can project any
    incoming signal onto the signal S and residual
    R subspaces
  • G is an analogous matrix to H that is used to
    create the projection onto R
  • H is the projection onto S, and G is the
    projection onto R

Projection onto R
Projection onto S
11
Support Vector Machines
12
Support Vector Machines
  • The performance of a system can be fully
    explained with the distribution of its parameters
  • SVMs estimate the decision boundary for the given
    distribution
  • Areas with less information are allowed a larger
    margin of error
  • New observations can be classified using the
    decision boundary and are labeled as
  • (-1) outside
  • (1) inside

x2
Soft decision boundary
Hard decision boundary
x1
13
Linear Classification Separable Input Space
x2
Abnormal Class
  • SVM finds a function D(x) that best separates the
    two classes (max M)
  • D(x) can be used as a classifier
  • Through the support vectors we can
  • compress the input space by excluding all other
    data except for the support vectors.
  • The SVs tell us everything we need to know about
    the system in order to perform detection
  • By minimizing the norm of w we find the line or
    linear surface that best separates the two
    classes
  • The decision function is the linear combination
    of the weight vector w

M
w
D(x)
x1
Normal Class
Training Support Vectors ai

New observation vector
14
Linear Classification Inseparable Input Space
x2
Abnormal Class
  • For inseparable data the SVM finds a function
    D(x) that best separates the two classes by
  • Maximizing the margin M and minimizing the sum of
    slack errors xi
  • Function D(x) can be used as a classifier
  • In this illustration, a new observation point
    that falls to the right of it is considered
    abnormal
  • Points below and to the left are considered
    normal
  • By minimizing the norm of w and the sum of slack
    errors xi we find the line or linear surface that
    best separates the two classes

x1
x1
M
x2
x2
D(x)
x1
Normal Class
Training Support Vectors
New observation vector
15
Nonlinear classification
  • For inseparable data the SVM finds a nonlinear
    function D(x) that best separates the two classes
    by
  • Use of a kernel map k(.)
  • KF(xi)F(x)
  • Feature map F(x)x2 v2x 1T
  • The decision function D(x) requires the dot
    product of the feature map F using the same
    mathematical framework as the linear classifier
  • This is called the Kernel Trick
  • (example)

x2
16
SVM Training
17
Training SVMs for Classification
Confidence Limit training
  • Need effective way to train SVM without the
    presence of negative class data
  • Convert outer distribution of positive class to
    negative
  • Confidence limit training uses a defined
    confidence level around which a negative class is
    generated
  • One class training takes a percentage of the
    positive class data and converts it to negative
    class
  • is an optimization problem
  • minimizes the volume in the decision surface VS
  • does not need negative class information

x2
x2
D1(x)
VS1
x1
x1
One Class training
x2
x2
D2(x)
VS2
x1
x1
VS1 gt VS2
18
One Class Training
Performance region
x2
  • The negative class is important for SVM accuracy
  • The data is portioned using Kmeans
  • The negative class is computed around each
    cluster centroid
  • The negative class is selected from the positive
    class data as the points that have
  • the fewest neighboors
  • Denoted by D
  • Computationally this is done by maximizing the
    sum of Euclidian distances from between all
    points

x1
19
Class Prediction Probabilities and Maximum
Likelihood Estimation
20
Fitting a Sigmoid Function
  • In this project we are interested in finding the
    probability that our class prediction is correct
  • Modeling the miss-classification rate
  • The class prediction in PHM is the prediction of
    normality or abnormality
  • With an MLE estimate of the density function of
    these class probabilities we can determine the
    uncertainty of the prediction

x2
x2
D(x)
x1
x1
Hard decision boundary
Probability
distance
21
MLE and SVMs
  • Using a semi-parametric approach a Sigmoid
    function S is fitted along the hard decision
    boundary to model class probability
  • We are interested in determining the density
    function that best prescribes this probability
  • The likelihood is computed based on the knowledge
    of the decision function values D(xi), in the
    parameter space

x2
D(x)
x1
P(yD(xi)) Likelihood function
D(x)
22
MLE and the Sigmoid Function
  • Parameters a and b are determined by solving a
    maximum likelihood estimation (MLE) of y
  • The minimization is a two parameter optimization
    problem of F, a function of a and b
  • Depending on parameters a and b the shape of
    the sigmoid will change.
  • It can be proven that the MLE optimization
    problem is convex
  • Can use Newtons method with a backtracking line
    search

23
Joint Probability Model
24
Joint Probability Model
P ( y xS , xR )
Final Class Probability
Projection onto Residual model
Projection onto PCA model
Classification for x
  • Class prediction P(yxS,xR) based on the joint
    class probabilities from
  • PCA model p(yxS)
  • Residual model p(yxR)
  • p(ycxS) - the probability that a point xS is
    classified as c in the PCA model
  • p(yxR) - the probability that a point is
    classified as c in the residual model
  • P(yxS,xR) - the final probability that a point
    x is classified as c
  • Anticipate better accuracy and sensitivity to
    onset of anomalies

25
Joint Probability Model
Bayes Rule
Assumption
  • The joint probability model depends on the
    results of the SVC from both models (PCA and
    Residual)
  • Assumption Data on models is linearly
    independent
  • Changes in the joint classification probability
    can be used as precursor to anomalies and used
    for prediction

26
Schedule/Progress
27
SVM Classification Example
28
Example Non-Linear Classification
  • Have 4 1-D data points represented in vector x
    and a label vector y given by
  • x1,2,5,6T
  • y-1,-1,1,-1T
  • This means that coordinates x(1), x(2) and x(4)
    belong to the same class I (circles) and x(3) is
    its own class II (squares)
  • The decision function D(x) is given as the
    nonlinear combination of the weight vector which
    is expressed in terms of the lagrange multipliers
  • The lagrange multipliers are computed in the
    quadratic optimization problem
  • We are going to use a polynomial kernel of degree
    two because we can see that some kind of parabola
    will separate the classes

29
Example Non-Linear Classification - Construct
Hessian for quadratic optimization
  • Notice that in order to calculate the scalar
    product FTF in the feature space, we do not need
    to perform the mapping using the equation for F.
    Instead we calculate this product directly in the
    input space using the input data by computing the
    kernel of the map
  • This is called the kernel trick

30
Example Non-Linear Classification - The Kernel
Trick
  • Let x belong to the real 2-D input space
  • Choose a mapping function F of degree two
  • The required dot product of the map function can
    be expresses as
  • a dot product in the input space
  • This is the kernel trick
  • The Kernel trick basically says that any mapping
    can be expressed in terms of a dot product of the
    input space data to some degree
  • here to the second degree

31
Example Non-Linear Classification Decision
function D(x)
  • Compute Lagrange multipliers a through the
    quadratic optimization problem
  • Plug into equation for D(x)
  • Determine b using the class constraints
  • y-1,-1,1,-1
  • b-9
  • The end result is a nonlinear (quadratic)
    decision function
  • For x(1)1, sign(D(x)-4.33)lt0 ? C1
  • For x(2)2, sign(D(x)-1.00)lt0 ? C1
  • For x(3)5, sign(D(x)0.994)gt0 ? C2
  • For x(4)6, sign(D(x)-1.009)lt0 ? C1
  • The nonlinear classifier correctly classified the
    data!

32
Quadratic Optimization and Global solutions
  • What do all these methods have in common?
  • Quadratic optimization of the weight vector w
  • Where H is the hessian matrix
  • y is the class membership of each training point
  • This type of equation is defined as a quadratic
    optimization problem solution to which gives
  • Lagrange multipliers a, which in turn are used in
    D(x)
  • In Matlab quadprog is used to solve the
    quadratic optimization
  • Because there can only exist one solution to the
    quadratic problem it
  • guarantees a global solution.
Write a Comment
User Comments (0)
About PowerShow.com