Outlier Detection Using SemiDiscrete Decomposition - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Outlier Detection Using SemiDiscrete Decomposition

Description:

clustering a method to find some structure in the data ... Unsupervised learning of a ternary decision tree. objects divided by value in column 1 ... – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 25

Provided by: Mayo88

Category:

more less

Transcript and Presenter's Notes

Title: Outlier Detection Using SemiDiscrete Decomposition

1
Outlier Detection Using SemiDiscrete Decomposition
S. McConnell and D.B. Skillicorn

Presented by Bryan Decaire

2
Dataset / Table

n rows (objects) each with m columns
(attributes)
each object a point in m-dimensional space
clustering a method to find some structure in the
data
high dimensional spaces are difficult to work
with
distance between nearest and farthest neighbor
almost the same
large m creates problems (large gt 15)
Dimension-reduction techniques (e.g. SVD)

3
SVD

Transforms from an m-dimensional space to new
m-dimensional space
A USVt
A is an n x m matrix
U is an n x m matrix
S is an m x m, diagonal matrix, non-negative, r
decreasing singular values where r is the rank of
A (r m)
V is an m x m matrix

4
SVD

Properties of new space are
Axes (U and V) are orthonormal
Axes are ordered such that variation that exists
is ordered from largest to smallest
Later dimensions may thus be ignored
Rank k (k r) approximation to A
Multiply the following
first k columns of U
upper left k x k submatrix of S
First k columns of V, transposed

5
SDD

Weak analogue of SVD
Axes are NOT orthonormal
Coordinates of points are from set -1, 0, 1
Compact representation of transformed space
(2-bits/element of X and Y)
O(nm)3 complexity

6
SDD

Mixed integer programming (MIP) problem
Linear programming problem
linear function to be optimized
problem constraints
non-negative variables
Mixed integer because some of the decision
variables are constrained to have only integer
values at the optimal solution

7
SDD - Equation

Ak Xk Dk Yk
equation for a k-dimension SDD
Ak is an n x m matrix
Xk is an n x k matrix, elements are -1, 0, 1
Dk is a k x k diagonal matrix
Yk is a k x m, elements are -1, 0, 1
k is no longer related to the rank of matrix A

8
SDD - Algorithm

Let Ak kth term approximation
Let Rk residual at the kth step
Let xi be the ith column of X, di the ith
diagonal element of D, and yi the ith row of Y
X, D, Y are initialized to 0
For (i 1 to k)
Ak-1 X D Y
Rk A Ak-1
Find a triple (xi, di, yi) that minimizes Ri
dixiyi F
Choose an initial yi
Solve for xi and di using selected yi
Solve for yi using di and xi from previous step
Repeat until convergence criteria is met

9
SDD - Properties

Rows of X represent the coordinates of the object
in the new space defined by Y
Unsupervised learning of a ternary decision tree
objects divided by value in column 1
continue dividing objects by values in each
successive column
clusters identified via decision tree creation
leaves require examination
does not contain every possible branch
New data classification would be unknown

10
SDD - Example

Given a matrix A as

11
SDD Example (cont)

The first five columns of X
The first five columns of Y
The values of D are

12
SDD Example (cont)

Resultant Decision Tree

13
SDD Example Analysis

The first approximation subtracted from A
A1 d1 x1 y1
A A1

14
SDD Example Analysis (cont)

A
A A1

15
SDD Example Analysis (cont)

General Effect Find regions where the
difference in values is large
Volume of the difference is key
Both magnitude of difference and
Region of occurrence impact decision
2-Norm as objective function implies height has a
greater impact than region
Tends to find outlier clusters in data
SDD is a form of bump hunting
Finds regions that are above/below the rest of
the data

16
Observations

When structure of dataset is many small clusters
SVD and SDD produce similar results
SVD is an accurate low-dimensional representation
SDD recognizes the clusters as bumps
Example comparison plot of sparse clusters on a
text dataset. SVD generated the coordinates
while SDD set shape and color

17
Observations (cont)

SDD useful in latent semantic indexing (LSI)
Document-text matrices tend to contain many small
clusters
SDD is an outlier detector
Emphasizes most unusual patterns in a dataset

18
SDD - Problems

Algorithm is sensitive to initial choice of yi
Largest possible slice not always removed
Therefore, values of D are not always decreasing
Authors propose a reorder of the D values after
the algorithm completes by accounting for the
total effect of the individual slice
D values are dependent upon the current matrix
state
Reordering at the end cannot reproduce the same
results as having chosen a different removal
order
Computation of D is dependent on the average
height of clusters being removed
Partitioning of the original matrix may not be
optimal

19
Example Outlier Discovery

In a geochemical dataset containing 1670 sample
points, measuring the concentration of 33
elements
No other technique found outlier clusters such as
these

20
Example Outlier Discovery

In a dataset containing observed properties of
galaxies, with 8 attributes and 460 rows
SDD partitioned a single SVD cluster into a set
of subclusters reflecting finer detail

21
Related Work

PRIM is a bump hunting technique
Top-down, removing slices in a single dimension
Rule-based techniques
Some use exhaustive search, may not scale well
SVD based
Principal Direction Divisive Partitioning
Partitions a dataset based on variation of the
direction of the first singular value axis,
repeats for each subpartition independently

22
Related Work