Outlier Detection Using SemiDiscrete Decomposition - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Outlier Detection Using SemiDiscrete Decomposition

Description:

clustering a method to find some structure in the data ... Unsupervised learning of a ternary decision tree. objects divided by value in column 1 ... – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 25
Provided by: Mayo88
Category:

less

Transcript and Presenter's Notes

Title: Outlier Detection Using SemiDiscrete Decomposition


1
Outlier Detection Using SemiDiscrete Decomposition
S. McConnell and D.B. Skillicorn
  • Presented by Bryan Decaire

2
Dataset / Table
  • n rows (objects) each with m columns
    (attributes)
  • each object a point in m-dimensional space
  • clustering a method to find some structure in the
    data
  • high dimensional spaces are difficult to work
    with
  • distance between nearest and farthest neighbor
    almost the same
  • large m creates problems (large gt 15)
  • Dimension-reduction techniques (e.g. SVD)

3
SVD
  • Transforms from an m-dimensional space to new
    m-dimensional space
  • A USVt
  • A is an n x m matrix
  • U is an n x m matrix
  • S is an m x m, diagonal matrix, non-negative, r
    decreasing singular values where r is the rank of
    A (r m)
  • V is an m x m matrix

4
SVD
  • Properties of new space are
  • Axes (U and V) are orthonormal
  • Axes are ordered such that variation that exists
    is ordered from largest to smallest
  • Later dimensions may thus be ignored
  • Rank k (k r) approximation to A
  • Multiply the following
  • first k columns of U
  • upper left k x k submatrix of S
  • First k columns of V, transposed

5
SDD
  • Weak analogue of SVD
  • Axes are NOT orthonormal
  • Coordinates of points are from set -1, 0, 1
  • Compact representation of transformed space
    (2-bits/element of X and Y)
  • O(nm)3 complexity

6
SDD
  • Mixed integer programming (MIP) problem
  • Linear programming problem
  • linear function to be optimized
  • problem constraints
  • non-negative variables
  • Mixed integer because some of the decision
    variables are constrained to have only integer
    values at the optimal solution

7
SDD - Equation
  • Ak Xk Dk Yk
  • equation for a k-dimension SDD
  • Ak is an n x m matrix
  • Xk is an n x k matrix, elements are -1, 0, 1
  • Dk is a k x k diagonal matrix
  • Yk is a k x m, elements are -1, 0, 1
  • k is no longer related to the rank of matrix A

8
SDD - Algorithm
  • Let Ak kth term approximation
  • Let Rk residual at the kth step
  • Let xi be the ith column of X, di the ith
    diagonal element of D, and yi the ith row of Y
  • X, D, Y are initialized to 0
  • For (i 1 to k)
  • Ak-1 X D Y
  • Rk A Ak-1
  • Find a triple (xi, di, yi) that minimizes Ri
    dixiyi F
  • Choose an initial yi
  • Solve for xi and di using selected yi
  • Solve for yi using di and xi from previous step
  • Repeat until convergence criteria is met

9
SDD - Properties
  • Rows of X represent the coordinates of the object
    in the new space defined by Y
  • Unsupervised learning of a ternary decision tree
  • objects divided by value in column 1
  • continue dividing objects by values in each
    successive column
  • clusters identified via decision tree creation
  • leaves require examination
  • does not contain every possible branch
  • New data classification would be unknown

10
SDD - Example
  • Given a matrix A as

11
SDD Example (cont)
  • The first five columns of X
  • The first five columns of Y
  • The values of D are

12
SDD Example (cont)
  • Resultant Decision Tree

13
SDD Example Analysis
  • The first approximation subtracted from A
  • A1 d1 x1 y1
  • A A1

14
SDD Example Analysis (cont)
  • A
  • A A1

15
SDD Example Analysis (cont)
  • General Effect Find regions where the
    difference in values is large
  • Volume of the difference is key
  • Both magnitude of difference and
  • Region of occurrence impact decision
  • 2-Norm as objective function implies height has a
    greater impact than region
  • Tends to find outlier clusters in data
  • SDD is a form of bump hunting
  • Finds regions that are above/below the rest of
    the data

16
Observations
  • When structure of dataset is many small clusters
  • SVD and SDD produce similar results
  • SVD is an accurate low-dimensional representation
  • SDD recognizes the clusters as bumps
  • Example comparison plot of sparse clusters on a
    text dataset. SVD generated the coordinates
    while SDD set shape and color

17
Observations (cont)
  • SDD useful in latent semantic indexing (LSI)
  • Document-text matrices tend to contain many small
    clusters
  • SDD is an outlier detector
  • Emphasizes most unusual patterns in a dataset

18
SDD - Problems
  • Algorithm is sensitive to initial choice of yi
  • Largest possible slice not always removed
  • Therefore, values of D are not always decreasing
  • Authors propose a reorder of the D values after
    the algorithm completes by accounting for the
    total effect of the individual slice
  • D values are dependent upon the current matrix
    state
  • Reordering at the end cannot reproduce the same
    results as having chosen a different removal
    order
  • Computation of D is dependent on the average
    height of clusters being removed
  • Partitioning of the original matrix may not be
    optimal

19
Example Outlier Discovery
  • In a geochemical dataset containing 1670 sample
    points, measuring the concentration of 33
    elements
  • No other technique found outlier clusters such as
    these

20
Example Outlier Discovery
  • In a dataset containing observed properties of
    galaxies, with 8 attributes and 460 rows
  • SDD partitioned a single SVD cluster into a set
    of subclusters reflecting finer detail

21
Related Work
  • PRIM is a bump hunting technique
  • Top-down, removing slices in a single dimension
  • Rule-based techniques
  • Some use exhaustive search, may not scale well
  • SVD based
  • Principal Direction Divisive Partitioning
  • Partitions a dataset based on variation of the
    direction of the first singular value axis,
    repeats for each subpartition independently

22
Related Work
  • Comparison of PDDP and SDD
  • Given a dataset A

23
Related Work (cont)
24
Conclusion
  • SVD
  • Space transforming technique
  • Models decreasing amounts of variation
  • SDD is a bump hunting technique
  • Finds regions with large values and extracts them
  • Removes bumps with largest volume
  • Detects outliers
  • SVD and SDD
  • Agree on datasets that contain many small
    clusters
  • Disagree on datasets that contain few large
    clusters
Write a Comment
User Comments (0)
About PowerShow.com