Title: Dual data driven SIMCA as a one-class classifier
1Dual data driven SIMCA as a one-class classifier
Alexey PomerantsevICP RAS
2One-class classifier, e.g. SIMCA
3Standard bi-variate normal distribution
4Extremes and Outliers
?0.01?0.05
a is Extreme significance
? is Outlier significance
5Extreme plot
6Principal Component Analysis
Karl Pearson, 1901
7Scores Orthogonal Distances
8SD OD distributions
OD
SD
9Data Driven SIMCA
SD
OD
10Total Distance
11Tolerance Areas
a is Extreme significance
? is Outlier significance
12Classical Data Driven (CDD) SIMCA
Classical Method of Moments
Given
Then
Where
13Robust Data Driven (RDD) SIMCA
Robust Method of Moments
Given
Then
Where
Mmedian(u) Rinterquartile(u)
14Dual Data Driven SIMCA
Given
XTtPE h(h1,...., hI) v(v1,....,
vI)
Then
CDD SIMCA RDD SIMCA
YesCDD SIMCA NoRDD SIMCA
15Case study I. Simulated data with outliers
The numbers of variables, J3 The numbers of
objects, I100 The number of principal
components, A2The ? properties areE(?) 0,
v11 v22 v33 0.28, rank(V) 2. The ?
component properties are E(?) 0, ?0.05
(first 97 objects)E(?) 0, ? 0.2 (last 3
objects)
16SIMCA plots
17REFERENCE RDD-SIMCA
18Totally in 10 data sets with outliers
Expected
19Case study II. Real world data with 2 groups
Substance in the closed PE bags, 82 drums
measured by NIR.Totally 246 spectra Group G1
200 objectsGroup G2 46 objects
ACA 642 (2009) 222-227
20Probe position effect
21Extreme plots
Expected number of extremes NaI
Clean subset G1
Contaminated dataset G1G2
22Results of separation
Subset G1 revealed
Subset G2 revealed
23Reference
24One-class classification
Alternatives
Type II error 1- Type I error
25How to find ß in case AC is known
Target
Alternative
26Two-classes discrimination plums apples
27Errors of Type I and Type II
28Type II error ß
Target
PCA
29Non-central chi-squared distribution
chi-squared distribution
non-central chi-squared distribution
the noncentrality parameter
30Calculation of ß
Total distance of Target class (TC)
h0? ,v0?, Nh?, Nv?
31Case study II. Real world data with 2 groups
Substance in the closed PE bags, 82 drums
measured by NIR.Totally 246 spectra Group G1
200 objectsGroup G2 46 objects
Type II error estimation
32G2 AC1 AC2 AC3 AC4
33Total distance c distributions
34Type II validation
35Risk management
given a
calculated ccrit
found ß
given ß
found a
calculated ccrit
36Conclusion 1
Extreme objects play an important role in data
analysis. These objects should not be confused
with outliers. The number of extremes should be
compared to the expected number, coupled with the
significance level ?.
Clean dataset
Contaminated dataset
37Conclusion 2
Errors in decision making are inevitable.
Reducing one error, we increase the other. The
researcher's task is to find the balance of
risks. Our approach provides such an
opportunity.Examples will be presented in
Oxanas lecture.
38Conclusion 3
The proposed Dual Data Driven PCA/SIMCA approach
looks like a fine competitor to the pure
classical and to the strictly robust methods.
This technique has demonstrated a proper
performance in the analysis of both regular and
contaminated data sets.
Clean dataset
Contaminated dataset
39Thank you for your attention
A Lawyers Mistake