L2 and L1 Criteria for K-Means Bilinear Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

L2 and L1 Criteria for K-Means Bilinear Clustering

Description:

Tom Sawyer. Outline. Clustering: general and K-Means. Bilinear PCA model and clustering ... Tom Sawyer. 1. 2. 3. iK-Means: Anomalous clusters K-means ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 62

Provided by: helen54

Category:

more less

Transcript and Presenter's Notes

Title: L2 and L1 Criteria for K-Means Bilinear Clustering

1
L2 and L1 Criteria for K-Means
Bilinear Clustering

B. Mirkin
School of Computer Science
Birkbeck College, University of London
Advert of a Special Issue The Computer Journal,
Profiling Expertise and Behaviour Deadline 15
Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
ark/cfp_cj_profiling.txt

2
Outline More of Properties than Methods

Clustering, K-Means and Issues
Data recovery PCA model and clustering
Data scatter decompositions for L2 and L1
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

WHAT IS CLUSTERING WHAT IS DATA
K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Mixed Data Interpretation Aids
WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering
DATA RECOVERY MODELS Statistics Modelling as
Data Recovery
Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering
DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters
GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability

Clustering, K-Means and Issues
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

5
Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)

Pluto doesnt fit in the two clusters of planets
originated another cluster (2006)

6
Clustering algorithms

Nearest neighbour
Wards
Conceptual clustering
K-means
Kohonen SOM
Spectral clustering
.

7
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
K 3
hypothetical centroids (_at_)

_at_ _at_
_at_

8
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

9
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

10
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
4. Output
final centroids and clusters

_at_ _at_

_at_
11
Advantages of K-Means

Models typology building
Computationally effective
Can be utilised incrementally, on-line
Shortcomings (?) of K-Means
Initialisation affects results
Convex cluster shape

12
Initial Centroids Correct
Two cluster case
13
Initial Centroids Correct
Final
Initial
14
Different Initial Centroids
15
Different Initial Centroids Wrong
Initial
Final
16
Issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results These all can
be addressed with the data recovery approach
17

Clustering, K-Means and Issues
Data recovery PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

18
Data recovery for data mining (discovery of
patterns in data)

Type of Data
Similarity
Temporal
Entity-to-feature

Type of Model
Regression
Principal components
Clusters

Model Data Model_Derived_Data
Residual Pythagoras Datam
Model_Derived_Datam Residualm m 1, 2.
The better fit, the better the model a natural
source of optimisation problems
19
K-Means as a data recovery method

20
Representing a partition
Cluster k Centroid ckv (v - feature)
Binary 1/0 membership zik (i -
entity)
21
Basic equations (same as for PCA, but score
vectors zk constrained to be binary)
y data entry, z 1/0 membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
22

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions L2 and L1
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

23
Quadratic data scatter
decomposition (classic)
K-means Alternating LS minimisation y data
entry, z 1/0 membership c - cluster
centroid, N cardinality i - entity,
v - feature /category, k - cluster
24
Absolute Data Scatter Decomposition
(Mirkin 1997)

Ckv are medians
25
Outline

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions L2 and L1
Implications for data pre-processing
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

26
Meaning of the Data scatter

m1,2 The sum of contributions of features the
basis for feature pre-processing (dividing by
range rather than std)
Proportional to the summary variance (L2) /
absolute deviation from the median (L1)

27
Standardisation of features

Yik (Xik Ak)/Bk
X - original data
Y standardised data
i entities
k features
Ak shift of the origin, typically, the average
Bk rescaling factor, traditionally the standard
deviation, but range may be better in clustering

28
Normalising

by std
decreases the effect of more useful feature 2
by range
keeps the effect of distribution shape in T(Y)
B range?categories (for L2 case)
(under the equality-of-variables assumption)

29
Data standardisation

Categories as one/zero variables
Subtracting the average
All features Normalising by range
Categories - sometimes by the number of them

30
Illustration of data pre-processing
Mixed scale data table
31
Conventional quantitative coding data
standardisation
32
No normalisation
Tom Sawyer
33
Z-scoring (scaling by std)
Tom Sawyer
34
Normalising by range?categories

Tom Sawyer
35
Outline

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

36
Contribution of a feature F to a partition
(m2)
Contrib(F)

Proportional to
correlation ratio ?2 if F is quantitative
a contingency coefficient between cluster
partition and F, if F is nominal
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)

37
Contribution of a quantitative feature to a
partition (m2)

Proportional to
correlation ratio ?2 if F is quantitative

38
Contribution of a pair nominal feature
partition, L2 case

Proportional to a contingency coefficient
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Bj1
Still needs be normalised by the square root of
categories, to balance the contribution of a
numerical feature

39
Contribution of a pair nominal feature
partition, L1 case

A highly original contingency coefficient
Still needs be normalised by square root of
categories, to balance the contribution of a
numerical feature

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

41
Equivalent criteria (1)

A. Bilinear residuals squared MIN
Minimizing difference between data and
cluster structure
B. Distance-to-Centre Squared MIN
Minimizing difference between data and
cluster structure

42
Equivalent criteria (2)

C. Within-group error squared MIN
Minimizing difference between data and
cluster structure
D. Within-group variance Squared MIN
Minimizing within-cluster variance

43
Equivalent criteria (3)

E. Semi-averaged within distance squared MIN
Minimizing dissimilarities within clusters
F. Semi-averaged within similarity squared MAX
Maximizing similarities within clusters

44
Equivalent criteria (4)

G. Distant Centroids MAX
Finding anomalous types
H. Consensus partition MAX
Maximizing correlation between sought partition
and given variables

45
Equivalent criteria (5)

I. Spectral Clusters MAX
Maximizing summary Raileigh quotient over binary
vectors

46
Gowers controversy 2N1 entities
c1 c2
c3 N N
1

Two-cluster possibilities
W(c1,c2/c3) N2 d(c1,c2)
W(c1/c2,c3) N d(c2,c3)
? W(c1/c2,c3)o(W(c1,c2/c3))
Separation over grand mean/median rather than
just over distances
(in the most general d setting)

47
Outline

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction strategy Anomalous
patterns and iK-Means
Comments on optimisation problems
Issue of the number of clusters
Conclusion and future work

48
PCA inspired Anomalous Pattern Clustering

yiv cv zi eiv,
where zi 1 if i?S, zi 0 if i?S
With Euclidean distance squared

cS must be anomalous, that is,
interesting
49
Spectral clustering can be not optimal (1)

Spectral clustering (becoming popular, in a
different setting)
Find maximum eigenvector by maximising
over all possible x
Define zi1 if xigta zi0, if xi ? a, for some
a

50
Spectral clustering can be not optimal (2)

Example (for similarity data)

1
2 6 19
20 3 4 5
z 0.681 0.260 0.126 0.168 i
1 2 3-5 6-20
This cannot be typical
51
Initial setting with Anomalous Pattern Cluster
52
Anomalous Pattern Clusters Iterate
0
53
iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
54
iK-MeansDefining K and Initial Setting with
Iterative Anomalous Pattern Clustering

Find all Anomalous Pattern clusters
Remove smaller (e.g., singleton) clusters
Put the number of remaining clusters as K and
initialise K-Means with their centres

55
Outline

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

56
Study of eight Number-of-clusters methods (joint
work with Mark Chiang)

Variance based
Hartigan (HK)
Calinski Harabasz (CH)
Jump Statistic (JS)
Structure based
Silhouette Width (SW)
Consensus based
Consensus Distribution area (CD)
Consensus Distribution mean (DD)
Sequential extraction of APs (iK-Means)
Least Square (LS)
Least Moduli (LM)

57
Experimental results
at 9 Gaussian clusters (3 size patterns), 1000 x
15 data
Estimated number of clusters Estimated number of clusters Adjusted Rand Index Adjusted Rand Index
Large spread Small spread Large spread Small spread
HK
CH
JS
SW
CD
DD
LS
LM
Two winners counted each time
1-time winner 2-times winner 3-times
winner
58

Clustering general and K-Means
Bilinear PCA model and clustering
Data Scatter Decompositions Quadratic and
Absolute
Contributions of nominal features
Explications of Quadratic criterion
One-by-one cluster extraction Anomalous patterns
and iK-Means
Issue of the number of clusters
Comments on optimisation problems
Conclusion and future work

59
Some other data recovery clustering models

Hierarchical clustering Ward agglomerative and
Ward-like divisive, relation to wavelets and Haar
basis (1997)
Additive clustering partition and one-by-one
clustering (1987)
Biclustering box clustering (1995)

60
Hierarchical clustering for conventional and
spatial data

Model Same
Cluster structure 3-valued z representing a
split
A split SS1S2 of a node S in children S1, S2
zi 0 if i ?S, a if i?S1
-b if i?S2
If a and b taken to z being centred, the
node vectors for a hierarchy form
orthogonal base (an analogue to SVD)

61
Last

Data recovery K-Means-wise model is an adequate
tool that involves wealth of interesting criteria
for mathematical investigation
L1 criterion remains a mystery, even for the most
popular method, the PCA
Greedy-wise approaches remain a vital element
both theoretically and practically
Evolutionary approaches started sneaking in
should be given more attention
Extending the approach to other data types is a
promising direction