Title: L2 and L1 Criteria for K-Means Bilinear Clustering
1L2 and L1 Criteria for K-Means
Bilinear Clustering
- B. Mirkin
- School of Computer Science
- Birkbeck College, University of London
- Advert of a Special Issue The Computer Journal,
Profiling Expertise and Behaviour Deadline 15
Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
ark/cfp_cj_profiling.txt
2Outline More of Properties than Methods
- Clustering, K-Means and Issues
- Data recovery PCA model and clustering
- Data scatter decompositions for L2 and L1
- Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
3- WHAT IS CLUSTERING WHAT IS DATA
- K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Mixed Data Interpretation Aids - WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering - DATA RECOVERY MODELS Statistics Modelling as
Data Recovery - Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering - DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters - GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability
4 - Clustering, K-Means and Issues
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
5Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)
- Pluto doesnt fit in the two clusters of planets
originated another cluster (2006)
6Clustering algorithms
- Nearest neighbour
- Wards
- Conceptual clustering
- K-means
- Kohonen SOM
- Spectral clustering
- .
7K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - K 3
hypothetical centroids (_at_)
8K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
9K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence -
10K-Means a generic clustering method
- Entities are presented as multidimensional
points () - 0. Put K
hypothetical centroids (seeds) - 1. Assign
points to the centroids - according
to Minimum distance rule - 2. Put
centroids in gravity centres of - thus
obtained clusters - 3. Iterate 1.
and 2. until convergence - 4. Output
final centroids and clusters -
_at_ _at_
_at_
11 Advantages of K-Means
- Models typology building
- Computationally effective
- Can be utilised incrementally, on-line
- Shortcomings (?) of K-Means
- Initialisation affects results
- Convex cluster shape
12Initial Centroids Correct
Two cluster case
13Initial Centroids Correct
Final
Initial
14Different Initial Centroids
15Different Initial Centroids Wrong
Initial
Final
16Issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results These all can
be addressed with the data recovery approach
17 - Clustering, K-Means and Issues
- Data recovery PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
18Data recovery for data mining (discovery of
patterns in data)
- Type of Data
- Similarity
- Temporal
- Entity-to-feature
- Type of Model
- Regression
- Principal components
- Clusters
Model Data Model_Derived_Data
Residual Pythagoras Datam
Model_Derived_Datam Residualm m 1, 2.
The better fit, the better the model a natural
source of optimisation problems
19K-Means as a data recovery method
20Representing a partition
Cluster k Centroid ckv (v - feature)
Binary 1/0 membership zik (i -
entity)
21Basic equations (same as for PCA, but score
vectors zk constrained to be binary)
y data entry, z 1/0 membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
22 - Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions L2 and L1
- Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
23 Quadratic data scatter
decomposition (classic)
K-means Alternating LS minimisation y data
entry, z 1/0 membership c - cluster
centroid, N cardinality i - entity,
v - feature /category, k - cluster
24 Absolute Data Scatter Decomposition
(Mirkin 1997)
Ckv are medians
25Outline
- Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions L2 and L1
- Implications for data pre-processing
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
26Meaning of the Data scatter
- m1,2 The sum of contributions of features the
basis for feature pre-processing (dividing by
range rather than std) - Proportional to the summary variance (L2) /
absolute deviation from the median (L1)
27Standardisation of features
- Yik (Xik Ak)/Bk
- X - original data
- Y standardised data
- i entities
- k features
- Ak shift of the origin, typically, the average
- Bk rescaling factor, traditionally the standard
deviation, but range may be better in clustering
28Normalising
- by std
-
- decreases the effect of more useful feature 2
- by range
- keeps the effect of distribution shape in T(Y)
- B range?categories (for L2 case)
- (under the equality-of-variables assumption)
29Data standardisation
- Categories as one/zero variables
- Subtracting the average
- All features Normalising by range
- Categories - sometimes by the number of them
30Illustration of data pre-processing
Mixed scale data table
31Conventional quantitative coding data
standardisation
32No normalisation
Tom Sawyer
33Z-scoring (scaling by std)
Tom Sawyer
34Normalising by range?categories
Tom Sawyer
35Outline
- Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
36Contribution of a feature F to a partition
(m2)
Contrib(F)
- Proportional to
- correlation ratio ?2 if F is quantitative
- a contingency coefficient between cluster
partition and F, if F is nominal - Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
37Contribution of a quantitative feature to a
partition (m2)
- Proportional to
- correlation ratio ?2 if F is quantitative
38Contribution of a pair nominal feature
partition, L2 case
- Proportional to a contingency coefficient
- Pearson chi-square (Poisson normalised)
- Goodman-Kruskal tau-b (Range normalised)
- Bj1
- Still needs be normalised by the square root of
categories, to balance the contribution of a
numerical feature
39Contribution of a pair nominal feature
partition, L1 case
- A highly original contingency coefficient
- Still needs be normalised by square root of
categories, to balance the contribution of a
numerical feature
40 - Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
41Equivalent criteria (1)
- A. Bilinear residuals squared MIN
- Minimizing difference between data and
- cluster structure
- B. Distance-to-Centre Squared MIN
- Minimizing difference between data and
- cluster structure
42Equivalent criteria (2)
- C. Within-group error squared MIN
- Minimizing difference between data and
- cluster structure
- D. Within-group variance Squared MIN
- Minimizing within-cluster variance
43Equivalent criteria (3)
- E. Semi-averaged within distance squared MIN
- Minimizing dissimilarities within clusters
- F. Semi-averaged within similarity squared MAX
- Maximizing similarities within clusters
44Equivalent criteria (4)
- G. Distant Centroids MAX
- Finding anomalous types
- H. Consensus partition MAX
- Maximizing correlation between sought partition
and given variables
45Equivalent criteria (5)
- I. Spectral Clusters MAX
- Maximizing summary Raileigh quotient over binary
vectors
46Gowers controversy 2N1 entities
c1 c2
c3 N N
1
- Two-cluster possibilities
- W(c1,c2/c3) N2 d(c1,c2)
- W(c1/c2,c3) N d(c2,c3)
- ? W(c1/c2,c3)o(W(c1,c2/c3))
- Separation over grand mean/median rather than
just over distances
(in the most general d setting)
47Outline
- Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction strategy Anomalous
patterns and iK-Means - Comments on optimisation problems
- Issue of the number of clusters
- Conclusion and future work
48PCA inspired Anomalous Pattern Clustering
- yiv cv zi eiv,
- where zi 1 if i?S, zi 0 if i?S
- With Euclidean distance squared
cS must be anomalous, that is,
interesting
49Spectral clustering can be not optimal (1)
- Spectral clustering (becoming popular, in a
different setting) - Find maximum eigenvector by maximising
- over all possible x
- Define zi1 if xigta zi0, if xi ? a, for some
a
50Spectral clustering can be not optimal (2)
- Example (for similarity data)
1
2 6 19
20 3 4 5
z 0.681 0.260 0.126 0.168 i
1 2 3-5 6-20
This cannot be typical
51Initial setting with Anomalous Pattern Cluster
52Anomalous Pattern Clusters Iterate
0
53iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
54iK-MeansDefining K and Initial Setting with
Iterative Anomalous Pattern Clustering
- Find all Anomalous Pattern clusters
- Remove smaller (e.g., singleton) clusters
- Put the number of remaining clusters as K and
initialise K-Means with their centres
55Outline
- Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
56Study of eight Number-of-clusters methods (joint
work with Mark Chiang)
- Variance based
- Hartigan (HK)
- Calinski Harabasz (CH)
- Jump Statistic (JS)
- Structure based
- Silhouette Width (SW)
- Consensus based
- Consensus Distribution area (CD)
- Consensus Distribution mean (DD)
- Sequential extraction of APs (iK-Means)
- Least Square (LS)
- Least Moduli (LM)
57Experimental results
at 9 Gaussian clusters (3 size patterns), 1000 x
15 data
Estimated number of clusters Estimated number of clusters Adjusted Rand Index Adjusted Rand Index
Large spread Small spread Large spread Small spread
HK
CH
JS
SW
CD
DD
LS
LM
Two winners counted each time
1-time winner 2-times winner 3-times
winner
58 - Clustering general and K-Means
- Bilinear PCA model and clustering
- Data Scatter Decompositions Quadratic and
Absolute - Contributions of nominal features
- Explications of Quadratic criterion
- One-by-one cluster extraction Anomalous patterns
and iK-Means - Issue of the number of clusters
- Comments on optimisation problems
- Conclusion and future work
59Some other data recovery clustering models
- Hierarchical clustering Ward agglomerative and
Ward-like divisive, relation to wavelets and Haar
basis (1997) - Additive clustering partition and one-by-one
clustering (1987) - Biclustering box clustering (1995)
60Hierarchical clustering for conventional and
spatial data
- Model Same
-
- Cluster structure 3-valued z representing a
split - A split SS1S2 of a node S in children S1, S2
- zi 0 if i ?S, a if i?S1
- -b if i?S2
- If a and b taken to z being centred, the
- node vectors for a hierarchy form
- orthogonal base (an analogue to SVD)
61Last
- Data recovery K-Means-wise model is an adequate
tool that involves wealth of interesting criteria
for mathematical investigation - L1 criterion remains a mystery, even for the most
popular method, the PCA - Greedy-wise approaches remain a vital element
both theoretically and practically - Evolutionary approaches started sneaking in
should be given more attention - Extending the approach to other data types is a
promising direction