L2 and L1 Criteria for K-Means Bilinear Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

L2 and L1 Criteria for K-Means Bilinear Clustering

Description:

Tom Sawyer. Outline. Clustering: general and K-Means. Bilinear PCA model and clustering ... Tom Sawyer. 1. 2. 3. iK-Means: Anomalous clusters K-means ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 62
Provided by: helen54
Category:

less

Transcript and Presenter's Notes

Title: L2 and L1 Criteria for K-Means Bilinear Clustering


1
L2 and L1 Criteria for K-Means
Bilinear Clustering
  • B. Mirkin
  • School of Computer Science
  • Birkbeck College, University of London
  • Advert of a Special Issue The Computer Journal,
    Profiling Expertise and Behaviour Deadline 15
    Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
    ark/cfp_cj_profiling.txt

2
Outline More of Properties than Methods
  • Clustering, K-Means and Issues
  • Data recovery PCA model and clustering
  • Data scatter decompositions for L2 and L1
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

3
  • WHAT IS CLUSTERING WHAT IS DATA
  • K-MEANS CLUSTERING Conventional K-Means
    Initialization of K-Means Intelligent K-Means
    Mixed Data Interpretation Aids
  • WARD HIERARCHICAL CLUSTERING Agglomeration
    Divisive Clustering with Ward Criterion
    Extensions of Ward Clustering
  • DATA RECOVERY MODELS Statistics Modelling as
    Data Recovery
  • Data Recovery Model for K-Means for Ward
    Extensions to Other Data Types One-by-One
    Clustering
  • DIFFERENT CLUSTERING APPROACHES Extensions of
    K-Means Graph-Theoretic Approaches Conceptual
    Description of Clusters
  • GENERAL ISSUES Feature Selection and Extraction
    Similarity on Subsets and Partitions Validity
    and Reliability

4
  • Clustering, K-Means and Issues
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

5
Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)
  • Pluto doesnt fit in the two clusters of planets
    originated another cluster (2006)

6
Clustering algorithms
  • Nearest neighbour
  • Wards
  • Conceptual clustering
  • K-means
  • Kohonen SOM
  • Spectral clustering
  • .

7
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • K 3
    hypothetical centroids (_at_)
  • _at_ _at_
  • _at_

8
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • _at_ _at_
  • _at_

9
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • _at_ _at_
  • _at_

10
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • 4. Output
    final centroids and clusters

_at_ _at_


_at_
11
Advantages of K-Means
  • Models typology building
  • Computationally effective
  • Can be utilised incrementally, on-line
  • Shortcomings (?) of K-Means
  • Initialisation affects results
  • Convex cluster shape

12
Initial Centroids Correct
Two cluster case
13
Initial Centroids Correct
Final
Initial
14
Different Initial Centroids
15
Different Initial Centroids Wrong
Initial
Final
16
Issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results These all can
be addressed with the data recovery approach
17
  • Clustering, K-Means and Issues
  • Data recovery PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

18
Data recovery for data mining (discovery of
patterns in data)
  • Type of Data
  • Similarity
  • Temporal
  • Entity-to-feature
  • Type of Model
  • Regression
  • Principal components
  • Clusters

Model Data Model_Derived_Data
Residual Pythagoras Datam
Model_Derived_Datam Residualm m 1, 2.
The better fit, the better the model a natural
source of optimisation problems
19
K-Means as a data recovery method

20
Representing a partition
Cluster k Centroid ckv (v - feature)
Binary 1/0 membership zik (i -
entity)
21
Basic equations (same as for PCA, but score
vectors zk constrained to be binary)
y data entry, z 1/0 membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
22
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions L2 and L1
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

23
Quadratic data scatter
decomposition (classic)
K-means Alternating LS minimisation y data
entry, z 1/0 membership c - cluster
centroid, N cardinality i - entity,
v - feature /category, k - cluster
24
Absolute Data Scatter Decomposition
(Mirkin 1997)

Ckv are medians
25
Outline
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions L2 and L1
  • Implications for data pre-processing
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

26
Meaning of the Data scatter
  • m1,2 The sum of contributions of features the
    basis for feature pre-processing (dividing by
    range rather than std)
  • Proportional to the summary variance (L2) /
    absolute deviation from the median (L1)

27
Standardisation of features
  • Yik (Xik Ak)/Bk
  • X - original data
  • Y standardised data
  • i entities
  • k features
  • Ak shift of the origin, typically, the average
  • Bk rescaling factor, traditionally the standard
    deviation, but range may be better in clustering

28
Normalising
  • by std
  • decreases the effect of more useful feature 2
  • by range
  • keeps the effect of distribution shape in T(Y)
  • B range?categories (for L2 case)
  • (under the equality-of-variables assumption)

29
Data standardisation
  • Categories as one/zero variables
  • Subtracting the average
  • All features Normalising by range
  • Categories - sometimes by the number of them

30
Illustration of data pre-processing
Mixed scale data table
31
Conventional quantitative coding data
standardisation
32
No normalisation
Tom Sawyer
33
Z-scoring (scaling by std)
Tom Sawyer
34
Normalising by range?categories

Tom Sawyer
35
Outline
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

36
Contribution of a feature F to a partition
(m2)
Contrib(F)
  • Proportional to
  • correlation ratio ?2 if F is quantitative
  • a contingency coefficient between cluster
    partition and F, if F is nominal
  • Pearson chi-square (Poisson normalised)
  • Goodman-Kruskal tau-b (Range normalised)

37
Contribution of a quantitative feature to a
partition (m2)
  • Proportional to
  • correlation ratio ?2 if F is quantitative

38
Contribution of a pair nominal feature
partition, L2 case
  • Proportional to a contingency coefficient
  • Pearson chi-square (Poisson normalised)
  • Goodman-Kruskal tau-b (Range normalised)
  • Bj1
  • Still needs be normalised by the square root of
    categories, to balance the contribution of a
    numerical feature

39
Contribution of a pair nominal feature
partition, L1 case
  • A highly original contingency coefficient
  • Still needs be normalised by square root of
    categories, to balance the contribution of a
    numerical feature

40
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

41
Equivalent criteria (1)
  • A. Bilinear residuals squared MIN
  • Minimizing difference between data and
  • cluster structure
  • B. Distance-to-Centre Squared MIN
  • Minimizing difference between data and
  • cluster structure

42
Equivalent criteria (2)
  • C. Within-group error squared MIN
  • Minimizing difference between data and
  • cluster structure
  • D. Within-group variance Squared MIN
  • Minimizing within-cluster variance

43
Equivalent criteria (3)
  • E. Semi-averaged within distance squared MIN
  • Minimizing dissimilarities within clusters
  • F. Semi-averaged within similarity squared MAX
  • Maximizing similarities within clusters

44
Equivalent criteria (4)
  • G. Distant Centroids MAX
  • Finding anomalous types
  • H. Consensus partition MAX
  • Maximizing correlation between sought partition
    and given variables

45
Equivalent criteria (5)
  • I. Spectral Clusters MAX
  • Maximizing summary Raileigh quotient over binary
    vectors

46
Gowers controversy 2N1 entities
c1 c2
c3 N N
1
  • Two-cluster possibilities
  • W(c1,c2/c3) N2 d(c1,c2)
  • W(c1/c2,c3) N d(c2,c3)
  • ? W(c1/c2,c3)o(W(c1,c2/c3))
  • Separation over grand mean/median rather than
    just over distances
    (in the most general d setting)

47
Outline
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction strategy Anomalous
    patterns and iK-Means
  • Comments on optimisation problems
  • Issue of the number of clusters
  • Conclusion and future work

48
PCA inspired Anomalous Pattern Clustering
  • yiv cv zi eiv,
  • where zi 1 if i?S, zi 0 if i?S
  • With Euclidean distance squared

cS must be anomalous, that is,
interesting
49
Spectral clustering can be not optimal (1)
  • Spectral clustering (becoming popular, in a
    different setting)
  • Find maximum eigenvector by maximising
  • over all possible x
  • Define zi1 if xigta zi0, if xi ? a, for some
    a

50
Spectral clustering can be not optimal (2)
  • Example (for similarity data)

1
2 6 19
20 3 4 5
z 0.681 0.260 0.126 0.168 i
1 2 3-5 6-20
This cannot be typical
51
Initial setting with Anomalous Pattern Cluster
52
Anomalous Pattern Clusters Iterate
0
53
iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
54
iK-MeansDefining K and Initial Setting with
Iterative Anomalous Pattern Clustering
  • Find all Anomalous Pattern clusters
  • Remove smaller (e.g., singleton) clusters
  • Put the number of remaining clusters as K and
    initialise K-Means with their centres

55
Outline
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

56
Study of eight Number-of-clusters methods (joint
work with Mark Chiang)
  • Variance based
  • Hartigan (HK)
  • Calinski Harabasz (CH)
  • Jump Statistic (JS)
  • Structure based
  • Silhouette Width (SW)
  • Consensus based
  • Consensus Distribution area (CD)
  • Consensus Distribution mean (DD)
  • Sequential extraction of APs (iK-Means)
  • Least Square (LS)
  • Least Moduli (LM)

57
Experimental results
at 9 Gaussian clusters (3 size patterns), 1000 x
15 data
Estimated number of clusters Estimated number of clusters Adjusted Rand Index Adjusted Rand Index
Large spread Small spread Large spread Small spread
HK
CH
JS
SW
CD
DD
LS
LM
Two winners counted each time
1-time winner 2-times winner 3-times
winner
58
  • Clustering general and K-Means
  • Bilinear PCA model and clustering
  • Data Scatter Decompositions Quadratic and
    Absolute
  • Contributions of nominal features
  • Explications of Quadratic criterion
  • One-by-one cluster extraction Anomalous patterns
    and iK-Means
  • Issue of the number of clusters
  • Comments on optimisation problems
  • Conclusion and future work

59
Some other data recovery clustering models
  • Hierarchical clustering Ward agglomerative and
    Ward-like divisive, relation to wavelets and Haar
    basis (1997)
  • Additive clustering partition and one-by-one
    clustering (1987)
  • Biclustering box clustering (1995)

60
Hierarchical clustering for conventional and
spatial data
  • Model Same
  • Cluster structure 3-valued z representing a
    split
  • A split SS1S2 of a node S in children S1, S2
  • zi 0 if i ?S, a if i?S1
  • -b if i?S2
  • If a and b taken to z being centred, the
  • node vectors for a hierarchy form
  • orthogonal base (an analogue to SVD)

61
Last
  • Data recovery K-Means-wise model is an adequate
    tool that involves wealth of interesting criteria
    for mathematical investigation
  • L1 criterion remains a mystery, even for the most
    popular method, the PCA
  • Greedy-wise approaches remain a vital element
    both theoretically and practically
  • Evolutionary approaches started sneaking in
    should be given more attention
  • Extending the approach to other data types is a
    promising direction
Write a Comment
User Comments (0)
About PowerShow.com