Different Perspectives at Clustering: The

About This Presentation

Title:

Different Perspectives at Clustering: The

Description:

Machine learning: Prediction. Data mining: Revealing ... Machine learning ... Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 71

Provided by: helen54

Category:

more less

Transcript and Presenter's Notes

Title: Different Perspectives at Clustering: The

1
Different Perspectives at Clustering The
Number-of-Clusters Case

B. Mirkin
School of Computer Science
Birkbeck College, University of London
IFCS 2006

2
Different Perspectives at Number of Clusters
Talk Outline

Clustering and K-Means A discussion
Clustering goals and four perspectives
Number of clusters in
- Classical statistics perspective
- Machine learning perspective
- Data Mining perspective
(including a simulation study with 8 methods)
- Knowledge discovery perspective
(including a comparative genomics project)

WHAT IS CLUSTERING WHAT IS DATA
K-MEANS CLUSTERING Conventional K-Means
Initialization of K-Means Intelligent K-Means
Interpretation Aids
WARD HIERARCHICAL CLUSTERING Agglomeration
Divisive Clustering with Ward Criterion
Extensions of Ward Clustering
DATA RECOVERY MODELS Statistics Modelling as
Data Recovery
Data Recovery Model for K-Means for Ward
Extensions to Other Data Types One-by-One
Clustering
DIFFERENT CLUSTERING APPROACHES Extensions of
K-Means Graph-Theoretic Approaches Conceptual
Description of Clusters
GENERAL ISSUES Feature Selection and Extraction
Similarity on Subsets and Partitions Validity
and Reliability

4
Example W. Jevons (1835-1882), updated in
Mirkin 1996

Pluto doesnt fit in the two clusters of planets

5
Example A Few Clusters

Clustering interface to WEB search engines
(Grouper)
Query Israel (after O. Zamir and O. Etzioni
2001)

Cluster sites Interpretation
1 View Refine 24 Society, religion Israel and Iudaism Judaica collection
2 View Refine 12 Middle East, War, History The state of Israel Arabs and Palestinians
3 View Refine 31 Economy, Travel Israel Hotel Association Electronics in Israel
6
Clustering Main Steps

Data collecting
Data pre-processing
Finding clusters (the only step appreciated in
conventional clustering)
Interpretation
Drawing conclusions

7
Conventional Clustering Cluster Algorithms

Single Linkage Nearest Neighbour
Ward Agglomeration
Conceptual Clustering
K-means
Kohonen SOM
.

8
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
K 3
hypothetical centroids (_at_)

_at_ _at_
_at_

9
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

10
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence

_at_ _at_
_at_

11
K-Means a generic clustering method

Entities are presented as multidimensional
points ()
0. Put K
hypothetical centroids (seeds)
1. Assign
points to the centroids
according
to Minimum distance rule
2. Put
centroids in gravity centres of
thus
obtained clusters
3. Iterate 1.
and 2. until convergence
4. Output
final centroids and clusters

_at_ _at_

_at_
12
Advantages of K-Means

Conventional
Models typology building
Computationally effective
Can be incremental, on-line
Unconventional
Associates feature salience with feature scales
and correlation/association
Applicable to mixed scale data

13
Drawbacks of K-Means

No advice on
Data pre-processing
Number of clusters
Initial setting
Instability of results
Criterion can be inadequate
Insufficient interpretation aids

14
Initial Centroids Correct
Two cluster case
15
Initial Centroids Correct
Final
Initial
16
Different Initial Centroids
17
Different Initial Centroids Wrong, even though
in different clusters
Initial
Final
18
Two types of goals (with no clear-cut
borderline)

Engineering goals
Data analysis goals

19
Engineering goals (examples)

Devising a market segmentation to minimise the
promotion and advertisement expenses
Dividing a large scheme into modules to minimise
the cost
Organisation structure design

20
Data analysis goals (examples)

Recovery of the distribution function
Prediction
Revealing patterns in data
Enhancing knowledge with additional concepts
and regularities
Each of these is realised
in a different perspective at clustering

21
Clustering Perspectives

Classical statistics
Recovery of a multimodal distribution
function
Machine learning Prediction
Data mining Revealing patterns in data
Knowledge discovery additional concepts
and regularities

22
Clustering Perspectives at Clusters

Classical statistics
As many as meaningful modes (mixture items)
Machine learning
As many as needed for acceptable prediction
Data mining
As many as meaningful patterns in data
(including incomplete clustering)
Knowledge discovery
As many as needed to produce concepts and
regularities adequate to the domain

23
Main Sources for Deriving Clusters

Classical statistics
Model of the world
Machine learning
Cost Accuracy Trade Off
Data mining
Data
Knowledge discovery
Domain knowledge

24
Classical Statistics Perspective

There must be a model of data generation
E.g., Mixture of Gaussians
The task identify all parameters of the model
by using observed data
E.g., The number of Gaussians and their
probabilities, means and covariances

25
Mixture of 3 Gaussian densities
26
Classical statistics perspective on K-Means

But a maximum likelihood method with spherical
Gaussians of the same variance
- within a cluster, all variables are
independent and Gaussian with the same
cluster-independent variance (z - scoring is a
must then)
- the issue of the number of clusters can be
approached with conventional approaches to
hypothesis testing

27
Machine learning perspective

Clusters should be of help in learning data
incrementally generated
The number should be specified by the trade-off
between accuracy and cost
A criterion should guarantee partitioning of the
feature space with clearly separated high density
areas
A method should be proven to be consistent with
the criterion on the population

28
Machine learning on K-Means

The number of clusters to be specified
according to prediction goals
Pre-processing no advice
An incremental version of K-means converges to a
minimum of the summary within-cluster variance,
under conventional assumptions of data generation
(McQueen 1967 major reference, though the
method is traced a dozen or two years earlier)

29
Data mining perspective

30
Data recovery framework fordata mining methods

Type of Data
Similarity
Temporal
Entity-to-feature
Co-occurrence

Type of Model
Regression
Principal components
Clusters

Model Data Model_Derived_Data
Residual Pythagoras Data2 Model_Derived_Data2
Residual2 The better fit the better the model
31
K-Means as a data recovery method
32
Representing a partition
Cluster k Centroid ckv (v -
feature) Binary 1/0 membership zik
(i - entity)
33
Basic equations (analogous to PCA)
y data entry, z - membership c - cluster
centroid, N cardinality i - entity,
v - feature /category, k - cluster
34
Meaning of Data scatter

The sum of contributions of features the basis
for feature pre-processing (dividing by range,
not std)
Proportional to the summary variance

35
Contribution of a feature F to a partition
Contrib(F)

Proportional to
correlation ratio ?2 if F is quantitative
a contingency coefficient between cluster
partition and F, if F is nominal
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)

36
Contribution of a quantitative feature to a
partition

Proportional to
correlation ratio ?2 if F is quantitative

37
Contribution of a nominal feature to
a partition

Proportional to a contingency coefficient
Pearson chi-square (Poisson normalised)
Goodman-Kruskal tau-b (Range normalised)
Bj1

38
Pythagorean Decomposition of data scatter for
interpretation
39
Contribution based description of clusters

C. Dickens FCon 0
M. Twain LenD lt 28
L. Tolstoy NumCh gt 3 or
Direct 1

40
Principal Cluster Analysis (Anomalous Pattern)
Method

yiv cv zi eiv,
where zi 1 if i?S, zi 0 if i?S
With Euclidean distance squared

cS must be anomalous, that is, interesting
41
Initial setting with Anomalous Single Cluster
for iK-Means
42
iK-Means with Anomalous Single Clusters
0
43
Anomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
44
Simulation study of 8 methods (joint work with
Mark Chiang) Number-of clusters methods

Variance based
Hartigan(HK)
Calinski Harabasz (CH)
Jump Statistic (JS)
Structure based
Silhouette Width (SW)
Consensus based
Consensus Distribution area (CD)
Consensus Distribution mean (DD)
Sequential extraction of APs
Least Square (LS)
Least Moduli (LM)

45
Data generation for the experiment

Gaussian Mixture (6,7,9 clusters) with
Cluster spatial size
- Constant (spherical)
- k-proportional
- k2-proportional
Cluster spread (distance between centroids)

Spread Spherical PPCA model PPCA model
Spread Spherical k-proport. k2-proport.
Large 2 (?) 10 (?) 10 (?)
Small 0.2 (?) 0.5 (?) 2 (?)
46
Evaluation of results Estimated clustering
versus that generated

Number of clusters
Distance between centroids
Similarity between partitions

47
Distance between estimated centroids (o) and
those generated (o )

Prime Assignment

e1(q1)
e2(q2)
e3(q3)
g1------e2 g2------e4 g3------e5
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
48
Distance between estimated centroids (o) and
those generated (o )

Final Assignment

e1(q1)
e2(q2)
e3(q3)
g1------e2, e1 g2------e4, e3 g3------e5
G1(p1)
G2(p2)
e4(q4)
e5(q5)
G3(p3)
49
Distance between centroids quadratic and
city-block
g1(p1)------e1(q1), e2(q2) g2(p2)------e3(q3),
e4(q4) g3(p3)------e5(q5)

Assignment
Distancing

d1(q1d(g1,e1)q2d(g1,e2))/(q1q2)
d2(q3d(g2,e3)q4d(g2,e4))/(q3q4)
d3(q5d(g3,e5)/q5
50
Distance between centroids quadratic and
city-block

Assignment
Distancing
3. Averaging

p1d1p2d2p3d3
51
Similarity between partitions according to their
confusion table

Relative distance (Mirkin-Cherny 1970)
Tchouprov coefficient (Cramer 1943)
Adjusted Rand Index (Arabie-Hubert, 1985)
Average Overlap (Mirkin 2005)

52
Results
at 9 clusters, 1000 entities, 20 features
generated
Estimated number of clusters Estimated number of clusters Distance between Centroids Distance between Centroids Adjust Rand Index Adjust Rand Index
Large spread Small spread Large spread Small spread Large spread Small spread
HK
CH
JS
SW
CD
DD
LS
LM
53
Knowledge discovery perspective on clustering

Conforming to and enhancing domain knowledge
Informal considerations so far
Relevant items
Decision trees
External validation

54
A case to generalise

Entities with a similarity measure
Clustering interpretation tool developed
Clustering method using a similarity threshold
leading to a number of clusters
Domain knowledge leading to constraints to the
similarity threshold
Best fitting interpretation provides for the best
number of clusters

55
Entities with a similarity measure

740 Homologous Protein Families (HPFs)
(in 30 herpes-virus genomes)
Homology defined by a protein sequence fragment
Sequence neighbourhood based similarity measure
on HPFs

F1 F2 F3
56
Interpretation tool Mapping to an evolutionary
tree over genomes
F3 F2 F1
57
Algorithm ADDI-S (Mirkin JoC 1987), a data
approximation technique

To maximize Contribution to Data Scatter, Average
within-cluster similarity c multiplied by the
clusters size S
Algorithm ADDI-S
Take S j for arbitrary j
Given S, find cc(S) and similarities b(i,S) to S
for all entities i in and out of S
Check the differences b(i,S)-c/2. If they are
consistent, change the state of a most
contributing entity. Else, stop and output S.
Resulting S a tightness property.
Holzinger (1941) B-coefficient,
ArkadievBraverman (1964, 1967) Specter, Mirkin
(1976, 1987) ADDI family, Ben-Dor, Shamir,
Yakhini (1999) CAST

58
Algorithm ADDI-S (Mirkin 1987), a data
approximation techniques

Number of clusters Depends on similarity shift
threshold b
b(ij) ? b(ij) b

59
Domain knowledge Function is known at some HPFs

287 pairs of HPFs with known function of which 86
are SYNONYMOUS (same function)

density
Non-synonymous
synonymous
0.42 0.67 Similarity
Two values Min error No non-synonymous
60
Knowledge enhancing

Analyzing the reconstructed contents in 3 family
ancestors and HUCA (the root)
Analyzing differences between b.42 and b.67
cluster reconstructions
Analyzing gene arrangement within genomes
Glycoprotein Ls HPFs are sequence-dissimilar,
but they are always followed in genomes by
glycolase that is mapped to HUCA
Glyc L Glycolase
Therefore glycoprotein L must be in HUCA too

61
Final HPFs and APFs

HPFs with a sequence-based similarity measure
Interpretation parsimonious histories
Clustering ADDI-S using a similarity threshold
leading to a number of clusters
Domain knowledge 86 pairs should be in same
clusters, and 201 in different clusters ? 2
suggested similarity thresholds
Best fitting 102 APF (aggregating 249 HPFs) and
491 singleton HPFs

Whole HPF aggregation methods structure (joint
work with
R. Camargo, T. Fenner, P. Kellam, G. Loizou)

63
Conclusion I Number of clusters?

Engineering perspective defined by cost/effect
Classical statistics perspective can and should
be determined from data with a model
Machine learning perspective can be specified
according to the prediction accuracy to achieve
Data mining perspective not to pre-specify only
those are of interest that bear interesting
patterns
Knowledge discovery perspective not to
pre-specify those that are best in knowledge
enhancing

64
Conclusion II Each other data analysis concept

Classical statistics perspective can be
determined from data with a model
Machine learning perspective prediction accuracy
to achieve
Data mining perspective data approximation
Knowledge discovery perspective knowledge
enhancing

65
Variance based methods

Hartigan (HK)
calculate HT(Wk/Wk1-1)(N-k-1), where N is the
number of entities
find the k which HT is less than a threshold 10
Calinski and Harabasz (CH)
calculate CH((T-Wk)/(k-1))/(Wk/(N-k)), where T
is the data scatter
find the k which maximize CH

Wk is given K, the smallest within-cluster
summary distance to centroids among those found
at different K-Means initializations
66
Variance based methods

Jump Statistic (JS)
for each entity i, clustering SS1,S2,,Sk, and
centroids CC1,C2,,Ck
calculate d(i, Sk)(yi-Ck)TG-1(yi-Ck) and dk(
d(i, Sk))/PN

where P is the number of features, N is the
number of rows and G is the covariance matrix of
y
select a transformation power, typically P/2
calculate the jumps JSd
find the k which maximize JS

67
Structure based methods

Silhouette Width (SW)
for each entity i, a(i)average dissimilarity
between i and
all other entities of the cluster to which i
belongs and
b(i) is the minimum of average dissimilarity
of i to all entities
in other cluster
for each other cluster Sk, d(i, Sk)average
dissimilarity between i
and all entities of Sk
(i)min(d(i, Sk)) over Sk
s(i)b(i)-a(i)/max(a(i),b(i))
calculate the average s
find the K maximizing the average s

s(i)/N
68
Consensus based methods
Consensus based area(CD)

For each different K-means initializations
Find the connectivity matrix
Calculate the consensus matrix
Calculate the cumulative distribution matrix CDF
Calculate the area under CDF A(k)
Calculate ?(K1)
find the k which maximize ?(k)

69
Consensus based methods
µK is the mean of the consensus matrix sK is the
variance of the consensus matrix
avdis(K) µK(1- µK)- sK2
davdis(K)(avdis(K)-avdis(K1))/avdis(K1)
Find the K which maximize davdis(DD)
70
Sequential cluster extraction

Intelligent K-Means
Anomalous Pattern (Initial clusters)
Removal of singletons
K-Means
Euclidean Distance The within-cluster mean
? Least Square Criterion (LS)
Manhattan Distance The within-cluster median ?
Least Modules Criterion (LM)

Write a Comment

User Comments (0)

About PowerShow.com

Different Perspectives at Clustering: The - PowerPoint PPT Presentation

Different Perspectives at Clustering: The

Machine learning: Prediction. Data mining: Revealing ... Machine learning ... Pearson chi-square (Poisson normalised) Goodman-Kruskal tau-b (Range normalised) ... – PowerPoint PPT presentation