Clustering: Tackling Challenges with Data Recovery Approach - PowerPoint PPT Presentation

About This Presentation
Title:

Clustering: Tackling Challenges with Data Recovery Approach

Description:

Advert of a Special Issue: The Computer Journal, Profiling Expertise and ... WARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 61
Provided by: helen54
Category:

less

Transcript and Presenter's Notes

Title: Clustering: Tackling Challenges with Data Recovery Approach


1
Clustering Tackling Challenges with Data
Recovery Approach
  • B. Mirkin
  • School of Computer Science
  • Birkbeck University of London
  • Advert of a Special Issue The Computer Journal,
    Profiling Expertise and Behaviour Deadline 15
    Nov. 2006. To submit, http// www.dcs.bbk.ac.uk/m
    ark/cfp_cj_profiling.txt

2
  • WHAT IS CLUSTERING WHAT IS DATA
  • K-MEANS CLUSTERING Conventional K-Means
    Initialization of K-Means Intelligent K-Means
    Mixed Data Interpretation Aids
  • WARD HIERARCHICAL CLUSTERING Agglomeration
    Divisive Clustering with Ward Criterion
    Extensions of Ward Clustering
  • DATA RECOVERY MODELS Statistics Modelling as
    Data Recovery
  • Data Recovery Model for K-Means for Ward
    Extensions to Other Data Types One-by-One
    Clustering
  • DIFFERENT CLUSTERING APPROACHES Extensions of
    K-Means Graph-Theoretic Approaches Conceptual
    Description of Clusters
  • GENERAL ISSUES Feature Selection and Extraction
    Similarity on Subsets and Partitions Validity
    and Reliability

3
What is clustering?
  • Finding homogeneous fragments, mostly sets of
    entities, in data for further analysis

4
Example W. Jevons (1857) planet clusters,
updated (Mirkin, 1996)
  • Pluto doesnt fit in the two clusters of planets

5
Example A Few Clusters
  • Clustering interface to WEB search engines
  • (Grouper)
  • Query Israel (after O. Zamir and O. Etzioni
    2001)

Cluster sites Interpretation
1 View Refine 24 Society, religion Israel and Iudaism Judaica collection
2 View Refine 12 Middle East, War, History The state of Israel Arabs and Palestinians
3 View Refine 31 Economy, Travel Israel Hotel Association Electronics in Israel
6
Clustering algorithms
  • Nearest neighbour
  • Ward
  • Conceptual clustering
  • K-means
  • Kohonen SOM
  • Etc.

7
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • K 3
    hypothetical centroids (_at_)
  • _at_ _at_
  • _at_

8
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • _at_ _at_
  • _at_

9
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • _at_ _at_
  • _at_

10
K-Means a generic clustering method
  • Entities are presented as multidimensional
    points ()
  • 0. Put K
    hypothetical centroids (seeds)
  • 1. Assign
    points to the centroids
  • according
    to Minimum distance rule
  • 2. Put
    centroids in gravity centres of
  • thus
    obtained clusters
  • 3. Iterate 1.
    and 2. until convergence
  • 4. Output
    final centroids and clusters

_at_ _at_


_at_
11
Advantages of K-Means
  • Models typology building
  • Computationally effective
  • Can be utilised incrementally, on-line
  • Shortcomings of K-Means
  • Instability of results
  • Convex cluster shape

12
Initial Centroids Correct
Two cluster case
13
Initial Centroids Correct
Final
Initial
14
Different Initial Centroids
15
Different Initial Centroids Wrong
Initial
Final
16
Clustering issues
K-Means gives no advice on Number of
clusters Initial setting Data
normalisation Mixed variable scales
Multiple data sets K-Means gives limited advice
on Interpretation of results
17
Data recovery for data mining (discovery of
patterns in data)
  • Type of Data
  • Similarity
  • Temporal
  • Entity-to-feature
  • Co-occurrence
  • Type of Model
  • Regression
  • Principal components
  • Clusters

Model Data Model_Derived_Data
Residual Pythagoras Data2 Model_Derived_Data2
Residual2 The better fit, the better the
model
18
Pythagorean decomposition in Data recovery
approach, provides for
  • Data scatter a unique data characteristic (A
    perspective at data normalisation)
  • Additive contributions of entities or features to
    clusters (A perspective for interpretation)
  • Feature contributions are correlation/association
    measures affected by scaling (Mixed scale data
    treatable)
  • Clusters can be extracted one-by-one (Data mining
    perspective, incomplete clustering, number of
    clusters)
  • Multiple data can be approximated as well as
    single sourced ones (not talked of today)

19
Example
Mixed scale data table
20
Conventional quantitative coding data
standardisation
21
Standardisation of features
  • Yik (Xik Ak)/Bk
  • X - original data
  • Y standardised data
  • i entities
  • k features
  • Ak shift of the origin, typically, the average
  • Bk rescaling factor, traditionally the standard
    deviation, but range may be better in clustering

22
No standardisation
Tom Sawyer
23
Z-scoring (scaling by std)
Tom Sawyer
24
Standardising by range weight
Tom Sawyer
25
K-Means as a data recovery method
26
Representing a partition
Cluster k Centroid ckv (v -
feature) Binary 1/0 membership zik
(i - entity)
27
Basic equations (analogous to PCA, with score
vectors zk constrained to be binary)
y data entry, z membership, not
score c - cluster centroid, N
cardinality i - entity, v - feature
/category, k - cluster
28
Meaning of Data scatter
  • The sum of contributions of features the basis
    for feature pre-processing (dividing by range
    rather than std)
  • Proportional to the summary variance

29
Contribution of a feature F to a partition
Contrib(F)
  • Proportional to
  • correlation ratio ?2 if F is quantitative
  • a contingency coefficient between cluster
    partition and F, if F is nominal
  • Pearson chi-square (Poisson normalised)
  • Goodman-Kruskal tau-b (Range normalised)

30
Contribution of a quantitative feature to a
partition
  • Proportional to
  • correlation ratio ?2 if F is quantitative

31
Contribution of a nominal feature to
a partition
  • Proportional to a contingency coefficient
  • Pearson chi-square (Poisson normalised)
  • Goodman-Kruskal tau-b (Range normalised)
  • Bj1

32
Pythagorean Decomposition of data scatter for
interpretation
33
Contribution based description of clusters
  • C. Dickens FCon 0
  • M. Twain LenD lt 28
  • L. Tolstoy NumCh gt 3 or
  • Direct 1

34
PCA based Anomalous Pattern Clustering
  • yiv cv zi eiv,
  • where zi 1 if i?S, zi 0 if i?S
  • With Euclidean distance squared

cS must be anomalous, that is, interesting
35
Initial setting with Anomalous Pattern Cluster
36
Anomalous Pattern Clusters Iterate
0
37
iK-MeansAnomalous clusters K-means
After extracting 2 clusters (how one can know
that 2 is right?)
Final
38
Example of iK-Means Media Mirrored Russian
Corruption (55 cases) with M. Levin and E.
Bakaleinik
  • Features
  • Corrupt office (1)
  • Client (1)
  • Rendered service (6)
  • Mechanism of corruption (2)
  • Environment (1)

39
A schema for Bribery

Environment
Interaction
Office
Client
Service
40
Data standardisation
  • Categories as one/zero variables
  • Subtracting the average
  • All features Normalising by range
  • Categories, sometimes by the number of them

41
iK-MeansInitial Setting with Iterative
Anomalous Pattern Clustering
  • 13 clusters found with AC, of which
    8 do not fit (4 singletons, 4
    doublets)
  • 5 clusters remain, to get initial seeds from
  • Cluster elements are taken as seeds

42
Interpretation II Patterning(Interpretation I
Representatives Interpretation III Conceptual
description)
  • Patterns in centroid values of salient features
  • Salience of feature v at cluster k

    (grand mean -
    within-cluster mean)2

43
InterpretationII III
  • Cluster 1 (7 cases)
  • Other branch (877)
  • Improper categorisation (439)
  • Level of client (242)
  • Cluster 2 (19 cases)
  • Obstruction of justice (467)
  • Law enforcement (379)
  • Occasional (251)

Branch Other
Branch Law Enforc. Service No
Cover-Up Client Level ? Organisation
44
InterpretationII (pattern) III (appcod)
  • Cluster 3 (10 cases)
  • Extortion (474)
  • Organisation(289)
  • Government (275)

0 lt Extort - Obstruct lt 1 2
lt Extort Bribe lt3 No
Inspection No Protection
NO ERRORS
45
Overall Description It is Branch that matters
  • Government
  • Extortion for free services (Cluster 3)
  • Protection (Cluster 4)
  • Law enforcement
  • Obstruction of justice (Cluster 2)
  • Cover-up (Cluster 5)
  • Other
  • Category change (Cluster 1)
  • Is this knowledge enhancement?

46
Data recovery clustering of similarities
  • Example
  • Similarities between algebraic functions in an
  • experimental method for knowledge evaluation
  • lnx x² x³ x½ x¼
  • lnx - 1 1 2.5 2.5
  • x² 1 - 6 2.5 2.5
  • X³ 1 6 - 3 3
  • x½ 2.5 2.5 3 - 4
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7

47
Additive clustering
  • Similarities are the sum of intensities of
    clusters
  • Cl. 0 All are funcrtions, lnx, x², x³, x½,
  • Intensity 1 (upper sub-matrix)
  • lnx x² x³ x½ x¼
  • lnx - 1 1 1 1
  • x² 1 - 1 1 1
  • X³ 1 6 - 1 1
  • x½ 2.5 2.5 3 - 1
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7 (lower sub-matrix)

48
Additive clustering
  • Similarities are the sum of intensities of
    clusters
  • Cl. 1 Power functions, x², x³, x½, x¼
  • Intensity 2 (upper sub-matrix)
  • lnx x² x³ x½ x¼
  • lnx - 0 0 0 0
  • x² 1 - 2 2 2
  • X³ 1 6 - 2 2
  • x½ 2.5 2.5 3 - 2
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7 (lower sub-matrix)

49
Additive clustering
  • Similarities are the sum of intensities of
    clusters
  • Cl. 2 Sub-linear functions, lnx, x½, x¼
  • Intensity 1 (upper sub-matrix)
  • lnx x² x³ x½ x¼
  • lnx - 0 0 1 1
  • x² 1 - 0 0 0
  • X³ 1 6 - 0 0
  • x½ 2.5 2.5 3 - 1
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7 (lower sub-matrix)

50
Additive clustering
  • Similarities are the sum of intensities of
    clusters
  • Cl. 3 Fast growing functions, x², x³
  • Intensity 3 (upper sub-matrix)
  • lnx x² x³ x½ x¼
  • lnx - 0 0 0 0
  • x² 1 - 3 0 0
  • X³ 1 6 - 0 0
  • x½ 2.5 2.5 3 - 0
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7 (lower sub-matrix)

51
Additive clustering
  • Similarities are the sum of intensities of
    clusters
  • Residuals relatively small
  • (upper sub-matrix)
  • lnx x² x³ x½ x¼
  • lnx - 0 0 .5 .5
  • x² 1 - 0 -.5 -.5
  • X³ 1 6 - 0 0
  • x½ 2.5 2.5 3 - 0
  • x¼ 2.5 2.5 3 4 -
  • Scoring similarities between algebraic functions
    by a 6th grade
  • student in scale 1 to 7 (lower sub-matrix)

52
Data recovery Additive clustering
  • Observed similarity matrix
  • B Ag A1 A2 A3 E
  • Problem given B, find As to minimize E, the
    differences between B and summary A
  • ??B (Ag A1 A2 A3)?? ? min A

53
Doubly greedy strategy
  • OUTER LOOP One cluster at a time
  • Find real c and binary z to minimize L2(B,c,z)
  • Take cluster S i z i 1
  • Update B B ? B - czzT
  • Reiterate
  • After m iterations Sk, NkSk, ck
  • T(B) c12 N12 cm2 Nm2 L2 (?)

54
Inner loop finding a cluster
  • Maximize Contribution to (?), Max (cNS)2
  • N.Property Average similarity b(i,S) of i to S
    gt c/2 if i? S and lt c/2 if i ? S
  • Algorithm ADDI-S
  • Take S i for arbitrary i
  • Given S, find cc(S) and b(i,S) for all i
  • If b(i,S)-c/2. is gt0 for i? S or lt 0 for i ? S
    change the state of i. Else, stop and output S.
  • Resulting S satisfies the property.
  • Holzinger (1941) B-coefficient,
    ArkadievBraverman (1964,
  • 1967) Specter, Mirkin (1976, 1987) ADDI-,
    Ben-Dor, Shamir,
  • Yakhini (1999) CAST

55
DRA on Mixed variable scales and normalisation
Feature Normalisation any measure, clear of
the distribution e. g., range Nominal
scale Binary categories normalised to get the
total feature contribution right e.g. by the
square root of the number of categories
56
DRA on Interpretation
Cluster centroids are supplemented with
contributions of feature/cluster pairs or
entity/cluster pairs K-Means What is
Representative? Distance Min (conventional) Inne
r product Max (data recovery)
57
DRA on Incomplete clustering
  • With the model assigning un-clustered entities to
    the norm (e.g., gravity centre), Anomalous
    Pattern clustering (iterated)

58
DRA on Number of clusters
  • iK-Means
  • (under the assumption that every cluster, in
  • sequence, contributes more than the next one
  • a planetary model)
  • Otherwise, the issue is rather bleak

59
Failure of statistically sound criteria
  • MingTso Chiang (2006) 100 entities in 6D 4
    clusters between dist. 50 times gt within dist.
  • Hartigans F coefficient and Jump statistic fail

60
Conclusion
  • Data recovery approach should be the major
    mathematical underpinning
  • for data mining as a framework for finding
    patterns in data
Write a Comment
User Comments (0)
About PowerShow.com