Title: Course on Data Mining 5815504
1Course on Data Mining (581550-4)
Intro/Ass. Rules
Clustering
Episodes
KDD Process
Text Mining
Appl./Summary
2Course on Data Mining (581550-4)
Today 14.11.2001
- Today's subject
- Classification, clustering
- Next week's program
- Lecture Data mining process
- Exercise Classification, clustering
- Seminar Classification, clustering
3Classification and clustering
- Classification and prediction
- Clustering and similarity
4Cluster analysis
- What is cluster analysis?
- Similarity and dissimilarity
- Types of data in cluster analysis
- Major clustering methods
- Partitioning methods
- Hierarchical methods
- Outlier analysis
- Summary
Overview
5What is cluster analysis?
- Cluster a collection of data objects
- similar to one another within the same cluster
- dissimilar to the objects in the other clusters
- Aim of clustering to group a set of data objects
into clusters
6Typical uses of clustering
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
Used as?
7Applications of clustering
- Marketing discovering of distinct customer
groups in a purchase database - Land use identifying of areas of similar land
use in an earth observation database - Insurance identifying groups of motor insurance
policy holders with a high average claim cost - City-planning identifying groups of houses
according to their house type, value, and
geographical location
8What is good clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The quality of a clustering result depends on
- the similarity measure used
- implementation of the similarity measure
- The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns
9Requirements of clustering in data mining (1)
- Scalability
- Ability to deal with different types of
attributes - Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to
determine input parameters
10Requirements of clustering in data mining (2)
- Ability to deal with noise and outliers
- Insensitivity to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
11Similarity and dissimilarity between objects (1)
- There is no single definition of similarity or
dissimilarity between data objects - The definition of similarity or dissimilarity
between objects depends on - the type of the data considered
- what kind of similarity we are looking for
12Similarity and dissimilarity between objects (2)
- Similarity/dissimilarity between objects is often
expressed in terms of a distance measure d(x,y) - Ideally, every distance measure should be a
metric, i.e., it should satisfy the following
conditions
13Type of data in cluster analysis
- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types
- Complex data types
14Interval-scaled variables (1)
- Continuous measurements of a roughly linear scale
- For example, weight, height and age
- The measurement unit can affect the cluster
analysis - To avoid dependence on the measurement unit, we
should standardize the data
15Interval-scaled variables (2)
- To standardize the measurements
- calculate the mean absolute deviation
- where and
- calculate the standardized measurement (z-score)
16Interval-scaled variables (3)
- One group of popular distance measures for
interval-scaled variables are Minkowski distances - where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
is a positive integer
17Interval-scaled variables (4)
- If q 1, the distance measure is Manhattan (or
city block) distance - If q 2, the distance measure is Euclidean
distance
18Binary variables (1)
- A binary variable has only two states 0 or 1
- A contingency table for binary data
Object j
Object i
19Binary variables (2)
- Simple matching coefficient (invariant
similarity, if the binary variable is symmetric) - Jaccard coefficient (noninvariant similarity, if
the binary variable is asymmetric)
20Binary variables (3)
- Example dissimilarity between binary variables
- a patient record table
- eight attributes, of which
- gender is a symmetric attribute, and
- the remaining attributes are asymmetric binary
21Binary variables (4)
- Let the values Y and P be set to 1, and the value
N be set to 0 - Compute distances between patients based on the
asymmetric variables by using Jaccard coefficient
22Nominal variables
- A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green - Method 1 simple matching
- m of matches, p total of variables
- Method 2 use a large number of binary variables
- create a new binary variable for each of the M
nominal states
23Ordinal variables
- An ordinal variable can be discrete or continuous
- Order of values is important, e.g., rank
- Can be treated like interval-scaled
- replacing xif by their rank
- map the range of each variable onto 0, 1 by
replacing i-th object in the f-th variable by - compute the dissimilarity using methods for
interval-scaled variables
24Ratio-scaled variables
- A positive measurement on a nonlinear scale,
approximately at exponential scale - for example, AeBt or Ae-Bt
- Methods
- treat them like interval-scaled variables not a
good choice! (why?) - apply logarithmic transformation yif log(xif)
- treat them as continuous ordinal data and treat
their rank as interval-scaled
25Variables of mixed types (1)
- A database may contain all the six types of
variables - One may use a weighted formula to combine their
effects - where
26Variables of mixed types (2)
- Contribution of variable f to distance d(i,j)
- if f is binary or nominal
- if f is interval-based use the normalized
distance - if f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled
27Complex data types
- All objects considered in data mining are not
relational gt complex types of data - examples of such data are spatial data,
multimedia data, genetic data, time-series data,
text data and data collected from World-Wide Web - Often totally different similarity or
dissimilarity measures than above - can, for example, mean using of string and/or
sequence matching, or methods of information
retrieval
28Major clustering methods
- Partitioning methods
- Hierarchical methods
- Density-based methods
- Grid-based methods
- Model-based methods (conceptual clustering,
neural networks)
29Partitioning methods
- A partitioning method construct a partition of a
database D of n objects into a set of k clusters
such that - each cluster contains at least one object
- each object belongs to exactly one cluster
- Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
30Criteria for judging the quality of partitions
- Global optimal exhaustively enumerate all
partitions - Heuristic methods
- k-means (MacQueen67) each cluster is
represented by the center of the cluster
(centroid) - k-medoids (Kaufman Rousseeuw87) each cluster
is represented by one of the objects in the
cluster (medoid)
31K-means clustering method (1)
- Input to the algorithm the number of clusters k,
and a database of n objects - Algorithm consists of four steps
- partition object into k nonempty subsets/clusters
- compute a seed points as the centroid (the mean
of the objects in the cluster) for each cluster
in the current partition - assign each object to the cluster with the
nearest centroid - go back to Step 2, stop when there are no more
new assignments
32K-means clustering method (2)
- Alternative algorithm also consists of four
steps - arbitrarily choose k objects as the initial
cluster centers (centroids) - (re)assign each object to the cluster with the
nearest centroid - update the centroids
- go back to Step 2, stop when there are no more
new assignments
33K-means clustering method - Example
34Strengths of K-means clustering method
- Relatively scalable in processing large data sets
- Relatively efficient O(tkn), where n is
objects, k is clusters, and t is iterations.
Normally, k, t ltlt n. - Often terminates at a local optimum the global
optimum may be found using techniques such as
genetic algorithms
35Weaknesses of K-means clustering method
- Applicable only when the mean of objects is
defined - Need to specify k, the number of clusters, in
advance - Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex
shapes, or clusters of very different size
36Variations of K-means clustering method (1)
- A few variants of the k-means which differ in
- selection of the initial k centroids
- dissimilarity calculations
- strategies for calculating cluster centroids
37Variations of K-means clustering method (2)
- Handling categorical data k-modes (Huang98)
- replacing means of clusters with modes
- using new dissimilarity measures to deal with
categorical objects - using a frequency-based method to update modes of
clusters - A mixture of categorical and numerical data
k-prototype method
38K-medoids clustering method
- Input to the algorithm the number of clusters k,
and a database of n objects - Algorithm consists of four steps
- arbitrarily choose k objects as the initial
medoids (representative objects) - assign each remaining object to the cluster with
the nearest medoid - select a nonmedoid and replace one of the medoids
with it if this improves the clustering - go back to Step 2, stop when there are no more
new assignments
39Hierarchical methods
- A hierarchical method construct a hierarchy of
clustering, not just a single partition of
objects - The number of clusters k is not required as an
input - Use a distance matrix as clustering criteria
- A termination condition can be used (e.g., a
number of clusters)
40A tree of clusterings
- The hierarchy of clustering is ofter given as a
clustering tree, also called a dendrogram - leaves of the tree represent the individual
objects - internal nodes of the tree represent the clusters
41Two types of hierarchical methods (1)
- Two main types of hierarchical clustering
techniques - agglomerative (bottom-up)
- place each object in its own cluster (a
singleton) - merge in each step the two most similar clusters
until there is only one cluster left or the
termination condition is satisfied - divisive (top-down)
- start with one big cluster containing all the
objects - divide the most distinctive cluster into smaller
clusters and proceed until there are n clusters
or the termination condition is satisfied
42Two types of hierarchical methods (2)
43Inter-cluster distances
- Three widely used ways of defining the
inter-cluster distance, i.e., the distance
between two separate clusters, are - single linkage method (nearest neighbor)
- complete linkage method (furthest neighbor)
- average linkage method (unweighted pair-group
average)
44Strengths of hierarchical methods
- Conceptually simple
- Theoretical properties are well understood
- When clusters are merged/split, the decision is
permanent gt the number of different
alternatives that need to be examined is reduced
45Weaknesses of hierarchical methods
- Merging/splitting of clusters is permanent gt
erroneous decisions are impossible to correct
later - Divisive methods can be computational hard
- Methods are not (necessarily) scalable for large
data sets
46Outlier analysis (1)
- Outliers
- are objects that are considerably dissimilar from
the remainder of the data - can be caused by a measurement or execution
error, or - are the result of inherent data variability
- Many data mining algorithms try
- to minimize the influence of outliers
- to eliminate the outliers
47Outlier analysis (2)
- Minimizing the effect of outliers and/or
eliminating the outliers may cause information
loss - Outliers themselves may be of interest gt
outlier mining - Applications of outlier mining
- Fraud detection
- Customized marketing
- Medical treatments
48Summary (1)
- Cluster analysis groups objects based on their
similarity - Cluster analysis has wide applications
- Measure of similarity can be computed for various
type of data - Selection of similarity measure is dependent on
the data used and the type of similarity we are
searching for
49Summary (2)
- Clustering algorithms can be categorized into
- partitioning methods,
- hierarchical methods,
- density-based methods,
- grid-based methods, and
- model-based methods
- There are still lots of research issues on
cluster analysis
50Seminar Presentations/Groups 7-8
Classification of spatial data
K. Koperski, J. Han, N. Stefanovic An Efficient
Two-Step Method of Classification of Spatial
Data", SDH98
51Seminar Presentations/Groups 7-8
WEBSOM
K. Lagus, T. Honkela, S. Kaski, T. Kohonen
Self-organizing Maps of Document Collections A
New Approach to Interactive Exploration,
KDD96 T. Honkela, S. Kaski, K. Lagus, T.
Kohonen WEBSOM Self-Organizing Maps of
Document Collections, WSOM97
52Course on Data Mining
Thanks to Jiawei Han from Simon Fraser
University for his slides which greatly helped
in preparing this lecture!
53References - clustering
- R. Agrawal, J. Gehrke, D. Gunopulos, and P.
Raghavan. Automatic subspace clustering of high
dimensional data for data mining applications.
SIGMOD'98 - M. R. Anderberg. Cluster Analysis for
Applications. Academic Press, 1973. - M. Ankerst, M. Breunig, H.-P. Kriegel, and J.
Sander. Optics Ordering points to identify the
clustering structure, SIGMOD99. - P. Arabie, L. J. Hubert, and G. De Soete.
Clustering and Classification. World Scietific,
1996 - M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters
in large spatial databases. KDD'96. - M. Ester, H.-P. Kriegel, and X. Xu. Knowledge
discovery in large spatial databases Focusing
techniques for efficient class identification.
SSD'95. - D. Fisher. Knowledge acquisition via incremental
conceptual clustering. Machine Learning,
2139-172, 1987. - D. Gibson, J. Kleinberg, and P. Raghavan.
Clustering categorical data An approach based on
dynamic systems. In Proc. VLDB98.
54References - clustering
- S. Guha, R. Rastogi, and K. Shim. Cure An
efficient clustering algorithm for large
databases. SIGMOD'98. - A. K. Jain and R. C. Dubes. Algorithms for
Clustering Data. Printice Hall, 1988. - L. Kaufman and P. J. Rousseeuw. Finding Groups in
Data an Introduction to Cluster Analysis. John
Wiley Sons, 1990. - E. Knorr and R. Ng. Algorithms for mining
distance-based outliers in large datasets.
VLDB98. - G. J. McLachlan and K.E. Bkasford. Mixture
Models Inference and Applications to Clustering.
John Wiley and Sons, 1988. - P. Michaud. Clustering techniques. Future
Generation Computer systems, 13, 1997. - R. Ng and J. Han. Efficient and effective
clustering method for spatial data mining.
VLDB'94. - E. Schikuta. Grid clustering An efficient
hierarchical clustering method for very large
data sets. Proc. 1996 Int. Conf. on Pattern
Recognition, 101-105. - G. Sheikholeslami, S. Chatterjee, and A. Zhang.
WaveCluster A multi-resolution clustering
approach for very large spatial databases.
VLDB98. - W. Wang, Yang, R. Muntz, STING A Statistical
Information grid Approach to Spatial Data Mining,
VLDB97. - T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH
an efficient data clustering method for very
large databases. SIGMOD'96.