Title: Data Warehousing and Data Mining: Chapter 4
1Data Warehousing and Data Mining Chapter 4
2What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters - Clustering is unsupervised classification no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
- Measuring the performance of supervised learning
algorithms
3Basic Measures for Clustering
- Clustering Given a database D t1, t2, ..,
tn, a distance measure dis(ti, tj) defined
between any two objects ti and tj, and an integer
value k, the clustering problem is to define a
mapping f D ? 1, ,k where each ti is
assigned to one cluster Kf, 1 f k,
4What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to one another within the same cluster
- Dissimilar to the objects in other clusters
- Cluster analysis
- Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters - Clustering is unsupervised learning no
predefined classes - Typical applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
5General Applications of Clustering
- Pattern Recognition
- Spatial Data Analysis
- create thematic maps in GIS by clustering feature
spaces - detect spatial clusters and explain them in
spatial data mining - Image Processing
- Economic Science (especially market research)
- WWW
- Document classification
- Cluster Weblog data to discover groups of similar
access patterns
6Examples of Clustering Applications
- Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs - Land use Identification of areas of similar land
use in an earth observation database - Insurance Identifying groups of motor insurance
policy holders with a high average claim cost - City-planning Identifying groups of houses
according to their house type, value, and
geographical location - Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults
7Constraint-Based Clustering Analysis
- Clustering analysis less parameters but more
user-desired constraints, e.g., an ATM allocation
problem
8Clustering Cities
- Clustering Turkish cities
- Based on
- Political
- Demographic
- Economical
- Characteristics
- Political general elections 1999,2002
- Demographicpopulation,urbanization rates
- Economicalgnp per head,growth rate of gnp
9What Is Good Clustering?
- A good clustering method will produce high
quality clusters with - high intra-class similarity
- low inter-class similarity
- The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation. - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.
10Requirements of Clustering in Data Mining
- Scalability
- Ability to deal with different types of
attributes - Ability to handle dynamic data
- Discovery of clusters with arbitrary shape
- Minimal requirements for domain knowledge to
determine input parameters - Able to deal with noise and outliers
- Insensitive to order of input records
- High dimensionality
- Incorporation of user-specified constraints
- Interpretability and usability
11Chapter 5. Cluster Analysis
- What is Cluster Analysis?
- Types of Data in Cluster Analysis
- A Categorization of Major Clustering Methods
- Partitioning Methods
- Hierarchical Methods
- Density-Based Methods
- Grid-Based Methods
- Model-Based Clustering Methods
- Outlier Analysis
- Summary
12Partitioning Algorithms Basic Concept
- Partitioning method Construct a partition of a
database D of n objects into a set of k clusters,
s.t., min sum of squared distance - Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all
partitions - Heuristic methods k-means and k-medoids
algorithms - k-means (MacQueen67) Each cluster is
represented by the center of the cluster - k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster
13The K-Means Clustering Algorithm
- choose k, number of clusters to be determined
- Choose k objects randomly as the initial cluster
centers - repeat
- Assign each object to their closest cluster
center - Using Euclidean distance
- Compute new cluster centers
- Calculate mean points
- until
- No change in cluster centers or
- No object change its cluster
14The K-Means Clustering Method
10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
15Example TBP Sec 3.3 page 84 Table 3.6
- Instance X Y
- 1 1.0 1.5
- 2 1.0 4.5
- 3 2.0 1.5
- 4 2.0 3.5
- 5 3.0 2.5
- 6 5.0 6.1
- k is chosen as 2 k2
- Chose two points at random representing initial
cluster centers - Object 1 and 3 are chosen as cluster centers
166
2
4
5
1
3
17Example cont.
- Euclidean distance between point i and j
- D(i - j)( (Xi Xj)2 (Yi Yj)2)1/2
- Initial cluster centers
- C1(1.0,1.5) C2(2.0,1.5)
- D(C1 1) 0.00 D(C2 1) 1.00 C1
- D(C1 2) 3.00 D(C2 2) 3.16 C1
- D(C1 3) 1.00 D(C2 3) 0.00 C2
- D(C1 4) 2.24 D(C2 4) 2.00 C2
- D(C1 5) 2.24 D(C2 5) 1.41 C2
- D(C1 6) 6.02 D(C2 6) 5.41 C2
- C11,2 C23.4.5.6
186
2
4
5
1
3
19Example cont.
- Recomputing cluster centers
- For C1
- XC1 (1.01.0)/2 1.0
- YC1 (1.54.5)/2 3.0
- For C2
- XC2 (2.02.03.05.0)/4 3.0
- YC2 (1.53.52.56.0)/4 3.375
- Thus the new cluster centers are
- C1(1.0,3.0) and C2(3.0,3.375)
- As the cluster centers have changed
- The algorithm performs another iteration
206
2
4
5
1
3
21Example cont.
- New cluster centers
- C1(1.0,3.0) and C2(3.0,3.375)
- D(C1 1) 1.50 D(C2 1) 2.74 C1
- D(C1 2) 1.50 D(C2 2) 2.29 C1
- D(C1 3) 1.80 D(C2 3) 2.13 C1
- D(C1 4) 1.12 D(C2 4) 1.01 C2
- D(C1 5) 2.06 D(C2 5) 0.88 C2
- D(C1 6) 5.00 D(C2 6) 3.30 C2
- C1 1,2.3 C24.5.6
22Example cont.
- computing new cluster centers
- For C1
- XC1 (1.01.02.0)/3 1.33
- YC1 (1.54.51.5)/3 2.50
- For C2
- XC2 (2.03.05.0)/3 3.33
- YC2 (3.52.56.0)/3 4.00
- Thus the new cluster centers are
- C1(1.33,2.50) and C2(3.33,4.3.00)
- As the cluster centers have changed
- The algorithm performs another iteration
23Exercise
- Perform the third iteration
24Commands
- each initial cluster centers may end up with
different final cluster configuration - Finds local optimum but not necessarily the
global optimum - Based on sum of squared error SSE differences
between objects and their cluster centers - Choose a terminating criterion such as
- Maximum acceptable SSE
- Execute K-Means algorithm until satisfying the
condition
25Comments on the K-Means Method
- Strength Relatively efficient O(tkn), where n
is objects, k is clusters, and t is
iterations. Normally, k, t ltlt n. - Comparing PAM O(k(n-k)2 ), CLARA O(ks2
k(n-k)) - Comment Often terminates at a local optimum. The
global optimum may be found using techniques such
as deterministic annealing and genetic algorithms
26Weaknesses of K-Means Algorithm
- Applicable only when mean is defined, then what
about categorical data? - Need to specify k, the number of clusters, in
advance - run the algorithm with different k values
- Unable to handle noisy data and outliers
- Not suitable to discover clusters with non-convex
shapes - Works best when clusters are of approximately of
equal size
27Presence of Outliers
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
When k2
When k3
When k2 the two natural clusters are not captured
28Quality of clustering depends on unit of measure
income
income
x
x
x
x
x
x
x
x
x
x
age
Income measured by YTL,age by year
x
x
So what to do?
age
Income measured by TL, age by year
29Variations of the K-Means Method
- A few variants of the k-means which differ in
- Selection of the initial k means
- Dissimilarity calculations
- Strategies to calculate cluster means
- Handling categorical data k-modes (Huang98)
- Replacing means of clusters with modes
- Using new dissimilarity measures to deal with
categorical objects - Using a frequency-based method to update modes of
clusters - A mixture of categorical and numerical data
k-prototype method
30Exercise
- Show by designing simple examples
- a) K-means algorithm may converge to different
local optima starting from different initial
assignments of objects into different clusters - b) In the case of clusters of unequal size,
K-means algorithm may fail to catch the obvious
(natural) solution
31 32How to chose K
- for reasonable values of k
- e.g. from 2 to 15
- plot k versus SSE (sum of square error)
- visually inspect the plot and
- as k increases SSE falls
- choose the breaking point
33SSE
2 4 6 8 10 12 k
34Velidation of clustering
- partition the data into two equal groups
- apply clustering the
- one of these partitions
- compare cluster centers with the overall data
- or
- apply clustering to each of these groups
- compare cluster centers
35Data Structures
- Data matrix
- (two modes)
- Dissimilarity matrix
- (one mode)
36Properties of Dissimilarity Measures
- Properties
- d(i,j) ? 0 for i ? j
- d(i,i) 0
- d(i,j) d(j,i) symmetry
- d(i,j) ? d(i,k) d(k,j) triangular inequality
- Exercise Can you find examples where distance
between objects are not obeying symmetry
property
37Measure the Quality of Clustering
- Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d(i, j) - There is a separate quality function that
measures the goodness of a cluster. - The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables. - Weights should be associated with different
variables based on applications and data
semantics. - It is hard to define similar enough or good
enough - the answer is typically highly subjective.
38Type of data in clustering analysis
- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types
39Classification by Scale
- Nominal scalemerely distinguish classes with
respect to A and B XAXB or XA?XB - e.g. color red, blue, green,
- gender male, female
- occupation engineering, management. ..
- Ordinal scale indicates ordering of objects in
addition to distinguishing - XAXB or XA?XB XAgtXB or XAltXB
- e.g. education no schoollt primary sch. lt high
sch. lt undergrad lt grad - age young lt middle lt old
- income low lt medium lt high
40- Interval scale assign a meaningful measure of
difference between two objects - Not only XAgtXB but XAis XA XB units different
from XB - e.g. specific gravity
- temperature in oC or oF
- Boiling point of water is 100 oC different then
its melting point or 180 oF different - Ratio scale an interval scale with a meaningful
zero point - XA gt XB but XA is XA/XB times greater then XB
- e.g. height, weight, age (as an integer)
- temperature in oK or oR
- Water boils at 373 oK and melts at 273 oK
- Boiling point of water is 1.37 times hotter then
melting poing
41Comparison of Scales
- Strongest scale is ratio weakest scale is ordinal
- Ahmets height is 2.00 meters HA
- Mehmets height is 1.50 meter HM
- HA ? HM nominal their heights are different
- HA gt HM ordinal Ahmet is taller then Mehmet
- HA - HM 0.50 meters interval Ahmet is 50 cm
taller then Mehmet - HA / HM 1.333 ratio scale, no mater height is
measured in meter or inch
42Interval-valued variables
- Standardize data
- Calculate the mean absolute deviation
- where
- Calculate the standardized measurement (z-score)
- Using mean absolute deviation is more robust than
using standard deviation
43X2
X2
X1
X1
Both has zero mean and standardized by Z scores
?X1 and ?X2 are unity in both cases I and
II ?X1.X20 in case I whereas ?X1.X2 ?1 in case
II Shell we use the same distance measure in
both cases After obtaining the z scores
44Exercise
X2
X2
A
X1
A
X1
Suppose d(A,O) 0.5 in case I and II Does it
reflect the distance between A and
origin? Suggest a transformation so as to handle
correlation Between variables
45Other Standardizations
- Min Max scale between 0 and 1 or -1 and 1
- Decimal scale
- For Ratio Scaled variales
- Mean transformation
- zi,f xi,f/mean_f
- Measure in terms of means of variable f
- Log transformation
- zi,f logxi,f
46Similarity and Dissimilarity Between Objects
- Distances are normally used to measure the
similarity or dissimilarity between two data
objects - Some popular ones include Minkowski distance
- where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
gt1 - If q 1, d is Manhattan distance
47Similarity and Dissimilarity Between Objects
(Cont.)
- If q 2, d is Euclidean distance
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
- Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures
48Similarity and Dissimilarity Between Objects
(Cont.)
- Weights can be assigned to variables
- Where wi i 1P weights showing the importance
of each variable
49XA
XA
XB
XB
Manhatan distance between XA and XB
Euclidean distance between XA and XB
50Binary Variables
- Symmetric asymmetric
- Symmetric both of its states are equally
valuable and carry the same weight - gender male female
- 0 male 1 female arbitarly coded as 0 or 1
- Asymmetric variables
- Outcomes are not equally important
- Encoded by 0 and 1
- E.g. patient smoker or not
- 1 for smoker 0 for nonsmoker asymmetric
- Positive and negative outcomes of a disease test
- HIV positive by 1 HIV negative 0
51Binary Variables
- A contingency table for binary data
- Simple matching coefficient (invariant, if the
binary variable is symmetric) - Jaccard coefficient (noninvariant if the binary
variable is asymmetric)
Object j
Object i
52Dissimilarity between Binary Variables
- Example
- gender is a symmetric attribute
- the remaining attributes are asymmetric binary
- let the values Y and P be set to 1, and the value
N be set to 0
53Nominal Variables
- A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green - Method 1 Simple matching
- m of matches, p total of variables
- Higher weights can be assigned to variables with
large number of states - Method 2 use a large number of binary variables
- creating a new binary variable for each of the M
nominal states
54Example
- 2 nominal variables
- Faculty and country for students
- Faculty eng, applied Sc., Pure Sc., Admin., 5
distinct values - Country Turkey, USA 10 distinct values
- P 2 just two varibales
- Weight of country may be increased
- Student A (eng, Turkey) B(Applied Sc, Turkey)
- m 1 in one variable A and B are similar
- D(A,B) (2-1)/2 1/2
55Example cont.
- Different binary variables for each faculty
- Eng 1 if student is in engineering 0 otherwise
- AppSc 1 if student in MIS, 0 otherwise
- Different binary variables for each country
- Turkey 1 if sturent Turkish, 0 otherwise
- USA 1 if student USA ,0 otherwise
56(No Transcript)
57Ordinal Variables
- An ordinal variable can be discrete or continuous
- Order is important, e.g., rank
- Can be treated like interval-scaled
- replace xif by their rank
- map the range of each variable onto 0, 1 by
replacing i-th object in the f-th variable by - compute the dissimilarity using methods for
interval-scaled variables
58Example
- Credit Card type gold gt silver gt bronze gt
normal, 4 states - Education grad gt undergrad gt highschool gt
primary school gt no school, 5 states - Two customers
- A(gold,highschool)
- B(normal,no school)
- rA,card 1 , rB,card 4
- rA,edu 3 , rA,card 5
- zA,card (1-1)/(4-1)0
- zB,card (4-1)/(4-1)1
- zA,edu (3-1)/(5-1)0.5
- zB,edu (5-1)/(5-1)1
- Use any interval scale distance measure on z
values
59Exercise
- Find an attribute having both ordinal and nominal
charecterisitics - define a similarity or dissimilarity measure for
to objects A and B
60Ratio-Scaled Variables
- Ratio-scaled variable a positive measurement on
a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt - Methods
- treat them like interval-scaled variablesnot a
good choice! (why?the scale can be distorted) - apply logarithmic transformation
- yif log(xif)
- treat them as continuous ordinal data treat their
rank as interval-scaled
61Example
- Cluster individuals based on age weights and
heights - All are ratio scale variables
- Mean transformation
- Zp,i xp,i/meanp
- As absolute zero makes sense measure distance by
units of mean for each variable - Then you may apply z logz
- Use any distance measure for interval scales then
62Example cont.
- A weight difference of 0.5 kg is much more
important for babies then for adults - d(3kg,3.5kg) 0.5 (3.5-3)/3 percentage
difference - d(71.5kg,70.0kg) 0.5
- d(3kg,3.5kg) (3.5-3)/3 percentage difference
very significant approximately log(3.5)-log3 - d(71.5kg,71.0kg) (71.5-70.0)/70.0
- Not important log71.5 log71 almost zero
63Examples from Sports
- 48 48
- 51 52
- 54 56
- 57 62
- 60 68
- 63.5 74
- 67 82
- 71 90
- 75 100
- 81 130
64Basic Measures for Clustering
- Clustering Given a database D t1, t2, ..,
tn, a distance measure dis(ti, tj) defined
between any two objects ti and tj, and an integer
value k, the clustering problem is to define a
mapping f D ? 1, , k where each ti is
assigned to one cluster Kf, 1 f k, such that
?tfp,tfq ? Kf and ts ? Kf, dis(tfp,tfq)
dis(tfp,ts) - Centroid, radius, diameter
- Typical alternatives to calculate the distance
between clusters - Single link, complete link, average, centroid,
medoid
65Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
- Centroid the middle of a cluster
- Radius square root of average distance from any
point of the cluster to its centroid - Diameter square root of average mean squared
distance between all pairs of points in the
cluster
66Typical Alternatives to Calculate the Distance
between Clusters
- Single link smallest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) min(tip, tjq) - Complete link largest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) max(tip, tjq) - Average avg distance between an element in one
cluster and an element in the other, i.e.,
dis(Ki, Kj) avg(tip, tjq) - Centroid distance between the centroids of two
clusters, i.e., dis(Ki, Kj) dis(Ci, Cj) - Medoid distance between the medoids of two
clusters, i.e., dis(Ki, Kj) dis(Mi, Mj) - Medoid one chosen, centrally located object in
the cluster
67Major Clustering Approaches
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Density-based based on connectivity and density
functions - Grid-based based on a multiple-level granularity
structure - Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other
68The K-Medoids Clustering Method
- Find representative objects, called medoids, in
clusters - PAM (Partitioning Around Medoids, 1987)
- starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering - PAM works effectively for small data sets, but
does not scale well for large data sets - CLARA (Kaufmann Rousseeuw, 1990)
- CLARANS (Ng Han, 1994) Randomized sampling
- Focusing spatial data structure (Ester et al.,
1995)
69CLARA (Clustering Large Applications) (1990)
- CLARA (Kaufmann and Rousseeuw in 1990)
- Built in statistical analysis packages, such as
S - It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output - Strength deals with larger data sets than PAM
- Weakness
- Efficiency depends on the sample size
- A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased
70CLARANS (Randomized CLARA) (1994)
- CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han94) - CLARANS draws sample of neighbors dynamically
- The clustering process can be presented as
searching a graph where every node is a potential
solution, that is, a set of k medoids - If the local optimum is found, CLARANS starts
with new randomly selected node in search for a
new local optimum - It is more efficient and scalable than both PAM
and CLARA - Focusing techniques and spatial access structures
may further improve its performance (Ester et
al.95)
71Hierarchical Clustering
- Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition
72AGNES (Agglomerative Nesting)
- Introduced in Kaufmann and Rousseeuw (1990)
- Implemented in statistical analysis packages,
e.g., Splus - Use the Single-Link method and the dissimilarity
matrix. - Merge nodes that have the least dissimilarity
- Go on in a non-descending fashion
- Eventually all nodes belong to the same cluster
73A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
74Example
- A B C D E
- A 0
- B 1 0
- C 2 2 0
- D 2 4 1 0
- E 3 3 5 3 0
75Simple Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
2
3
2
3
3
C
C
E
E
2
2
1
2
3
1
D
D
1
ABCDE
ABCDE
D
A
C
B
E
76Complete Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
3
3
3
5
3
C
5
C
E
4
E
2
3
1
3
1
D
D
1
ABCDE
ABECD
B
C
A
D
E
77Average Link distance measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE average link AB and CD is 1
1
1
A
B
A
B
2
2
2
3
2
3.5
3
C
5
C
E
4
E
4
2
2.5
1
2
3
1
D
D
1
ABCDE Average link between ABCD and E Is 3.5
ABCDE Average link Between AB andCD Is 2.5
D
A
C
B
E
78DIANA (Divisive Analysis)
- Introduced in Kaufmann and Rousseeuw (1990)
- Implemented in statistical analysis packages,
e.g., Splus - Inverse order of AGNES
- Eventually each node forms a cluster on its own
79More on Hierarchical Clustering Methods
- Major weakness of agglomerative clustering
methods - do not scale well time complexity of at least
O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based
clustering - BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters - CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction - CHAMELEON (1999) hierarchical clustering using
dynamic modeling
80Probability Based Clustering
- A mixture is a set of n probability distributions
where each distribution represents a cluster - The mixture model assigns each object a
probability of being a member of cluster k - Rather than assigning the object or not assigning
the object into cluster k - as in K-mean, k-medoids
- P(C1/X)
81Simplest Case
- K 2 two clusters
- There is one real valued attribute X
- Each having Normally distributed
- Five parameters to determine
- mean and standard deviation for cluster A ?A
- ?A
- mean and standard deviation for cluster B ?B ?B
- Sampling probability for cluster A pA
- As P(A) P(B) 1
-
82- There are two populations each normally
distributed with mean and standard deviation - An object comes from one of them but which one is
unknown - Assign a probability that an object comes from A
or B - That is equivalent to assign a cluster for each
object
83- In parametric statistics
- Assume that objects come from a distribution
usually - Normal distribution with unknown parameters ? and
? - Estimate the parameters from data
- Test hypothesis based on the normality assumption
- Test the normality assumption
84Two normal dirtributions Bi modal Arbitary shapes
can be captured by mixture of normals
85- Variable age
- Clasters C1 C2 C3
- P(C1/age 25)
- P(C2/age 35)
- P(C3/age 65)
86Returning to Mixtures
- Given the five parameters
- Finding the probabilities that a given object
comes from each distribution or cluster is easy - Given an object X,
- The probability that it belongs to cluster A
- P(AX) P(A,X) P(XA)P(A) f(X?A,?A)P(A)
- P(X) P(X)
P(X) - P(BX) P(B,X) P(XB)P(B) f(X?B,?B)P(B)
- P(X) P(X)
P(X)
87- f(X?A,?A) is the normal distribution function
for cluster A
The denominator P(x) will disappear Calculate
the numerators for both P(AX) and P(BX) And
normalize them by dividing their sum
88- P(AX) _ P(XA)P(A) _
- P(XA)P(A) P(XB)P(B)
- P(BX) _ P(XB)P(B) _
- P(XA)P(A) P(XB)P(B)
- Note that the final outcome P(AX) and P(BX)
gives the probabilities that given object X
belongs to cluster A and B - Note that these can be calculated only by knowing
the priori probabilities P(A) and P(B)
89EM Algorithm for the Simple Case
- We do not know neither of these things
- Distribution that each object came from
- P(X/A), P(X/B)
- Nor the five parameters of the mixture model
- ?A, ?B,?A,?B,P(A)
- Adapt the procedure used for k-means algorithm
90EM Algorithm for the Simple Case
- Initially guess initial values for the five
parameters - ?A, ?B,?A,?B,P(A)
- Until a specified termination criterion
- Compute P(A/Xi) and P(B/Xi) for each Xi i 1..n
- Class probabilities for each object
- As these are normal distributions with
- The expectation step
- Use these probabilities to reestimate the
parameters - ?A, ?B,?A,?B,P(A)
- Maximization of the likelihood of the
distributions - Until likelihood converges or does not improve so
much
91Estimation of new means
- If wi is the probability that object i belongs to
cluster A - WiA P(AXi) calculated in E step
- ?A w1Ax1 w2Ax2 wnAxn
- w1A w2A wnA
- WiB P(BXi) calculated in E step
- ?B w1Bx1 w2Bx2 wnBxn
- w1B w2B wnB
92Estimation of new standard deviations
- ?2Aw1A(X1- ?A)2w2A(X2- ?A)2wnA(Xn- ?A)2
- w1A w2A wnA
- ?2Bw1B(X1- ?B)2w2B(X2- ?B)2wnB(Xn- ?B)2
- w1B w2B wnB
- Xi are all the objects not just belonging to A or
B - These are maximum likelihood estimators for
variance - If the weights are equal denominator sum up to n
rather then n-1 -
93Estimation of new priori probabilities
- P(A) P(AX1) P(AX2) P(AXn)
- n
- pA w1A w2A wnA
- n
- P(B) P(BX1) P(BX2) P(BXn)
- n
- pB w1Bw2B wnB
- n
- Or P(B) 1-P(A)
94General Considerations
- EM converges to a local optimum maximum
likelihood score - May not be global
- EM may be run with different initial values of
the parameters and the clustering structure with
the highest likelihood is chosen - Or an initial k-means is conducted to get the
mean and standard deviation parameters
95Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg D. Keim (KDD98)
- CLIQUE Agrawal, et al. (SIGMOD98)
96Density Concepts
- Core object (CO)object with at least M objects
within a radius E-neighborhood - Directly density reachable (DDR)x is CO, y is in
xs E-neighborhood - Density reachablethere exists a chain of DDR
objects from x to y - Density based clusterdensity connected objects
maximum w.r.t. reachability
97Density-Based Clustering Background
- Two parameters
- Eps Maximum radius of the neighbourhood
- MinPts Minimum number of points in an
Eps-neighbourhood of that point - NEps(p) q belongs to D dist(p,q) lt Eps
- Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if - 1) p belongs to NEps(q)
- 2) core point condition
- NEps (q) gt MinPts
98Density-Based Clustering Background (II)
- Density-reachable
- A point p is density-reachable from a point q
wrt. Eps, MinPts if there is a chain of points
p1, , pn, p1 q, pn p such that pi1 is
directly density-reachable from pi - Density-connected
- A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps and
MinPts.
p
p1
q
99Let MinPts 3
Q
M
P
M,P are core objects since each is in an ?
neighborhood containing at least 3 points Q is
directly density-reachable from M M is directly
density-reachable form P and vice versa Q is
indirectly density reachable form P since Q is
directly density-reachable from M and M is
directly density reacable From P but p is not
density reacable from Q sicne Q is not a core
object
100DBSCAN Density Based Spatial Clustering of
Applications with Noise
- Relies on a density-based notion of cluster A
cluster is defined as a maximal set of
density-connected points - Discovers clusters of arbitrary shape in spatial
databases with noise
101DBSCAN The Algorithm
- Arbitrary select a point p
- Retrieve all points density-reachable from p wrt
Eps and MinPts. - If p is a core point, a cluster is formed.
- If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database. - Continue the process until all of the points have
been processed.