Data Warehousing and Data Mining: Chapter 4

About This Presentation

Title:

Data Warehousing and Data Mining: Chapter 4

Description:

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 k(n-k) ... e.g. from 2 to 15. plot k versus SSE (sum of square error) visually inspect the plot and ... – PowerPoint PPT presentation

Number of Views:256

Avg rating:3.0/5.0

Slides: 102

Provided by: jiaw196

Category:

more less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining: Chapter 4

1
Data Warehousing and Data Mining Chapter 4

BIS 541
Summer 2008

2
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
Clustering is unsupervised classification no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms
Measuring the performance of supervised learning
algorithms

3
Basic Measures for Clustering

Clustering Given a database D t1, t2, ..,
tn, a distance measure dis(ti, tj) defined
between any two objects ti and tj, and an integer
value k, the clustering problem is to define a
mapping f D ? 1, ,k where each ti is
assigned to one cluster Kf, 1 f k,

4
What is Cluster Analysis?

Cluster a collection of data objects
Similar to one another within the same cluster
Dissimilar to the objects in other clusters
Cluster analysis
Finding similarities between data according to
the characteristics found in the data and
grouping similar data objects into clusters
Clustering is unsupervised learning no
predefined classes
Typical applications
As a stand-alone tool to get insight into data
distribution
As a preprocessing step for other algorithms

5
General Applications of Clustering

Pattern Recognition
Spatial Data Analysis
create thematic maps in GIS by clustering feature
spaces
detect spatial clusters and explain them in
spatial data mining
Image Processing
Economic Science (especially market research)
WWW
Document classification
Cluster Weblog data to discover groups of similar
access patterns

6
Examples of Clustering Applications

Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs
Land use Identification of areas of similar land
use in an earth observation database
Insurance Identifying groups of motor insurance
policy holders with a high average claim cost
City-planning Identifying groups of houses
according to their house type, value, and
geographical location
Earth-quake studies Observed earth quake
epicenters should be clustered along continent
faults

7
Constraint-Based Clustering Analysis

Clustering analysis less parameters but more
user-desired constraints, e.g., an ATM allocation
problem

8
Clustering Cities

Clustering Turkish cities
Based on
Political
Demographic
Economical
Characteristics
Political general elections 1999,2002
Demographicpopulation,urbanization rates
Economicalgnp per head,growth rate of gnp

9
What Is Good Clustering?

A good clustering method will produce high
quality clusters with
high intra-class similarity
low inter-class similarity
The quality of a clustering result depends on
both the similarity measure used by the method
and its implementation.
The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns.

10
Requirements of Clustering in Data Mining

Scalability
Ability to deal with different types of
attributes
Ability to handle dynamic data
Discovery of clusters with arbitrary shape
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
High dimensionality
Incorporation of user-specified constraints
Interpretability and usability

11
Chapter 5. Cluster Analysis

What is Cluster Analysis?
Types of Data in Cluster Analysis
A Categorization of Major Clustering Methods
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Clustering Methods
Outlier Analysis
Summary

12
Partitioning Algorithms Basic Concept

Partitioning method Construct a partition of a
database D of n objects into a set of k clusters,
s.t., min sum of squared distance
Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion
Global optimal exhaustively enumerate all
partitions
Heuristic methods k-means and k-medoids
algorithms
k-means (MacQueen67) Each cluster is
represented by the center of the cluster
k-medoids or PAM (Partition around medoids)
(Kaufman Rousseeuw87) Each cluster is
represented by one of the objects in the cluster

13
The K-Means Clustering Algorithm

choose k, number of clusters to be determined
Choose k objects randomly as the initial cluster
centers
repeat
Assign each object to their closest cluster
center
Using Euclidean distance
Compute new cluster centers
Calculate mean points
until
No change in cluster centers or
No object change its cluster

14
The K-Means Clustering Method

Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
15
Example TBP Sec 3.3 page 84 Table 3.6

Instance X Y
1 1.0 1.5
2 1.0 4.5
3 2.0 1.5
4 2.0 3.5
5 3.0 2.5
6 5.0 6.1
k is chosen as 2 k2
Chose two points at random representing initial
cluster centers
Object 1 and 3 are chosen as cluster centers

16
6
2
4
5
1
3
17
Example cont.

Euclidean distance between point i and j
D(i - j)( (Xi Xj)2 (Yi Yj)2)1/2
Initial cluster centers
C1(1.0,1.5) C2(2.0,1.5)
D(C1 1) 0.00 D(C2 1) 1.00 C1
D(C1 2) 3.00 D(C2 2) 3.16 C1
D(C1 3) 1.00 D(C2 3) 0.00 C2
D(C1 4) 2.24 D(C2 4) 2.00 C2
D(C1 5) 2.24 D(C2 5) 1.41 C2
D(C1 6) 6.02 D(C2 6) 5.41 C2
C11,2 C23.4.5.6

18
6
2
4
5
1
3
19
Example cont.

Recomputing cluster centers
For C1
XC1 (1.01.0)/2 1.0
YC1 (1.54.5)/2 3.0
For C2
XC2 (2.02.03.05.0)/4 3.0
YC2 (1.53.52.56.0)/4 3.375
Thus the new cluster centers are
C1(1.0,3.0) and C2(3.0,3.375)
As the cluster centers have changed
The algorithm performs another iteration

20
6
2

4

5
1
3
21
Example cont.

New cluster centers
C1(1.0,3.0) and C2(3.0,3.375)
D(C1 1) 1.50 D(C2 1) 2.74 C1
D(C1 2) 1.50 D(C2 2) 2.29 C1
D(C1 3) 1.80 D(C2 3) 2.13 C1
D(C1 4) 1.12 D(C2 4) 1.01 C2
D(C1 5) 2.06 D(C2 5) 0.88 C2
D(C1 6) 5.00 D(C2 6) 3.30 C2
C1 1,2.3 C24.5.6

22
Example cont.

computing new cluster centers
For C1
XC1 (1.01.02.0)/3 1.33
YC1 (1.54.51.5)/3 2.50
For C2
XC2 (2.03.05.0)/3 3.33
YC2 (3.52.56.0)/3 4.00
Thus the new cluster centers are
C1(1.33,2.50) and C2(3.33,4.3.00)
As the cluster centers have changed
The algorithm performs another iteration

23
Exercise

Perform the third iteration

24
Commands

each initial cluster centers may end up with
different final cluster configuration
Finds local optimum but not necessarily the
global optimum
Based on sum of squared error SSE differences
between objects and their cluster centers
Choose a terminating criterion such as
Maximum acceptable SSE
Execute K-Means algorithm until satisfying the
condition

25
Comments on the K-Means Method

Strength Relatively efficient O(tkn), where n
is objects, k is clusters, and t is
iterations. Normally, k, t ltlt n.
Comparing PAM O(k(n-k)2 ), CLARA O(ks2
k(n-k))
Comment Often terminates at a local optimum. The
global optimum may be found using techniques such
as deterministic annealing and genetic algorithms

26
Weaknesses of K-Means Algorithm

Applicable only when mean is defined, then what
about categorical data?
Need to specify k, the number of clusters, in
advance
run the algorithm with different k values
Unable to handle noisy data and outliers
Not suitable to discover clusters with non-convex
shapes
Works best when clusters are of approximately of
equal size

27
Presence of Outliers
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
When k2
When k3
When k2 the two natural clusters are not captured
28
Quality of clustering depends on unit of measure
income
income
x
x
x
x
x
x
x
x
x
x
age
Income measured by YTL,age by year
x
x
So what to do?
age
Income measured by TL, age by year
29
Variations of the K-Means Method

A few variants of the k-means which differ in
Selection of the initial k means
Dissimilarity calculations
Strategies to calculate cluster means
Handling categorical data k-modes (Huang98)
Replacing means of clusters with modes
Using new dissimilarity measures to deal with
categorical objects
Using a frequency-based method to update modes of
clusters
A mixture of categorical and numerical data
k-prototype method

30
Exercise

Show by designing simple examples
a) K-means algorithm may converge to different
local optima starting from different initial
assignments of objects into different clusters
b) In the case of clusters of unequal size,
K-means algorithm may fail to catch the obvious
(natural) solution

31

32
How to chose K

for reasonable values of k
e.g. from 2 to 15
plot k versus SSE (sum of square error)
visually inspect the plot and
as k increases SSE falls
choose the breaking point

33
SSE
2 4 6 8 10 12 k
34
Velidation of clustering

partition the data into two equal groups
apply clustering the
one of these partitions
compare cluster centers with the overall data
or
apply clustering to each of these groups
compare cluster centers

35
Data Structures

Data matrix
(two modes)
Dissimilarity matrix
(one mode)

36
Properties of Dissimilarity Measures

Properties
d(i,j) ? 0 for i ? j
d(i,i) 0
d(i,j) d(j,i) symmetry
d(i,j) ? d(i,k) d(k,j) triangular inequality
Exercise Can you find examples where distance
between objects are not obeying symmetry
property

37
Measure the Quality of Clustering

Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d(i, j)
There is a separate quality function that
measures the goodness of a cluster.
The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables.
Weights should be associated with different
variables based on applications and data
semantics.
It is hard to define similar enough or good
enough
the answer is typically highly subjective.

38
Type of data in clustering analysis

Interval-scaled variables
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types

39
Classification by Scale

Nominal scalemerely distinguish classes with
respect to A and B XAXB or XA?XB
e.g. color red, blue, green,
gender male, female
occupation engineering, management. ..
Ordinal scale indicates ordering of objects in
addition to distinguishing
XAXB or XA?XB XAgtXB or XAltXB
e.g. education no schoollt primary sch. lt high
sch. lt undergrad lt grad
age young lt middle lt old
income low lt medium lt high

Interval scale assign a meaningful measure of
difference between two objects
Not only XAgtXB but XAis XA XB units different
from XB
e.g. specific gravity
temperature in oC or oF
Boiling point of water is 100 oC different then
its melting point or 180 oF different
Ratio scale an interval scale with a meaningful
zero point
XA gt XB but XA is XA/XB times greater then XB
e.g. height, weight, age (as an integer)
temperature in oK or oR
Water boils at 373 oK and melts at 273 oK
Boiling point of water is 1.37 times hotter then
melting poing

41
Comparison of Scales

Strongest scale is ratio weakest scale is ordinal
Ahmets height is 2.00 meters HA
Mehmets height is 1.50 meter HM
HA ? HM nominal their heights are different
HA gt HM ordinal Ahmet is taller then Mehmet
HA - HM 0.50 meters interval Ahmet is 50 cm
taller then Mehmet
HA / HM 1.333 ratio scale, no mater height is
measured in meter or inch

42
Interval-valued variables

Standardize data
Calculate the mean absolute deviation
where
Calculate the standardized measurement (z-score)
Using mean absolute deviation is more robust than
using standard deviation

43
X2

X2

X1

X1

Both has zero mean and standardized by Z scores
?X1 and ?X2 are unity in both cases I and
II ?X1.X20 in case I whereas ?X1.X2 ?1 in case
II Shell we use the same distance measure in
both cases After obtaining the z scores
44
Exercise
X2

X2

A

X1

A

X1

Suppose d(A,O) 0.5 in case I and II Does it
reflect the distance between A and
origin? Suggest a transformation so as to handle
correlation Between variables
45
Other Standardizations

Min Max scale between 0 and 1 or -1 and 1
Decimal scale
For Ratio Scaled variales
Mean transformation
zi,f xi,f/mean_f
Measure in terms of means of variable f
Log transformation
zi,f logxi,f

46
Similarity and Dissimilarity Between Objects

Distances are normally used to measure the
similarity or dissimilarity between two data
objects
Some popular ones include Minkowski distance
where i (xi1, xi2, , xip) and j (xj1, xj2,
, xjp) are two p-dimensional data objects, and q
gt1
If q 1, d is Manhattan distance

47
Similarity and Dissimilarity Between Objects
(Cont.)

If q 2, d is Euclidean distance
Properties
d(i,j) ? 0
d(i,i) 0
d(i,j) d(j,i)
d(i,j) ? d(i,k) d(k,j)
Also, one can use weighted distance, parametric
Pearson product moment correlation, or other
disimilarity measures

48
Similarity and Dissimilarity Between Objects
(Cont.)

Weights can be assigned to variables
Where wi i 1P weights showing the importance
of each variable

49
XA
XA
XB
XB
Manhatan distance between XA and XB
Euclidean distance between XA and XB
50
Binary Variables

Symmetric asymmetric
Symmetric both of its states are equally
valuable and carry the same weight
gender male female
0 male 1 female arbitarly coded as 0 or 1
Asymmetric variables
Outcomes are not equally important
Encoded by 0 and 1
E.g. patient smoker or not
1 for smoker 0 for nonsmoker asymmetric
Positive and negative outcomes of a disease test
HIV positive by 1 HIV negative 0

51
Binary Variables

A contingency table for binary data
Simple matching coefficient (invariant, if the
binary variable is symmetric)
Jaccard coefficient (noninvariant if the binary
variable is asymmetric)

Object j
Object i
52
Dissimilarity between Binary Variables

Example
gender is a symmetric attribute
the remaining attributes are asymmetric binary
let the values Y and P be set to 1, and the value
N be set to 0

53
Nominal Variables

A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green
Method 1 Simple matching
m of matches, p total of variables
Higher weights can be assigned to variables with
large number of states
Method 2 use a large number of binary variables
creating a new binary variable for each of the M
nominal states

54
Example

2 nominal variables
Faculty and country for students
Faculty eng, applied Sc., Pure Sc., Admin., 5
distinct values
Country Turkey, USA 10 distinct values
P 2 just two varibales
Weight of country may be increased
Student A (eng, Turkey) B(Applied Sc, Turkey)
m 1 in one variable A and B are similar
D(A,B) (2-1)/2 1/2

55
Example cont.

Different binary variables for each faculty
Eng 1 if student is in engineering 0 otherwise
AppSc 1 if student in MIS, 0 otherwise
Different binary variables for each country
Turkey 1 if sturent Turkish, 0 otherwise
USA 1 if student USA ,0 otherwise

56
(No Transcript)
57
Ordinal Variables

An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto 0, 1 by
replacing i-th object in the f-th variable by
compute the dissimilarity using methods for
interval-scaled variables

58
Example

Credit Card type gold gt silver gt bronze gt
normal, 4 states
Education grad gt undergrad gt highschool gt
primary school gt no school, 5 states
Two customers
A(gold,highschool)
B(normal,no school)
rA,card 1 , rB,card 4
rA,edu 3 , rA,card 5
zA,card (1-1)/(4-1)0
zB,card (4-1)/(4-1)1
zA,edu (3-1)/(5-1)0.5
zB,edu (5-1)/(5-1)1
Use any interval scale distance measure on z
values

59
Exercise

Find an attribute having both ordinal and nominal
charecterisitics
define a similarity or dissimilarity measure for
to objects A and B

60
Ratio-Scaled Variables

Ratio-scaled variable a positive measurement on
a nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt
Methods
treat them like interval-scaled variablesnot a
good choice! (why?the scale can be distorted)
apply logarithmic transformation
yif log(xif)
treat them as continuous ordinal data treat their
rank as interval-scaled

61
Example

Cluster individuals based on age weights and
heights
All are ratio scale variables
Mean transformation
Zp,i xp,i/meanp
As absolute zero makes sense measure distance by
units of mean for each variable
Then you may apply z logz
Use any distance measure for interval scales then

62
Example cont.

A weight difference of 0.5 kg is much more
important for babies then for adults
d(3kg,3.5kg) 0.5 (3.5-3)/3 percentage
difference
d(71.5kg,70.0kg) 0.5
d(3kg,3.5kg) (3.5-3)/3 percentage difference
very significant approximately log(3.5)-log3
d(71.5kg,71.0kg) (71.5-70.0)/70.0
Not important log71.5 log71 almost zero

63
Examples from Sports

48 48
51 52
54 56
57 62
60 68
63.5 74
67 82
71 90
75 100
81 130

64
Basic Measures for Clustering

Clustering Given a database D t1, t2, ..,
tn, a distance measure dis(ti, tj) defined
between any two objects ti and tj, and an integer
value k, the clustering problem is to define a
mapping f D ? 1, , k where each ti is
assigned to one cluster Kf, 1 f k, such that
?tfp,tfq ? Kf and ts ? Kf, dis(tfp,tfq)
dis(tfp,ts)
Centroid, radius, diameter
Typical alternatives to calculate the distance
between clusters
Single link, complete link, average, centroid,
medoid

65
Centroid, Radius and Diameter of a Cluster (for
numerical data sets)

Centroid the middle of a cluster
Radius square root of average distance from any
point of the cluster to its centroid
Diameter square root of average mean squared
distance between all pairs of points in the
cluster

66
Typical Alternatives to Calculate the Distance
between Clusters

Single link smallest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) min(tip, tjq)
Complete link largest distance between an
element in one cluster and an element in the
other, i.e., dis(Ki, Kj) max(tip, tjq)
Average avg distance between an element in one
cluster and an element in the other, i.e.,
dis(Ki, Kj) avg(tip, tjq)
Centroid distance between the centroids of two
clusters, i.e., dis(Ki, Kj) dis(Ci, Cj)
Medoid distance between the medoids of two
clusters, i.e., dis(Ki, Kj) dis(Mi, Mj)
Medoid one chosen, centrally located object in
the cluster

67
Major Clustering Approaches

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Density-based based on connectivity and density
functions
Grid-based based on a multiple-level granularity
structure
Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of that model to each other

68
The K-Medoids Clustering Method

Find representative objects, called medoids, in
clusters
PAM (Partitioning Around Medoids, 1987)
starts from an initial set of medoids and
iteratively replaces one of the medoids by one of
the non-medoids if it improves the total distance
of the resulting clustering
PAM works effectively for small data sets, but
does not scale well for large data sets
CLARA (Kaufmann Rousseeuw, 1990)
CLARANS (Ng Han, 1994) Randomized sampling
Focusing spatial data structure (Ester et al.,
1995)

69
CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as
S
It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output
Strength deals with larger data sets than PAM
Weakness
Efficiency depends on the sample size
A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased

70
CLARANS (Randomized CLARA) (1994)

CLARANS (A Clustering Algorithm based on
Randomized Search) (Ng and Han94)
CLARANS draws sample of neighbors dynamically
The clustering process can be presented as
searching a graph where every node is a potential
solution, that is, a set of k medoids
If the local optimum is found, CLARANS starts
with new randomly selected node in search for a
new local optimum
It is more efficient and scalable than both PAM
and CLARA
Focusing techniques and spatial access structures
may further improve its performance (Ester et
al.95)

71
Hierarchical Clustering

Use distance matrix as clustering criteria. This
method does not require the number of clusters k
as an input, but needs a termination condition

72
AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages,
e.g., Splus
Use the Single-Link method and the dissimilarity
matrix.
Merge nodes that have the least dissimilarity
Go on in a non-descending fashion
Eventually all nodes belong to the same cluster

73
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
74
Example

A B C D E
A 0
B 1 0
C 2 2 0
D 2 4 1 0
E 3 3 5 3 0

75
Simple Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
2
3
2
3
3
C
C
E
E
2
2
1
2
3
1
D
D
1
ABCDE
ABCDE
D
A
C
B
E
76
Complete Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
3
3
3
5
3
C
5
C
E
4
E
2
3
1
3
1
D
D
1
ABCDE
ABECD
B
C
A
D
E
77
Average Link distance measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE average link AB and CD is 1
1
1
A
B
A
B
2
2
2
3
2
3.5
3
C
5
C
E
4
E
4
2
2.5
1
2
3
1
D
D
1
ABCDE Average link between ABCD and E Is 3.5
ABCDE Average link Between AB andCD Is 2.5
D
A
C
B
E
78
DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990)
Implemented in statistical analysis packages,
e.g., Splus
Inverse order of AGNES
Eventually each node forms a cluster on its own

79
More on Hierarchical Clustering Methods

Major weakness of agglomerative clustering
methods
do not scale well time complexity of at least
O(n2), where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering
BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters
CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
CHAMELEON (1999) hierarchical clustering using
dynamic modeling

80
Probability Based Clustering

A mixture is a set of n probability distributions
where each distribution represents a cluster
The mixture model assigns each object a
probability of being a member of cluster k
Rather than assigning the object or not assigning
the object into cluster k
as in K-mean, k-medoids
P(C1/X)

81
Simplest Case

K 2 two clusters
There is one real valued attribute X
Each having Normally distributed
Five parameters to determine
mean and standard deviation for cluster A ?A
?A
mean and standard deviation for cluster B ?B ?B
Sampling probability for cluster A pA
As P(A) P(B) 1

There are two populations each normally
distributed with mean and standard deviation
An object comes from one of them but which one is
unknown
Assign a probability that an object comes from A
or B
That is equivalent to assign a cluster for each
object

In parametric statistics
Assume that objects come from a distribution
usually
Normal distribution with unknown parameters ? and
?
Estimate the parameters from data
Test hypothesis based on the normality assumption
Test the normality assumption

84
Two normal dirtributions Bi modal Arbitary shapes
can be captured by mixture of normals
85

Variable age
Clasters C1 C2 C3
P(C1/age 25)
P(C2/age 35)
P(C3/age 65)

86
Returning to Mixtures

Given the five parameters
Finding the probabilities that a given object
comes from each distribution or cluster is easy
Given an object X,
The probability that it belongs to cluster A
P(AX) P(A,X) P(XA)P(A) f(X?A,?A)P(A)
P(X) P(X)
P(X)
P(BX) P(B,X) P(XB)P(B) f(X?B,?B)P(B)
P(X) P(X)
P(X)

f(X?A,?A) is the normal distribution function
for cluster A

The denominator P(x) will disappear Calculate
the numerators for both P(AX) and P(BX) And
normalize them by dividing their sum
88

P(AX) _ P(XA)P(A) _
P(XA)P(A) P(XB)P(B)
P(BX) _ P(XB)P(B) _
P(XA)P(A) P(XB)P(B)
Note that the final outcome P(AX) and P(BX)
gives the probabilities that given object X
belongs to cluster A and B
Note that these can be calculated only by knowing
the priori probabilities P(A) and P(B)

89
EM Algorithm for the Simple Case

We do not know neither of these things
Distribution that each object came from
P(X/A), P(X/B)
Nor the five parameters of the mixture model
?A, ?B,?A,?B,P(A)
Adapt the procedure used for k-means algorithm

90
EM Algorithm for the Simple Case

Initially guess initial values for the five
parameters
?A, ?B,?A,?B,P(A)
Until a specified termination criterion
Compute P(A/Xi) and P(B/Xi) for each Xi i 1..n
Class probabilities for each object
As these are normal distributions with
The expectation step
Use these probabilities to reestimate the
parameters
?A, ?B,?A,?B,P(A)
Maximization of the likelihood of the
distributions
Until likelihood converges or does not improve so
much

91
Estimation of new means

If wi is the probability that object i belongs to
cluster A
WiA P(AXi) calculated in E step
?A w1Ax1 w2Ax2 wnAxn
w1A w2A wnA
WiB P(BXi) calculated in E step
?B w1Bx1 w2Bx2 wnBxn
w1B w2B wnB

92
Estimation of new standard deviations

?2Aw1A(X1- ?A)2w2A(X2- ?A)2wnA(Xn- ?A)2
w1A w2A wnA
?2Bw1B(X1- ?B)2w2B(X2- ?B)2wnB(Xn- ?B)2
w1B w2B wnB
Xi are all the objects not just belonging to A or
B
These are maximum likelihood estimators for
variance
If the weights are equal denominator sum up to n
rather then n-1

93
Estimation of new priori probabilities

P(A) P(AX1) P(AX2) P(AXn)
n
pA w1A w2A wnA
n
P(B) P(BX1) P(BX2) P(BXn)
n
pB w1Bw2B wnB
n
Or P(B) 1-P(A)

94
General Considerations

EM converges to a local optimum maximum
likelihood score
May not be global
EM may be run with different initial values of
the parameters and the clustering structure with
the highest likelihood is chosen
Or an initial k-means is conducted to get the
mean and standard deviation parameters

95
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies
DBSCAN Ester, et al. (KDD96)
OPTICS Ankerst, et al (SIGMOD99).
DENCLUE Hinneburg D. Keim (KDD98)
CLIQUE Agrawal, et al. (SIGMOD98)

96
Density Concepts

Core object (CO)object with at least M objects
within a radius E-neighborhood
Directly density reachable (DDR)x is CO, y is in
xs E-neighborhood
Density reachablethere exists a chain of DDR
objects from x to y
Density based clusterdensity connected objects
maximum w.r.t. reachability

97
Density-Based Clustering Background

Two parameters
Eps Maximum radius of the neighbourhood
MinPts Minimum number of points in an
Eps-neighbourhood of that point
NEps(p) q belongs to D dist(p,q) lt Eps
Directly density-reachable A point p is directly
density-reachable from a point q wrt. Eps, MinPts
if
1) p belongs to NEps(q)
2) core point condition
NEps (q) gt MinPts

98
Density-Based Clustering Background (II)

Density-reachable
A point p is density-reachable from a point q
wrt. Eps, MinPts if there is a chain of points
p1, , pn, p1 q, pn p such that pi1 is
directly density-reachable from pi
Density-connected
A point p is density-connected to a point q wrt.
Eps, MinPts if there is a point o such that both,
p and q are density-reachable from o wrt. Eps and
MinPts.

p
p1
q
99
Let MinPts 3

Q

M

P

M,P are core objects since each is in an ?
neighborhood containing at least 3 points Q is
directly density-reachable from M M is directly
density-reachable form P and vice versa Q is
indirectly density reachable form P since Q is
directly density-reachable from M and M is
directly density reacable From P but p is not
density reacable from Q sicne Q is not a core
object
100
DBSCAN Density Based Spatial Clustering of
Applications with Noise

Relies on a density-based notion of cluster A
cluster is defined as a maximal set of
density-connected points
Discovers clusters of arbitrary shape in spatial
databases with noise

101
DBSCAN The Algorithm

Arbitrary select a point p
Retrieve all points density-reachable from p wrt
Eps and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are
density-reachable from p and DBSCAN visits the
next point of the database.
Continue the process until all of the points have
been processed.

Write a Comment

User Comments (0)