Data Warehousing and Data Mining: Chapter 4 - PowerPoint PPT Presentation

1 / 101
About This Presentation
Title:

Data Warehousing and Data Mining: Chapter 4

Description:

Comparing: PAM: O(k(n-k)2 ), CLARA: O(ks2 k(n-k) ... e.g. from 2 to 15. plot k versus SSE (sum of square error) visually inspect the plot and ... – PowerPoint PPT presentation

Number of Views:255
Avg rating:3.0/5.0
Slides: 102
Provided by: jiaw196
Category:

less

Transcript and Presenter's Notes

Title: Data Warehousing and Data Mining: Chapter 4


1
Data Warehousing and Data Mining Chapter 4
  • BIS 541
  • Summer 2008

2
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Cluster analysis
  • Finding similarities between data according to
    the characteristics found in the data and
    grouping similar data objects into clusters
  • Clustering is unsupervised classification no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms
  • Measuring the performance of supervised learning
    algorithms

3
Basic Measures for Clustering
  • Clustering Given a database D t1, t2, ..,
    tn, a distance measure dis(ti, tj) defined
    between any two objects ti and tj, and an integer
    value k, the clustering problem is to define a
    mapping f D ? 1, ,k where each ti is
    assigned to one cluster Kf, 1 f k,

4
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to one another within the same cluster
  • Dissimilar to the objects in other clusters
  • Cluster analysis
  • Finding similarities between data according to
    the characteristics found in the data and
    grouping similar data objects into clusters
  • Clustering is unsupervised learning no
    predefined classes
  • Typical applications
  • As a stand-alone tool to get insight into data
    distribution
  • As a preprocessing step for other algorithms

5
General Applications of Clustering
  • Pattern Recognition
  • Spatial Data Analysis
  • create thematic maps in GIS by clustering feature
    spaces
  • detect spatial clusters and explain them in
    spatial data mining
  • Image Processing
  • Economic Science (especially market research)
  • WWW
  • Document classification
  • Cluster Weblog data to discover groups of similar
    access patterns

6
Examples of Clustering Applications
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • Land use Identification of areas of similar land
    use in an earth observation database
  • Insurance Identifying groups of motor insurance
    policy holders with a high average claim cost
  • City-planning Identifying groups of houses
    according to their house type, value, and
    geographical location
  • Earth-quake studies Observed earth quake
    epicenters should be clustered along continent
    faults

7
Constraint-Based Clustering Analysis
  • Clustering analysis less parameters but more
    user-desired constraints, e.g., an ATM allocation
    problem

8
Clustering Cities
  • Clustering Turkish cities
  • Based on
  • Political
  • Demographic
  • Economical
  • Characteristics
  • Political general elections 1999,2002
  • Demographicpopulation,urbanization rates
  • Economicalgnp per head,growth rate of gnp

9
What Is Good Clustering?
  • A good clustering method will produce high
    quality clusters with
  • high intra-class similarity
  • low inter-class similarity
  • The quality of a clustering result depends on
    both the similarity measure used by the method
    and its implementation.
  • The quality of a clustering method is also
    measured by its ability to discover some or all
    of the hidden patterns.

10
Requirements of Clustering in Data Mining
  • Scalability
  • Ability to deal with different types of
    attributes
  • Ability to handle dynamic data
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

11
Chapter 5. Cluster Analysis
  • What is Cluster Analysis?
  • Types of Data in Cluster Analysis
  • A Categorization of Major Clustering Methods
  • Partitioning Methods
  • Hierarchical Methods
  • Density-Based Methods
  • Grid-Based Methods
  • Model-Based Clustering Methods
  • Outlier Analysis
  • Summary

12
Partitioning Algorithms Basic Concept
  • Partitioning method Construct a partition of a
    database D of n objects into a set of k clusters,
    s.t., min sum of squared distance
  • Given a k, find a partition of k clusters that
    optimizes the chosen partitioning criterion
  • Global optimal exhaustively enumerate all
    partitions
  • Heuristic methods k-means and k-medoids
    algorithms
  • k-means (MacQueen67) Each cluster is
    represented by the center of the cluster
  • k-medoids or PAM (Partition around medoids)
    (Kaufman Rousseeuw87) Each cluster is
    represented by one of the objects in the cluster

13
The K-Means Clustering Algorithm
  • choose k, number of clusters to be determined
  • Choose k objects randomly as the initial cluster
    centers
  • repeat
  • Assign each object to their closest cluster
    center
  • Using Euclidean distance
  • Compute new cluster centers
  • Calculate mean points
  • until
  • No change in cluster centers or
  • No object change its cluster

14
The K-Means Clustering Method
  • Example

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
15
Example TBP Sec 3.3 page 84 Table 3.6
  • Instance X Y
  • 1 1.0 1.5
  • 2 1.0 4.5
  • 3 2.0 1.5
  • 4 2.0 3.5
  • 5 3.0 2.5
  • 6 5.0 6.1
  • k is chosen as 2 k2
  • Chose two points at random representing initial
    cluster centers
  • Object 1 and 3 are chosen as cluster centers

16
6
2
4
5
1
3
17
Example cont.
  • Euclidean distance between point i and j
  • D(i - j)( (Xi Xj)2 (Yi Yj)2)1/2
  • Initial cluster centers
  • C1(1.0,1.5) C2(2.0,1.5)
  • D(C1 1) 0.00 D(C2 1) 1.00 C1
  • D(C1 2) 3.00 D(C2 2) 3.16 C1
  • D(C1 3) 1.00 D(C2 3) 0.00 C2
  • D(C1 4) 2.24 D(C2 4) 2.00 C2
  • D(C1 5) 2.24 D(C2 5) 1.41 C2
  • D(C1 6) 6.02 D(C2 6) 5.41 C2
  • C11,2 C23.4.5.6

18
6
2
4
5
1
3
19
Example cont.
  • Recomputing cluster centers
  • For C1
  • XC1 (1.01.0)/2 1.0
  • YC1 (1.54.5)/2 3.0
  • For C2
  • XC2 (2.02.03.05.0)/4 3.0
  • YC2 (1.53.52.56.0)/4 3.375
  • Thus the new cluster centers are
  • C1(1.0,3.0) and C2(3.0,3.375)
  • As the cluster centers have changed
  • The algorithm performs another iteration

20
6
2

4

5
1
3
21
Example cont.
  • New cluster centers
  • C1(1.0,3.0) and C2(3.0,3.375)
  • D(C1 1) 1.50 D(C2 1) 2.74 C1
  • D(C1 2) 1.50 D(C2 2) 2.29 C1
  • D(C1 3) 1.80 D(C2 3) 2.13 C1
  • D(C1 4) 1.12 D(C2 4) 1.01 C2
  • D(C1 5) 2.06 D(C2 5) 0.88 C2
  • D(C1 6) 5.00 D(C2 6) 3.30 C2
  • C1 1,2.3 C24.5.6

22
Example cont.
  • computing new cluster centers
  • For C1
  • XC1 (1.01.02.0)/3 1.33
  • YC1 (1.54.51.5)/3 2.50
  • For C2
  • XC2 (2.03.05.0)/3 3.33
  • YC2 (3.52.56.0)/3 4.00
  • Thus the new cluster centers are
  • C1(1.33,2.50) and C2(3.33,4.3.00)
  • As the cluster centers have changed
  • The algorithm performs another iteration

23
Exercise
  • Perform the third iteration

24
Commands
  • each initial cluster centers may end up with
    different final cluster configuration
  • Finds local optimum but not necessarily the
    global optimum
  • Based on sum of squared error SSE differences
    between objects and their cluster centers
  • Choose a terminating criterion such as
  • Maximum acceptable SSE
  • Execute K-Means algorithm until satisfying the
    condition

25
Comments on the K-Means Method
  • Strength Relatively efficient O(tkn), where n
    is objects, k is clusters, and t is
    iterations. Normally, k, t ltlt n.
  • Comparing PAM O(k(n-k)2 ), CLARA O(ks2
    k(n-k))
  • Comment Often terminates at a local optimum. The
    global optimum may be found using techniques such
    as deterministic annealing and genetic algorithms

26
Weaknesses of K-Means Algorithm
  • Applicable only when mean is defined, then what
    about categorical data?
  • Need to specify k, the number of clusters, in
    advance
  • run the algorithm with different k values
  • Unable to handle noisy data and outliers
  • Not suitable to discover clusters with non-convex
    shapes
  • Works best when clusters are of approximately of
    equal size

27
Presence of Outliers
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
When k2
When k3
When k2 the two natural clusters are not captured
28
Quality of clustering depends on unit of measure
income
income
x
x
x
x
x
x
x
x
x
x
age
Income measured by YTL,age by year
x
x
So what to do?
age
Income measured by TL, age by year
29
Variations of the K-Means Method
  • A few variants of the k-means which differ in
  • Selection of the initial k means
  • Dissimilarity calculations
  • Strategies to calculate cluster means
  • Handling categorical data k-modes (Huang98)
  • Replacing means of clusters with modes
  • Using new dissimilarity measures to deal with
    categorical objects
  • Using a frequency-based method to update modes of
    clusters
  • A mixture of categorical and numerical data
    k-prototype method

30
Exercise
  • Show by designing simple examples
  • a) K-means algorithm may converge to different
    local optima starting from different initial
    assignments of objects into different clusters
  • b) In the case of clusters of unequal size,
    K-means algorithm may fail to catch the obvious
    (natural) solution

31
























































32
How to chose K
  • for reasonable values of k
  • e.g. from 2 to 15
  • plot k versus SSE (sum of square error)
  • visually inspect the plot and
  • as k increases SSE falls
  • choose the breaking point

33
SSE
2 4 6 8 10 12 k
34
Velidation of clustering
  • partition the data into two equal groups
  • apply clustering the
  • one of these partitions
  • compare cluster centers with the overall data
  • or
  • apply clustering to each of these groups
  • compare cluster centers

35
Data Structures
  • Data matrix
  • (two modes)
  • Dissimilarity matrix
  • (one mode)

36
Properties of Dissimilarity Measures
  • Properties
  • d(i,j) ? 0 for i ? j
  • d(i,i) 0
  • d(i,j) d(j,i) symmetry
  • d(i,j) ? d(i,k) d(k,j) triangular inequality
  • Exercise Can you find examples where distance
    between objects are not obeying symmetry
    property

37
Measure the Quality of Clustering
  • Dissimilarity/Similarity metric Similarity is
    expressed in terms of a distance function, which
    is typically metric d(i, j)
  • There is a separate quality function that
    measures the goodness of a cluster.
  • The definitions of distance functions are usually
    very different for interval-scaled, boolean,
    categorical, ordinal and ratio variables.
  • Weights should be associated with different
    variables based on applications and data
    semantics.
  • It is hard to define similar enough or good
    enough
  • the answer is typically highly subjective.

38
Type of data in clustering analysis
  • Interval-scaled variables
  • Binary variables
  • Nominal, ordinal, and ratio variables
  • Variables of mixed types

39
Classification by Scale
  • Nominal scalemerely distinguish classes with
    respect to A and B XAXB or XA?XB
  • e.g. color red, blue, green,
  • gender male, female
  • occupation engineering, management. ..
  • Ordinal scale indicates ordering of objects in
    addition to distinguishing
  • XAXB or XA?XB XAgtXB or XAltXB
  • e.g. education no schoollt primary sch. lt high
    sch. lt undergrad lt grad
  • age young lt middle lt old
  • income low lt medium lt high

40
  • Interval scale assign a meaningful measure of
    difference between two objects
  • Not only XAgtXB but XAis XA XB units different
    from XB
  • e.g. specific gravity
  • temperature in oC or oF
  • Boiling point of water is 100 oC different then
    its melting point or 180 oF different
  • Ratio scale an interval scale with a meaningful
    zero point
  • XA gt XB but XA is XA/XB times greater then XB
  • e.g. height, weight, age (as an integer)
  • temperature in oK or oR
  • Water boils at 373 oK and melts at 273 oK
  • Boiling point of water is 1.37 times hotter then
    melting poing

41
Comparison of Scales
  • Strongest scale is ratio weakest scale is ordinal
  • Ahmets height is 2.00 meters HA
  • Mehmets height is 1.50 meter HM
  • HA ? HM nominal their heights are different
  • HA gt HM ordinal Ahmet is taller then Mehmet
  • HA - HM 0.50 meters interval Ahmet is 50 cm
    taller then Mehmet
  • HA / HM 1.333 ratio scale, no mater height is
    measured in meter or inch

42
Interval-valued variables
  • Standardize data
  • Calculate the mean absolute deviation
  • where
  • Calculate the standardized measurement (z-score)
  • Using mean absolute deviation is more robust than
    using standard deviation

43
X2

X2








X1








X1







Both has zero mean and standardized by Z scores
?X1 and ?X2 are unity in both cases I and
II ?X1.X20 in case I whereas ?X1.X2 ?1 in case
II Shell we use the same distance measure in
both cases After obtaining the z scores
44
Exercise
X2

X2



A




X1

A






X1







Suppose d(A,O) 0.5 in case I and II Does it
reflect the distance between A and
origin? Suggest a transformation so as to handle
correlation Between variables
45
Other Standardizations
  • Min Max scale between 0 and 1 or -1 and 1
  • Decimal scale
  • For Ratio Scaled variales
  • Mean transformation
  • zi,f xi,f/mean_f
  • Measure in terms of means of variable f
  • Log transformation
  • zi,f logxi,f

46
Similarity and Dissimilarity Between Objects
  • Distances are normally used to measure the
    similarity or dissimilarity between two data
    objects
  • Some popular ones include Minkowski distance
  • where i (xi1, xi2, , xip) and j (xj1, xj2,
    , xjp) are two p-dimensional data objects, and q
    gt1
  • If q 1, d is Manhattan distance

47
Similarity and Dissimilarity Between Objects
(Cont.)
  • If q 2, d is Euclidean distance
  • Properties
  • d(i,j) ? 0
  • d(i,i) 0
  • d(i,j) d(j,i)
  • d(i,j) ? d(i,k) d(k,j)
  • Also, one can use weighted distance, parametric
    Pearson product moment correlation, or other
    disimilarity measures

48
Similarity and Dissimilarity Between Objects
(Cont.)
  • Weights can be assigned to variables
  • Where wi i 1P weights showing the importance
    of each variable

49
XA
XA
XB
XB
Manhatan distance between XA and XB
Euclidean distance between XA and XB
50
Binary Variables
  • Symmetric asymmetric
  • Symmetric both of its states are equally
    valuable and carry the same weight
  • gender male female
  • 0 male 1 female arbitarly coded as 0 or 1
  • Asymmetric variables
  • Outcomes are not equally important
  • Encoded by 0 and 1
  • E.g. patient smoker or not
  • 1 for smoker 0 for nonsmoker asymmetric
  • Positive and negative outcomes of a disease test
  • HIV positive by 1 HIV negative 0

51
Binary Variables
  • A contingency table for binary data
  • Simple matching coefficient (invariant, if the
    binary variable is symmetric)
  • Jaccard coefficient (noninvariant if the binary
    variable is asymmetric)

Object j
Object i
52
Dissimilarity between Binary Variables
  • Example
  • gender is a symmetric attribute
  • the remaining attributes are asymmetric binary
  • let the values Y and P be set to 1, and the value
    N be set to 0

53
Nominal Variables
  • A generalization of the binary variable in that
    it can take more than 2 states, e.g., red,
    yellow, blue, green
  • Method 1 Simple matching
  • m of matches, p total of variables
  • Higher weights can be assigned to variables with
    large number of states
  • Method 2 use a large number of binary variables
  • creating a new binary variable for each of the M
    nominal states

54
Example
  • 2 nominal variables
  • Faculty and country for students
  • Faculty eng, applied Sc., Pure Sc., Admin., 5
    distinct values
  • Country Turkey, USA 10 distinct values
  • P 2 just two varibales
  • Weight of country may be increased
  • Student A (eng, Turkey) B(Applied Sc, Turkey)
  • m 1 in one variable A and B are similar
  • D(A,B) (2-1)/2 1/2

55
Example cont.
  • Different binary variables for each faculty
  • Eng 1 if student is in engineering 0 otherwise
  • AppSc 1 if student in MIS, 0 otherwise
  • Different binary variables for each country
  • Turkey 1 if sturent Turkish, 0 otherwise
  • USA 1 if student USA ,0 otherwise

56
(No Transcript)
57
Ordinal Variables
  • An ordinal variable can be discrete or continuous
  • Order is important, e.g., rank
  • Can be treated like interval-scaled
  • replace xif by their rank
  • map the range of each variable onto 0, 1 by
    replacing i-th object in the f-th variable by
  • compute the dissimilarity using methods for
    interval-scaled variables

58
Example
  • Credit Card type gold gt silver gt bronze gt
    normal, 4 states
  • Education grad gt undergrad gt highschool gt
    primary school gt no school, 5 states
  • Two customers
  • A(gold,highschool)
  • B(normal,no school)
  • rA,card 1 , rB,card 4
  • rA,edu 3 , rA,card 5
  • zA,card (1-1)/(4-1)0
  • zB,card (4-1)/(4-1)1
  • zA,edu (3-1)/(5-1)0.5
  • zB,edu (5-1)/(5-1)1
  • Use any interval scale distance measure on z
    values

59
Exercise
  • Find an attribute having both ordinal and nominal
    charecterisitics
  • define a similarity or dissimilarity measure for
    to objects A and B

60
Ratio-Scaled Variables
  • Ratio-scaled variable a positive measurement on
    a nonlinear scale, approximately at exponential
    scale, such as AeBt or Ae-Bt
  • Methods
  • treat them like interval-scaled variablesnot a
    good choice! (why?the scale can be distorted)
  • apply logarithmic transformation
  • yif log(xif)
  • treat them as continuous ordinal data treat their
    rank as interval-scaled

61
Example
  • Cluster individuals based on age weights and
    heights
  • All are ratio scale variables
  • Mean transformation
  • Zp,i xp,i/meanp
  • As absolute zero makes sense measure distance by
    units of mean for each variable
  • Then you may apply z logz
  • Use any distance measure for interval scales then

62
Example cont.
  • A weight difference of 0.5 kg is much more
    important for babies then for adults
  • d(3kg,3.5kg) 0.5 (3.5-3)/3 percentage
    difference
  • d(71.5kg,70.0kg) 0.5
  • d(3kg,3.5kg) (3.5-3)/3 percentage difference
    very significant approximately log(3.5)-log3
  • d(71.5kg,71.0kg) (71.5-70.0)/70.0
  • Not important log71.5 log71 almost zero

63
Examples from Sports
  • 48 48
  • 51 52
  • 54 56
  • 57 62
  • 60 68
  • 63.5 74
  • 67 82
  • 71 90
  • 75 100
  • 81 130

64
Basic Measures for Clustering
  • Clustering Given a database D t1, t2, ..,
    tn, a distance measure dis(ti, tj) defined
    between any two objects ti and tj, and an integer
    value k, the clustering problem is to define a
    mapping f D ? 1, , k where each ti is
    assigned to one cluster Kf, 1 f k, such that
    ?tfp,tfq ? Kf and ts ? Kf, dis(tfp,tfq)
    dis(tfp,ts)
  • Centroid, radius, diameter
  • Typical alternatives to calculate the distance
    between clusters
  • Single link, complete link, average, centroid,
    medoid

65
Centroid, Radius and Diameter of a Cluster (for
numerical data sets)
  • Centroid the middle of a cluster
  • Radius square root of average distance from any
    point of the cluster to its centroid
  • Diameter square root of average mean squared
    distance between all pairs of points in the
    cluster

66
Typical Alternatives to Calculate the Distance
between Clusters
  • Single link smallest distance between an
    element in one cluster and an element in the
    other, i.e., dis(Ki, Kj) min(tip, tjq)
  • Complete link largest distance between an
    element in one cluster and an element in the
    other, i.e., dis(Ki, Kj) max(tip, tjq)
  • Average avg distance between an element in one
    cluster and an element in the other, i.e.,
    dis(Ki, Kj) avg(tip, tjq)
  • Centroid distance between the centroids of two
    clusters, i.e., dis(Ki, Kj) dis(Ci, Cj)
  • Medoid distance between the medoids of two
    clusters, i.e., dis(Ki, Kj) dis(Mi, Mj)
  • Medoid one chosen, centrally located object in
    the cluster

67
Major Clustering Approaches
  • Partitioning algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • Hierarchy algorithms Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
  • Density-based based on connectivity and density
    functions
  • Grid-based based on a multiple-level granularity
    structure
  • Model-based A model is hypothesized for each of
    the clusters and the idea is to find the best fit
    of that model to each other

68
The K-Medoids Clustering Method
  • Find representative objects, called medoids, in
    clusters
  • PAM (Partitioning Around Medoids, 1987)
  • starts from an initial set of medoids and
    iteratively replaces one of the medoids by one of
    the non-medoids if it improves the total distance
    of the resulting clustering
  • PAM works effectively for small data sets, but
    does not scale well for large data sets
  • CLARA (Kaufmann Rousseeuw, 1990)
  • CLARANS (Ng Han, 1994) Randomized sampling
  • Focusing spatial data structure (Ester et al.,
    1995)

69
CLARA (Clustering Large Applications) (1990)
  • CLARA (Kaufmann and Rousseeuw in 1990)
  • Built in statistical analysis packages, such as
    S
  • It draws multiple samples of the data set,
    applies PAM on each sample, and gives the best
    clustering as the output
  • Strength deals with larger data sets than PAM
  • Weakness
  • Efficiency depends on the sample size
  • A good clustering based on samples will not
    necessarily represent a good clustering of the
    whole data set if the sample is biased

70
CLARANS (Randomized CLARA) (1994)
  • CLARANS (A Clustering Algorithm based on
    Randomized Search) (Ng and Han94)
  • CLARANS draws sample of neighbors dynamically
  • The clustering process can be presented as
    searching a graph where every node is a potential
    solution, that is, a set of k medoids
  • If the local optimum is found, CLARANS starts
    with new randomly selected node in search for a
    new local optimum
  • It is more efficient and scalable than both PAM
    and CLARA
  • Focusing techniques and spatial access structures
    may further improve its performance (Ester et
    al.95)

71
Hierarchical Clustering
  • Use distance matrix as clustering criteria. This
    method does not require the number of clusters k
    as an input, but needs a termination condition

72
AGNES (Agglomerative Nesting)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Implemented in statistical analysis packages,
    e.g., Splus
  • Use the Single-Link method and the dissimilarity
    matrix.
  • Merge nodes that have the least dissimilarity
  • Go on in a non-descending fashion
  • Eventually all nodes belong to the same cluster

73
A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
74
Example
  • A B C D E
  • A 0
  • B 1 0
  • C 2 2 0
  • D 2 4 1 0
  • E 3 3 5 3 0

75
Simple Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
2
3
2
3
3
C
C
E
E
2
2
1
2
3
1
D
D
1
ABCDE
ABCDE
D
A
C
B
E
76
Complete Link Distance Measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE
1
1
A
B
A
B
2
2
3
3
3
5
3
C
5
C
E
4
E
2
3
1
3
1
D
D
1
ABCDE
ABECD
B
C
A
D
E
77
Average Link distance measure
1
A
B
2
2
1
3
A
B
3
C
5
E
4
2
1
C
3
D
E
1
D
ABCDE
ABCDE average link AB and CD is 1
1
1
A
B
A
B
2
2
2
3
2
3.5
3
C
5
C
E
4
E
4
2
2.5
1
2
3
1
D
D
1
ABCDE Average link between ABCD and E Is 3.5
ABCDE Average link Between AB andCD Is 2.5
D
A
C
B
E
78
DIANA (Divisive Analysis)
  • Introduced in Kaufmann and Rousseeuw (1990)
  • Implemented in statistical analysis packages,
    e.g., Splus
  • Inverse order of AGNES
  • Eventually each node forms a cluster on its own

79
More on Hierarchical Clustering Methods
  • Major weakness of agglomerative clustering
    methods
  • do not scale well time complexity of at least
    O(n2), where n is the number of total objects
  • can never undo what was done previously
  • Integration of hierarchical with distance-based
    clustering
  • BIRCH (1996) uses CF-tree and incrementally
    adjusts the quality of sub-clusters
  • CURE (1998) selects well-scattered points from
    the cluster and then shrinks them towards the
    center of the cluster by a specified fraction
  • CHAMELEON (1999) hierarchical clustering using
    dynamic modeling

80
Probability Based Clustering
  • A mixture is a set of n probability distributions
    where each distribution represents a cluster
  • The mixture model assigns each object a
    probability of being a member of cluster k
  • Rather than assigning the object or not assigning
    the object into cluster k
  • as in K-mean, k-medoids
  • P(C1/X)

81
Simplest Case
  • K 2 two clusters
  • There is one real valued attribute X
  • Each having Normally distributed
  • Five parameters to determine
  • mean and standard deviation for cluster A ?A
  • ?A
  • mean and standard deviation for cluster B ?B ?B
  • Sampling probability for cluster A pA
  • As P(A) P(B) 1

82
  • There are two populations each normally
    distributed with mean and standard deviation
  • An object comes from one of them but which one is
    unknown
  • Assign a probability that an object comes from A
    or B
  • That is equivalent to assign a cluster for each
    object

83
  • In parametric statistics
  • Assume that objects come from a distribution
    usually
  • Normal distribution with unknown parameters ? and
    ?
  • Estimate the parameters from data
  • Test hypothesis based on the normality assumption
  • Test the normality assumption

84
Two normal dirtributions Bi modal Arbitary shapes
can be captured by mixture of normals
85
  • Variable age
  • Clasters C1 C2 C3
  • P(C1/age 25)
  • P(C2/age 35)
  • P(C3/age 65)

86
Returning to Mixtures
  • Given the five parameters
  • Finding the probabilities that a given object
    comes from each distribution or cluster is easy
  • Given an object X,
  • The probability that it belongs to cluster A
  • P(AX) P(A,X) P(XA)P(A) f(X?A,?A)P(A)
  • P(X) P(X)
    P(X)
  • P(BX) P(B,X) P(XB)P(B) f(X?B,?B)P(B)
  • P(X) P(X)
    P(X)

87
  • f(X?A,?A) is the normal distribution function
    for cluster A

The denominator P(x) will disappear Calculate
the numerators for both P(AX) and P(BX) And
normalize them by dividing their sum
88
  • P(AX) _ P(XA)P(A) _
  • P(XA)P(A) P(XB)P(B)
  • P(BX) _ P(XB)P(B) _
  • P(XA)P(A) P(XB)P(B)
  • Note that the final outcome P(AX) and P(BX)
    gives the probabilities that given object X
    belongs to cluster A and B
  • Note that these can be calculated only by knowing
    the priori probabilities P(A) and P(B)

89
EM Algorithm for the Simple Case
  • We do not know neither of these things
  • Distribution that each object came from
  • P(X/A), P(X/B)
  • Nor the five parameters of the mixture model
  • ?A, ?B,?A,?B,P(A)
  • Adapt the procedure used for k-means algorithm

90
EM Algorithm for the Simple Case
  • Initially guess initial values for the five
    parameters
  • ?A, ?B,?A,?B,P(A)
  • Until a specified termination criterion
  • Compute P(A/Xi) and P(B/Xi) for each Xi i 1..n
  • Class probabilities for each object
  • As these are normal distributions with
  • The expectation step
  • Use these probabilities to reestimate the
    parameters
  • ?A, ?B,?A,?B,P(A)
  • Maximization of the likelihood of the
    distributions
  • Until likelihood converges or does not improve so
    much

91
Estimation of new means
  • If wi is the probability that object i belongs to
    cluster A
  • WiA P(AXi) calculated in E step
  • ?A w1Ax1 w2Ax2 wnAxn
  • w1A w2A wnA
  • WiB P(BXi) calculated in E step
  • ?B w1Bx1 w2Bx2 wnBxn
  • w1B w2B wnB

92
Estimation of new standard deviations
  • ?2Aw1A(X1- ?A)2w2A(X2- ?A)2wnA(Xn- ?A)2
  • w1A w2A wnA
  • ?2Bw1B(X1- ?B)2w2B(X2- ?B)2wnB(Xn- ?B)2
  • w1B w2B wnB
  • Xi are all the objects not just belonging to A or
    B
  • These are maximum likelihood estimators for
    variance
  • If the weights are equal denominator sum up to n
    rather then n-1

93
Estimation of new priori probabilities
  • P(A) P(AX1) P(AX2) P(AXn)
  • n
  • pA w1A w2A wnA
  • n
  • P(B) P(BX1) P(BX2) P(BXn)
  • n
  • pB w1Bw2B wnB
  • n
  • Or P(B) 1-P(A)

94
General Considerations
  • EM converges to a local optimum maximum
    likelihood score
  • May not be global
  • EM may be run with different initial values of
    the parameters and the clustering structure with
    the highest likelihood is chosen
  • Or an initial k-means is conducted to get the
    mean and standard deviation parameters

95
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Major features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Several interesting studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg D. Keim (KDD98)
  • CLIQUE Agrawal, et al. (SIGMOD98)

96
Density Concepts
  • Core object (CO)object with at least M objects
    within a radius E-neighborhood
  • Directly density reachable (DDR)x is CO, y is in
    xs E-neighborhood
  • Density reachablethere exists a chain of DDR
    objects from x to y
  • Density based clusterdensity connected objects
    maximum w.r.t. reachability

97
Density-Based Clustering Background
  • Two parameters
  • Eps Maximum radius of the neighbourhood
  • MinPts Minimum number of points in an
    Eps-neighbourhood of that point
  • NEps(p) q belongs to D dist(p,q) lt Eps
  • Directly density-reachable A point p is directly
    density-reachable from a point q wrt. Eps, MinPts
    if
  • 1) p belongs to NEps(q)
  • 2) core point condition
  • NEps (q) gt MinPts

98
Density-Based Clustering Background (II)
  • Density-reachable
  • A point p is density-reachable from a point q
    wrt. Eps, MinPts if there is a chain of points
    p1, , pn, p1 q, pn p such that pi1 is
    directly density-reachable from pi
  • Density-connected
  • A point p is density-connected to a point q wrt.
    Eps, MinPts if there is a point o such that both,
    p and q are density-reachable from o wrt. Eps and
    MinPts.

p
p1
q
99
Let MinPts 3

Q

M


P







M,P are core objects since each is in an ?
neighborhood containing at least 3 points Q is
directly density-reachable from M M is directly
density-reachable form P and vice versa Q is
indirectly density reachable form P since Q is
directly density-reachable from M and M is
directly density reacable From P but p is not
density reacable from Q sicne Q is not a core
object
100
DBSCAN Density Based Spatial Clustering of
Applications with Noise
  • Relies on a density-based notion of cluster A
    cluster is defined as a maximal set of
    density-connected points
  • Discovers clusters of arbitrary shape in spatial
    databases with noise

101
DBSCAN The Algorithm
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p wrt
    Eps and MinPts.
  • If p is a core point, a cluster is formed.
  • If p is a border point, no points are
    density-reachable from p and DBSCAN visits the
    next point of the database.
  • Continue the process until all of the points have
    been processed.
Write a Comment
User Comments (0)
About PowerShow.com