GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters

About This Presentation

Title:

GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters

Description:

Information retrieval. DNA analysis. Market studies. Requirements. Scalability ... We have conducted extensive experiments using GAKREM on two kinds of data: ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 29

Provided by: cao83

Category:

more less

Transcript and Presenter's Notes

Title: GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters

1
GAKREM A Clustering Algorithm that Automatically
Generates a Number of Clusters
Cao Dang Nguyen August 2007
2
Introduction

Clustering means grouping similar objects into
groups
A cluster is a set of entities which are alike
and entities from different clusters are not
alike
Clustering is a very important data mining
technique
Applications
Feature selections
Image segmentation
Speed recognition
Information retrieval
DNA analysis
Market studies
Requirements
Scalability
Dealing with different types of attributes
High dimensionality
Interpretability and usability

3
Introduction

Clustering algorithms
Agglomerative algorithms
Divisive algorithms
K-means
Probabilistic algorithms
Neural network (SOM)
K-means is widely used because of its simplicity
and computational efficiency
The Expectation-Maximization (EM) is an iterative
statistical algorithm to locate a maximum
likelihood estimate of the mixture parameters
The EM and K-means have several drawbacks
They are very sensitive to initialization
They depend on user-inputs number of clusters

4
Introduction
5
Introduction
6
Introduction

To overcome such drawbacks of the EM and K-means,
we propose a novel algorithm, namely, GAKREM
We use genetic algorithm for estimating
parameters and initializing starting points for
the EM algorithm to avoid convergence toward
local optimal points,
The log likelihood of each configuration of
parameters and a number of clusters resulting
from the EM are used as the fitness value for
each individual (candidate) in population
We approximates the log likelihood of the EM for
each configuration by using logarithmic
regression instead of running the EM until it
converges
We use simple K-means algorithm to initially
assign data points to the clusters and to speed
up convergence of the EM for each candidate
The effectiveness of the GAKREM (Genetic
Algorithm K-means Logarithmic Regression
Expectation Maximization) is evaluated by
comparing its performance with the original EM,
K-means and LCV algorithms on several datasets

7
Background

Suppose that Xx1,x2, xN is independent and
identically distributed with distribution p,
therefore
We assume Z X , Y is a complete data set with
X is observed and Y is unknown data and specify a
joint density function
We need to maximize the log-likelihood function
L(? X) of the observed data, where
To maximize L, a Lagrange multiplier is solved

(1)
(2)
(3)
(4)
8
Background

The EM algorithm is a simpler algorithm for
obtaining ML estimate
E-Step to find the expected value of the
complete-data log-likelihood log p(Z ? )
denoted by
M-Step to maximize the expected value computed
in E-step

(5)
(6)
9
Background

In the case of the mixture-density parameter
estimation problem, we assume
where ? ?1, ?K,?1?K such that ??h1 and
?h ?h , ?h and ph(x ?h) is a
density Gaussian distribution function
The log-likelihood function of the incomplete
data is

(7)
(8)
10
Background

The unobserved data Y is a set of N labels Yy1,
yN
where yi ,, is a binary vector k
dimensions
i.e. if cluster h computes data point xi then
1 and 0, j?h
The log-likelihood function of the complete data
is
Given ? (t)?1(t), ?K(t),?1(t)?K(t), we
calculate the expectation of unobserved data Y

(9)
(10)
11
Background

Then, we maximize the log-likelihood function of
the complete data

(11)
12
GAKREM Algorithm

Phase I Compute the initial guess of parameters.
We use GA and simple K-means to guess the
parameter?(0)
Each chromosome j in population is configured as
a binary vector dimension N gj,i, i1..N,
where gi,i1 if data point xi is set up as a
centroid point of a cluster
gi,i1 is the number K of components
Examples suppose the size of dataset N 10, we
have
Chromosome 1 0001000010 has 2 clusters with
centroid points are 4th, 9th
Chromosome 2 0010001001 has 3 clusters with
centroid points are 3rd, 7th and 10th
To optimize the use of memory and time, we encode
only the positions 1-loci in chromosomes
4,9 , 3, 7, 10

13
(No Transcript)
14

Phase II Evaluate the chromosomes fitness
We perform a partial EM in r iterations for each
of above candidate and calculate the expected log
likelihood function by using logarithmic
regression E(C)a log(t) b, where
where Lt is the log-likelihood function of
iteration t, experimentally, we set r5,
t1,2,3,4,5
Log-likelihood is of no direct use because the
log-likelihood of the data can always be
increased by increasing the number of cluster k
Based on Occams razor entities should not be
multiplied beyond necessity, we suggest the
fitness value of each candidate as
fit (C )E(C) - log(k)

(12)
(13)
15

Phase III Evolutionary generation for finding
optimal fitness value
Step 1 Initialize population and evaluate fitness
value of chromosomes

Step 2 Generate new population

17
Results

We have conducted extensive experiments using
GAKREM on two kinds of data manually-generated
datasets and automatically-generated datasets as
used in Xu and Jordan (1996) and Jain and Dubes
(1988)
The probability of mutation was set
experimentally to 0.15 for all experiments
For comparison in determining the best number of
clusters in each generated data set, we have
selected the likelihood cross-validation
technique (Smyth, 1988) implemented in Weka
package
k ? 1
Repeat
k ? k1
Divide dataset D into v-folds (v10)
For i? 1 to v
train the model in the ith-fold and test the
model in D \ ith (with k)
lavg ? the average log-likelihood function
Until lavg not increasing or kkmax

18
Results

For comparison in determining the optimal
log-likelihood function in each generated data
set, we implemented EM and K-means
To test the robustness of the algorithms, we
repeated the experiments 100 times for each
dataset

19
Experiments on manually-generated data

The result of GAKREM on a 2-cluster mixture
derived from the Old Faithful dataset in Bishop
(1995)
This dataset has two dimensions duration of an
eruption of the geyser in minutes and
interruption time
GAKREM precisely recognizes a 2-cluster mixture,
of course, without the pre-defined number of
clusters

20
Experiments on manually-generated data

The behavior of GAKREM on the Old Faithful
dataset. It performs an heuristic search on the
global space. At the 1728th generation in this
trial, the optimal fitness value is stable at
-7.5568.

21
Experiments on manually-generated data
GAKREM
LCV
22
Experiments on manually-generated data
Average maximum log-likelihood functions of
GAKREM, EM and K-means testing on 2-, 3-, 4-,
5-,6-, 7-, 8- and 9-cluster datasets,
respectively.
23
Experiments on manually-generated data
K-means
EM
24
Experiments on automatically-generated data

Accuracy of GAKREM is 97
Accuracy of LCV is 31

25
Experiments on automatically-generated data
26
Experiments on automatically-generated data
K-means
EM
27
(No Transcript)
28
Conclusions Future Work

We presented a powerful new algorithm, called
GAKREM, that combines the best characteristics of
K-means and EM algorithms but avoids their
weaknesses.
We have tested the algorithm extensively, on both
manually and automatically generated datasets
We plan to use GAKREM to discover new pathways in
chromosome 21 proteins by using heterogeneous
data combining micro array data and interaction
data
Demonstration is available at http//isl.cudenver.
edu/GAKREM

Write a Comment

User Comments (0)

About PowerShow.com

GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters - PowerPoint PPT Presentation

GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters

Information retrieval. DNA analysis. Market studies. Requirements. Scalability ... We have conducted extensive experiments using GAKREM on two kinds of data: ... – PowerPoint PPT presentation