GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters

Description:

Information retrieval. DNA analysis. Market studies. Requirements. Scalability ... We have conducted extensive experiments using GAKREM on two kinds of data: ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: cao83
Category:

less

Transcript and Presenter's Notes

Title: GAKREM A Clustering Algorithm that Automatically Generates a Number of Clusters


1
GAKREM A Clustering Algorithm that Automatically
Generates a Number of Clusters
Cao Dang Nguyen August 2007
2
Introduction
  • Clustering means grouping similar objects into
    groups
  • A cluster is a set of entities which are alike
    and entities from different clusters are not
    alike
  • Clustering is a very important data mining
    technique
  • Applications
  • Feature selections
  • Image segmentation
  • Speed recognition
  • Information retrieval
  • DNA analysis
  • Market studies
  • Requirements
  • Scalability
  • Dealing with different types of attributes
  • High dimensionality
  • Interpretability and usability

3
Introduction
  • Clustering algorithms
  • Agglomerative algorithms
  • Divisive algorithms
  • K-means
  • Probabilistic algorithms
  • Neural network (SOM)
  • K-means is widely used because of its simplicity
    and computational efficiency
  • The Expectation-Maximization (EM) is an iterative
    statistical algorithm to locate a maximum
    likelihood estimate of the mixture parameters
  • The EM and K-means have several drawbacks
  • They are very sensitive to initialization
  • They depend on user-inputs number of clusters

4
Introduction
5
Introduction
6
Introduction
  • To overcome such drawbacks of the EM and K-means,
    we propose a novel algorithm, namely, GAKREM
  • We use genetic algorithm for estimating
    parameters and initializing starting points for
    the EM algorithm to avoid convergence toward
    local optimal points,
  • The log likelihood of each configuration of
    parameters and a number of clusters resulting
    from the EM are used as the fitness value for
    each individual (candidate) in population
  • We approximates the log likelihood of the EM for
    each configuration by using logarithmic
    regression instead of running the EM until it
    converges
  • We use simple K-means algorithm to initially
    assign data points to the clusters and to speed
    up convergence of the EM for each candidate
  • The effectiveness of the GAKREM (Genetic
    Algorithm K-means Logarithmic Regression
    Expectation Maximization) is evaluated by
    comparing its performance with the original EM,
    K-means and LCV algorithms on several datasets

7
Background
  • Suppose that Xx1,x2, xN is independent and
    identically distributed with distribution p,
    therefore
  • We assume Z X , Y is a complete data set with
    X is observed and Y is unknown data and specify a
    joint density function
  • We need to maximize the log-likelihood function
    L(? X) of the observed data, where
  • To maximize L, a Lagrange multiplier is solved

(1)
(2)
(3)
(4)
8
Background
  • The EM algorithm is a simpler algorithm for
    obtaining ML estimate
  • E-Step to find the expected value of the
    complete-data log-likelihood log p(Z ? )
    denoted by
  • M-Step to maximize the expected value computed
    in E-step

(5)
(6)
9
Background
  • In the case of the mixture-density parameter
    estimation problem, we assume
  • where ? ?1, ?K,?1?K such that ??h1 and
    ?h ?h , ?h and ph(x ?h) is a
  • density Gaussian distribution function
  • The log-likelihood function of the incomplete
    data is

(7)
(8)
10
Background
  • The unobserved data Y is a set of N labels Yy1,
    yN
  • where yi ,, is a binary vector k
    dimensions
  • i.e. if cluster h computes data point xi then
    1 and 0, j?h
  • The log-likelihood function of the complete data
    is
  • Given ? (t)?1(t), ?K(t),?1(t)?K(t), we
    calculate the expectation of unobserved data Y

(9)
(10)
11
Background
  • Then, we maximize the log-likelihood function of
    the complete data

(11)
12
GAKREM Algorithm
  • Phase I Compute the initial guess of parameters.
  • We use GA and simple K-means to guess the
    parameter?(0)
  • Each chromosome j in population is configured as
    a binary vector dimension N gj,i, i1..N,
    where gi,i1 if data point xi is set up as a
    centroid point of a cluster
  • gi,i1 is the number K of components
  • Examples suppose the size of dataset N 10, we
    have
  • Chromosome 1 0001000010 has 2 clusters with
    centroid points are 4th, 9th
  • Chromosome 2 0010001001 has 3 clusters with
    centroid points are 3rd, 7th and 10th
  • To optimize the use of memory and time, we encode
    only the positions 1-loci in chromosomes
  • 4,9 , 3, 7, 10

13
(No Transcript)
14
  • Phase II Evaluate the chromosomes fitness
  • We perform a partial EM in r iterations for each
    of above candidate and calculate the expected log
    likelihood function by using logarithmic
    regression E(C)a log(t) b, where
  • where Lt is the log-likelihood function of
    iteration t, experimentally, we set r5,
    t1,2,3,4,5
  • Log-likelihood is of no direct use because the
    log-likelihood of the data can always be
    increased by increasing the number of cluster k
  • Based on Occams razor entities should not be
    multiplied beyond necessity, we suggest the
    fitness value of each candidate as
  • fit (C )E(C) - log(k)

(12)
(13)
15
  • Phase III Evolutionary generation for finding
    optimal fitness value
  • Step 1 Initialize population and evaluate fitness
    value of chromosomes

16
  • Step 2 Generate new population

17
Results
  • We have conducted extensive experiments using
    GAKREM on two kinds of data manually-generated
    datasets and automatically-generated datasets as
    used in Xu and Jordan (1996) and Jain and Dubes
    (1988)
  • The probability of mutation was set
    experimentally to 0.15 for all experiments
  • For comparison in determining the best number of
    clusters in each generated data set, we have
    selected the likelihood cross-validation
    technique (Smyth, 1988) implemented in Weka
    package
  • k ? 1
  • Repeat
  • k ? k1
  • Divide dataset D into v-folds (v10)
  • For i? 1 to v
  • train the model in the ith-fold and test the
    model in D \ ith (with k)
  • lavg ? the average log-likelihood function
  • Until lavg not increasing or kkmax

18
Results
  • For comparison in determining the optimal
    log-likelihood function in each generated data
    set, we implemented EM and K-means
  • To test the robustness of the algorithms, we
    repeated the experiments 100 times for each
    dataset

19
Experiments on manually-generated data
  • The result of GAKREM on a 2-cluster mixture
    derived from the Old Faithful dataset in Bishop
    (1995)
  • This dataset has two dimensions duration of an
    eruption of the geyser in minutes and
    interruption time
  • GAKREM precisely recognizes a 2-cluster mixture,
    of course, without the pre-defined number of
    clusters

20
Experiments on manually-generated data
  • The behavior of GAKREM on the Old Faithful
    dataset. It performs an heuristic search on the
    global space. At the 1728th generation in this
    trial, the optimal fitness value is stable at
    -7.5568.

21
Experiments on manually-generated data
GAKREM
LCV
22
Experiments on manually-generated data
Average maximum log-likelihood functions of
GAKREM, EM and K-means testing on 2-, 3-, 4-,
5-,6-, 7-, 8- and 9-cluster datasets,
respectively.
23
Experiments on manually-generated data
K-means
EM
24
Experiments on automatically-generated data
  • Accuracy of GAKREM is 97
  • Accuracy of LCV is 31

25
Experiments on automatically-generated data
26
Experiments on automatically-generated data
K-means
EM
27
(No Transcript)
28
Conclusions Future Work
  • We presented a powerful new algorithm, called
    GAKREM, that combines the best characteristics of
    K-means and EM algorithms but avoids their
    weaknesses.
  • We have tested the algorithm extensively, on both
    manually and automatically generated datasets
  • We plan to use GAKREM to discover new pathways in
    chromosome 21 proteins by using heterogeneous
    data combining micro array data and interaction
    data
  • Demonstration is available at http//isl.cudenver.
    edu/GAKREM
Write a Comment
User Comments (0)
About PowerShow.com