Title: Genetic%20algorithms%20(GA)%20for%20clustering
1Genetic algorithms (GA)for clustering
Clustering Methods Part 2e
Pasi Fränti
- Speech and Image Processing UnitSchool of
Computing - University of Eastern Finland
2General structure
Genetic Algorithm Generate S initial
solutions REPEAT Z iterations Select best
solutions Create new solutions by
crossover Mutate solutions END-REPEAT
3Components of GA
- Representation of solution
- Selection method
- Crossover method
- Mutation
Most critical !
4Representation of solution
- Partition (P)
- Optimal centroid can be calculated from P.
- Only local changes can be made.
- Codebook (C)
- Optimal partition can be calculated from C.
- Calculation of P takes O(NM) ? slow.
- Combined (C, P)
- Both data structures are needed anyway.
- Computationally more efficient.
5Selection method
- To select which solutions will be used in
crossover for generating new solutions. - Main principle good solutions should be used
rather than weak solutions. - Two main strategies
- Roulette wheel selection
- Elitist selection.
- Exact implementation not so important.
6Roulette wheel selection
- Select two candidate solutions for the crossover
randomly. - Probability for a solution to be selected is
weighted according to its distortion
7Elitist selection
- Main principle select all possible pairs among
the best candidates.
Elitist approach using zigzag scanning among the
best solutions
8Crossover methods
- Different variants for crossover
- Random crossover
- Centroid distance
- Pairwise crossover
- Largest partitions
- PNN
- Local fine-tuning
- All methods give new allocation of the centroids.
- Local fine-tuning must be made by K-means.
- Two iterations of K-means is enough.
9Random crossover
Select M/2 centroids randomly from the two parent.
Solution 1
Solution 2
10Parent solution A
Parent solution B
New Solution How to create a new
solution? Picking M/2 randomly chosen cluster
centroids from each of the two parents in
turn. How many solutions are there? 36
possibilities how to create a new solution. What
is the probability to select a good one? Not
high, some are good but K-Means is needed, most
are bad. See statistics.
M 4
Some possibilities
Parent A Parent B Rating
c2, c4 c1, c4 Optimal
c1, c2 c3, c4 Good (K-Means)
c2, c3 c2, c3 Bad
Rough statistics Optimal 1 Good 7 Bad 28
11Parent solution A
Parent solution B
Child solution (optimal)
Child solution (good)
Child solution (bad)
12Centroid distance crossover Pan, McInnes, Jack,
1995 Electronics Letters Scheunders, 1997
Pattern Recognition Letters
- For each centroid, calculate its distance to the
center point of the entire data set. - Sort the centroids according to the distance.
- Divide into two sets central vectors (M/2
closest) and distant vectors (M/2 furthest). - Take central vectors from one codebook and
distant vectors from the other.
13Parent solution A
Parent solution B
1
2
4
5
8
New solution Variant (a) Take cental vectors
from parent solution A and distant vectors from
parent solution B OR Variant (b) Take distant
vectors from parent solution A and central
vectors from parent solution B
14Child - variant (a)
Child variant (b)
New solution Variant (a) Take cental vectors
from parent solution A and distant vectors from
parent solution B OR Variant (b) Take distant
vectors from parent solution A and central
vectors from parent solution B
15Pairwise crossoverFränti et al, 1997 Computer
Journal
- Greedy approach
- For each centroid, find its nearest centroid in
the other parent solution that is not yet used. - Among all pairs, select one of the two randomly.
- Small improvement
- No reason to consider the parents as separate
solutions. - Take union of all centroids.
- Make the pairing independent of parent.
16Pairwise crossover example
Initial parent solutions
MSE11.92?109
MSE8.79?109
17Pairwise crossover example
Pairing between parent solutions
MSE7.34?109
18Pairwise crossover example
Pairing without restrictions
MSE4.76?109
19Largest partitionsFränti et al, 1997 Computer
Journal
- Select centroids that represent largest clusters.
- Selection by greedy manner.
- (illustration to appear later)
20PNN crossover for GAFränti et al, 1997 The
Computer Journal
Initial 2
Initial 1
Union
After PNN
Combined
PNN
21The PNN crossover method (1) Fränti, 2000
Pattern Recognition Letters
22The PNN crossover method (2)
23Importance of K-means(Random crossover)
Bridge
Worst
Best
24Effect of crossover method(with k-means
iterations)
Bridge
25Effect of crossover method(with k-means
iterations)
Binary data (Bridge2)
26Mutations
- Purpose is to implement small random changes to
the solutions. - Happens with a small probability.
- Sensible approach change the location of one
centroid by the random swap! - Role of mutations is to simulate local search.
- If mutations are needed ? crossover method is not
very good.
27Effect of k-means and mutations
K-means improves but not vital
Mutations alone better than random crossover!
28Pseudo code of GAIS Virmajoki Fränti, 2006
Pattern Recognition
29PNN vs. IS crossovers
Further improvement of about 1
30Optimized GAIS variants
- GAIS short (optimized for speed)
- Create new generations only as long as the best
solution keeps improving (T). - Use a small population size (Z10)
- Apply two iterations of k-means (G2).
- GAIS long (optimized for quality)
- Create a large number of generations (T100)
- Large population size (Z100)
- Iterate k-means relatively long (G10).
31Comparison of algorithms
32Variation of the result
33Time vs. quality comparisonBridge
34Conclusions
- Best clustering obtained by GA.
- Crossover method most important.
- Mutations not needed.
35References
- P. Fränti and O. Virmajoki, "Iterative shrinking
method for clustering problems", Pattern
Recognition, 39 (5), 761-765, May 2006. - P. Fränti, "Genetic algorithm with deterministic
crossover for vector quantization", Pattern
Recognition Letters, 21 (1), 61-68, January 2000.
- P. Fränti, J. Kivijärvi, T. Kaukoranta and
O. Nevalainen, "Genetic algorithms for large
scale clustering problems", The Computer Journal,
40 (9), 547-554, 1997. - J. Kivijärvi, P. Fränti and O. Nevalainen,
"Self-adaptive genetic algorithm for clustering",
Journal of Heuristics, 9 (2), 113-129, 2003. - J.S. Pan, F.R. McInnes and M.A. Jack, VQ codebook
design using genetic algorithms. Electronics
Letters, 31, 1418-1419, August 1995. - P. Scheunders, A genetic Lloyd-Max quantization
algorithm. Pattern Recognition Letters, 17,
547-556, 1996.