Title: Solving the Set Covering Problem based on a new Clustering Heuristic
1Solving the Set Covering Problem based on a new
Clustering Heuristic
- Nikolaos Mastrogiannis
- Ioannis Giannikos
- Basilis Boutsinas
2The Set Covering Problem
- The Set Covering Problem (SCP) is the problem of
covering the rows of an m-row, n-column, zero-one
matrix ( ) by a subset of the columns at
minimum cost. Defining - if column j is in the
solution (with cost ) - otherwise
- the SCP is Minimize
subject to
3Clustering Heuristics General Description
- Given the set covering problem as stated above, a
clustering heuristic must specify - A method for partitioning the column-set into
homogeneous clusters formed by similar columns. - A rule for selecting a best column for each
cluster. - If the set J of all the selected columns forms a
cover, then a prime cover PJ is extracted from J.
Else, the current partition is modified and the
process is repeated. - The proposed Clustering Heuristic, is based on
the general principles of - k-means clustering algorithm (MacQueen 1967)
- k-modes clustering algorithm (Huang 1998)
- ELECTRE methods and especially ELECTRE I
multicriteria method (Roy, 1968) -
4The Clustering Heuristic Introduction (1)
- The Clustering Heuristic consists of the
following steps - Select k initial centroids, one for each cluster.
- Assign each column to the proper cluster
according to - - the distance of the column to be clustered
from the best of the centroids, for each of the
rows (attributes) that describe both the column
and the centroid - - the importance of the best of the centroids
in terms of its rows (attributes) weights - - the possible veto that some of the rows
(attributes) might oppose in the result of the
assignment process - Update the centroid of each cluster after
each assignment
5The Clustering Heuristic Introduction (2)
- After all columns have been assigned to clusters,
retest the dissimilarity of columns against the
current centroids according to Step 2. If a
column is found nearest to another cluster
rather than its current one, re-assign the column
to that cluster and update the centroids of both
clusters. - Repeat 3 until no column has changed clusters
after a full cycle test of the whole dataset.
6The Clustering Heuristic Introduction (3)
- After partitioning the set of columns into
homogeneous clusters, the best column for each
cluster is chosen according to the Chvatals
selector - When the best column for each cluster is
identified, we solve the SCP as an integer
programming problem in MS Excel Solver in order
to extract a prime cover from the above
partition. Else we modify the partition (by
changing the number of clusters) and we repeat
the process.
7Description of the Clustering Heuristic (1)
- Step 1 The initial centroids Kt, t 1, k are
selected among the columns to be clustered. - Step 2.1 For every row (attribute) l1,m common
between column yj , j1, n and each of the
centroids Kt, t 1, k we calculate the distance, -
, where -
- and , are the relative
frequencies of values Ktl and yjl - If , where ql is an
indifference threshold, then row (attribute) l
belong to the concordance coalition, that is, it
supports the proposition column yj is similar to
centroids Kt on row (attribute) l. Otherwise,
row (attribute) l, belongs to the discordance
coalition. - Observation The indifference threshold ql is a
variable automatically valued according to the
number of the discrete non-zero distances
.
8Description of the Clustering Heuristic (2)
- Step 2.2 In order to choose the best of the
centroids Kt, t 1, k and then assign the column
to be clustered to its cluster, we calculate the
concordance index (CIt, i1, k) and threshold
(CTt, i1, k), as follows -
, - where wc, wf, wp are weights of rows (attributes)
and bonus is a parameter valued automatically,
set to enforce those clusters (1, k) that contain
as much zero differences as possible in every
row. - We sort in an descending order, the concordance
index (CIt, i1, k) and its corresponding
threshold (CTt, i1, k) for every cluster.
9Description of the Clustering Heuristic (3)
- If , where
b 1,, k denotes the k positions of the k
centroids and the corresponding indices and
thresholds in the descending order stated above,
the column j is clustered to cluster b.
Otherwise, the process is repeated until the
column is clustered. - Observation1 Parameter is
the only parameter defined by the user. If m1 is
0.7, this means that for each column, the best of
the centroids incorporates 70 of the strength of
the weights of the attributes that belong to both
the concordance and the discordance coalition. - Observation2 The weighting of the rows l1, m in
the proposed algorithm is based on the density of
1s in matrix A. Thus -
-
10Description of the Clustering Heuristic (4)
- Step 2.3 This step confirms or rejects the
allocation of a column to a cluster - If j, is the column to be clustered, t, is the
cluster assigned to the column according to step
2.2 and l, are all the rows that belong the
discordance coalition, then - If
, for every row l that belongs to - the discordance coalition, that is
, then the clustering is confirmed, the
centroid of the cluster is updated and we proceed
to the next column. Otherwise, we return to step
2.2. - Observation Parameter Ul is called veto
threshold, it is automatically valued, and for
every row l, Veto threshold (Ul) gt Indifference
threshold (ql).
11Description of the Clustering Heuristic (5)
- Step 3 We retest the dissimilarity of columns
against the current centroids according to Step
2, we re-assign every column needed to the proper
cluster and update the centroids of both
clusters. - Step 4 We repeat Step 3 until no column has
changed clusters after a full cycle test. - Finding a new centroid for cluster k
- Column xj is a centroid for the zero-one
matrix ( ) if it minimizes - If is the number of columns with a
discrete value on row l, and - is the relative frequency of appearance of
on the set of columns, then DIS is minimized iff
for for every l1,,m
12Computational Experimentation
- The algorithm was tested using datasets from the
OR Library of J.E.Beasley. - The datasets include problems of size 50 x 500
and 200 x 1000. - In 80 of the tested datasets, the optimal
solution was found using in the
50 x 500 problems and in the 200
x 1000 problems. - Final SCP was solved using Premium Solver V.8 and
in particular the XPRESS-MP Solver Engine
13Conclusions
- The new clustering heuristic
- combines three different scientific fields (set
covering, data mining, multicriteria analysis). - takes into consideration the weight of each row
(attribute) in the clustering process rather than
considering each row equivalently weighted. - calculates these weights according to the density
of 1s to the set. - analyzes the dataset in details through the
pairwise comparisons of each column with each
centroid for each row. - takes into consideration the possible objections
of the minority of the rows (attributes) for the
clustering process. - presents covering results as well as processing
time which are very promising.
14Bibliography
- Beasley, J.E (1987), An algorithm for set
covering problem, European Journal of Operational
Research, 31, pp.85-93 - Huang, Z. (1998), Extensions to the k-means
Algorithm for Clustering Large Data Sets with
categorical Values, Data Mining and Knowledge
Discovery, vol.2, no.3, pp.283-304. - Huang, Z., Ng, M.K., Rong, H. Li, Z. (2005),
Automated Variable Weighting in k-means type
clustering, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol.27, no.5,
pp.657-668. - MacQueen, J.B. (1967), Some methods for
classification and analysis of multivariate
observations, In proceedings of the 5th Berkeley
Symposium on Mathematical Statistics and
Probability, pp.281-297 - Roy, B. (1968) Classement et choix en présence de
points de vue muptiples La méthode ELECTRE.
R.I.R.O., 8, 57-75. - Roy, B. (1991) The outranking approach and the
foundations of ELECTRE methods. Theory and
Decision, 31, 49-73.
15Thank you for your attention