Solving the Set Covering Problem based on a new Clustering Heuristic - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Solving the Set Covering Problem based on a new Clustering Heuristic

Description:

A method for partitioning the column-set into 'homogeneous' clusters formed by similar columns. ... set of columns into 'homogeneous' clusters, the best column ... – PowerPoint PPT presentation

Number of Views:280

Avg rating:3.0/5.0

Slides: 16

Provided by: acer170

Category:

more less

Transcript and Presenter's Notes

Title: Solving the Set Covering Problem based on a new Clustering Heuristic

1
Solving the Set Covering Problem based on a new
Clustering Heuristic

Nikolaos Mastrogiannis
Ioannis Giannikos
Basilis Boutsinas

2
The Set Covering Problem

The Set Covering Problem (SCP) is the problem of
covering the rows of an m-row, n-column, zero-one
matrix ( ) by a subset of the columns at
minimum cost. Defining
if column j is in the
solution (with cost )
otherwise
the SCP is Minimize
subject to

3
Clustering Heuristics General Description

Given the set covering problem as stated above, a
clustering heuristic must specify
A method for partitioning the column-set into
homogeneous clusters formed by similar columns.
A rule for selecting a best column for each
cluster.
If the set J of all the selected columns forms a
cover, then a prime cover PJ is extracted from J.
Else, the current partition is modified and the
process is repeated.
The proposed Clustering Heuristic, is based on
the general principles of
k-means clustering algorithm (MacQueen 1967)
k-modes clustering algorithm (Huang 1998)
ELECTRE methods and especially ELECTRE I
multicriteria method (Roy, 1968)

4
The Clustering Heuristic Introduction (1)

The Clustering Heuristic consists of the
following steps
Select k initial centroids, one for each cluster.
Assign each column to the proper cluster
according to
- the distance of the column to be clustered
from the best of the centroids, for each of the
rows (attributes) that describe both the column
and the centroid
- the importance of the best of the centroids
in terms of its rows (attributes) weights
- the possible veto that some of the rows
(attributes) might oppose in the result of the
assignment process
Update the centroid of each cluster after
each assignment

5
The Clustering Heuristic Introduction (2)

After all columns have been assigned to clusters,
retest the dissimilarity of columns against the
current centroids according to Step 2. If a
column is found nearest to another cluster
rather than its current one, re-assign the column
to that cluster and update the centroids of both
clusters.
Repeat 3 until no column has changed clusters
after a full cycle test of the whole dataset.

6
The Clustering Heuristic Introduction (3)

After partitioning the set of columns into
homogeneous clusters, the best column for each
cluster is chosen according to the Chvatals
selector
When the best column for each cluster is
identified, we solve the SCP as an integer
programming problem in MS Excel Solver in order
to extract a prime cover from the above
partition. Else we modify the partition (by
changing the number of clusters) and we repeat
the process.

7
Description of the Clustering Heuristic (1)

Step 1 The initial centroids Kt, t 1, k are
selected among the columns to be clustered.
Step 2.1 For every row (attribute) l1,m common
between column yj , j1, n and each of the
centroids Kt, t 1, k we calculate the distance,
, where
and , are the relative
frequencies of values Ktl and yjl
If , where ql is an
indifference threshold, then row (attribute) l
belong to the concordance coalition, that is, it
supports the proposition column yj is similar to
centroids Kt on row (attribute) l. Otherwise,
row (attribute) l, belongs to the discordance
coalition.
Observation The indifference threshold ql is a
variable automatically valued according to the
number of the discrete non-zero distances
.

8
Description of the Clustering Heuristic (2)

Step 2.2 In order to choose the best of the
centroids Kt, t 1, k and then assign the column
to be clustered to its cluster, we calculate the
concordance index (CIt, i1, k) and threshold
(CTt, i1, k), as follows
,
where wc, wf, wp are weights of rows (attributes)
and bonus is a parameter valued automatically,
set to enforce those clusters (1, k) that contain
as much zero differences as possible in every
row.
We sort in an descending order, the concordance
index (CIt, i1, k) and its corresponding
threshold (CTt, i1, k) for every cluster.

9
Description of the Clustering Heuristic (3)

If , where
b 1,, k denotes the k positions of the k
centroids and the corresponding indices and
thresholds in the descending order stated above,
the column j is clustered to cluster b.
Otherwise, the process is repeated until the
column is clustered.
Observation1 Parameter is
the only parameter defined by the user. If m1 is
0.7, this means that for each column, the best of
the centroids incorporates 70 of the strength of
the weights of the attributes that belong to both
the concordance and the discordance coalition.
Observation2 The weighting of the rows l1, m in
the proposed algorithm is based on the density of
1s in matrix A. Thus

10
Description of the Clustering Heuristic (4)

Step 2.3 This step confirms or rejects the
allocation of a column to a cluster
If j, is the column to be clustered, t, is the
cluster assigned to the column according to step
2.2 and l, are all the rows that belong the
discordance coalition, then
If
, for every row l that belongs to
the discordance coalition, that is
, then the clustering is confirmed, the
centroid of the cluster is updated and we proceed
to the next column. Otherwise, we return to step
2.2.
Observation Parameter Ul is called veto
threshold, it is automatically valued, and for
every row l, Veto threshold (Ul) gt Indifference
threshold (ql).

11
Description of the Clustering Heuristic (5)

Step 3 We retest the dissimilarity of columns
against the current centroids according to Step
2, we re-assign every column needed to the proper
cluster and update the centroids of both
clusters.
Step 4 We repeat Step 3 until no column has
changed clusters after a full cycle test.
Finding a new centroid for cluster k
Column xj is a centroid for the zero-one
matrix ( ) if it minimizes
If is the number of columns with a
discrete value on row l, and
is the relative frequency of appearance of
on the set of columns, then DIS is minimized iff

for for every l1,,m

12
Computational Experimentation

The algorithm was tested using datasets from the
OR Library of J.E.Beasley.
The datasets include problems of size 50 x 500
and 200 x 1000.
In 80 of the tested datasets, the optimal
solution was found using in the
50 x 500 problems and in the 200
x 1000 problems.
Final SCP was solved using Premium Solver V.8 and
in particular the XPRESS-MP Solver Engine

13
Conclusions

The new clustering heuristic
combines three different scientific fields (set
covering, data mining, multicriteria analysis).
takes into consideration the weight of each row
(attribute) in the clustering process rather than
considering each row equivalently weighted.
calculates these weights according to the density
of 1s to the set.
analyzes the dataset in details through the
pairwise comparisons of each column with each
centroid for each row.
takes into consideration the possible objections
of the minority of the rows (attributes) for the
clustering process.
presents covering results as well as processing
time which are very promising.

14
Bibliography

Beasley, J.E (1987), An algorithm for set
covering problem, European Journal of Operational
Research, 31, pp.85-93
Huang, Z. (1998), Extensions to the k-means
Algorithm for Clustering Large Data Sets with
categorical Values, Data Mining and Knowledge
Discovery, vol.2, no.3, pp.283-304.
Huang, Z., Ng, M.K., Rong, H. Li, Z. (2005),
Automated Variable Weighting in k-means type
clustering, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol.27, no.5,
pp.657-668.
MacQueen, J.B. (1967), Some methods for
classification and analysis of multivariate
observations, In proceedings of the 5th Berkeley
Symposium on Mathematical Statistics and
Probability, pp.281-297
Roy, B. (1968) Classement et choix en présence de
points de vue muptiples La méthode ELECTRE.
R.I.R.O., 8, 57-75.
Roy, B. (1991) The outranking approach and the
foundations of ELECTRE methods. Theory and
Decision, 31, 49-73.