Solving the Set Covering Problem based on a new Clustering Heuristic - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Solving the Set Covering Problem based on a new Clustering Heuristic

Description:

A method for partitioning the column-set into 'homogeneous' clusters formed by similar columns. ... set of columns into 'homogeneous' clusters, the best column ... – PowerPoint PPT presentation

Number of Views:280
Avg rating:3.0/5.0
Slides: 16
Provided by: acer170
Category:

less

Transcript and Presenter's Notes

Title: Solving the Set Covering Problem based on a new Clustering Heuristic


1
Solving the Set Covering Problem based on a new
Clustering Heuristic
  • Nikolaos Mastrogiannis
  • Ioannis Giannikos
  • Basilis Boutsinas

2
The Set Covering Problem
  • The Set Covering Problem (SCP) is the problem of
    covering the rows of an m-row, n-column, zero-one
    matrix ( ) by a subset of the columns at
    minimum cost. Defining
  • if column j is in the
    solution (with cost )
  • otherwise
  • the SCP is Minimize
    subject to

3
Clustering Heuristics General Description
  • Given the set covering problem as stated above, a
    clustering heuristic must specify
  • A method for partitioning the column-set into
    homogeneous clusters formed by similar columns.
  • A rule for selecting a best column for each
    cluster.
  • If the set J of all the selected columns forms a
    cover, then a prime cover PJ is extracted from J.
    Else, the current partition is modified and the
    process is repeated.
  • The proposed Clustering Heuristic, is based on
    the general principles of
  • k-means clustering algorithm (MacQueen 1967)
  • k-modes clustering algorithm (Huang 1998)
  • ELECTRE methods and especially ELECTRE I
    multicriteria method (Roy, 1968)

4
The Clustering Heuristic Introduction (1)
  • The Clustering Heuristic consists of the
    following steps
  • Select k initial centroids, one for each cluster.
  • Assign each column to the proper cluster
    according to
  • - the distance of the column to be clustered
    from the best of the centroids, for each of the
    rows (attributes) that describe both the column
    and the centroid
  • - the importance of the best of the centroids
    in terms of its rows (attributes) weights
  • - the possible veto that some of the rows
    (attributes) might oppose in the result of the
    assignment process
  • Update the centroid of each cluster after
    each assignment

5
The Clustering Heuristic Introduction (2)
  • After all columns have been assigned to clusters,
    retest the dissimilarity of columns against the
    current centroids according to Step 2. If a
    column is found nearest to another cluster
    rather than its current one, re-assign the column
    to that cluster and update the centroids of both
    clusters.
  • Repeat 3 until no column has changed clusters
    after a full cycle test of the whole dataset.

6
The Clustering Heuristic Introduction (3)
  • After partitioning the set of columns into
    homogeneous clusters, the best column for each
    cluster is chosen according to the Chvatals
    selector
  • When the best column for each cluster is
    identified, we solve the SCP as an integer
    programming problem in MS Excel Solver in order
    to extract a prime cover from the above
    partition. Else we modify the partition (by
    changing the number of clusters) and we repeat
    the process.

7
Description of the Clustering Heuristic (1)
  • Step 1 The initial centroids Kt, t 1, k are
    selected among the columns to be clustered.
  • Step 2.1 For every row (attribute) l1,m common
    between column yj , j1, n and each of the
    centroids Kt, t 1, k we calculate the distance,

  • , where
  • and , are the relative
    frequencies of values Ktl and yjl
  • If , where ql is an
    indifference threshold, then row (attribute) l
    belong to the concordance coalition, that is, it
    supports the proposition column yj is similar to
    centroids Kt on row (attribute) l. Otherwise,
    row (attribute) l, belongs to the discordance
    coalition.
  • Observation The indifference threshold ql is a
    variable automatically valued according to the
    number of the discrete non-zero distances
    .

8
Description of the Clustering Heuristic (2)
  • Step 2.2 In order to choose the best of the
    centroids Kt, t 1, k and then assign the column
    to be clustered to its cluster, we calculate the
    concordance index (CIt, i1, k) and threshold
    (CTt, i1, k), as follows

  • ,
  • where wc, wf, wp are weights of rows (attributes)
    and bonus is a parameter valued automatically,
    set to enforce those clusters (1, k) that contain
    as much zero differences as possible in every
    row.
  • We sort in an descending order, the concordance
    index (CIt, i1, k) and its corresponding
    threshold (CTt, i1, k) for every cluster.

9
Description of the Clustering Heuristic (3)
  • If , where
    b 1,, k denotes the k positions of the k
    centroids and the corresponding indices and
    thresholds in the descending order stated above,
    the column j is clustered to cluster b.
    Otherwise, the process is repeated until the
    column is clustered.
  • Observation1 Parameter is
    the only parameter defined by the user. If m1 is
    0.7, this means that for each column, the best of
    the centroids incorporates 70 of the strength of
    the weights of the attributes that belong to both
    the concordance and the discordance coalition.
  • Observation2 The weighting of the rows l1, m in
    the proposed algorithm is based on the density of
    1s in matrix A. Thus

10
Description of the Clustering Heuristic (4)
  • Step 2.3 This step confirms or rejects the
    allocation of a column to a cluster
  • If j, is the column to be clustered, t, is the
    cluster assigned to the column according to step
    2.2 and l, are all the rows that belong the
    discordance coalition, then
  • If
    , for every row l that belongs to
  • the discordance coalition, that is
    , then the clustering is confirmed, the
    centroid of the cluster is updated and we proceed
    to the next column. Otherwise, we return to step
    2.2.
  • Observation Parameter Ul is called veto
    threshold, it is automatically valued, and for
    every row l, Veto threshold (Ul) gt Indifference
    threshold (ql).

11
Description of the Clustering Heuristic (5)
  • Step 3 We retest the dissimilarity of columns
    against the current centroids according to Step
    2, we re-assign every column needed to the proper
    cluster and update the centroids of both
    clusters.
  • Step 4 We repeat Step 3 until no column has
    changed clusters after a full cycle test.
  • Finding a new centroid for cluster k
  • Column xj is a centroid for the zero-one
    matrix ( ) if it minimizes
  • If is the number of columns with a
    discrete value on row l, and
  • is the relative frequency of appearance of
    on the set of columns, then DIS is minimized iff

    for for every l1,,m

12
Computational Experimentation
  • The algorithm was tested using datasets from the
    OR Library of J.E.Beasley.
  • The datasets include problems of size 50 x 500
    and 200 x 1000.
  • In 80 of the tested datasets, the optimal
    solution was found using in the
    50 x 500 problems and in the 200
    x 1000 problems.
  • Final SCP was solved using Premium Solver V.8 and
    in particular the XPRESS-MP Solver Engine

13
Conclusions
  • The new clustering heuristic
  • combines three different scientific fields (set
    covering, data mining, multicriteria analysis).
  • takes into consideration the weight of each row
    (attribute) in the clustering process rather than
    considering each row equivalently weighted.
  • calculates these weights according to the density
    of 1s to the set.
  • analyzes the dataset in details through the
    pairwise comparisons of each column with each
    centroid for each row.
  • takes into consideration the possible objections
    of the minority of the rows (attributes) for the
    clustering process.
  • presents covering results as well as processing
    time which are very promising.

14
Bibliography
  • Beasley, J.E (1987), An algorithm for set
    covering problem, European Journal of Operational
    Research, 31, pp.85-93
  • Huang, Z. (1998), Extensions to the k-means
    Algorithm for Clustering Large Data Sets with
    categorical Values, Data Mining and Knowledge
    Discovery, vol.2, no.3, pp.283-304.
  • Huang, Z., Ng, M.K., Rong, H. Li, Z. (2005),
    Automated Variable Weighting in k-means type
    clustering, IEEE Transactions on Pattern Analysis
    and Machine Intelligence, vol.27, no.5,
    pp.657-668.
  • MacQueen, J.B. (1967), Some methods for
    classification and analysis of multivariate
    observations, In proceedings of the 5th Berkeley
    Symposium on Mathematical Statistics and
    Probability, pp.281-297
  • Roy, B. (1968) Classement et choix en présence de
    points de vue muptiples  La méthode ELECTRE.
    R.I.R.O., 8, 57-75.
  • Roy, B. (1991) The outranking approach and the
    foundations of ELECTRE methods. Theory and
    Decision, 31, 49-73.

15
Thank you for your attention
Write a Comment
User Comments (0)
About PowerShow.com