Software Clustering Based on Information Loss Minimization - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Software Clustering Based on Information Loss Minimization

Description:

LIMBO algorithm. Produce summaries of the artifacts ... Limbo found Utility Clusters. November 2003. Vassilios Tzerpos. 15. Non-Structural Feature Results ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 17
Provided by: ut76
Category:

less

Transcript and Presenter's Notes

Title: Software Clustering Based on Information Loss Minimization


1
Software Clustering Based on Information Loss
Minimization
  • Periklis Andritsos
  • University of Toronto
  • Vassilios Tzerpos
  • York University

The 10th Working Conference on Reverse
Engineering 
2
The Software Clustering Problem
  • Input
  • A set of software artifacts (files, classes)
  • Structural information, i.e. interdependencies
    between the artifacts (invocations, inheritance)
  • Non-structural information (timestamps,
    ownership)
  • Goal Partition the artifacts into meaningful
    groups in order to help understand the software
    system at hand

3
Example
Program files
Utility files
4
Open questions
  • Validity of clusters discovered based on
    high-cohesion and low-coupling
  • No guarantee that legacy software was developed
    in such a way
  • Discovering utility subsystems
  • Utility subsystems are low-cohesion /
    high-coupling
  • They commonly occur in manual decompositions
  • Utilizing non-structural information
  • What types of information has value?
  • LOC, timestamps, ownership, directory structure

5
Our goals
  • Create decompositions that convey as much
    information as possible about the artifacts they
    contain
  • Discover utility subsystems as well as subsystems
    based on high-cohesion and low-coupling
  • Evaluate the usefulness of any combination of
    structural and non-structural information

6
Information Theory Basics
  • Entropy H(A)
  • Measures the Uncertainty in a random variable A
  • Conditional Entropy H(BA)
  • Measures the Uncertainty of a variable B,given a
    value for variable A.
  • Mutual Information I(AB)
  • Measures the Dependence of two random variables A
    and B.

7
Information Bottleneck (IB) Method
  • A random variable that ranges over the
    artifacts to be clustered
  • B a random variable that ranges over the
    artifacts features
  • I(AB) mutual information of A and B
  • Information Bottleneck Method TPB99
  • Compress A into a clustering Ck so that the
    information preserved about B is maximum
    (knumber of clusters).
  • Optimization criterion
  • minimize I(AB) - I(CkB) ? minimize H(BCk)
    H(BA)

8
Information Bottleneck Method
9
Agglomerative IB
  • Conceptualize graph as an nxm matrix (artifacts
    by features)
  • Compute an nxn matrix indicating the
    information loss we would incur if we joined any
    two artifacts into a cluster
  • Merge tuples with the minimum information loss

10
Adding Non-Structural Data
  • If we have information about the Developer and
    Location of files we express the artifacts to be
    clustered using a new matrix
  • Instead of B we use B to include non-structural
    data
  • We can compute I(AB) and proceed as before

11
ScaLable InforMation BOttleneck
  • AIB has quadratic complexity since we need to
    compute an (nxn) distance matrix.
  • LIMBO algorithm
  • Produce summaries of the artifacts
  • Apply agglomerative clustering on the summaries

12
Experimental Evaluation
  • Data Sets
  • TOBEY 939 files / 250,000 LOC
  • LINUX 955 files / 750,000 LOC
  • Clustering Algorithms
  • ACDC Pattern-based
  • BUNCH Adheres to High-Cohesion and Low-Coupling
  • NAHC, SAHC
  • Cluster Analysis Algorithms
  • Single linkage (SL)
  • Complete linkage (CL)
  • Weighted average linkage (WA)
  • Unweighted average linkage (UA)

13
Experimental Evaluation
  • Compared output of different algorithms using
    MoJo
  • MoJo measures the number of Move/Join operations
    needed to transform one clustering to another.
  • The smaller the MoJo value of a particular
    clustering, the more effective the algorithm that
    produced it.
  • We compute MoJo with respect to an authoritative
    decomposition

14
Structural Feature Results
Limbo found Utility Clusters
15
Non-Structural Feature Results
  • We considered all possible combinations of
    structural and non-structural features.
  • Non-Structural Features available only for Linux
  • Developers (dev)
  • Directory (dir)
  • Lines of Code (loc)
  • Time of Last Update (time)
  • For each combination we report the number of
    clusters k when the MoJo value between k and k1
    differs by one.

16
Non-Structural Feature Results
  • 8 combinations outperform structural results.
  • Dir information produced better decompositions.
  • Dev information has a positive effect.
  • Time leads to worse clusterings.
Write a Comment
User Comments (0)
About PowerShow.com