Title: Searching for the Boundaries of Concepts in Code
1Searching for the Boundaries of Concepts in Code
- Part of the
- CONTRACTS Project
-
2Concept Assignment to Raise the Abstraction Level
of Slicing
Professor Mark Harman Kings College London Dr
Nicolas Gold Kings College London Kiarash
Mahdavi Kings College London Zheng Li Kings
College London Professor Rob Hierons Brunel
University Professor Dave Binkley Loyola
College, USA Professor Jim Cordy Queens
University, Canada DaimlerChrysler Knowledge
Software
3Concept Assignment
- First defined in 1993
- The process of assigning descriptive terms to
their implementation in source code (and possibly
to each other), the terms being nominated by a
maintainer and usually relating to computational
intent.
4Concept Assignment
- Strength
- More Abstract understanding of software.
- useful for comprehension.
- Weakness
- Usually not executable.
- can not be directly applied to testing,
re-factoring and other useful software
maintenance operations.
5Slicing
- Process of extracting or isolating Parts of a
program that depend on or are dependent upon a
specified element of a program. - Strength
- Executable.
- Can be used for testing, maintenance and software
reuse. - Weakness
- Effective use of slicing requires good code
understanding.
6Concept Assignment and Slicing
- Concept Assignment provides good code
understanding, but may not be executable. - Slicing is executable, but effective slicing
requires good code understanding.
7Concept Assignment and Slicing Potential
- Potential to create More expressive slices
- Potential for new types of analysis e.g.
concept-level impact analysis - Could also facilitate
- Reuse/reengineering
- Comprehension/reverse engineering
- Domain model improvement
8Now back to the presentation
- Our current Concept Assignment algorithm (HB-CA)
provides good Quality Assignments but does not
consider overlapping concepts. - But Concepts do overlap.
MOVE EXAMPLE TO PRINT-LL. MOVE POLICY-NUM TO
OUT-PNUM. MOVE 13 TO PRINT-CC. MOVE SCHEME-REF
TO OUT-SREF. CALL PRINT USING P-PRINTLINE. CALL
WRITE USING OUT-REC.
9Hypothesis Based Concept Assignment
- HB-CA requirements
- source code
- Library contains indicators and concept
relationships provided by maintainer
10Library
- Provided by maintainer
- Contains Indicators and concept relationships
- Indicators are used to scan the software to
allocate the possible presence of concepts
(Hypothesis) - Concept relationships demonstrate how related
concepts can be combined to identify complex
concepts
11Concept relationship Example
Action
Write
Object
Record
Database
File
Specialisation
Transaction
PaymentFile
12HB-CA Process
- Hypothesis Generation Using indicators to locate
and create an ordered list of concept hypothesis
(Hypothesis List). - Segmentation Using a Self Organising Map to
cluster the Hypothesis list to identify areas of
conceptual focus. - Concept binding Scoring the clusters and
allocating (binding) a simple or complex concept.
13Hypothesis Generation
Hypothesis List
Source code
. . Output File Output . . .
. . 6----------------- 7----------------- 8---
-------------- 9----------------- . . .
Indicator Library
14Segmentation
- Currently Action Hypothesis are used to train the
SOM. - SOM is used to identify non overlapping
(isolated) clusters within hard segments.
15Concept Binding
- Each cluster created by SOM is examined.
- All possible Simple or composite permutations of
Concepts are scored. - The highest scoring Concept is selected and bound
to the partition.
16Concept binding example
. . Output File Output . . .
. . 6----------------- 7----------------- 8---
-------------- 9----------------- . . .
. . Output File Output . . .
Segmentation Algorithm(SOM)
Output File
Concept Relationship Library
Write Customer Record
Hypothesis List
Source code
Hypothesis List
17Overlapping Concept Problem Definition
- To find a set of concepts such that
- Create strongest concept binding.
- Are not restricted by boundaries.
- Cover as much of the Segment as possible.
18Algorithms
- Genetic Algorithm
- Evolve a population of solution (chromosomes) by
using operators such as mutation and crossover. - Guided by a fitness function.
- Hill Climbing
- Search through the local neighbourhood of a
solution - Guided by a fitness function.
- Try to escape local optima
- Random search.
19Chromosome
- Represent clustering solution to a Hard Segment
- Composed of Genes, representing clusters.
- Genes may have an on or off switch, which is used
for the fitness function. - Variable number of genes
20Chromosome Structure
Chromosome
--------------------
Genes
Hyp. list
21Fitness Function
- All algorithms use the same fitness function.
- Used to evaluate a potential solution (set of
clusters) against a set of desirable
characteristics. - Initially a winning concepts is identified for
each cluster (concept binding). - The set of clusters are evaluated according to
the fitness function.
22Fitness Function
Concept binding Strength
Segment Coverage
23Fitness Function Examples
Cluster fitness
Coverage
Cluster length
1
A A B Z Z B B Z Z
1-01-11
1
3-23-04
5
6
Fitness (8 7)/((211)9) 0.483
5
3-23-13
7
8
11
24Fitness Function
- After evaluation of all cluster Fitness
(determined by corresponding Gene), Overlapping
clusters with the same winning concept that are
of lower fitness are turned off. - Also complete overlap is not allowed and
corresponding gene is turned off.
25Genetic Algorithm
- Tournament selection
- 0.99 coefficient
- Flexible Stopping condition
- fitness stagnation over period of 50 generations
26Genetic Algorithm
- Crossover
- Clusters boundaries of the Genes are used to
create new Genes (clusters) - Only on Genes are used
- Crossover rate 0.8
- Mutation
- Consist of randomly changing the value of start
or end location of a cluster within a gene - Mutation rate 0.001
27Hill Climbing
- Start from a random solution
- Evaluate fitness of the solutions neighbourhood
- Move to fitter neighbouring solution
- Try to escape when reaching at local optimum
28Hill Climbing
- Two Stages
- Local search stage (moving through neighbours)
- Evaluate the neighbourhood of current chromosomes
- Escape from local optimum
- Crossover the Genes within the current chromosome
29Hill Climbing
- Local search
- Operations to examine Neighbours
- Move increase or decrease both boundaries by 1
hypothesis - Resize moving one of the cluster boundaries by 1
hypothesis
30Hill Climbing
- Escape from local optimum
- Crossover Genes at random
- Add the Genes that improve on the overall fitness
to the solution
31Random Search
- For comparison purposes
- Randomly generate and evaluate chromosomes
- Stop when reached the largest number of fitness
calculations used by GA or HC
32Experimental details
- 21 COBOL II programs
- 10 runs per segment for each search type
33GA and HC Median fitness ordered by Segment Size
34GA and Random Median Fitness ordered by Segment
Size
35HC and Random Median Fitness ordered by Segment
Size
36GA and HC Cost Comparison ordered by Segment Size
37Conclusions
- GA does best, and HC worst.
- A better definition of HC neighbouring solutions
may be helpful.
38Discussion / Future Work
- Alternative graphical representation to clearly
display distribution of results. - Complexity of searching for overlapping clusters.
- Statistical evaluation of the results
significance.