Title: ICSE PROMISE 2005
1Nearest Neighbor Sampling for Better Defect
Prediction
Gary D. Boetticher Department of Software
Engineering University of Houston - Clear
Lake Houston, Texas, USA
2The Problem Why is there not more ML in Software
Engineering?
Machine Learning
7 to 16
Human-Based 62 to 86 Jørgensen 2004
3Key Idea
- More ML in SE through a more defined experimental
process.
4Agenda
- A better defined process for better predicting
(quality) - Experiments Nearest Neighbor Sampling on PROMISE
Defect data sets - Extending the approach
- Discussion
- Conclusions
5A Better Defined Process
- Emphasis of ML approaches
- Emphasis on Measuring Success
- PRED(X)
- Accuracy
- MARE
- Prediction success depends upon the relationship
between training and test data.
6PROMISE Defect Data (from NASA)
- 21 Inputs
- Size (SLOC, Comments)
- Complexity (McCabe Cyclomatic Comp.)
- Vocabulary (Halstead Operators, Operands)
- 1 Output Number of Defects
7Data Preprocessing
8Experiment 1
9Experiment 1 Continued
Remaining Vectors from Data set
?
Remaining Vectors from Data set
?
?
?
Nasty Test
10Experiment 1 Continued
- J48 and Naïve Bayes Classifiers from WEKA
- 200 Trials (100 Nice Test Data 100 Nasty Test
Data) - CM1
- JM1
- KC1
- KC2
- PC1
20 Nice Trials 20 Nasty Trials
11Results Accuracy
12Results Average Confusion Matrix
Note the distribution
0 Defects
Average Nasty Results
1 Defects
13Experiment 2 60 Train, KNN3
14Assessing Experiment Difficulty
- Exp_Difficulty 1 - Matches / Total_Test_Instance
s
Match Test vectors nearest neighbor is
from the same class instance
in the training set.
Hard experiment
Experimental Difficulty 1 Experimental
Difficulty 0
Easy experiment
15Assessing Overall Data Difficulty
- Overall Data Difficulty 1 - Matches /
Total_Data_Instances
Match A data vectors nearest neighbor is
from the same class instance
as another vector in the data set.
Difficult Data
Overall Data Difficulty 1 Overall Data
Difficulty 0
Easy Data
16Discussion Anticipated Benefits
- Method for characterizing difficulty of
experiment - More realistic models
- Easy to implement
- Can be integrated into N-Way Cross Validation
- Can apply to various types of SE data sets
- Defect Prediction
- Effort Estimation
- Can be extended beyond SE to other domains
17Discussion Potential Problems
- More work needs to be done
- Agreement on how to measure Experimental
Difficulty - Extra overhead
- Implicitly or Explicitly Data Staved Domain
18Conclusions
- How to get more ML in SE?
Assess experiments/data for their difficulty
- Benefits
- More credibility to the modeling process
- More reliable predictors
- More realistic models
19Acknowledgements
Thanks to the reviewers for their comments!
20References
1) M. Jørgensen, A Review of Studies on Expert
Estimation of Software Development Effort,
Journal Systems and Software, Vol 70, Issues 1-2,
2004, Pp. 37-60.