ICSE PROMISE 2005

About This Presentation

Title:

ICSE PROMISE 2005

Description:

A better defined process for better predicting (quality) ... MARE. Prediction success depends upon the relationship between training and test data. ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 21

Provided by: Boett7

Learn more at: http://nas.uhcl.edu

Category:

more less

Transcript and Presenter's Notes

Title: ICSE PROMISE 2005

1
Nearest Neighbor Sampling for Better Defect
Prediction
Gary D. Boetticher Department of Software
Engineering University of Houston - Clear
Lake Houston, Texas, USA
2
The Problem Why is there not more ML in Software
Engineering?
Machine Learning

Algorithmic

7 to 16
Human-Based 62 to 86 Jørgensen 2004
3
Key Idea

More ML in SE through a more defined experimental
process.

4
Agenda

A better defined process for better predicting
(quality)
Experiments Nearest Neighbor Sampling on PROMISE
Defect data sets
Extending the approach
Discussion
Conclusions

5
A Better Defined Process

Emphasis of ML approaches
Emphasis on Measuring Success
PRED(X)
Accuracy
MARE
Prediction success depends upon the relationship
between training and test data.

6
PROMISE Defect Data (from NASA)

21 Inputs
Size (SLOC, Comments)
Complexity (McCabe Cyclomatic Comp.)
Vocabulary (Halstead Operators, Operands)
1 Output Number of Defects

7
Data Preprocessing

Reduced to 2 classes

8
Experiment 1
9
Experiment 1 Continued
Remaining Vectors from Data set
?
Remaining Vectors from Data set
?
?
?
Nasty Test
10
Experiment 1 Continued

J48 and Naïve Bayes Classifiers from WEKA
200 Trials (100 Nice Test Data 100 Nasty Test
Data)
CM1
JM1
KC1
KC2
PC1

20 Nice Trials 20 Nasty Trials
11
Results Accuracy
12
Results Average Confusion Matrix

Average Nice Results

Note the distribution
0 Defects
Average Nasty Results
1 Defects
13
Experiment 2 60 Train, KNN3
14
Assessing Experiment Difficulty

Exp_Difficulty 1 - Matches / Total_Test_Instance
s

Match Test vectors nearest neighbor is
from the same class instance
in the training set.
Hard experiment
Experimental Difficulty 1 Experimental
Difficulty 0
Easy experiment
15
Assessing Overall Data Difficulty

Overall Data Difficulty 1 - Matches /
Total_Data_Instances

Match A data vectors nearest neighbor is
from the same class instance
as another vector in the data set.
Difficult Data
Overall Data Difficulty 1 Overall Data
Difficulty 0
Easy Data
16
Discussion Anticipated Benefits

Method for characterizing difficulty of
experiment
More realistic models
Easy to implement
Can be integrated into N-Way Cross Validation
Can apply to various types of SE data sets
Defect Prediction
Effort Estimation
Can be extended beyond SE to other domains

17
Discussion Potential Problems

More work needs to be done
Agreement on how to measure Experimental
Difficulty
Extra overhead
Implicitly or Explicitly Data Staved Domain

18
Conclusions

How to get more ML in SE?

Assess experiments/data for their difficulty

Benefits
More credibility to the modeling process
More reliable predictors
More realistic models

19
Acknowledgements
Thanks to the reviewers for their comments!
20
References
1) M. Jørgensen, A Review of Studies on Expert
Estimation of Software Development Effort,
Journal Systems and Software, Vol 70, Issues 1-2,
2004, Pp. 37-60.

Write a Comment

User Comments (0)