An Approach to Software Testing of Machine Learning Applications

About This Presentation

Title:

Description:

Number of Views:130

Avg rating:3.0/5.0

Slides: 19

Provided by: GailK9

Category:

more less

Transcript and Presenter's Notes

Title: An Approach to Software Testing of Machine Learning Applications

1
An Approach to Software Testing of Machine
Learning Applications

2
Introduction

We are investigating the quality assurance of
Machine Learning (ML) applications
Currently we are concerned with a real-world
application for potential future use in
predicting electrical device failures
Machine Learning applications fall into a class
for which it can be said that there is no
reliable oracle
These are also known as non-testable programs
and could fall into Davis and Weyukers class of
programs which were written in order to
determine the answer in the first place. There
would be no need to write such programs, if the
correct answer were known.

3
Introduction

We have developed an approach to creating test
cases for Machine Learning applications
Analyze the problem domain and real-world data
sets
Analyze the algorithm as it is defined
Analyze an implementations runtime options
Our approach was designed for MartiRank and then
generalized to other ranking algorithms such as
Support Vector Machines (SVM)

4
Overview

5
Machine Learning Fundamentals

Data sets consist of a number of examples, each
of which has attributes and a label
In the first phase (training), a model is
generated that attempts to generalize how
attributes relate to the label
In the second phase, the model is applied to a
previously-unseen data set (testing data) with
unknown labels to produce a classification (or,
in our case, a ranking)
This can be used for validation or for prediction

6
MartiRank and SVM

MartiRank was specifically designed for the
device failure application
Seeks to find the combination of segmenting and
sorting the data that produces the best result
SVM is typically a classification algorithm
Seeks to find a hyperplane that separates
examples from different classes
Different kernels use different approaches
SVM-Light has a ranking mode based on the
distance from the hyperplane

7
Related Work

There has been much research into applying
Machine Learning techniques to software testing,
but not the other way around
Reusable real-world data sets and Machine
Learning frameworks are available for checking
how well a Machine Learning algorithm predicts,
but not for testing its correctness

8
Analyzing the Problem Domain

9
Analyzing the Algorithm

Look for imprecisions in the specification, not
necessarily bugs in the implementation
How to handle missing attribute values
How to handle negative labels
Consider how to construct a data set that could
cause a predictable ranking

10
Analyzing the Runtime Options

11
The Testing Framework

Data set generator of examples, of
attributes, failures, missing, any
categorical data, repeat/no-repeat modes
Model comparison specific to MartiRank
Ranking comparison includes metrics like
normalized equivalence and AUCs
Tracing options for generating and comparing
outputs of debugging statements

12
Equivalence Classes

13
Testing MartiRank

14
Regression Testing of MartiRank

15
Testing Multiple Implementations of MartiRank

We had three implementations developed by three
different coders
Can be used as pseudo-oracles for each other
Used to discover a bug in the way one
implementation was handling missing values

16
Applying Approach to SVM-Light

Permuting the input data led to different models
Caused by chunking data for use by an
approximating variant of optimization algorithm
Introduction of noise in a data set in some cases
caused it not to find a predictable ranking
Different kernels also caused different results
with predictable rankings

17
Evaluation and Observations

Testing approach revealed bugs and imprecision in
the implementations, as well as discrepancies
from the stated algorithms
Inspection of the algorithms led to the creation
of predictable data sets
What is predictable for one algorithm may not
lead to a predictable ranking in another
Algorithms failure to address specific data set
traits can lead to incorrect results (and/or
inconsistent results across implementations)
The approach can be generalized to other Machine
Learning ranking algorithms, as well as
classification

18
Limitations and Future Work