An Approach to Software Testing of Machine Learning Applications - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

An Approach to Software Testing of Machine Learning Applications

Description:

An Approach to Software Testing of Machine Learning Applications ... Consider how to construct a data set that could cause a 'predictable' ranking. 10 ... – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: An Approach to Software Testing of Machine Learning Applications


1
An Approach to Software Testing of Machine
Learning Applications
  • Chris Murphy, Gail Kaiser, Marta Arias
  • Columbia University

2
Introduction
  • We are investigating the quality assurance of
    Machine Learning (ML) applications
  • Currently we are concerned with a real-world
    application for potential future use in
    predicting electrical device failures
  • Machine Learning applications fall into a class
    for which it can be said that there is no
    reliable oracle
  • These are also known as non-testable programs
    and could fall into Davis and Weyukers class of
    programs which were written in order to
    determine the answer in the first place. There
    would be no need to write such programs, if the
    correct answer were known.

3
Introduction
  • We have developed an approach to creating test
    cases for Machine Learning applications
  • Analyze the problem domain and real-world data
    sets
  • Analyze the algorithm as it is defined
  • Analyze an implementations runtime options
  • Our approach was designed for MartiRank and then
    generalized to other ranking algorithms such as
    Support Vector Machines (SVM)

4
Overview
  • Machine Learning Background
  • Testing Approach and Framework
  • Findings and Results
  • Evaluation and Observations
  • Future Work

5
Machine Learning Fundamentals
  • Data sets consist of a number of examples, each
    of which has attributes and a label
  • In the first phase (training), a model is
    generated that attempts to generalize how
    attributes relate to the label
  • In the second phase, the model is applied to a
    previously-unseen data set (testing data) with
    unknown labels to produce a classification (or,
    in our case, a ranking)
  • This can be used for validation or for prediction

6
MartiRank and SVM
  • MartiRank was specifically designed for the
    device failure application
  • Seeks to find the combination of segmenting and
    sorting the data that produces the best result
  • SVM is typically a classification algorithm
  • Seeks to find a hyperplane that separates
    examples from different classes
  • Different kernels use different approaches
  • SVM-Light has a ranking mode based on the
    distance from the hyperplane

7
Related Work
  • There has been much research into applying
    Machine Learning techniques to software testing,
    but not the other way around
  • Reusable real-world data sets and Machine
    Learning frameworks are available for checking
    how well a Machine Learning algorithm predicts,
    but not for testing its correctness

8
Analyzing the Problem Domain
  • Consider properties of the real-world data sets
  • Data set size Number of attributes and examples
  • Range of values attributes and labels
  • Precision of floating-point numbers
  • Categorical data how alphanumeric attrs are
    addressed
  • Also, repeating or missing data values

9
Analyzing the Algorithm
  • Look for imprecisions in the specification, not
    necessarily bugs in the implementation
  • How to handle missing attribute values
  • How to handle negative labels
  • Consider how to construct a data set that could
    cause a predictable ranking

10
Analyzing the Runtime Options
  • Determine how the implementation may manipulate
    the input data
  • Permuting the input order
  • Reading the input in chunks
  • Consider configuration parameters
  • For example, disabled anything probabilistic
  • Need to ensure that results are deterministic and
    repeatable

11
The Testing Framework
  • Data set generator of examples, of
    attributes, failures, missing, any
    categorical data, repeat/no-repeat modes
  • Model comparison specific to MartiRank
  • Ranking comparison includes metrics like
    normalized equivalence and AUCs
  • Tracing options for generating and comparing
    outputs of debugging statements

12
Equivalence Classes
  • Data sizes of different orders of magnitude
  • Repeating vs. non-repeating attribute values
  • Missing vs. no-missing attribute values
  • Categorical vs. non-categorical data
  • 0/1 labels vs. non-negative integer labels
  • Predictable vs. non-predictable data sets
  • Used data set generator to parameterize test case
    selection criteria

13
Testing MartiRank
  • Produced a core dump on data sets with large
    number of attributes (over 200)
  • Implementation does not correctly handle negative
    labels
  • Does not use a stable sorting algorithm

14
Regression Testing of MartiRank
  • Creation of a suite of testing data allowed us to
    use it for regression testing
  • Discovered that refactoring had introduced a bug
    into an important calculation

15
Testing Multiple Implementations of MartiRank
  • We had three implementations developed by three
    different coders
  • Can be used as pseudo-oracles for each other
  • Used to discover a bug in the way one
    implementation was handling missing values

16
Applying Approach to SVM-Light
  • Permuting the input data led to different models
  • Caused by chunking data for use by an
    approximating variant of optimization algorithm
  • Introduction of noise in a data set in some cases
    caused it not to find a predictable ranking
  • Different kernels also caused different results
    with predictable rankings

17
Evaluation and Observations
  • Testing approach revealed bugs and imprecision in
    the implementations, as well as discrepancies
    from the stated algorithms
  • Inspection of the algorithms led to the creation
    of predictable data sets
  • What is predictable for one algorithm may not
    lead to a predictable ranking in another
  • Algorithms failure to address specific data set
    traits can lead to incorrect results (and/or
    inconsistent results across implementations)
  • The approach can be generalized to other Machine
    Learning ranking algorithms, as well as
    classification

18
Limitations and Future Work
  • Test suite adequacy for coverage not addressed
  • Can also include mutation testing for
    effectiveness of data sets
  • Should investigate creating large data sets that
    correlate to real-world data
  • Could also consider non-deterministic Machine
    Learning algorithms

19
Questions?
Write a Comment
User Comments (0)
About PowerShow.com