Parameterizing Random Test Data According to Equivalence Classes presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parameterizing Random Test Data According to Equivalence Classes

1
Parameterizing Random Test Data According to
Equivalence Classes

Chris Murphy, Gail Kaiser, Marta Arias
Columbia University

2
What is random testing?

This is not part of the talk!!!!
Random testing is the notion of using random
input to test the application
As opposed to using pre-determined and manually
selected equivalence classes or partitions

3
Introduction

We are investigating the quality assurance of
Machine Learning (ML) applications
Currently we are concerned with a real-world
application for potential future use in
predicting electrical device failures
Using ranking instead of classification
Our concern is not whether an algorithm predicts
well but whether an implementation operates
correctly

4
Data Set Options

Real-world data sets
Not always accessible/available
May not necessarily contain the separation or
combination of traits that we desire to test
Hand-generation of data
Only useful for small tests
Random testing
Limited by the lack of a reliable test oracle
ML applications of interest fall into the
category of non-testable programs

5
Motivation

Without a reliable test oracle, we can only
Look for obvious faults
Consider intermediate results
Detect discrepancies in the specification
We need to restrict some properties of random
test data generation

6
Our Solution

Parameterized Random Test Data Generation
Automatically generate random data sets, but
parameterized to control the range and
characteristics of those random values
Parameterization allows us to create a hybrid
between equivalence class partitioning and random
testing

7
Overview

Machine Learning Background
Data Generation Framework
Findings and Results
Evaluation and Observations
Conclusions and Future Work

8
Machine Learning Fundamentals

Data sets consist of a number of examples, each
of which has attributes and a label
In the first phase (training), a model is
generated that attempts to generalize how
attributes relate to the label
In the second phase (validation), the model is
applied to a previously-unseen data set with
unknown labels to produce a classification (or,
in our case, a ranking)

9
Problems Faced in Testing

The testing input should be based on the problem
domain
Need to consider a way to mimic all of the traits
of the real-world data sets
Also need to keep in mind that we do not have a
reliable test oracle

10
Analyzing the Problem Domain

Consider properties of data sets in general
Data set size number of attributes and examples
Range of values attributes and labels
Precision of floating-point numbers
Whether values can repeat
Consider properties of real-world data sets in
the domain of interest
How alphanumeric attributes are to be interpreted
Whether data values might be missing

11
Equivalence Classes

Data sizes of different orders of magnitude
Repeating vs. non-repeating attribute values
Missing vs. no-missing attribute values
Categorical vs. non-categorical data
0/1 labels vs. non-negative integer labels
Predictable vs. non-predictable data sets
Used data set generator to parameterize test case
selection criteria

12
How Data Are Generated

M attributes and N examples
No-repeat mode
Generate a list of integers from 1 to MN and
then randomly permute them
Repeat mode
Each value in the data set is simply a random
integer between 1 and MN
Tool ensures at least one set of repeating numbers

13
Generating Labels

Specify percentage of positive examples to
include in the data set
positive examples have a label of 1
negative examples have a label of 0
Data generation framework guarantees that the
number of positive examples comes out to be the
right number, even though the values are randomly
placed throughout the data set
Labels are never unknown/missing

14
Categorical Data

For some alphanumeric attributes, data
pre-processing is used to expand K distinct
values to K attributes
Same as in real-world ranking application
Input parameter to data generation tool is of the
format (a1, a2, ..., aK-1, aK, m)
a1 through aK represent the percentage
distribution of those values for the categorical
attribute
m is the percentage of unknown values

15
Data Set Generator - Parameters

of examples
of attributes
positive examples (label 1)
missing
any categorical data
repeat/no-repeat modes

16
Sample Data Sets

10 examples, 10 attributes, 40 positive
examples, 20 missing, repeats allowed

27,81,88,59, ?,16,88, ?,41, ?,0 15,70,91,41, ?,
3, ?, ?, ?,64,0 82, ?,51,47, ?, 4, 1,99,
?,51,0 22,72,11, ?,96,24,44,92, ?,11,1 57,77,
?,86,89,77,61,76,96,98,1 76,11, 4,51,43,
?,79,21,28, ?,0 6,33, ?, ?,52,63,94,75,
8,26,0 77,36,91, ?,47, 3,85,71,35,45,1 ?,17,15,
2,90,70, ?, 7,41,42,0 8,58,42,41,74,87,68,68,
1,15,1
35, 3,20,41,91, ?,32,11,43, ?,1 19,50,11,57,36,94,
?,96, 7,23,1 24,36,36,79,78,33,34, ?,32, ?,0
?,15, ?,19,65,80,17,78,43, ?,0 40,31,89,50,83,55,2
5, ?, ?,45,1 52, ?, ?, ?, ?,39,79,82,94,
?,0 86,45, ?, ?,74,68,13,66,42,56,0
?,53,91,23,11, ?,47,61,79, 8,0 77,11,34,44,92,
?,63,62,51,51,1 21, 1,70,14,16,40,63,94,69,83,0
17
The Testing Framework

Data set generator
Model comparison
Ranking comparison includes metrics like
normalized equivalence and AUCs
Tracing options for generating and comparing
outputs of debugging statements

18
MartiRank and SVM

MartiRank was specifically designed for the
real-world device failure application
Seeks to find the sequence of attributes to
segment and sort the data to produce the best
result
SVM is typically a classification algorithm
Seeks to find a hyperplane that separates
examples from different classes
SVM-Light has a ranking mode based on the
distance from the hyperplane

19
Findings

Testing approach and framework were developed for
MartiRank then applied to SVM
Only the findings most related to parameterized
random testing are presented here
More details and case studies about the testing
of MartiRank can be found in our tech report

20
Issue 1 Repeating Values

One version of MartiRank did not use stable
sorting

... 91,41,19, 3,57,11,20,64,0 36,73,47,
3,85,71,35,45,1 ... ... ... ...
stable
... 91,41,19, 3,57,11,20,64,0 ... ... ... 36,73,47
, 3,85,71,35,45,1 ...
... 36,73,47, 3,85,71,35,45,1 91,41,19,
3,57,11,20,64,0 ... ... ... ...
unstable
21
Issue 2 Sparse Data Sets

Not specifically addressed in specification

41,91, ?,32,11,43, ?,1 57,36,94, ?,96,
7,23,1 79,78,33,34, ?,31, ?,0 19,65,80,17,78,46,
?,0 50,83,55,25, ?, ?,45,1 ?, ?,39,79,82,94, ?,0
41,91, ?,32,11,43, ?,1 19,65,80,17,78,46,
?,0 79,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94,
?,0 50,83,55,25, ?, ?,45,1 57,36,94, ?,96, 7,23,1
sort around missing values
put missing values at end
randomly insert missing values
41,91, ?,32,11,43, ?,1 19,65,80,17,78,46, ?,0 ?,
?,39,79,82,94, ?,0 57,36,94, ?,96,
7,23,1 79,78,33,34, ?,31, ?,0 50,83,55,25, ?,
?,45,1
41,91, ?,32,11,43, ?,1 50,83,55,25, ?,
?,45,1 19,65,80,17,78,46, ?,0 79,78,33,34, ?,31,
?,0 ?, ?,39,79,82,94, ?,0 57,36,94, ?,96, 7,23,1
22
Issue 3 Categorical Data

Discovered that refactoring had introduced a bug
into an important calculation
A global variable was being used incorrectly
This bug did not appear in any of the tests only
with repeating values or only with missing values
However, categorical data necessarily has
repeating values and may have missing

23
Issue 4 Permuted Input Data

Randomly permuting the input data led to
different models (and then different rankings)
generated by SVM-Light
Caused by chunking data for use by an
approximating variant of optimization algorithm

24
Observations

Parameterized random testing allowed us to
isolate the traits of the data sets
These traits may appear in real-world data but
not necessarily in the desired combinations
Algorithms failure to address specific data set
traits can lead to discrepancies

25
Related Work Machine Learning

There has been much research into applying
Machine Learning techniques to software testing,
but not the other way around
Reusable real-world data sets and Machine
Learning frameworks are available for checking
how well a Machine Learning algorithm predicts,
but not for testing its correctness

26
Related Work Random Testing

Parameterization generally refers to specifying
data type or range of values
Our work differs from that of Thénevod-Fosse et
al. 91 on structural statistical testing,
which focuses on path selection and coverage
testing, not system testing
Also differs from uniform statistical testing
because although we do select random data over a
uniform distribution, we parameterize it
according to equivalence classes

27
Limitations and Future Work

Test suite adequacy for coverage not addressed or
measured
Could also consider non-deterministic Machine
Learning algorithms
Can also include mutation testing for
effectiveness of data sets
Should investigate creating large data sets that
correlate to real-world data

28
Conclusion

Our contribution is an approach that combines
parameterization and randomness to control the
properties of very large data sets
Critical for limiting the scope of individual
tests and for pinpointing specific issues related
to the traits of the input data

Parameterizing Random Test Data According to Equivalence Classes PowerPoint PPT Presentation