parameterizing random test data according to equivalence classes chris murphy, gail kaiser, marta...
Post on 21-Dec-2015
215 views
TRANSCRIPT
Parameterizing Random Test Data According to Equivalence Classes
Chris Murphy, Gail Kaiser, Marta Arias
Columbia University
What is random testing?This is not part of the talk!!!!
Random testing is the notion of using “random” input to test the application
As opposed to using pre-determined and manually selected “equivalence classes” or “partitions”
Introduction We are investigating the quality assurance
of Machine Learning (ML) applications Currently we are concerned with a real-
world application for potential future use in predicting electrical device failuresUsing ranking instead of classification
Our concern is not whether an algorithm predicts well but whether an implementation operates correctly
Data Set Options Real-world data sets
Not always accessible/available May not necessarily contain the separation or
combination of traits that we desire to test Hand-generation of data
Only useful for small tests Random testing
Limited by the lack of a reliable test oracle ML applications of interest fall into the category
of “non-testable programs”
Motivation Without a reliable test oracle, we can only:
Look for obvious faultsConsider intermediate resultsDetect discrepancies in the specification
We need to restrict some properties of random test data generation
Our Solution Parameterized Random Test Data Generation
Automatically generate random data sets, but parameterized to control the range and characteristics of those random values
Parameterization allows us to create a hybrid between equivalence class partitioning and random testing
Overview Machine Learning Background Data Generation Framework Findings and Results Evaluation and Observations Conclusions and Future Work
Machine Learning Fundamentals Data sets consist of a number of
examples, each of which has attributes and a label
In the first phase (“training”), a model is generated that attempts to generalize how attributes relate to the label
In the second phase (“validation”), the model is applied to a previously-unseen data set with unknown labels to produce a classification (or, in our case, a ranking)
Problems Faced in Testing The testing input should be based on the
problem domain Need to consider a way to mimic all of the
traits of the real-world data sets Also need to keep in mind that we do not
have a reliable test oracle
Analyzing the Problem Domain Consider properties of data sets in general
Data set size: number of attributes and examples Range of values: attributes and labels Precision of floating-point numbers Whether values can repeat
Consider properties of real-world data sets in the domain of interest How alphanumeric attributes are to be interpreted Whether data values might be missing
Equivalence Classes Data sizes of different orders of magnitude Repeating vs. non-repeating attribute values Missing vs. no-missing attribute values Categorical vs. non-categorical data 0/1 labels vs. non-negative integer labels Predictable vs. non-predictable data sets
Used data set generator to parameterize test case selection criteria
How Data Are Generated M attributes and N examples No-repeat mode:
Generate a list of integers from 1 to M*N and then randomly permute them
Repeat mode: Each value in the data set is simply a random
integer between 1 and M*NTool ensures at least one set of repeating
numbers
Generating Labels Specify percentage of “positive examples” to
include in the data set positive examples have a label of 1 negative examples have a label of 0
Data generation framework guarantees that the number of positive examples comes out to be the right number, even though the values are randomly placed throughout the data set
Labels are never unknown/missing
Categorical Data For some alphanumeric attributes, data
pre-processing is used to expand K distinct values to K attributesSame as in real-world ranking application
Input parameter to data generation tool is of the format (a1, a2, ..., aK-1, aK, m)a1 through aK represent the percentage
distribution of those values for the categorical attribute
m is the percentage of unknown values
Data Set Generator - Parameters # of examples # of attributes % positive examples (label = 1) % missing any categorical data repeat/no-repeat modes
Sample Data Sets 10 examples, 10 attributes, 40% positive
examples, 20% missing, repeats allowed
27,81,88,59, ?,16,88, ?,41, ?,015,70,91,41, ?, 3, ?, ?, ?,64,082, ?,51,47, ?, 4, 1,99, ?,51,022,72,11, ?,96,24,44,92, ?,11,157,77, ?,86,89,77,61,76,96,98,176,11, 4,51,43, ?,79,21,28, ?,0 6,33, ?, ?,52,63,94,75, 8,26,077,36,91, ?,47, 3,85,71,35,45,1 ?,17,15, 2,90,70, ?, 7,41,42,0 8,58,42,41,74,87,68,68, 1,15,1
35, 3,20,41,91, ?,32,11,43, ?,119,50,11,57,36,94, ?,96, 7,23,124,36,36,79,78,33,34, ?,32, ?,0 ?,15, ?,19,65,80,17,78,43, ?,040,31,89,50,83,55,25, ?, ?,45,152, ?, ?, ?, ?,39,79,82,94, ?,086,45, ?, ?,74,68,13,66,42,56,0 ?,53,91,23,11, ?,47,61,79, 8,077,11,34,44,92, ?,63,62,51,51,121, 1,70,14,16,40,63,94,69,83,0
The Testing Framework Data set generator Model comparison Ranking comparison: includes metrics like
normalized equivalence and AUCs Tracing options: for generating and
comparing outputs of debugging statements
MartiRank and SVM MartiRank was specifically designed for
the real-world device failure applicationSeeks to find the sequence of attributes to
segment and sort the data to produce the best result
SVM is typically a classification algorithmSeeks to find a hyperplane that separates
examples from different classesSVM-Light has a ranking mode based on the
distance from the hyperplane
Findings Testing approach and framework were
developed for MartiRank then applied to SVM
Only the findings most related to parameterized random testing are presented here More details and case studies about the testing of
MartiRank can be found in our tech report
Issue #1: Repeating Values One version of MartiRank did not use
“stable” sorting
...91,41,19, 3,57,11,20,64,0.........36,73,47, 3,85,71,35,45,1...
...36,73,47, 3,85,71,35,45,191,41,19, 3,57,11,20,64,0............
...91,41,19, 3,57,11,20,64,036,73,47, 3,85,71,35,45,1............
stable
unstable
Issue #2: Sparse Data Sets Not specifically addressed in specification
41,91, ?,32,11,43, ?,157,36,94, ?,96, 7,23,179,78,33,34, ?,31, ?,019,65,80,17,78,46, ?,050,83,55,25, ?, ?,45,1 ?, ?,39,79,82,94, ?,0
41,91, ?,32,11,43, ?,119,65,80,17,78,46, ?,079,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,050,83,55,25, ?, ?,45,157,36,94, ?,96, 7,23,1
41,91, ?,32,11,43, ?,119,65,80,17,78,46, ?,0 ?, ?,39,79,82,94, ?,057,36,94, ?,96, 7,23,179,78,33,34, ?,31, ?,050,83,55,25, ?, ?,45,1
sort “around” missing values
put missing values at end
41,91, ?,32,11,43, ?,150,83,55,25, ?, ?,45,119,65,80,17,78,46, ?,079,78,33,34, ?,31, ?,0 ?, ?,39,79,82,94, ?,057,36,94, ?,96, 7,23,1
randomly insertmissing values
Issue #3: Categorical Data Discovered that refactoring had introduced
a bug into an important calculationA global variable was being used incorrectly
This bug did not appear in any of the tests only with repeating values or only with missing values
However, categorical data necessarily has repeating values and may have missing
Issue #4: Permuted Input Data Randomly permuting the input data led to
different models (and then different rankings) generated by SVM-Light
Caused by “chunking” data for use by an approximating variant of optimization algorithm
Observations Parameterized random testing allowed us
to isolate the traits of the data sets
These traits may appear in real-world data but not necessarily in the desired combinations
Algorithm’s failure to address specific data set traits can lead to discrepancies
Related Work – Machine Learning There has been much research into applying
Machine Learning techniques to software testing, but not the other way around
Reusable real-world data sets and Machine Learning frameworks are available for checking how well a Machine Learning algorithm predicts, but not for testing its correctness
Related Work – Random Testing Parameterization generally refers to
specifying data type or range of values Our work differs from that of Thénevod-
Fosse et al. [’91] on “structural statistical testing”, which focuses on path selection and coverage testing, not system testing
Also differs from “uniform statistical testing” because although we do select random data over a uniform distribution, we parameterize it according to equivalence classes
Limitations and Future Work Test suite adequacy for coverage not addressed
or measured Could also consider non-deterministic Machine
Learning algorithms
Can also include mutation testing for effectiveness of data sets
Should investigate creating large data sets that correlate to real-world data
Conclusion Our contribution is an approach that
combines parameterization and randomness to control the properties of very large data sets
Critical for limiting the scope of individual tests and for pinpointing specific issues related to the traits of the input data