This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Ali Ghanbari, Dept. of Computer Science, Iowa State University;
(2) Deepak-George Thomas, Dept. of Computer Science, Iowa State University;
(3) Muhammad Arbab Arshad, Dept. of Computer Science, Iowa State University;
(4) Hridesh Rajan, Dept. of Computer Science, Iowa State University.
Table of Links
- Abstract & Introduction
- Background
- Motivating Example
- Proposed Approach
- Supported DNN Bugs
- Evaluation
- Threats to Validity
- Related Work
- Conclusion, Acknowledgments & References
IV. PROPOSED APPROACH
Our technique deepmufl comprises four components: (1) mutation generator, (2) test case splitter, (3) mutation executor/tester, and (4) suspiciousness calculator. Figure 2 depicts these components as processes, numbered accordingly, taking inputs and producing outputs. Mutation generator (marked ⃝1 in Figure 2), applies 79 mutators on all the layers of the input buggy DNN, so as to generate a pool of mutants, i.e., variants of the original, buggy DNN model with small perturbations, e.g., replacing activation function of a layer. Test case splitter (marked ⃝2 in the figure), applies the original buggy DNN on a given set of test data points, i.e., test cases (or input values) paired with their expected output values, so as to partition the set into two subset, namely passing test cases and failing test cases. Passing test cases are referred to as those input values for which the expected output matches that of produced by the original model, whereas failing test cases are referred to as those input values for which the expected output does not match that of produced by the model. This component also stores the output of the original model on each of the test cases. Next, mutation executor (which is also called mutation tester, marked ⃝3 in the figure) applies the generated mutants on each of the passing and failing test cases and the results are compared to that of the original model recorded in the previous step. This amounts to a mutation execution matrix that is used to calculate suspiciousness values for each layer in the model (marked ⃝4 in the figure). The user may instruct deepmufl to use a specific fault localization formula, e.g., MUSE or Metallaxis with SBI or Ochiai, for calculating suspiciousness values. The layers are then ranked based on the calculated suspiciousness values for user inspection. The ranked list is accompanied with the information about the mutations conducted on each layer to facilitate debugging.
A. Mutation Generator
Mutation generator component receives the original, buggy DNN model and generates as many variants of the model, i.e., mutants, as possible, by systematically mutating every elements of the input model. This component implements 79 mutators. Mutators can be viewed as transformation operators that when applied to a given element, e.g., a neuron or a layer, in the model, returns a new variants of the model with that particular element mutated. Table 2 lists all the mutators implemented in deepmufl, the types of model elements on which they can operate, and the way each mutator affects the target element. These mutators are inspired by the ones implemented in the existing mutation analysis systems, e.g., [27], [28], [29], [49], [50], [51], [52], [53], to name a few. Ma et al. [29], and Hu et al. [28], define so-called model-level mutators that also operate on pre-trained models. Direct reuse of all of their mutators was not possible, as those mutators depend on random values which would introduce a source of nondeterminism in deepmufl: mutating the same model element with random values, e.g., Gaussian fuzzing, as in [29], would yield a different mutant each time, making deepmufl to produce different outputs on each run for the same model. In general, as far as MBFL is concerned, using variable values (whether it is deteministic or not), instead of the current hard-coded ones, for mutation of weights would not bring about any benefit, as the goal here is to break the model in some way and observe its impact on originally failing and passing test cases.
We argue that not all model bugs could be emulated using mutators at the level of pre-trained models, e.g., missing batch normalization, but the mutators listed in Table 2 are sufficient for emulating a subset of such bugs, e.g., wrong activation function or missing/redundant layer. Please see §V for a more detailed discussion on supported bugs.
Mutation generation in deepmufl is done directly on the trained model and there is no need for retraining the model. This makes deepmufl quite fast and perhaps more attractive from a practical point of view. However, this comes with a risk of not being traceable, i.e., a mutation on pre-trained .h5 model does not directly correspond to a line of source code for the user to inspect. In the Keras programs that we studied, this limitation was mitigated by the fact that the models with Sequential architecture were implemented using a wellunderstood structure and mapping layer numbers/identifiers in deepmufl’s reports to source code was trivial. In a future work, with the help of simple auto-generated annotations, e.g., for lexical scoping of the code snippet for model declaration, we will extend deepmufl to automatically map layer numbers/identifiers in its reports to actual lines of source code.
Humbatova et al. [27] argue about the importance of mutators emulating real DNN faults. We acknowledge that mutators emulating real faults would help generating more informative reports that would also give hints on how to fix the program. However, unlike mutation analysis, the main objective of an MBFL technique is to assign suspiciousness values to the program elements which can, in theory, be done using any kind of mutator, whether or not it makes sense from the standpoint of a developer. It is worth noting that the alternative design decision of using DeepCrime [27] as a mutation generation engine for deepmufl would result in finding more bugs than the current version of deepmufl, e.g., bugs related to training hyper-parameters or training dataset, but such a design is expected to be impacted by the nondeterminacy inherent in training process and, given the fact that we do not employ any training data selection, would be significantly slower due to numerous re-training. Nevertheless, finding more bugs would be an incentive for exploring this alternative as a future work.
B. Test Case Splitter
Before we introduce this component, we need to clarify certain terminology.
Definition 1.** A data point in a testing dataset for a DNN model is defined to be a pair of the form (I, O), where I ∈ R m and O ∈ R n, with m and n being positive integers. In this paper I is called test case, test input, or input, while O is called expected output or ground-truth value.
Given a test dataset, test case splitter component applies the original model on each of the test cases for the data points and checks if the model output matches to the expected output. If the two outputs match, then the corresponding test case will be marked as passing, otherwise it will be marked as failing. This component also records the output produced by the original model to be used during impact analysis, described below.
C. Mutation Executor (Mutation Tester)
We start describing this component with a definition.
Definition 2. A mutation execution matrix E is a k×l matrix, where k is the number of generated mutants, while l is the number of test cases. Each element Ei,j in the matrix is a member of the set {✓, ✗, ❍}, wherein ✓ indicates that the i th mutant impacts j th test case, whereas ✗ indicates that the mutant does not affect the test case. ❍ denotes a nonviable mutant, i.e., a mutant that fails when loading or applying it on a test case. Such mutants might be generated, e.g., due to creating a shape error [32] after the mutation.
Mutation executor component construct mutation execution matrix by applying each of the generated mutants (see §IV-A) on the failing and passing test cases to determine which of the test cases are impacted by which mutants. The impact of mutation on test cases is measured using two types of impacts, i.e., type 1 impact and type 2 impact, defined below.
Definition 3. Given a DNN model M, its mutated version M′ , and a data point (I, O), we define the two types of impacts:
• Type 1: Mutant M′ impacts the test case I if M(I) = O but M′ (I) ̸= O, or M(I) ̸= O but M′ (I) = O. In other words, type 1 impact tracks pass to fail and fail to pass test cases, a la MUSE [17]. `
• Type 2: Mutant M′ impacts the test case I if M(I) ̸= M′ (I).
In this definition, M(I) or M′ (I) denotes the operation of applying model M or M′ on the test case I.
It is worth noting that checking for equality of two values can be tricky for regression models, as those models approximate the expected values. To work around this problem, deepmufl compares values obtained from regression models using a user-defined delta threshold, i.e., the values are deemed equal if their absolute difference is no more than a threshold. By default, deepmufl uses a threshold of 0.001. This is the approach adopted by well-known testing frameworks for comparing floating-point values [54], [55]. Also, whether deepmufl uses type 1 or type 2 impact is a user preference and is determined alongside the threshold.
D. Suspiciousness Value Calculator
Armed with the above definitions, we can now give concrete definitions for the terms used in Eq. 1, 2, and 3, specialized to DNNs.
• Given a model element, i.e., neuron or a layer, e, M(e) is defined to be the set of all mutants generated by mutating e. These sets are produced by mutation generator process.
• Assuming that m is a mutant on the element e, Tf (m, e) (or Tp(m, e)) is defined as the set of failing (or passing) test cases that are impacted in type 1 or type 2 fashion by m. More concretely, Tp(m, e) = {t | Em,t = ✓ ∧ t is passing}, similarly Tf (m, e) = {t | Em,t = ✓ ∧ t is failing}. These two sets are defined using a quite similar notation; the readers are encouraged to review Definitions 2 and 3 to avoid confusion.
• Tf (or Tp) is the set of originally failing (or passing) test cases. These sets are constructed by test case splitter.
• F ⇝ P (or P ⇝ F), for a model element e, is defined as the set of originally failing (or passing) test cases that turned into passing (or failing) as a result of some mutation on e. More concretely, assuming the execution matrix E is constructed using type 1 impact, F ⇝ P is defined as {t | t is failing ∧ ∃m ∈ M(e) · Em,t = ✓}. Similarly, P ⇝ F is defined as {t | t is passing ∧ ∃m ∈ M(e)·Em,t = ✓}. In other words, these sets track all the failing/passing test cases that are type 1 impacted by some mutant of a given element. These two sets are defined using a quite similar notation; the readers are encouraged to review Definitions 2 and 3 to avoid confusion.
Having specialized definitions for the terms used in the fault localization formulas described earlier, we are now able to calculate suspiciousness values for the elements in a DNN model. Guided by the user preferences, deepmufl calculates all the values for |Tf (m, e)|, etc., and plugs into the specified formula to calculate suspiciousness values for the elements. It is worth noting that if all the mutants generated for a given element are nonviable, MUSE formula (Eq. 3) and all the variants of Metallaxis (e.g., Eq. 1), by definition, will return 0 as the suspiciousness value for the element. Nonviable mutants do not contribute toward localizing the bug, therefore they are considered overhead to the fault localization process. Fortunately, based on our observations in our dataset of buggy DNNs, nonviable mutants are rare.
Equivalent mutants are another source of overhead for deepmufl. Currently, we do not have any means of detecting equivalent mutants, but we argue that these mutants do not impact MBFL results, as they are equivalent to the original model and do not impact any passing or failing test cases.