Mutation-based Fault Localization of Deep Neural Networks: Evaluation

cover
12 Mar 2024

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Ali Ghanbari, Dept. of Computer Science, Iowa State University;

(2) Deepak-George Thomas, Dept. of Computer Science, Iowa State University;

(3) Muhammad Arbab Arshad, Dept. of Computer Science, Iowa State University;

(4) Hridesh Rajan, Dept. of Computer Science, Iowa State University.

VI. EVALUATION

We evaluate deepmufl and compare it to state-of-the-art static and dynamic DNN fault localization techniques, by investigating the following research questions (RQs).

• RQ1 (Effectiveness):

  1. How does deepmufl compare to state-of-the-art tools in terms of the number of bugs detected?

  2. How many bugs does deepmufl detect from each subcategory of model bugs in our dataset and how does that compare to state-of-the-art tools?

  3. What are the overlap of detected bugs among deepmufl and other fault localization techniques?

• RQ2 (Efficiency):

  1. What is the impact of mutation selection on the effectiveness and efficiency of deepmufl?

  2. How does deepmufl compare to state-of-the-art tools in terms of end-to-end fault localization time?

A. Dataset of DNN Bugs

To evaluate deepmufl and compare it to state-of-the-art DNN fault localization techniques, we queried StackOverflow Q&A website for posts about Keras that had at least one accepted answer. Details about the SQL query used to obtain the initial list of posts is available online [36]. The query resulted in 8,412 posts that we manually sieved through to find the programs with model bugs. Specifically, we kept the bugs that satisfied the following conditions.

• Implemented using Sequential API of Keras,

• The bug in the program was a model bug supported by deepmufl as described in §V, and

• The bug either had training datatset available in the post in some form (e.g., hard-coded, clearly described in the body of the post, or a link to the actual data was provided) or we could see the described error using synthetic data obtained from scikit-learn’s dataset generation API.

This resulted in 102 bugs and we paired each bug with a fix obtained from the accepted answer to the question. We further added 7 bugs from DeepLocalize dataset [11] that are also coming from StackOverflow and we paired these bugs also with their fixes that are obtained from the most up-voted answer. Thus, we ended up with 109 bugs in total. To the best of our knowledge, this is the largest dataset of model bugs obtained from StackOverflow and it overlaps with the existing DNN bug datasets from previous research [12], [56]. Our bug dataset contains 85 classifiers (45 fully-connected DNNs, 29 CNNs, and 11 RNNs) and 24 regression models (19 fullyconnected DNNs, 3 CNNs, and 2 RNNs). And each category has at least one example of model bugs. Therefore, we believe that our dataset is greatly representative of model bugs, i.e., the bugs supported by deepmufl (and other tools that support this type of bugs), as we have examples of each sub-category of bug in various locations of the models for various regression and classification tasks.

After loading the training dataset for the bugs, we fitted the buggy models three times and stored them in .h5 file format separately. The repetition was conducted to take randomness in training into account. Randomness in data generation was mitigated by using deterministic random seeds. For fault localization purposes, we used the test dataset, and if it was not available, we used the training dataset itself. When we had to use synthesized data points, we deterministically splitted the generated set of data into training and testing datasets.

B. Baseline Approaches and Measures of Effectiveness

In RQ1 and RQ2, we compare five different configurations of deepmufl to recent static and dynamic DNN fault localization tools. The five configurations of deepmufl are as follows.

Metallaxis [30]: In this setting, we use the Metallaxis formula to calculate suspiciousness values of model elements. Metallaxis, by default, uses SBI [46] to calculate suspiciousness values for individual mutants. A recent study [57] provides empirical evidence on the superiority of Ochiai [47] over SBI when used within Metallaxis formula. Thus, we considered the following four combinations: type 1 impact: (1) SBI formula (i.e., Eq. 1); (2) Ochiai formula (i.e., Eq. 2), and type 2 impact: (3) SBI formula (i.e., Eq. 1); (4) Ochiai formula (i.e., Eq. 2).

MUSE [17]: We used the default formula of MUSE to calculate the suspiciousness of model elements. For this, only type 1 impact is considered, as the heuristics behind MUSE are defined based on type 1 impact.

Our technique follows a more traditional way of reporting root causes for the bugs [30], [17], [57], [19], [22], [21], [15], in that it reports a list of potential root causes ranked based on the likelihood of being responsible for the bug. This allows the users find the bugs faster and spend less time reading through the fault localization report, which in turn increases practicality of the technique [58]. We have used top-N, with N = 1, metric to measure the effectiveness of deepmufl in RQ1 and RQ2. Specifically, if the numbers of any of the buggy layers of the bug appeared in the first place in the output of deepmufl, we reported it as detected, otherwise we marked the bug as not-detected. We emphasize that top-1 metric gives a strong evidence on the effectiveness of deepmufl, as the developers usually only inspect top-ranked elements, e.g., over 70% of the developers only check top-5 ranked elements [59].

Table 3: Effectiveness of different deepmufl configurations and four other tools in detecting bugs from four sub-categories of model bugs

Our selection criteria for the studied fault localization techniques are: (1) availability; (2) reproducibility of the results reported in the original papers, so as to have a level of confidence on the correctness of the results reported here; and (3) support for model bugs in our dataset, so that we can make a meaningful comparison to deepmufl. Below we give a brief description of each of the selected tools, why we believe they support model bugs, and how we have interpreted their outputs in our experiments, i.e., when we regard a bug being detected by the tool.

1) Neuralint: A static fault localization tool that uses 23 rules to detect faults and design inefficiencies in the model. Each rule is associated with a set of rules of thumb to fix the bug that are shown to the user in case the precondition for any of the rules are satisfied. The five rules described in Section 4.2.1 of the paper target model bugs. Neuralint produces outputs of the form [Layer L ==> MSG] ∗ , where L is the suspicious layer number, and MSG is a description of the detected issue and/or suggestion on how to fix the problem. A bug is deemed detected by this tool if it is located in the layer mentioned in the output message or the messages describe any of the root causes of the bug.

2) DeepLocalize: A dynamic fault localization technique that detects numerical errors during model training. One of three rules described in Section III.D of the paper checks model bugs related to wrong activation function. DeepLocalize produces a single message of the form Batch B Layer L : MSG, where B is the batch number wherein the symptom is detected and L and MSG are the same as we described for Neuralint. A bug is deemed detected if it is located in the layer mentioned in the output message or the message describes any of the root causes of the bug.

3) DeepDiagnosis: A tool similar to DeepLocalize, but with more bug pattern rules and a decision procedure to give actionable fix suggestions to the users based on the observations. All 8 rules in Table 2 of the paper monitor the symptoms of model bugs. Similar to DeepLocalize, DeepDiagnosis produces a single message of the form Batch B Layer L : MSG1 [OR MSG2], where B and L are the same as described in DeepLocalize and MSG1 and MSG2 are two alternative solutions that the tool might suggest to fix the detected problem. A bug is deemed detected if it is located in the layer mentioned in the output message or the message describes any of the root causes of the bug.

4) UMLAUT: A hybrid, i.e., a combination of static and dynamic, technique that works by applying heuristic static checks on, and injecting dynamic checks in, the program, parameters, model structure, and model behavior. Violated checks raise error flags which are propagated to a web-based interface that uses visualizations, tutorial explanations, and code snippets to help users find and fix detected errors in their code. All three rules described in Section 5.2 of the paper target model bugs. The tool generates outputs of the form [< MSG1 > · · · < MSGm >] ∗ , where m > 0 and MSGi is a description of the problem detected by the tool. A bug is deemed detected if any of the messages match the fix prescribed by the ground-truth.

C. Results

To answer RQ1, we ran deepmufl (using its five configurations) and four other tools on the 109 bugs in our benchmark. We refer the reader to the repository [36] for the raw data about which bug is detected by which tool, and here we describe the summaries and provide insights.

At top-1, deepmufl detects 42, 47, 26, 37, and 53 bugs using its Metallaxis SBI + Type 1, Metallaxis Ochiai + Type 1, Metallaxis SBI + Type 2, Metallaxis Ochiai + Type 2, and MUSE, respectively, configurations. Meanwhile Neuralint, DeepLocalize, DeepDiagnosis, and UMLAUT detect 21, 26, 30, and 25, respectively, bugs. Therefore, as far as the number of bugs detected by each technique is concerned, MUSE configuration of deepmufl is the most effective configuration of deepmufl, significantly outperforming studied techniques, and Metallaxis Ochiai + Type 2 is the least effective one, outperformed by DeepDiagnosis. An empirical study [57], which uses a specific dataset of traditional buggy programs, concludes that Metallaxis Ochiai + Type 2 is the most effective configuration for MBFL. Meanwhile, our results for DNNs corroborates the theoretical results by Shin and Bae [60], i.e., we provide empirical evidence that in the context of DNNs MUSE is the most effective MBFL approach.

Table 3 reports more details and insights on the numbers discussed above. Specifically, it reports the number of bugs detected by each configuration of deepmufl an four other studied tools from each sub-category of model bugs present in our dataset of bugs. As we can see from the upper half of the table, MUSE is most effective in detecting bugs related to activation function (SC1), bugs related to model type/properties (SC2), and wrong/redundant/missing layer (SC4), while Metallaxis Ochiai + Type 1 configuration outperforms other configurations in detecting bugs related to layer properties (SC3). Similarly, from bottom half of the table, we can see that other tools are also quite effective in detecting bugs related to activation function, with DeepDiagnosis being the most effective one among others. We can also observe that UMLAUT has been the most effective tool in detecting bugs related to layer properties. As we can see, MUSE configuration of deepmufl is consistently more effective than other tools across all bug sub-categories.

Table 4: The fraction of bugs detected by each deepmufl configuration that are also detected by the other four tools

Table 4 provides further insights on the overlap of bugs detected by each variant of deepmufl and those detected by the other four tools. Each value in row r and column c of this table, where 2 ≤ r ≤ 5 and 2 ≤ c ≤ 6, denotes the percentage of bugs detected by the deepmufl variant corresponding to row r and tool corresponding to column c. The values inside the parenthesis are the actual number of bugs. For example, 8 out of 42, i.e., 19.05%, of the bugs detected by Metallaxis SBI + Type 1 configuration of deepmufl are also detected by DeepLocalize. The last column of the table reports same statistics, except for all four of the studied tools combined. As we can see from the table, 60.38% of the bugs detected by MUSE configuration of deepmufl are already detected by one of the four tools, yet it detects 21 (=53-32) bugs that are not detected by any other tools. This is because deepmufl approaches fault localization problem from a fundamentally different aspect giving it more flexibility. Specifically, instead of looking for conditions that trigger a set of hard-coded rules, indicating bug patterns, deepmufl breaks the model using a set of mutators to observe how different mutation impact the model behavior. Then by leveraging the heuristics underlying traditional MBFL techniques, it performs fault localization using the observed impacts on the model behavior. Listing 2 shows an example of a model bug that only deepmufl can detect.

Listing 2: Bug 48251943 in our dataset

The problem with this regression model is that it does not output negative values, and given the fact that the dataset contains negative target values, the model achieves high MSE. Since this model does not result in any numerical errors during training, DeepLocalize does not issue any warning messages, and since MSE decreases and the model does not show any sign of erratic behavior DeepDiagnosis does not detect the bug. UMLAUT’s messages instruct adding softmax layer and checking validation accuracy which is clearly not related to the problem, because the bug is fixed by changing the activation function of the last layer to tanh and normalizing the output values. Lastly, Neuralint issues an error message regarding incorrect loss function which also seems to be a false positive.

To answer RQ2, we ran deepmufl and the other four tools on a Dell workstation with Intel(R) Xeon(R) Gold 6138 CPU at 2.00 GHz, 330 GB RAM, 128 GB RAM disk, and Ubuntu 18.04.1 LTS and measured the time needed for model training as well as the MBFL process to complete. We repeated this process four times, and in each round of deepmufl’s execution, we randomly selected 100% (i.e., no selection), 75%, 50%, and 25% of the generated mutants for testing. Random mutation selection is a common method for reducing the overhead of mutation analysis [61], [35]. During random selection, we made sure that each layer receives at least one mutants, so that we do not mask any bug. The last row in Table 5 reports the average timing (of 3 runs) of MBFL in each round of mutation selection. The table also reports the impact of mutation selection on the number of bugs detected by each configuration of deepmufl. As we can see, in MUSE configuration of deepmufl, by using 50% of the mutants, one can halve the execution time and still detect 92.45% of the previously detected bugs. Therefore, mutation selection can be used as an effective way for curtailing MBFL time in DNNs.

For a fair comparison of deepmufl to state-of-the-art fault localization tools in terms of efficiency, we need to take into account the fact that deepmufl requires a pre-trained model as its input. Thus, as far as the end-to-end fault localization time from an end-user’s perspective is concerned, we want to take into consideration the time needed to train the input model in addition the time needed to run deepmufl. With training time taken into account, deepmufl takes, on average, 1492.48, 1714.63, 1958.35, and 2192.4 seconds when we select 25%, 50%, 75%, and 100% of the generated mutants, respectively. We also emphasize that the time for DeepLocalize and DeepDiagnosis varied based on whether or not they found the bug. Given the fact that a user could terminate the fault localization process after a few epochs when they lose hope in finding bugs with these two tools, we report two average measurements for DeepLocalize and DeepDiagnosis: (1) average time irrespective of the fact that the tools succeed in finding the bug; (2) average time if the tools successfully finds the bug. Unlike these two tools, the time for Neuralint and UMLAUT does not change based on the fact that they detect a bug or not. DeepLocalize takes on average 1244.09 seconds and it takes on average 57.29 seconds when the tool successfully finds the bug. These numbers for DeepDiagnosis are 1510.71 and 11.05 seconds, respectively. Meanwhile, Neuralint and UMLAUT take on average 2.87 seconds and 1302.61 seconds to perform fault localization.

D. Discussion

It is important to note that while deepmufl outperforms state-of-the-art techniques in terms of the number of bugs detected in our dataset, it is not meant to replace them. Our dataset only covers a specific type of bugs, i.e., model bugs, while other studied techniques push the envelope by detecting bugs related to factors like learning rate and training data normalization, which are currently outside of deepmufl’s reach. We observed that combining all the techniques results in detecting 87 of the bugs in our dataset; exploring ways to combine various fault localization approaches by picking the right tool based on the characteristics of the bug is an interesting topic for future research. Moreover, depending on the applications and resource constraints, a user might prefer one tool over another. For example, although Neuralint might be limited by its static nature, e.g., it might not be able analyze models that use complex computed values and objects in their construction, it takes only few seconds for the tool to conduct fault localization. Thus, in some applications, e.g., online integration with IDEs, approaches like that of Neuralint might be the best choice

Table 5: The impact of mutation selection on the effectiveness and execution time of deepmuf

A major source of overhead in an MBFL technique is related to the sheer number of mutants that the technique generates and tests [62], [61]. Sufficient mutator selection [63] is referred to the process of selecting a subset of mutators that achieve the same (or similar) effect, i.e., same or similar mutation score and same or similar number of detected bugs, but with smaller number of mutants generated and tested. For the mutators of Table 2, so far, we have not conducted any analysis on which mutators might be redundant, as a reliable mutator selection requires a larger dataset that we currently lack. We postpone this study as a future work.

Combining fault localization tools can be conducted with the goal of improving efficiency. We see the opportunity in building faster, yet more effective, fault localization tools by predicting the likely right tool upfront for a given model or running tools one by one and moving on to the next tool if we have a level of confidence that the tool will not find the bug. We postpone this study for a future work.

Lastly, we would like to emphasize that comparisons to the above-mentioned techniques in a dataset of bugs that deepmufl supports is fair, as the other tools are also designed to detect bugs in the category of model bugs. However, making these tools to perform better than this, would require augmenting their current rule-base with numerous new rules, yet adding new rules comes with the obligation of justifying the generality and rationale behind them, which might be a quite difficult undertaking. deepmufl, on the other hand, approaches the fault localization problem differently, allowing for more flexibility without the need for hard-coded rules.