Authors:
(1) Bo Wang, Beijing Jiaotong University, Beijing, China (wangbo_cs@bjtu.edu.cn);
(2) Mingda Chen, Beijing Jiaotong University, Beijing, China (23120337@bjtu.edu.cn);
(3) Youfang Lin, Beijing Jiaotong University, Beijing, China (yflin@bjtu.edu.cn);
(4) Mike Papadakis, University of Luxembourg, Luxembourg (michail.papadakis@uni.lu);
(5) Jie M. Zhang, King’s College London, London, UK (jie.zhang@kcl.ac.uk).
Table of Links
3 Study Design
3.1 Overview and Research Questions
3.3 Mutation Generation via LLMs
4 Evaluation Results
4.1 RQ1: Performance on Cost and Usability
4.3 RQ3: Impacts of Different Prompts
4.4 RQ4: Impacts of Different LLMs
4.5 RQ5: Root Causes and Error Types of Non-Compilable Mutations
5 Discussion
5.1 Sensitivity to Chosen Experiment Settings
ABSTRACT
The question of how to generate high-utility mutations, to be used for testing purposes, forms a key challenge in mutation testing literature. Large Language Models (LLMs) have shown great potential in code-related tasks but their utility in mutation testing remains unexplored. To this end, we systematically investigate the performance of LLMs in generating effective mutations w.r.t. to their usability, fault detection potential, and relationship with real bugs. In particular, we perform a large-scale empirical study involving 4 LLMs, including both open- and closed-source models, and 440 real bugs on two Java benchmarks. We find that compared to existing approaches, LLMs generate more diverse mutations that are behaviorally closer to real bugs, which leads to approximately 18% higher fault detection than current approaches (i.e., 87% vs. 69%) in a newly collected set of bugs, purposely selected for evaluating learning-based approaches, i.e., mitigating potential data leakage concerns. Additionally, we explore alternative prompt engineering strategies and the root causes of uncompilable mutations, produced by the LLMs, and provide valuable insights for the use of LLMs in the context of mutation testing.
1 INTRODUCTION
Mutation testing is one of the most effective software testing techniques [17, 18, 27, 33, 59]. It is performed by producing program variants by making simple syntactic changes (i.e., mutations) in the source code of the origin program (i.e., mutants). These variants form the objectives of testing, in a sense requesting tests to trigger different behaviors on the original and the mutant programs. A mutant is called killed or live depending on whether the test execution leads to a different program output (observable behavior) from that of the original program or not. The proportion of the killed mutants over the entire set of mutants forms a test adequacy metric that is called mutation score.
Since mutations are the objectives of testing, the effectiveness of the method strongly depends on the set of mutants that one is using. Traditional techniques use simple syntactic transformations (rules), called mutation operators, that introduce syntactic changes in every code location that they apply [33, 59]. We call these approaches as rule-based approaches.
Unfortunately, rule-based approaches produce large numbers of redundant mutants, i.e., mutants that do not contribute to the testing process [5, 57] and thus exacerbate the method’s computational overheads as well as the human effort involved in examining live mutants. To address this issue, researchers leverage deep learning approaches to form mutation operators based on the project’s history [62, 70, 73], with the hope that these will produce few but effective mutations. While interesting, these approaches fail to produce effective mutations [56], perhaps due to the fact that they are trained on a few data instances.
Nevertheless, the question of which mutants should be used when performing mutation testing remains open due to the following challenges: (1) Forming effective mutations. To be effective, mutations need to be syntactically correct, which is not the case of most of the existing learning-based methods, and be likely fault revealing [10, 80], i.e., have a good potential to couple with real faults. While existing methods are effective, there is still significant room for improvement [15]. (2) Forming natural mutations. Mutations need to align with natural coding patterns and practices so that they align with developer expectations, i.e., making them feel that they are worth being addressed [7]. It is a challenge to ensure that the mutated code is not only syntactically correct but also aligns well with the realistic, common coding patterns and practices observed among developers. (3) Forming diverse and killable mutations. Determining whether a mutation is equivalent is undecidable [9], making it challenging to avoid the generation of equivalent mutants, leading to poor scalability and waste of resources [58, 76, 86, 88]. At the same time, mutations need to be diverse and to cover the entire spectrum of code and program behaviours.
To deal with these challenges, we aim at using Large-Language Models (LLMs), which have been trained on big code and can produce human-like code [11]. Following existing studies [26, 49, 65], we designed a default prompt that contains task instructions, the code element to be mutated, the corresponding method code, and a few-shot examples sampled from real-world bugs. We carried out a large-scale evaluation involving 4 LLMs, 2 closed-source commercial and 2 open-source ones, 440 real-world bugs from two Java benchmarks, i.e., Defects4J [37] and ConDefects [82], and compared with existing techniques and tools.
Our evaluation includes both quantitative and qualitative analysis for the mutations generated by LLMs, comparing them with existing approaches to assess their cost and effectiveness. Additionally, we study how different prompt engineering strategies and LLMs (i.e., GPT-3.5-Turbo [4], GPT-4-Turbo, CodeLlama-13bInstruct [64], and StarChat-𝛽-16b [45]) affect the effectiveness of the task. Moreover, we analyze the root causes and error types for the non-compilable mutants generated by LLMs.
Our results reveal that the best model for mutant generation is GPT-4 as it outperforms the other LLMs in all metrics used, i.e., produces less equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults. GPT-3.5 is the second best in terms of fault detection potential, leading to a detection of 96.7% of bugs in Defects4J and 86.7% in ConDefects. From the rest of the techniques, Major is the best, with the potential to reveal 91.6% of bugs in Defects4J and 68.9% in ConDefects, indicating an advantage of approximately 18% for the GPT on ConDefects which is a newly and unseen dataset.
Another interesting finding of our analysis regards the diversity of the mutants (measured in terms of newly introduced AST node types). In particular, our results show that GPTs exhibit the greatest diversity, newly introducing 45 different AST node types, in contrast to just 2 used by traditional approaches such as Major, and subsuming all existing approaches, indicating a much better contextualization of the LLM-based approach.
We also analyzed the mutations generated by GPT that fail to compile, a common issue with learning-based mutations, and found that they fall in 9 compilation error types with the Usage of Unknown Methods and Code Structural Destruction types being the most prevalent ones. This finding indicates that future approaches could focus on directing the LLMs in choosing the right method invocations when mutating.
We wrapped up our method into a mutation tool called Kumo, an LLM-based mutation generation tool for Java. Based on the experimental data, we have also built a dataset that contains high-quality LLM-generated mutants that can be used not only in mutation testing but also in other bug-seeding applications, such as fault localization and fault prediction. The implementation of Kumo and the experimental data are available for open science [1].
In summary, our paper makes the following main contributions:
• We investigate the applicability of LLMs in mutation testing. We perform extensive and detailed comparative experiments to evaluate LLMs against existing tools/methods. Our findings indicate that GPT models excel in generating mutations that closely mimic real bugs.
• We compare different prompts, and find that few-shot learning with suitable code context archives best performance.
• We analyze the error types of non-compilable mutations and figure out that method invocation and member assessment are more likely to lead LLMs to generate non-compilable mutations.
• We build a high-quality dataset of Java mutations that includes manually analysed equivalent mutants.
This paper is