Automated Program Repair and the Advent of Large Language Models in Software Engineering

Julian Aron Prenner

Within just a few years, large language models have made the jump from research labs to the mainstream. Intelligent assistants building on this technology, such as OpenAI’s ChatGPT, Google’s Gemini or Mistral’s le Chat, have already been commercialized and are today used by millions, on smartphones and in web browsers. The recent advancements in AI have also revolutionized the field of software development, where language model-based coding assistants promise to make the production of software more efficient. One of fields that has particularly benefited from the recent progress in the area of large language models is automated program repair, the main topic of this dissertation. As its name suggests, the goal of automated program repair is to repair programs; that is, to first find bugs in software and then to eliminate them by replacing the buggy code with correct code. As outlined in this dissertation, many different techniques, have been proposed as to how faults can be localized (i.e., found in a program) and how faulty code can be turned into correct code. When my Ph.D. studies started in 2020, this field was in a phase of transition, moving from previously used search-based to novel deep learning-based techniques (neural program repair). An important step in this development was the introduction of general-purpose, very large language models such as OpenAI’s GPT series of models. In late summer of 2021, we were granted access to the closed beta of OpenAI’s Codex, then one of the first very large language models specifically designed for code tasks. Evaluating Codex on the small QuixBugs program repair benchmark, we found it to be competitive not only with previous search-based, but also with relatively recent neural-based techniques; this evaluation is also presented in this dissertation. Although these early investigations were very limited, they already showed the great potential of large language models for program repair. More extensive studies later pointed in the same direction as our findings: very large code language models were as good or better than previous approaches. This is all the more notable, if one considers the fact that unlike such previous approaches, these models have not been specifically designed or trained for the task of program repair but instead are general-purpose code models steered only with well-crafted prompts. In recent years, the application of large language models in automated program repair became widespread; this is true for both, smaller language models that can be specifically trained for automated program repair and above-mentioned larger general-purpose models. Researches also proposed several extensions and improvements tailored to fixing bugs in software (e.g, self-supervision, different prompt designs, integration of test results and error messages as feedback, edit-aware denoising objectives and more). An aspect that received little attention, however, was the significance of context for a successful bugfix. By context here we mean the code sections preceding and succeeding the bug location which, along with the buggy code itself, usually form the input to a language model. Context provides important information from which the model can “understand” how to carry out the repair. It is also the source of so-called ingredients, that is, code elements (e.g., variable or function names) that might be needed for a successful bugfix. In a comprehensive study, presented also in this dissertation, we investigated how strong the effects of context are on repair success. We found, among other things, that the amount of context may have significant impact on fixing performance. Context is thus an often overlooked and not well documented source of bias, as its influence may be confounded with other improvements of a particular technique. In a second, subsequent study we then looked at the role of ingredients, in particular identifiers, in language model-based repair. We found that identifier ingredients missing from the context are often the cause of a failed bugfix. To mitigate this problem, we propose a novel method that is able to extract ingredients from larger amounts of context code. The shift to deep learning and large language models also goes hand in hand with the tendency to treat code as mere text. While execution was a central part of search based program repair, in language model-based or, more generally, deep learning-based systems, code often is handled very much like natural language text (although execution through tests is still often used to verify correctness of repairs). We argue that this is a waste of potential as execution can provide a wealth of information relevant for repair. The lack of a large dataset of executable bugs was one of the motivations for the creation of RunBugRun. This dataset combines three important aspects. First, as just mentioned, all bugs in RunBugRun are executable, opening up opportunities for novel repair approaches that leverage runtime information. Next, with over 700,000 bugs, it is large enough to satisfy the data hunger of modern approaches. Finally, with bugs in nine different programming languages, RunBugRun is also one of the few multilingual bug datasets, providing a counterpoint in a research landscape otherwise dominated by a few popular programming languages (e.g., Java, Python, C/C++). Our own experiments with this dataset show, among other things, that even a small language model can transfer repair knowledge from one programming language to another. This means that less popular programming languages can benefit from programming languages with high data availability. In summary, our contributions help to better understand the role of context and ingredients in neural program repair and, hopefully, promote a more execution-centric and multilingual future program repair research.

Automated Program Repair and the Advent of Large Language Models in Software Engineering

Abstract

Files and links (1)

Details

Metrics