A Critical Review of Large Language Model on Software Engineering: An Example from ChatGPT and Automated Program Repair

WorkDifferentWithAI.com Academic Paper Alert!

Written by Quanjun Zhang, Tongke Zhang, Juan Zhai, Chunrong Fang, Bowen Yu, Weisong Sun, Zhenyu Chen

Category: “AI for IT”

Article Section: AI Development and Operations; AI-Assisted Programming

Publication Date: 2023-10-13

SEO Description: “Reviewing Large Language Models like ChatGPT in Software Engineering and Automated Program Repair”

AI-Generated Paper Summary

GPT-4 API

The article is a scholarly review of the performance of Large Language Models (LLMs), particularly ChatGPT, in software engineering (SE) tasks such as Automated Program Repair (APR). The authors acknowledged the impressive performance of these LLMs in various tasks, but raised concerns about potential data leakage due to insufficient transparency in their training methods. A new benchmark, specific to APR tasks, was created to evaluate ChatGPT, revealing it surpassed other leading models in identifying and fixing bugs. The authors also examined how different types of prompts can affect model performance. The study concluded by highlighting the need for further investigation of LLMs in SE tasks beyond APR, emphasizing on reassessing the achievements of existing black-box LLM models.

Claude.ai Full PDF Analysis

Here is a summary of the key points and my analysis of the novelty and potential commercial applications of this paper:

Summary:

The paper raises an important overlooked issue of data leakage when evaluating the performance of black-box large language models (LLMs) like ChatGPT on common datasets for software engineering tasks like automated program repair (APR).
The authors construct a new APR benchmark dataset EvalGPTFix using competitive programming submissions from 2023, after ChatGPT’s training cutoff, to avoid any data leakage issues.
Experiments evaluate ChatGPT’s ability to fix bugs in EvalGPTFix under different prompts and interaction strategies. ChatGPT fixes 109/151 bugs with a basic prompt, and additional bugs with more detailed prompts and dialog feedback.
The results significantly outperform existing LLMs like CodeT5 and PLBART, demonstrating ChatGPT’s effectiveness at program repair when data leakage is controlled for.

Novelty:

Identifies the overlooked but critical issue of potential data leakage in evaluating black-box LLMs like ChatGPT on common SE datasets. Raises awareness of this bias.
Constructs a novel benchmark dataset specifically designed to avoid data leakage, enabling more rigorous evaluation.
Provides an extensive study analyzing ChatGPT’s repair capabilities under different prompts and interaction strategies.
Demonstrates ChatGPT’s strong performance at program repair when data leakage is controlled, overcoming a limitation of prior work.

Commercial Applications:

The results highlight the potential of LLMs like ChatGPT to assist software developers in fixing bugs with minimal effort. Could be integrated into IDEs.
ChatGPT’s effectiveness even with limited prompts shows it could help debug production issues where info is limited.
Interactive repair with execution feedback could enable collaborative human-AI bug fixing.
Techniques to construct leakage-free benchmarks could enable evaluating LLMs rigorously for commercial tasks.
Findings motivate developing prompt engineering strategies and domain-specific LLMs to further improve repair.

In summary, this is a novel study identifying an important bias, rigorously evaluating ChatGPT’s repair capabilities, and demonstrating significant commercial potential to assist developers. The benchmark dataset and analysis techniques are valuable contributions.

Keywords

Large Language Models, Software Engineering, ChatGPT, Automated Program Repair, bug-fixing capabilities

Author’s Abstract

Large Language Models (LLMs) have been gaining increasing attention and demonstrated promising performance across a variety of Software Engineering (SE) tasks, such as Automated Program Repair (APR), code summarization, and code completion. For example, ChatGPT, the latest black-box LLM, has been investigated by numerous recent research studies and has shown impressive performance in various tasks. However, there exists a potential risk of data leakage since these LLMs are usually close-sourced with unknown specific training details, e.g., pre-training datasets. In this paper, we seek to review the bug-fixing capabilities of ChatGPT on a clean APR benchmark with different research objectives. We first introduce {\benchmark}, a new benchmark with buggy and the corresponding fixed programs from competitive programming problems starting from 2023, after the training cutoff point of ChatGPT. The results on {\benchmark} show that ChatGPT is able to fix 109 out of 151 buggy programs using the basic prompt within 35 independent rounds, outperforming state-of-the-art LLMs CodeT5 and PLBART by 27.5\% and 62.4\% prediction accuracy. We also investigate the impact of three types of prompts, i.e., problem description, error feedback, and bug localization, leading to additional 34 fixed bugs. Besides, we provide additional discussion from the interactive nature of ChatGPT to illustrate the capacity of a dialog-based repair workflow with 9 additional fixed bugs. Inspired by the findings, we further pinpoint various challenges and opportunities for advanced SE study equipped with such LLMs (e.g.,~ChatGPT) in the near future. More importantly, our work calls for more research on the reevaluation of the achievements obtained by existing black-box LLMs across various SE tasks, not limited to ChatGPT on APR.

Read the full paper here