Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT

WorkDifferentWithAI.com Academic Paper Alert!

Written by Burak Yetiştiren, Işık Özsoy, Miray Ayerdem, Eray Tüzün

Category: “AI for IT”

Article Section: AI Development and Operations; AI-Assisted Programming

Publication Date: 2023-10-21

SEO Description: “Study evaluates AI-assisted code generation tools like GitHub Copilot, Amazon CodeWhisperer, ChatGPT on quality metrics.”

AI-Generated Paper Summary

GPT-4 API

The study “Evaluating the Code Quality of AI-Assisted Code Generation Tools: An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT” conducts an empirical comparison of prominent code generation tools, namely GitHub Copilot, Amazon CodeWhisperer, and OpenAI’s ChatGPT. It aims to assess code quality metrics, including Code Validity, Code Correctness, Code Security, Code Reliability, and Code Maintainability, using the HumanEval Dataset. Preliminary results indicate that ChatGPT, GitHub Copilot, and Amazon CodeWhisperer correctly generate code 65.2%, 46.3%, and 31.1% of the time, respectively. The insights gained from the study aims to aid software practitioners in choosing the appropriate tool for specific tasks.

Claude.ai Full PDF Analysis

Here is a summary of my analysis of the key points and implications from this paper:

Novelty:

First comparative study evaluating code quality metrics (validity, correctness, reliability, security, maintainability) of GitHub Copilot, Amazon CodeWhisperer, and ChatGPT. Provides unique insights into the strengths and weaknesses of each tool.
Proposes an experimental pipeline to systematically assess code generation tools using the HumanEval benchmark dataset. This methodology could be replicated for evaluating other tools.
Analyzes impact of input quality (e.g. docstrings, function names) on performance of code generators. Demonstrates the importance of clear problem descriptions.
Tracks improvement of GitHub Copilot and Amazon CodeWhisperer over time by comparing older and newer versions. Quantifies increases in code correctness.
Compiles comprehensive list of code smells encountered in generated code to highlight maintainability issues. Estimates average time to resolve.
Provides first academic comparison table of features for Copilot, CodeWhisperer and ChatGPT. Useful quick reference.

Commercial Applications:

Results help practitioners select optimal code generation tool for their needs, based on task requirements and tool strengths/weaknesses.
Insights into input quality impact assists developers in leveraging tools more effectively (e.g. providing clearer docstrings).
Code smell analysis informs developers on potential maintainability issues to address in generated code.
Tracking of tool improvements over time shows practitioners how quickly they are evolving.
Code correctness comparisons help set expectations on accuracy levels for real-world usage.
Methodology could be extended by companies to benchmark performance of proprietary code generation tools.

Overall, this rigorous study generates novel insights of high practical value for both researchers and practitioners applying AI code generators. The results and methodology significantly advance understanding of this emerging technology class.

Keywords

AI-assisted code generation, Code quality metrics, GitHub Copilot, Amazon CodeWhisperer, ChatGPT

Author’s Abstract

Context: AI-assisted code generation tools have become increasingly prevalent in software engineering, offering the ability to generate code from natural language prompts or partial code inputs. Notable examples of these tools include GitHub Copilot, Amazon CodeWhisperer, and OpenAI’s ChatGPT. Objective: This study aims to compare the performance of these prominent code generation tools in terms of code quality metrics, such as Code Validity, Code Correctness, Code Security, Code Reliability, and Code Maintainability, to identify their strengths and shortcomings. Method: We assess the code generation capabilities of GitHub Copilot, Amazon CodeWhisperer, and ChatGPT using the benchmark HumanEval Dataset. The generated code is then evaluated based on the proposed code quality metrics. Results: Our analysis reveals that the latest versions of ChatGPT, GitHub Copilot, and Amazon CodeWhisperer generate correct code 65.2%, 46.3%, and 31.1% of the time, respectively. In comparison, the newer versions of GitHub CoPilot and Amazon CodeWhisperer showed improvement rates of 18% for GitHub Copilot and 7% for Amazon CodeWhisperer. The average technical debt, considering code smells, was found to be 8.9 minutes for ChatGPT, 9.1 minutes for GitHub Copilot, and 5.6 minutes for Amazon CodeWhisperer. Conclusions: This study highlights the strengths and weaknesses of some of the most popular code generation tools, providing valuable insights for practitioners. By comparing these generators, our results may assist practitioners in selecting the optimal tool for specific tasks, enhancing their decision-making process.

Read the full paper here