Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks

WorkDifferentWithAI.com Academic Paper Alert!

Written by Jiho Shin, Clark Tang, Tahmineh Mohati, Maleknaz Nayebi, Song Wang, Hadi Hemmati

Category: “AI for IT”

Article Section: AI Development and Operations; AI-Assisted Programming

Publication Date: 2023-10-10

SEO Description: “Assessing GPT-4’s effectiveness in Automated Software Engineering via various prompting techniques.”

AI-Generated Paper Summary

GPT-4 API

The article titled “Prompt Engineering or Fine Tuning: An Empirical Assessment of Large Language Models in Automated Software Engineering Tasks” investigates the efficacy of advanced Large Language Models (LLM), particularly GPT-4, in Automated Software Engineering (ASE) tasks such as code generation, code summarization, and code translation. Three prompting engineering techniques were considered: basic prompting, in-context learning, and task-specific prompting. The study found that the performance of these techniques varied across tasks and didn’t consistently outperform fine-tuned models. For instance, while task-specific prompts in GPT-4 outdid the top-performing fine-tuned model by 8.33% in BLEU score for comment generation, it was outperformed in tasks like code generation. Also, progress was marked when human feedback was integrated into the prompting process. However, the study concludes that fully automated prompt engineering without human assistance still needs improvement for optimal performance in ASE tasks.

Claude.ai Full PDF Analysis

Here is a summary of my analysis of the key points and insights from this paper:

Novelty

This is the first comprehensive empirical study comparing prompt engineering strategies vs fine-tuning for large language models (LLMs) on automated software engineering (ASE) tasks.
The study examines 3 representative ASE tasks: comment generation, code generation, and code translation. This provides a broad assessment of LLMs for different modalities of code and natural language.
Both quantitative experiments with different prompt strategies and a qualitative user study with conversational prompting are conducted. The combination provides unique insights.
For quantitative analysis, prompt engineering strategies like basic prompting, in-context learning, and task-specific prompting are systematically compared.
The qualitative study explores how developers interactively evolve prompts when using LLMs for ASE tasks. Different prompt evolution patterns are identified.

Key Findings

Automated prompt strategies do not consistently outperform fine-tuned models across the ASE tasks examined. Performance varies by task.
Conversational prompting with human feedback significantly boosts LLM performance compared to automated prompting strategies.
Developers tend to refine prompts by requesting improvements, adding context, giving specific instructions etc. This helps guide the LLM.
The common prompt evolution patterns are found to be aligned with patterns that produce the best LLM responses. This suggests developers can effectively guide LLMs.

Commercial Applications

The insights on conversational prompting can help design more effective human-in-the-loop systems leveraging LLMs for software engineering.
Prompt refinement patterns identified can inform automated prompting systems, e.g. using reinforcement learning bots.
Comparative assessment of prompting vs fine-tuning provides guidance on tradeoffs for applying LLMs in industrial settings.
Better understanding of how to optimize prompts can help improve commercial LLM services for programming assistance.

In summary, this is a novel study providing useful empirical insights on prompting strategies for LLMs in ASE tasks, with both academic and commercial value. The findings can help advance research and applications of LLMs in software engineering.

Keywords

Prompt Engineering, Fine Tuning, Large Language Models, Automated Software Engineering Tasks, GPT-4

Author’s Abstract

In this paper, we investigate the effectiveness of state-of-the-art LLM, i.e., GPT-4, with three different prompting engineering techniques (i.e., basic prompting, in-context learning, and task-specific prompting) against 18 fine-tuned LLMs on three typical ASE tasks, i.e., code generation, code summarization, and code translation. Our quantitative analysis of these prompting strategies suggests that prompt engineering GPT-4 cannot necessarily and significantly outperform fine-tuning smaller/older LLMs in all three tasks. For comment generation, GPT-4 with the best prompting strategy (i.e., task-specific prompt) had outperformed the first-ranked fine-tuned model by 8.33% points on average in BLEU. However, for code generation, the first-ranked fine-tuned model outperforms GPT-4 with best prompting by 16.61% and 28.3% points, on average in BLEU. For code translation, GPT-4 and fine-tuned baselines tie as they outperform each other on different translation tasks. To explore the impact of different prompting strategies, we conducted a user study with 27 graduate students and 10 industry practitioners. From our qualitative analysis, we find that the GPT-4 with conversational prompts (i.e., when a human provides feedback and instructions back and forth with a model to achieve best results) showed drastic improvement compared to GPT-4 with automatic prompting strategies. Moreover, we observe that participants tend to request improvements, add more context, or give specific instructions as conversational prompts, which goes beyond typical and generic prompting strategies. Our study suggests that, at its current state, GPT-4 with conversational prompting has great potential for ASE tasks, but fully automated prompt engineering with no human in the loop requires more study and improvement.

Read the full paper here