RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models

WorkDifferentWithAI.com Academic Paper Alert!

Written by Zefan Wang, Zichuan Liu, Yingying Zhang, Aoxiao Zhong, Lunting Fan, Lingfei Wu, Qingsong Wen

Category: “AI for IT”

Article Section: AI Development and Operations; AI Development Tools

Publication Date: 2023-10-24

SEO Description: “RCAgent: Autonomous, tool-augmented large language models for cloud root cause analysis optimizing AI decision-making capabilities.”

AI-Generated Paper Summary

GPT-4 API

The paper is about RCAgent, a tool-augmented large language model (LLM) autonomous agent framework, used for cloud root cause analysis (RCA). The study explores LLM’s current limitations, such as reliance on manual workflow settings and not utilizing its full decision-making and environment interaction capabilities. In response, RCAgent, which is designed to offer practical and privacy-aware industrial RCA usage, is introduced. The tool runs on an internally deployed model and offers free-form data collection and comprehensive analysis. It boasts several enhancements such as Self-Consistency for action trajectories, along with methods for context management, stabilization, and importing domain knowledge. The paper concludes that RCAgent outperforms ReAct in all RCA aspects, such as predicting root causes, solutions, evidence, and responsibilities. The system has been integrated into Alibaba Cloud’s Real-time Compute Platform diagnosis and issue discovery workflow.

Claude.ai Full PDF Analysis

Here are my key takeaways from analyzing this academic paper on using large language models (LLMs) for cloud root cause analysis (RCA):

Novelty:

Proposes RCAgent, the first practical framework for using LLMs as autonomous agents for privacy-aware cloud RCA. This allows leveraging the decision-making abilities of LLMs for complex AIOps tasks without transmitting sensitive data externally.
Introduces enhancements like trajectory-level self-consistency, observation snapshot keys, JSON repairing, and LLM-based expert agents to make LLMs effective for noisy, real-world data.
Demonstrates the efficacy of RCAgent on Alibaba’s production Apache Flink platform, showing its practical applicability.

Commercial Applications:

RCAgent could be commercialized as an AIOps software service for automated and intelligent root cause analysis of cloud infrastructure incidents.
It has direct applicability for any cloud provider like AWS, Azure, GCP to diagnose issues in their infrastructure and reduce mean time to resolution.
The techniques can also be extended to IT operations in general, like automated diagnosis of on-premise infrastructure and applications.
There is scope for commercializing the expert agents as standalone services with customized data interfaces.
The self-consistency and stabilization methods could be packaged as frameworks to make LLMs effective in other real-world autonomous agent settings.

Overall, this paper demonstrates the promise of using LLMs as autonomous agents for practical applications like cloud RCA through a novel framework and enhancements. It has clear commercial potential to be developed into AIOps products or services for the cloud and IT infrastructure industry. The techniques also have applicability for other LLM agent systems dealing with complex real-world data.

Keywords

RCAgent, Cloud Root Cause Analysis, Autonomous Agents, Tool-Augmented Large Language Models, industrial privacy-aware usage

Author’s Abstract

Large language model (LLM) applications in cloud root cause analysis (RCA) have been actively explored recently. However, current methods are still reliant on manual workflow settings and do not unleash LLMs’ decision-making and environment interaction capabilities. We present RCAgent, a tool-augmented LLM autonomous agent framework for practical and privacy-aware industrial RCA usage. Running on an internally deployed model rather than GPT families, RCAgent is capable of free-form data collection and comprehensive analysis with tools. Our framework combines a variety of enhancements, including a unique Self-Consistency for action trajectories, and a suite of methods for context management, stabilization, and importing domain knowledge. Our experiments show RCAgent’s evident and consistent superiority over ReAct across all aspects of RCA — predicting root causes, solutions, evidence, and responsibilities — and tasks covered or uncovered by current rules, as validated by both automated metrics and human evaluations. Furthermore, RCAgent has already been integrated into the diagnosis and issue discovery workflow of the Real-time Compute Platform for Apache Flink of Alibaba Cloud.

Read the full paper here