Generative Language Models Exhibit Social Identity Biases

WorkDifferentWithAI.com Academic Paper Alert!

Written by Tiancheng Hu, Yara Kyrychenko, Steve Rathje, Nigel Collier, Sander van der Linden, Jon Roozenbeek

Category: “Ethical AI”

Article Section: Ethical and Responsible AI; Responsible AI Practices

Publication Date: 2023-10-24

SEO Description: “Generative language models exhibit social biases, can be mitigated by curated training data: study reveals.”

AI-Generated Paper Summary

GPT-4 API

The research paper titled “Generative Language Models Exhibit Social Identity Biases,” explores the concern about biases in generative language models related to social identity. The study investigates whether large language models exhibit ingroup solidarity and outgroup hostility, two core social biases. Using 51 large language models for comparison, the study found that most of these models manifest clear biases when asked to complete particular sentence prompts. The study uncovers that these biases can be intensified or mitigated during the fine-tuning process and suggests that bias can be minimized by carefully curating the training data. Hence, the research concludes that modern language models do exhibit social identity biases, and emphasizes the need for further research, specifically in user interactions with large language models to prevent potential reinforcement of biases in humans.

Claude.ai Full PDF Analysis

Based on my review, here are the key points about the novelty and potential commercial applications of this paper:

Novelty

First comprehensive study evaluating social identity biases like ingroup solidarity and outgroup hostility across a large set of state-of-the-art LLMs. Prior work has mainly focused on specific biases like gender or race.
Innovative approach using sentiment analysis of “we” vs “they” sentences to quantify ingroup and outgroup biases, similar to human “feelings thermometer”.
Establishes human baseline levels of ingroup/outgroup bias using large-scale web corpora like C4 and others that are used to train LLMs.
Shows most LLMs exhibit similar levels of ingroup favoritism and outgroup derogation as humans at baseline. Instruction fine-tuning can reduce but not eliminate the biases.
Demonstrates that fine-tuning on a partisan Twitter dataset greatly exacerbates ingroup and especially outgroup biases. Selectively removing ingroup positive or outgroup negative sentences from fine-tuning data mitigates the biases.

Commercial Applications

Provides methodology for LLM vendors to quantify and compare social identity biases across different models. This could help guide development of less biased LLMs.
Shows potential to reduce biases in commercial LLMs by fine-tuning on data with ingroup positive and outgroup negative sentences removed.
Results highlight need for careful dataset curation when deploying LLMs commercially to avoid exacerbating harmful stereotypes.
Findings will inform design of human-AI interfaces to account for model biases and prevent potential harms like algorithmic radicalization.
Quantifying social biases has commercial value for auditing LLMs before deployment in fields like healthcare, education, finance where fairness is crucial.

In summary, this is a significant contribution highlighting the prevalence of social identity biases in LLMs using novel methodology, with important implications for commercial development and application of large language models. The results point to specific techniques for mitigating biases that could be valuable for companies deploying AI assistants and chatbots.

Keywords

Generative Language Models, Social Identity Biases, Ingroup Solidarity, Outgroup Hostility, Bias Reduction Training

Author’s Abstract

The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. In this study, we investigate whether ingroup solidarity and outgroup hostility, fundamental social biases known from social science, are present in 51 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative biases when prompted to complete sentences (e.g., “We are…”). A comparison of LLM-generated sentences with human-written sentences on the internet reveals that these models exhibit similar level, if not greater, levels of bias than human text. To investigate where these biases stem from, we experimentally varied the amount of ingroup-positive or outgroup-negative sentences the model was exposed to during fine-tuning in the context of the United States Democrat-Republican divide. Doing so resulted in the models exhibiting a marked increase in ingroup solidarity and an even greater increase in outgroup hostility. Furthermore, removing either ingroup-positive or outgroup-negative sentences (or both) from the fine-tuning data leads to a significant reduction in both ingroup solidarity and outgroup hostility, suggesting that biases can be reduced by removing biased training data. Our findings suggest that modern language models exhibit fundamental social identity biases and that such biases can be mitigated by curating training data. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.

Read the full paper here