AIM logo Black
Search
Close this search box.

DeepMind’s “red teaming” language models with language models: What is it?

DeepMind has come out with a way to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves.

Share

Language models and innovations in improving them are one of the most exciting and talked about research areas right now. However, though we have seen several large language models in the last year from tech giants (DeepMind’s 280 billion parameter transformer language model, Gopher, Google’s Generalist Language Model, LG AI Research’s Language model Exaone), they cannot often be deployed as they can be harmful to users in different ways difficult to predict prior. To take a progressive step towards solving this issue, innovation mammoth DeepMind has come out with a way to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves.

The researchers generated test cases (red teaming) using a language model and then used a classifier to detect various harmful behaviours on test cases. As per DeepMind, the team evaluated the language model’s replies to generated test questions by using a classifier trained to detect offensive content. What came out of this was a vast quantity of offensive replies in a 280B parameter language model chatbot. 

What is this model exactly?

As per the paper titled, “Red Teaming Language Models with Language Models“, though LLMs such as GPT-3 and Gopher can generate high-quality, there are several hurdles in their deployment. It added, “Generative language models come with a risk of generating very harmful text, and even a small risk of harm is unacceptable in real-world applications.”

The team added that they use the approach to train the 280B parameter Dialogue-Prompted Gopher chatbot for offensive, generated content. They work on several methods such as zero-shot generation, few-shot generation, supervised learning, and reinforcement learning to generate test questions with the large language models. 

Image: DeepMind

As per the paper, red teaming gave versatile responses, with some methods proving effective in producing diverse test cases while some were effective at generating difficult test cases.

The generated test cases compared favourably to manually-written test cases from Xu et al. (2021b) in terms of diversity and difficulty. The team also used LM-based red teaming to see harmful chatbot behaviours that leak memorised training data. The researchers also generated targeted tests for a particular behaviour by sampling from a language model conditioned on a “prompt” or text prefix. 

It said, “We also use prompt-based red teaming to automatically discover groups of people that the chatbot discusses in more offensive ways than others, on average across many inputs.”

Observations

After the failure cases were detected, the team added that the harmful behaviour could be fixed by blacklisting certain phrases that frequently came up in harmful outputs or finding offensive training data quoted by the model that removes data when training future iterations of the model. The model can also be trained to minimise the likelihood of its original, harmful output for a given test input.

Prior work in this area 

There has been previous work to detect issues such as hate speech, indecent language, etc. 

  • HateCheck is a suite of functional tests for hate speech detection models. The research team built 29 model functionalities driven by a review of previous research and interviews with civil society stakeholders. They brought out test cases for each functionality and validated their quality through a structured annotation process.

RealToxicityPrompts is a dataset of 100K naturally occurring, sentence-level prompts derived from a large volume of English web text, teamed with toxicity scores from a widely-used toxicity classifier. The team assessed “controllable generation methods” and found out that though data or compute based methods are more effective at moving away from toxicity, there is no current method that is “failsafe against neural toxic degeneration.”

Share
Picture of Sreejani Bhattacharyya

Sreejani Bhattacharyya

I am a technology journalist at AIM. What gets me excited is deep-diving into new-age technologies and analysing how they impact us for the greater good. Reach me at sreejani.bhattacharyya@analyticsindiamag.com
Related Posts
CORPORATE TRAINING PROGRAMS ON GENERATIVE AI
Generative AI Skilling for Enterprises
Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.
Upcoming Large format Conference
June 28, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
discord-icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.

Subscribe

Subscribe to our Youtube channel and see how AI ecosystem works.