What is Constitutional AI: Harmlessness from AI Feedback
Anthropic is taking RLHF forward a notch
Hey Everyone,
I came across Constitutional A.I. in this tweet:
I think it’s actually even more important than Sridhar Ramaswamy cares to admit, so let’s dig into it.
See GitHub: https://github.com/anthropics/hh-rlhf
Anthropic, an AI safety and research company, has raised $580 million in a Series B back in April, 2022. That financing will help Anthropic build large-scale experimental infrastructure to explore and improve the safety properties of computationally intensive AI models.
Anthropic is an AI safety and research company working to build reliable, interpretable, and steerable AI systems.
Constitutional A.I. is an Upgrade to RLHF
I can’t emphasize enough how important this is for A.I. alignment and the future of things like Chatbots and GPT-4 trained tools.
Anthropic is important as a company for the depth and A.I. ethics emphasis of their research. For instance in 2022, their research interests span multiple areas including natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability.
Anthropic was founded in 2021, by former OpenAI VP of research Dario Amodei, intending to perform research in the public interest on making AI more reliable and explainable.
Anthropic are thus laying the foundations of new paths in how reinforcement learning works with human feedback.
Sign-Up for Constitutional AI Feedback Interface
Given the growing interest in language model-based chat interfaces, Anthropic is sharing their Constitutional AI feedback interface with a larger set of people. Sign up here: https://forms.gle/12FCefc6sHfBsP9j9.
The paper is worth reading, but they also recommend these papers. It’s clever because they are hiring all kinds of different people.
CAI - Constitutional AI
Anthropic A.I. shows the basic steps of their Constitutional AI (CAI) process, which consists of both a supervised learning (SL) stage, consisting of the steps at the top, and a Reinforcement Learning (RL) stage, shown as the sequence of steps at the bottom of the figure. Both the critiques and the AI feedback are steered by a small set of principles drawn from a ‘constitution’.
The supervised stage significantly improves the initial model, and gives some control over the initial behavior at the start of the RL phase, addressing potential exploration problems. The RL stage significantly improves performance and reliability.
Constitutional A.I. Will Get More important in 2023
The RL-CAI models trained with AI feedback learn to be less harmful at a given level of helpfulness.
For A.I. to be democratized it actually needs to learn to be more aligned with human feedback and harmlessness.
So this also had broad implications as more HLAI and the pursuit of AGI intensifies in the years and decades to come:
AI supervision may be more efficient than collecting human feedback. It allows us to focus more on providing a small amount of legible, focused, high-quality oversight. There may also be ways for humans and AI systems to collaborate [Bowman et al., 2022] to provide better supervision than either can provide alone.
AI systems can already perform some tasks at or beyond human level (e.g. [Silver et al., 2017]), and over time more examples are likely to emerge. We need to develop methods now that can provide oversight for these powerful AI systems, and scaling supervision may be one possibility, if the capability level of the supervisor can scale proportionally with the capabilities of the actor, and the supervisor remains aligned with our intended goals and constraints.
Dialogue
They trained language assistants that are both helpful and harmless without using human feedback labels for harmlessness. They referred to the technique as ‘constitutional AI’ (CAI) since we used a ‘constitution’ consisting of human-written principles. We established two methods:
(1) Constitutional AI which ‘bootstraps’ a helpful RLHF’s instruction-following abilities to critique and revise its own responses so as to remove harmful content, and (2) RL with model-generated labels for harmlessness, which further improves harmlessness.
They used this method to train models that are both harmless and non-evasive, partially resolving an issue in [Bai et al., 2022]. By removing human feedback labels for harmlessness, they have moved further away from reliance on human supervision, and closer to the possibility of a self-supervised approach to alignment.
RLAIF
Anthropic A.I. may be among the leaders in self-supervised RL that is aligned to human feedback. Or how do you say it? ‘RL from AI Feedback’ (RLAIF).
Imagine an A.I. that can self-train to be more aligned to humans?
Importantly Anthropic A.I. used chain-of-thought reasoning (and prompting) [Nye et al., 2021, Wei et al., 2022] to augment model performance and make AI decision making more transparent.
Important Conclusions
As we pass from prompting, to RLHF, to the constitutional methods discussed here, we lower the barrier to training AI models that behave in ways their creators intend.
This means that these methods also make it easier to train pernicious systems. The supervised methods we have discussed may be particularly accessible, since they do not require an efficient RL implementation with large language models.
A.I. is going to get better at training to be more trustworthy, safe and less harmful.
Anthropic was formed to examine: how to better understand the AI models increasingly in use in every industry as they grow beyond our ability to explain their logic and outcomes.
Research: Jared Kaplan developed the main ideas in discussion with Yuntao Bai, Amanda Askell, and Saurav Kadavath, and Jared carried out some of the initial experiments. Yuntao developed the method further and designed and carried out most of the experiments in this paper. Amanda helped develop the initial experiments, and Sandipan worked on harmlessness scores and automated generation of prompts.
Considering the big scope of things, Anthropic A.I, seems to be doing important research.
I hope this gave you a valuable glimpse into what Constitutional A.I. is and what’s going on in terms of A.I. alignment, harmfulness and making accessible A.I. more safe, transparent and ethical. I don’t pretend to be an expert in this.
Language models (LMs) have seen wide proliferation across various applications, from chatbots to code completion to writing assistants. However, the behaviors and risks of LMs are not well understood. As we enter 2023 I think this will become more apparent than ever as the lack of regulation in A.I. intersects with Generative A.I. trends.
Sort of hilarious what you find out after the fact what I didn't know when I wrote this was SBF was the lead investor in their series B. At the time I couldn't understand how they were able to fundraise so well given their obscured mission statement.