1. Title: Constitutional AI: Harmlessness from AI Feedback
Type: Research paper (PDF)
Source: arXiv (Anthropic authors)
As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.
2. Title: Claude’s Constitution
Type: Official values/policy document
Source: Anthropic website
Format: Web / PDF (if available)
Claude’s constitution is a detailed description of Anthropic’s intentions for Claude’s values and behavior. It plays a crucial role in our training process, and its content directly shapes Claude’s behavior. It’s also the final authority on our vision for Claude, and our aim is for all of our other guidance and training to be consistent with it.
Training models is a difficult task, and Claude’s behavior might not always reflect the constitution’s ideals. We will be open—for example, in our system cards—about the ways in which Claude’s behavior comes apart from our intentions. But we think transparency about those intentions is important regardless.
The document is written with Claude as its primary audience, so it might read differently than you’d expect. For example, it’s optimized for precision over accessibility, and it covers various topics that may be of less interest to human readers. We also discuss Claude in terms normally reserved for humans (e.g., “virtue,” “wisdom”). We do this because we expect Claude’s reasoning to draw on human concepts by default, given the role of human text in Claude’s training; and we think encouraging Claude to embrace certain human-like qualities may be actively desirable.
This constitution is written for our mainline, general-access Claude models. We have some models built for specialized uses that don’t fully fit this constitution; as we continue to develop products for specialized use cases, we will continue to evaluate how to best ensure our models meet the core objectives outlined in this constitution.
For a summary of the constitution, and for more discussion of how we’re thinking about it, see our blog post “Claude’s new constitution.”
Powerful AI models will be a new kind of force in the world, and people creating them have a chance to help them embody the best in humanity. We hope this constitution is a step in that direction.
We’re releasing Claude’s constitution in full under a Creative Commons CC0 1.0 Deed, meaning it can be freely used by anyone for any purpose without asking for permission.
Many people at Anthropic and beyond contributed to the creation of this document, as did several Claude models. Amanda Askell is the primary author and wrote the majority of the text. Joe Carlsmith wrote significant parts of many sections and played a core role in revising the text. Chris Olah, Jared Kaplan, and Holden Karnofsky made significant contributions to its content and development. More detailed contribution statement and acknowledgments below.
Type: Research paper
Source: Anthropic research / arXiv
There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.
0 Comments