Constitutional AI: Harmlessness from AI Feedback

Core Insight

Anthropic's CAI approach replaces human feedback in RLHF with AI-generated feedback guided by a set of principles ("constitution"), achieving competitive harmlessness without sacrificing helpfulness — and dramatically reducing the need for human red-teaming.

My Analysis

This paper is foundational for understanding how Anthropic thinks about alignment differently from OpenAI's RLHF-heavy approach. The key insight isn't the technique itself but the framing: instead of trying to enumerate all harmful outputs, define a set of principles and let the model self-improve against them.

What I found most interesting:

The "critique → revision" loop is elegant — it's basically teaching the model to be its own editor
The constitutional principles are surprisingly simple and readable (unlike reward model weights)
The self-play aspect means you can iterate without constantly needing human labelers

Open questions I still have:

How sensitive is the output to the exact wording of constitutional principles?
Does this approach work as well for subtle harms vs obvious ones?
How does this interact with model scale?

Constitutional AI: Harmlessness from AI Feedback

Core Insight

My Analysis

Related

10 papers for LLM alignment