← Back to Library
CONSTITUTIONAL AI: HARMLESSNESS
PapersadvancedMarch 29, 202645 min readEN

Constitutional AI: Harmlessness from AI Feedback

View source

Core Insight

Anthropic's CAI approach replaces human feedback in RLHF with AI-generated feedback guided by a set of principles ("constitution"), achieving competitive harmlessness without sacrificing helpfulness — and dramatically reducing the need for human red-teaming.

My Analysis

This paper is foundational for understanding how Anthropic thinks about alignment differently from OpenAI's RLHF-heavy approach. The key insight isn't the technique itself but the framing: instead of trying to enumerate all harmful outputs, define a set of principles and let the model self-improve against them.

What I found most interesting:

  • The "critique → revision" loop is elegant — it's basically teaching the model to be its own editor
  • The constitutional principles are surprisingly simple and readable (unlike reward model weights)
  • The self-play aspect means you can iterate without constantly needing human labelers

Open questions I still have:

  • How sensitive is the output to the exact wording of constitutional principles?
  • Does this approach work as well for subtle harms vs obvious ones?
  • How does this interact with model scale?