10 papers for LLM alignment

Core Insight

Understanding LLM alignment requires reading papers in the right order — from the foundational RLHF work to modern techniques like Constitutional AI and DPO. This list provides a structured reading path.

My Analysis

The reading order that worked for me:

InstructGPT (Ouyang et al., 2022) — The paper that started it all. RLHF applied to GPT-3.
RLHF original (Christiano et al., 2017) — The theoretical foundation for learning from human preferences.
Constitutional AI (Bai et al., 2022) — Anthropic's approach to scaling alignment without scaling human labeling.
DPO (Rafailov et al., 2023) — Direct Preference Optimization eliminates the reward model entirely.
RLAIF (Lee et al., 2023) — Using AI feedback instead of human feedback.

The remaining 5 papers are more specialized and can be read in any order based on interest.

Key observation: the field is moving toward reducing human involvement in the alignment loop. From RLHF (humans rank outputs) → CAI (AI critiques using principles) → DPO (direct optimization from preferences) → RLAIF (AI generates the preferences).

10 papers for LLM alignment

Core Insight

My Analysis

Related

Constitutional AI: Harmlessness from AI Feedback