โ† Back to Library
10 PAPERS FOR
Curated Listsintermediate๐Ÿ”January 20, 202610 min readEN

10 papers for LLM alignment

View source

Core Insight

Understanding LLM alignment requires reading papers in the right order โ€” from the foundational RLHF work to modern techniques like Constitutional AI and DPO. This list provides a structured reading path.

My Analysis

The reading order that worked for me:

  1. InstructGPT (Ouyang et al., 2022) โ€” The paper that started it all. RLHF applied to GPT-3.
  2. RLHF original (Christiano et al., 2017) โ€” The theoretical foundation for learning from human preferences.
  3. Constitutional AI (Bai et al., 2022) โ€” Anthropic's approach to scaling alignment without scaling human labeling.
  4. DPO (Rafailov et al., 2023) โ€” Direct Preference Optimization eliminates the reward model entirely.
  5. RLAIF (Lee et al., 2023) โ€” Using AI feedback instead of human feedback.

The remaining 5 papers are more specialized and can be read in any order based on interest.

Key observation: the field is moving toward reducing human involvement in the alignment loop. From RLHF (humans rank outputs) โ†’ CAI (AI critiques using principles) โ†’ DPO (direct optimization from preferences) โ†’ RLAIF (AI generates the preferences).