Core Insight
Understanding LLM alignment requires reading papers in the right order โ from the foundational RLHF work to modern techniques like Constitutional AI and DPO. This list provides a structured reading path.
My Analysis
The reading order that worked for me:
- InstructGPT (Ouyang et al., 2022) โ The paper that started it all. RLHF applied to GPT-3.
- RLHF original (Christiano et al., 2017) โ The theoretical foundation for learning from human preferences.
- Constitutional AI (Bai et al., 2022) โ Anthropic's approach to scaling alignment without scaling human labeling.
- DPO (Rafailov et al., 2023) โ Direct Preference Optimization eliminates the reward model entirely.
- RLAIF (Lee et al., 2023) โ Using AI feedback instead of human feedback.
The remaining 5 papers are more specialized and can be read in any order based on interest.
Key observation: the field is moving toward reducing human involvement in the alignment loop. From RLHF (humans rank outputs) โ CAI (AI critiques using principles) โ DPO (direct optimization from preferences) โ RLAIF (AI generates the preferences).