InstructGPT (2022) Paper Notes
Paper notes on InstructGPT covering its core method — the three-step RLHF recipe (SFT → reward model → PPO) — the alignment result where a 1.3B model beats 175B GPT-3, gains in truthfulness and toxicity, the alignment tax and PPO-ptx, and the limitations.