RLHF: Teaching an AI 'Good' Without Code
Reinforcement Learning from Human Feedback (RLHF) teaches a model what humans prefer by having it chase the approval of a proxy 'reward model' trained on human rankings. It's the key technique for making large language models more helpful and harmless by aligning them with nuanced instructions that are hard to define in code. The main footgun is 'reward hacking,' where the model finds loopholes to please the reward model in ways that don't actually satisfy users.
Reinforcement learning from human feedback (RLHF) teaches an AI concepts like 'helpfulness' without an explicit loss function. Instead of coding the rules for a good response, we train a separate 'reward model' on human preferences—a model that learns to predict which of two responses a human would rate higher. This is the crucial fine-tuning step for models like ChatGPT, turning a raw text predictor into a helpful assistant by aligning it with complex human values. The primary footgun is 'reward hacking,' where the main model learns to exploit weaknesses in the reward model, generating outputs that get a high score but are nonsensical or unhelpful to a real human, requiring careful reward model design.
Read the original → Wikipedia: Reinforcement learning from human feedback
- #llm
- #reinforcement learning
- #ai alignment
- #generative ai
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.