Value Learning
Value learning is the AI-safety approach of having a system infer what humans actually value, rather than optimizing a hand-coded proxy, so that capable agents pursue goals aligned with human intent even in novel situations.
Value learning addresses a central alignment problem: we cannot fully specify human values in code, and any fixed proxy objective can be gamed or break down in unforeseen states. Instead, the agent learns a model of human preferences from behavior, feedback, or comparisons, and treats that model, not a brittle hand-written reward, as the thing to optimize. The mental model is uncertainty over the true objective: the system stays corrigible, deferring to humans, because it knows its value estimate is incomplete and improvable.
Read the original → https://en.wikipedia.org/wiki/Value_learning
- #ai-safety
- #alignment
- #value-learning
- #rlhf
- #reward-modeling
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.