Constitutional AI: Teaching an AI Right from Wrong
Constitutional AI teaches a model to be harmless by making it follow a set of principles—a constitution—instead of relying on human-labeled examples of bad behavior. This self-correction process, called Reinforcement Learning from AI Feedback (RLAIF), is used to align powerful models, enabling them to refuse harmful requests while explaining their reasoning. The entire system's safety, however, hinges on the quality and completeness of the initial human-written constitution.
Constitutional AI teaches a model to be harmless by making it follow a written set of principles—a constitution—instead of using human-labeled examples of bad behavior. The AI learns to critique and revise its own outputs to better align with these rules. This two-phase process first finetunes a model on its own self-corrected responses, then uses Reinforcement Learning from AI Feedback (RLAIF). In RLAIF, an AI model judges pairs of responses to create a preference dataset, which then serves as the reward signal for training. This allows models to learn to be helpful and harmless without constant human oversight. The entire system's safety is only as good as its constitution; poorly written principles lead to undesirable behavior.
Read the original → arXiv
- #llms
- #ai safety
- #rlhf
- #alignment
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.