Mixture of Experts: Scaling Models by Activating Specialists
A Mixture of Experts (MoE) model acts like a team of specialists instead of one generalist. A router sends each token to a few expert sub-networks, enabling faster training and inference for massive models.
A Mixture of Experts (MoE) model trades dense computation for selective activation, like a team of specialists. Instead of every parameter processing every token, a 'router' network directs tokens to a small subset of 'expert' networks. This allows for training massive models like Mixtral 8x7B with far less compute and achieving faster inference than a dense model of the same parameter count. The main footgun is memory: all experts must be loaded into VRAM, leading to enormous memory requirements despite sparse activation.
Read the original → huggingface.co
- #llms
- #model architecture
- #moe
- #scaling
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.