Why Multi-Head Attention
WHAT IT TESTS: rationale for splitting attention into heads. OUTLINE: multiple heads attend to different subspaces and relations in parallel, which one big head averages away. RED FLAG: claiming more heads is always better or that it raises total compute.
WHAT IT TESTS: why Transformers split attention rather than scale one head. ANSWER OUTLINE: a single softmax-weighted head tends to produce one averaged attention pattern, limiting what relations it captures. Multi-head attention runs several lower-dimensional heads in parallel on different learned projections, so each can specialize in a different relationship or subspace, and their outputs are concatenated and projected. Total compute stays similar because each head is narrower.
Read the original → interview
- #multi-head-attention
- #transformers
- #attention
- #self-attention
- #nlp
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.