tezvyn:

Layer Norm and Residuals in Transformer Blocks

Source: interviewadvanced

WHAT IT TESTS: how Transformer blocks stay trainable at depth. OUTLINE: residuals preserve gradient flow, layer norm stabilizes activations per token, and it beats batch norm because it is independent of batch and sequence length.

WHAT IT TESTS: detailed understanding of Transformer block internals. ANSWER OUTLINE: each sublayer, attention and feed-forward, is wrapped in a residual connection that adds the input back to the output, preserving gradient flow and easing deep training. Layer normalization normalizes across the feature dimension for each token independently, stabilizing activations. It is preferred over batch normalization because it does not depend on batch size or variable sequence lengths and behaves identically at train and inference.

Read the original → interview

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Layer Norm and Residuals in Transformer Blocks · Tezvyn