How would you speed up slow single-GPU training?
Source: interviewintermediate
WHAT IT TESTS: knowledge of scaling training. OUTLINE: vertical scaling to bigger or multi-GPU instances, then data-parallel or model-parallel distributed training across nodes.
WHAT IT TESTS: whether you can scale ML training thoughtfully. ANSWER OUTLINE: first scale up to a larger or multi-GPU instance and apply mixed precision and a larger batch size to use the hardware fully; second scale out with distributed training, data parallelism replicating the model across GPUs and syncing gradients via all-reduce, or model parallelism for models too large to fit. Mention the communication overhead trade-off.
Read the original → interview
- #distributed-training
- #gpu
- #scaling
- #cloud
- #machine-learning
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.