How would you systematically diagnose high latency in an online inference service?
WHAT IT TESTS: Systems reasoning across serving stack. ANSWER OUTLINE: Check p90/p99 and TTFT to split queuing from compute; inspect queue depth, batch size, GPU, and benchmarks; check cache.
WHAT IT TESTS: End-to-end systems reasoning across the ML serving stack, separating infrastructure queuing from model compute and app overhead. ANSWER OUTLINE: Check tail percentiles p90/p99 and TTFT first to split queuing from inference; then inspect queue depth, batch size, and GPU use at the infrastructure layer; run isolated model benchmarks to establish baselines; audit app routing, serialization, and cache hits. RED FLAG: Jumping to model optimization before checking queue saturation, autoscaling cold starts, or batching misconfiguration.
Read the original → mlflow.org
- #mlops
- #infrastructure
- #latency
- #observability
- #interview
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.