How would you instrument and query P95 API latency by region?
This tests white-box latency instrumentation and safe cardinality for percentile aggregation. Strong answer: emit histograms by region, query P95 with histogram_quantile or a log percentile, and keep trace IDs in logs only.
This tests whether you can design white-box monitoring for request latency with dimensional labels without causing cardinality explosion. A strong answer covers four things: instrumenting the app with histograms or timers tagged by a low-cardinality region label; aggregating with histogram_quantile in Prometheus or percentile() in logs over a sliding window; keeping high-cardinality context like trace IDs in structured logs rather than metric labels; and validating accuracy by comparing histogram buckets against raw log samples.
Read the original → sre.google
- #observability
- #prometheus
- #latency
- #sre
- #metrics
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.