tezvyn:

Diagnose a Prometheus cardinality explosion

Source: interviewadvanced

WHAT IT TESTS: operating Prometheus at scale. OUTLINE: find offenders via TSDB stats and topk count by __name__, identify unbounded labels, then drop or aggregate them with relabeling. RED FLAG: just scaling memory without fixing label design.

WHAT IT TESTS: whether you understand that each unique label-set is a separate time series. ANSWER OUTLINE: diagnose using /tsdb-status, prometheus_tsdb_head_series, and count by (__name__) to find the worst metrics; cardinality usually explodes from unbounded labels like user ID, pod name, request path, or error message; mitigate with metric_relabel_configs to drop labels, bucketing high-variance values, and recording rules. RED FLAG: just adding RAM, blaming Prometheus itself, or proposing per-request labels.

Read the original → interview

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Diagnose a Prometheus cardinality explosion · Tezvyn