tezvyn:

Partitioning order events in a data lake

Source: interviewintermediate

WHAT IT TESTS: partition design for query pruning. OUTLINE: partition by the columns queries filter on, typically date hierarchy and category, balancing granularity to avoid too many tiny files. RED FLAG: partitioning on high-cardinality keys like order ID.

WHAT IT TESTS: whether you can choose partition keys that prune scans without creating a small-files problem. ANSWER OUTLINE: partition by the fields queries filter on, here a date hierarchy like year/month/day plus optionally product category, so a monthly-by-category query reads only relevant prefixes; pick granularity that keeps files reasonably sized; avoid high-cardinality keys like order or customer ID that explode partition counts. Combine with a columnar format and a catalog.

Read the original → interview

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

Partitioning order events in a data lake · Tezvyn