Partitioning order events in a data lake
WHAT IT TESTS: partition design for query pruning. OUTLINE: partition by the columns queries filter on, typically date hierarchy and category, balancing granularity to avoid too many tiny files. RED FLAG: partitioning on high-cardinality keys like order ID.
WHAT IT TESTS: whether you can choose partition keys that prune scans without creating a small-files problem. ANSWER OUTLINE: partition by the fields queries filter on, here a date hierarchy like year/month/day plus optionally product category, so a monthly-by-category query reads only relevant prefixes; pick granularity that keeps files reasonably sized; avoid high-cardinality keys like order or customer ID that explode partition counts. Combine with a columnar format and a catalog.
Read the original → interview
- #cloud
- #partitioning
- #data-lake
- #s3
- #query-optimization
Get five bites like this every day.
Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.