tezvyn:

The small files problem in data lakes

Source: interviewadvanced

WHAT IT TESTS: lake performance pathology. OUTLINE: too many tiny files inflate metadata and per-file overhead, slowing queries; caused by streaming micro-batches and over-partitioning; fix with compaction and table formats like Iceberg, Delta, or Hudi.

WHAT IT TESTS: whether you understand why many tiny files cripple lake performance and how to remedy it. ANSWER OUTLINE: each file carries fixed open, listing, and metadata overhead, so thousands of tiny files mean engines spend more time on bookkeeping than reading data, degrading query and listing performance. Root causes are streaming micro-batches, high-frequency writes, and over-partitioning.

Read the original → interview

Get five bites like this every day.

Tezvyn delivers a daily feed of 60-second tech bites with quizzes to lock in what you learn.

The small files problem in data lakes · Tezvyn