Buckets and objects are not enough

May 3, 2026 · cloud · Source ↗

TLDR

S3 lacks a native dataset abstraction, forcing teams to rely on prefix conventions that break cost tracking, governance, and lifecycle management at scale.

Prefixes act as a proxy for datasets but S3 cannot distinguish an organizational prefix from a real dataset boundary or implementation detail.
Tags can attach to buckets or individual objects, not to a logical dataset spanning many objects, making shared metadata expensive and inconsistent to maintain.
Tools like Storage Lens, data catalogs, and FinOps platforms each address a slice of the problem but none treat the dataset as a first-class primitive.
Netflix and Pinterest built custom pipelines on top of S3 Inventory and Storage Lens to approximate dataset-level visibility; most teams cannot absorb that cost.
The author argues the fix requires discovery-first inference from existing partition formats and access patterns, not a parallel registry requiring manual registration.

One commenter pointed out that S3 prefixes carry real performance implications beyond semantics, per AWS docs, adding an operational reason prefix structure matters beyond human readability.