Buckets and objects are not enough

· cloud · Source ↗

TLDR

  • S3 lacks a native dataset abstraction, forcing teams to rely on prefix conventions that break cost tracking, governance, and lifecycle management at scale.

Key Takeaways

  • Prefixes act as a proxy for datasets but S3 cannot distinguish an organizational prefix from a real dataset boundary or implementation detail.
  • Tags can attach to buckets or individual objects, not to a logical dataset spanning many objects, making shared metadata expensive and inconsistent to maintain.
  • Tools like Storage Lens, data catalogs, and FinOps platforms each address a slice of the problem but none treat the dataset as a first-class primitive.
  • Netflix and Pinterest built custom pipelines on top of S3 Inventory and Storage Lens to approximate dataset-level visibility; most teams cannot absorb that cost.
  • The author argues the fix requires discovery-first inference from existing partition formats and access patterns, not a parallel registry requiring manual registration.

Hacker News Comment Review

  • One commenter pointed out that S3 prefixes carry real performance implications beyond semantics, per AWS docs, adding an operational reason prefix structure matters beyond human readability.

Original | Discuss on HN