Removing fsync from our local storage engine

· databases systems · Source ↗

TLDR

  • FractalBits built a single-node KV engine reaching 190,985 obj/s on AWS i8g.2xlarge NVMe by replacing fsync with pre-allocation, O_DIRECT, and 4KB-atomic journal commits.

Key Takeaways

  • fsync is costly because it flushes inode, directory entries, extent maps, and the filesystem journal, not just file data, adding unpredictable tail latency.
  • The engine uses fallocate fixed-size pre-allocation so file size never changes, eliminating inode metadata writes on every PUT.
  • O_DIRECT bypasses the page cache; 4KB-aligned journal commits rely on NVMe atomic-write guarantees so a commit block either lands fully or not at all.
  • A complete PUT writes value data first via O_DIRECT, then batches a journal commit; power loss between the two leaves the region reclaimed by space-map scan on restart.
  • The no-fsync design is narrowly scoped: SSD-only, fixed file sizes, single-key atomicity, and cloud or enterprise NVMe with nonvolatile cache. HDD or POSIX-general use breaks the contract.

Hacker News Comment Review

  • Discussion is minimal but the author clarifies the design is not a general anti-fsync argument and depends on five specific constraints: SSD-only, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
  • One commenter flags that fsync complexity is a known pain point, linking prior HN discussion on how hard correct file I/O is in general.

Notable Comments

  • @alexhnn: “Working with files is hard… most of the complicity is from the fsync API” – frames the design as welcome relief from a notoriously tricky primitive.

Original | Discuss on HN