Removing fsync from our local storage engine

May 9, 2026 · databases systems · Source ↗

TLDR

FractalBits built a single-node KV engine reaching 190,985 obj/s on AWS i8g.2xlarge NVMe by replacing fsync with pre-allocation, O_DIRECT, and 4KB-atomic journal commits.

fsync is costly because it flushes inode, directory entries, extent maps, and the filesystem journal, not just file data, adding unpredictable tail latency.
The engine uses fallocate fixed-size pre-allocation so file size never changes, eliminating inode metadata writes on every PUT.
O_DIRECT bypasses the page cache; 4KB-aligned journal commits rely on NVMe atomic-write guarantees so a commit block either lands fully or not at all.
A complete PUT writes value data first via O_DIRECT, then batches a journal commit; power loss between the two leaves the region reclaimed by space-map scan on restart.
The no-fsync design is narrowly scoped: SSD-only, fixed file sizes, single-key atomicity, and cloud or enterprise NVMe with nonvolatile cache. HDD or POSIX-general use breaks the contract.

Discussion is minimal but the author clarifies the design is not a general anti-fsync argument and depends on five specific constraints: SSD-only, preallocated files, O_DIRECT, single-key atomicity, and device write guarantees.
One commenter flags that fsync complexity is a known pain point, linking prior HN discussion on how hard correct file I/O is in general.

@alexhnn: “Working with files is hard… most of the complicity is from the fsync API” – frames the design as welcome relief from a notoriously tricky primitive.