Replacing a 3 GB SQLite database with a 10 MB FST (finite state transducer) binary

· databases · Source ↗

TLDR

  • A Finnish-English dictionary app swapped a 3 GB SQLite/FTS inflection database for a 10 MB FST binary, achieving a 300x size reduction using BurntSushi’s fst Rust crate.

Key Takeaways

  • FSTs compress both prefixes and suffixes; Finnish’s agglutinative morphology (millions of inflections sharing a handful of endings) is an ideal fit, unlike tries which only share prefixes.
  • The fst crate by Andrew Gallant (ripgrep author) maps conjugations and declensions back to source definitions; static load at runtime sidesteps fst’s mutability weakness.
  • A naïve trie held ~400,000 entries at ~50 MB but could not scale to 40-60 million inflected forms without blowing memory budgets for low-end hardware.
  • The SQLite hack shipped first and worked; the FST rewrite was only possible because shipping the bad solution provided 9 months of runway to learn better approaches.
  • The Pro v2 binary is now ~20 MB all-in, smaller than the free v1 build, preserving the “pocket dictionary fits on any laptop” design constraint.

Hacker News Comment Review

  • One commenter noted the FST/MADFA structure is a rediscovery of the DAWG (Directed Acyclic Word Graph), itself called a DAFSA on Wikipedia, with roots going back roughly 20 years.

Notable Comments

  • @lscharen: identifies the structure as DAFSA/DAWG, a decades-old data structure the author independently rediscovered through the fst crate.

Original | Discuss on HN