You can beat the binary search

· systems · Source ↗

TLDR

  • SIMD Quad: a hybrid quaternary interpolation search plus SIMD block scan beats std::binary_search on sorted uint16 arrays up to 4096 elements in all tested cases.

Key Takeaways

  • The algorithm targets Roaring Bitmap’s 16-bit integer arrays (1-4096 elements), dividing them into 16-element blocks and using SIMD (NEON/SSE2) to check all 16 values simultaneously.
  • Quaternary interpolation search narrows to the right block first, then a single SIMD instruction checks the block, reducing branches and comparisons.
  • On Intel Emerald Rapids (GCC), SIMD Quad is 2x+ faster than binary search on warm cache; on Apple M4 (LLVM), the 2x+ gain appears on cold cache instead.
  • The quad (base-4) split helps mainly on Intel for large cold-cache arrays, exploiting memory-level parallelism; on Apple M4 it adds marginal benefit.
  • Source code provided for both ARM NEON and x64 SSE2 paths; fallback linear scan handles remainder elements outside full 16-element blocks.

Hacker News Comment Review

  • Commenters noted quaternary search is essentially loop-unrolled binary search: the comparison count is similar, so the real gain comes from the SIMD block scan, not the split factor.
  • A recurring theme: classical CS algorithms assume uniform memory latency and no SIMD, so any hardware-aware rewrite of a textbook algorithm can yield large constant-factor wins.
  • Practical alternative raised: for uint16 search spaces, an 8 kB bitmap could be faster still, suggesting the optimal structure depends heavily on memory budget and access pattern.

Notable Comments

  • @aidenn0: points out an 8 kB bitmap covers all uint16 values and may outperform any search algorithm for this use case.

Original | Discuss on HN