Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

· ai devtools web · Source ↗

TLDR

  • Cactus Compute distilled Gemini into a 26M-parameter Simple Attention Network for single-shot function calling, running INT4 at 14MB on-device.

Key Takeaways

  • Architecture: 8-head/4-KV encoder-decoder with cross-attention, ZCRMSNorm, RoPE, no FFN layers, d=512, BPE vocab 8192.
  • Pretrained on 16 TPU v6e for 200B tokens (27hrs); post-trained on 2B function-call examples in 45 mins.
  • Beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling benchmarks.
  • Production decode at 1200 tok/s, prefill at 6000 tok/s via Cactus; INT4 weights are 14MB, fully open.
  • MLPs dropped entirely: the team found attention-only transformers retain structured output ability when external knowledge is provided.

Hacker News Comment Review

  • Real-world discriminatory power is unverified: one tester found it routed a “contact my boss” prompt to a timer tool instead of email, suggesting fragility with ambiguous multi-tool setups.
  • The “26M” label caused widespread confusion with 26B; commenters recommended writing it as 0.026B for clarity.
  • The no-FFN design drew technical interest, with independent corroboration that removing MLP from transformers preserves transformation tasks but kills parametric knowledge retention.

Notable Comments

  • @kgeist: Independent student research confirmed removing MLP from Qwen preserves input transformation but eliminates stored knowledge, aligning with Needle’s architecture claim.
  • @tomaskafka: Casual test setting an alarm and adding groceries “outperformed Siri” on those narrow tasks.

Original | Discuss on HN