Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model

May 12, 2026 · ai devtools web · Source ↗

TLDR

Cactus Compute distilled Gemini into a 26M-parameter Simple Attention Network for single-shot function calling, running INT4 at 14MB on-device.

Architecture: 8-head/4-KV encoder-decoder with cross-attention, ZCRMSNorm, RoPE, no FFN layers, d=512, BPE vocab 8192.
Pretrained on 16 TPU v6e for 200B tokens (27hrs); post-trained on 2B function-call examples in 45 mins.
Beats FunctionGemma-270m, Qwen-0.6B, Granite-350m, and LFM2.5-350m on single-shot function calling benchmarks.
Production decode at 1200 tok/s, prefill at 6000 tok/s via Cactus; INT4 weights are 14MB, fully open.
MLPs dropped entirely: the team found attention-only transformers retain structured output ability when external knowledge is provided.

Real-world discriminatory power is unverified: one tester found it routed a “contact my boss” prompt to a timer tool instead of email, suggesting fragility with ambiguous multi-tool setups.
The “26M” label caused widespread confusion with 26B; commenters recommended writing it as 0.026B for clarity.
The no-FFN design drew technical interest, with independent corroboration that removing MLP from transformers preserves transformation tasks but kills parametric knowledge retention.

@kgeist: Independent student research confirmed removing MLP from Qwen preserves input transformation but eliminates stored knowledge, aligning with Needle’s architecture claim.
@tomaskafka: Casual test setting an alarm and adding groceries “outperformed Siri” on those narrow tasks.