Forge is a Python reliability layer for self-hosted LLM tool-calling that lifts an 8B local model to 86.5% on its 26-scenario eval suite via rescue parsing, retry nudges, and VRAM-aware context compaction.
Key Takeaways
Three usage modes: WorkflowRunner for full agentic loops, SlotWorker for multi-agent GPU sharing, and a drop-in OpenAI-compatible proxy that injects guardrails transparently.
Top config is Ministral-3 8B Instruct Q8 on llama-server; scores 86.5% overall and 76% on the hardest advanced_reasoning tier across 26 eval scenarios.
A synthetic respond tool forces the model to stay in tool-calling mode, preventing the text-vs-tool ambiguity that breaks small models; the tool call is stripped before the client sees the response.
Supports Ollama, llama-server, Llamafile, and Anthropic backends; eval harness includes batch eval, ASCII/HTML/markdown reports, and automatic resume via JSONL output.
Published as a peer-reviewed paper: Zambelli, A. “Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling.” doi:10.1145/3786335.3813193.
Hacker News Comment Review
The backend serving choice has outsized impact: the same Mistral-Nemo 12B weights jump from 7% on llama-server native FC to 83% on Llamafile in prompt-injected mode, suggesting the format pipeline matters more than raw weights.
Commenters with their own harnesses independently confirm that structured execution plans, conversation rewind, and tool-call collapse can reduce token use 2x-10x on constrained tasks, validating Forge’s design direction.
The tool-call ambiguity problem scales up: frontier models (Claude, Codex, Gemini) also misread exit-code-1 grep results as failures rather than empty results, a limitation Forge only partially addresses since it focuses on execution, not model reasoning quality.
Notable Comments
@simonw: Flags that “guardrails” is an overloaded term; here it specifically means validating and rescuing tool calls, not safety filtering.