Forge is a Python reliability layer for self-hosted LLM tool-calling that lifts an 8B model from 53% to 99% on agentic tasks via rescue parsing, retry nudges, and VRAM-aware context management.
Key Takeaways
Best self-hosted config: Ministral-3 8B Instruct Q8 on llama-server scores 86.5% across 26 eval scenarios, 76% on the hardest tier.
Four usage modes: WorkflowRunner, SlotWorker (multi-agent shared GPU slot), guardrails middleware, and a drop-in OpenAI-compatible proxy.
The proxy injects a synthetic respond tool so small models stay in tool-calling mode; the tool call is stripped before returning to the client.
Supports Ollama, llama-server, Llamafile, and Anthropic backends; llama-server recommended for top eval performance.
Published as a peer-reviewed paper: Zambelli, A. Forge: A Reliability Layer for Self-Hosted LLM Tool-Calling (doi:10.1145/3786335.3813193).
Hacker News Comment Review
Commenters initially unclear on what “guardrails” meant here; the author clarified it catches malformed tool calls mid-workflow and injects corrective nudges rather than blocking content.
Core mechanism confirmed by author: if a model sends malformed JSON to a tool (e.g., a hotel booking API), Forge intercepts, nudges the model, and retries – but will raise a max-iterations error if the model stays stuck.
Notable Comments
@tommica: Asked for a plain-language definition of guardrails in this context – a useful proxy for how non-obvious the framing is to first-time readers.