Show HN: Find the best local LLM for your hardware, ranked by benchmarks

· ai coding hardware · Source ↗

TLDR

  • whichllm auto-detects GPU/CPU/RAM and ranks local LLMs by merged real benchmarks, not just what fits in VRAM.

Key Takeaways

  • Scoring merges LiveBench, Artificial Analysis, Aider, Chatbot Arena ELO, and Open LLM Leaderboard with confidence-weighted dampening.
  • Recency-aware lineage demotion prevents stale 2024 leaderboard scores from outranking current-generation models.
  • VRAM estimation covers weights + GQA KV cache + activation + overhead; MoE models rank speed on active params, quality on total.
  • whichllm plan reverse-looks up what GPU you need for a target model and context length; --json enables pipeline use with Ollama or jq.
  • Evidence is graded across five levels (direct, variant, base, interpolated, self-reported) and cross-family score inheritance is rejected when param count diverges more than 2x.

Hacker News Comment Review

  • Commenters flagged a direct competitor, llmfit (Go-based), with the main differentiator being that whichllm is Python and adds benchmark-aware ranking rather than pure fit detection.
  • The fixed context-length assumption in VRAM estimation is a real gap: sliding window attention models like Mistral use substantially less KV cache at 32k context than the README implies.
  • Early users on Apple Silicon report stale Qwen 2.5 recommendations despite running Qwen 3.x fine, suggesting HuggingFace data freshness or ranking logic may lag model releases.

Notable Comments

  • @Jasssss: VRAM estimation does not account for sliding window attention, so KV cache sizing is likely overstated for Mistral-class models at long context.

Original | Discuss on HN