Running local inference on an M5 Max MacBook Pro costs roughly 3x more per million tokens than OpenRouter, with speed also slower by 3-7x.
Key Takeaways
At 10-40 tokens/sec on Gemma4 31B, amortized hardware plus electricity lands at $0.40-$4.79 per million tokens depending on lifespan assumptions.
OpenRouter serves Gemma4 31B at $0.38-$0.50 per million tokens at 60-70 tokens/sec, beating local on both cost and speed in most scenarios.
Hardware depreciation dominates the cost equation; electricity at $0.20/kWh adds only ~$0.02/hr at 100W load.
Only under the most optimistic assumptions (50W, 40 tok/s, 10-year lifespan) does local inference match OpenRouter pricing.
For salaried developers, token cost is ~1000x less than labor cost, making hosted APIs the pragmatic default.
Hacker News Comment Review
The core methodology drew heavy criticism: the laptop cost should be split between its use as a workstation and as an inference machine, not allocated entirely to token generation.
Commenters noted the analysis only counts output tokens; for agentic workloads, input tokens dominate, and local inference treats them as near-free, which shifts the comparison meaningfully.
Cloud inference benefits from concurrency: a single GPU serves many users simultaneously, spreading fixed power costs across requests in ways no single local device can match, making the cost gap structural rather than incidental.
Notable Comments
@antirez: Argues a 128GB M5 Max as primary workstation that also runs DeepSeek V4 Flash offline, without censorship, on private data, is a different value proposition than pure token cost.
@maho: Points out the post omits input token costs entirely, which dominate typical agentic workloads and are effectively free locally.