How fast is N tokens per second really?

· ai coding · Source ↗

TLDR

  • tokenspeed is an interactive browser tool that lets you feel LLM throughput rates from 5 tok/s (Raspberry Pi) to 800 tok/s (Cerebras) across code, prose, think, and agent modes.

Key Takeaways

  • Four rendering modes: code (syntax-highlighted), text (prose), think (dim-italic reasoning + code), agent (tool calls + generation pauses).
  • Keyboard shortcuts 1-9 jump between landmark speeds: 5 tok/s, 60 tok/s (hosted Claude/GPT), 200 tok/s (Groq), 800 tok/s (Cerebras-class).
  • Code is more token-dense than prose; same tok/s rate feels perceptually different depending on content type.
  • English prose averages ~1.3 tokens per word, so 30 tok/s is roughly 23 words/s – readable but not instant.
  • Tokenization approximates BPE style, not vendor-specific encoders; longer identifiers like processUserInput split into multiple tokens.

Hacker News Comment Review

  • Commenters note the tool undersells real reasoning-model latency: extended hidden thinking phases can burn 2-3x more tokens than visible output before a single code line appears.
  • antirez points out tok/s is underspecified without separating decoding speed, prefill speed, and how both degrade as context length grows – a model at 50 tok/s with 2k context may drop to 7 tok/s at 100k.
  • There is tension around whether fast generation is even useful for human review: above ~100-150 tok/s output outruns reading comprehension, making high speeds only valuable for subagents, not interactive coding sessions.

Notable Comments

  • @antirez: argues tok/s requires at minimum decoding speed, prefill speed, and the slope of both across context sizes to be actionable.
  • @charles_irl: built a parallel simulator at modal.com/llm-almanac/token-timing-simulator with similar motivation; notes tokenspeed’s content-type rendering is more realistic than his own.

Original | Discuss on HN