How fast is N tokens per second really?

May 20, 2026 · ai coding · Source ↗

TLDR

tokenspeed is an interactive browser tool that lets you feel LLM throughput rates from 5 tok/s (Raspberry Pi) to 800 tok/s (Cerebras) across code, prose, think, and agent modes.

Four rendering modes: code (syntax-highlighted), text (prose), think (dim-italic reasoning + code), agent (tool calls + generation pauses).
Keyboard shortcuts 1-9 jump between landmark speeds: 5 tok/s, 60 tok/s (hosted Claude/GPT), 200 tok/s (Groq), 800 tok/s (Cerebras-class).
Code is more token-dense than prose; same tok/s rate feels perceptually different depending on content type.
English prose averages ~1.3 tokens per word, so 30 tok/s is roughly 23 words/s – readable but not instant.
Tokenization approximates BPE style, not vendor-specific encoders; longer identifiers like processUserInput split into multiple tokens.

Commenters note the tool undersells real reasoning-model latency: extended hidden thinking phases can burn 2-3x more tokens than visible output before a single code line appears.
antirez points out tok/s is underspecified without separating decoding speed, prefill speed, and how both degrade as context length grows – a model at 50 tok/s with 2k context may drop to 7 tok/s at 100k.
There is tension around whether fast generation is even useful for human review: above ~100-150 tok/s output outruns reading comprehension, making high speeds only valuable for subagents, not interactive coding sessions.

@antirez: argues tok/s requires at minimum decoding speed, prefill speed, and the slope of both across context sizes to be actionable.
@charles_irl: built a parallel simulator at modal.com/llm-almanac/token-timing-simulator with similar motivation; notes tokenspeed’s content-type rendering is more realistic than his own.