Thinking Machines previews interaction models trained from scratch with 200ms micro-turns, handling audio, video, and text concurrently without external scaffolding.
Key Takeaways
Turn-based models freeze perception until a turn ends; interaction models use continuous micro-turn streams so silence, overlap, and interruption stay in context.
A dual-system design pairs a real-time interaction model with an async background model for deep reasoning, tool use, and browsing – sharing context throughout.
Encoder-free early fusion uses dMel embeddings for audio and hMLP 40x40 patches for video, all co-trained from scratch with the transformer.
Inference runs 200ms chunk streaming sessions persisted in GPU memory to avoid per-turn overhead; a version was upstreamed to SGLang.
Native full-duplex enables proactive interjections, live translation, and visual-cue reactions without VAD harnesses or turn-prediction components.
Hacker News Comment Review
No substantive HN discussion yet.
Notable Comments
@rohitpaulk: notes demos are “quirky and short” – a deliberate contrast to Anthropic and OpenAI style.