Interaction Models

· ai ai-agents design · Source ↗

TLDR

  • Thinking Machines previews interaction models trained from scratch with 200ms micro-turns, handling audio, video, and text concurrently without external scaffolding.

Key Takeaways

  • Turn-based models freeze perception until a turn ends; interaction models use continuous micro-turn streams so silence, overlap, and interruption stay in context.
  • A dual-system design pairs a real-time interaction model with an async background model for deep reasoning, tool use, and browsing – sharing context throughout.
  • Encoder-free early fusion uses dMel embeddings for audio and hMLP 40x40 patches for video, all co-trained from scratch with the transformer.
  • Inference runs 200ms chunk streaming sessions persisted in GPU memory to avoid per-turn overhead; a version was upstreamed to SGLang.
  • Native full-duplex enables proactive interjections, live translation, and visual-cue reactions without VAD harnesses or turn-prediction components.

Hacker News Comment Review

  • No substantive HN discussion yet.

Notable Comments

  • @rohitpaulk: notes demos are “quirky and short” – a deliberate contrast to Anthropic and OpenAI style.

Original | Discuss on HN